Optimized way of finding similar strings
Suppose I have a large list of words. (about 4-5 thousands and increasing). Someone searched for a string. Unfortunately, The string was not found on the wordlist. Now what would be the best and optimized way to find words similar to the input string? The first thing that came to my mind was calculating Levenshtein distance between each entry of the wordlist and the input string. But is that the optimized way to do that?
(Note that, this is not language-specific question)
EDIT: new solution
Yes, calculating Levenshtein distances between your input and the word list can be a reasonable approach, but takes a lot of time. BK-trees can improve this, but they become slow quickly as the Levenshtein distance becomes bigger. It seems we can speed up the Levenshtein distance calculations using a trie, as described in this excellent blog post:
It relies on the fact that the dynamic programming lookup table for Levenshtein distance has common rows in different invocations i.e. levenshtein(kate,cat) and levenshtein(kate,cats).
Running the Python program given on that page with the TWL06 dictionary gives:
> python dict_lev.py HACKING 1 Read 178691 words into 395185 nodes ('BACKING', 1) ('HACKING', 0) ('HACKLING', 1) ('HANKING', 1) ('HARKING', 1) ('HAWKING', 1) ('HOCKING', 1) ('JACKING', 1) ('LACKING', 1) ('PACKING', 1) ('SACKING', 1) ('SHACKING', 1) ('RACKING', 1) ('TACKING', 1) ('THACKING', 1) ('WHACKING', 1) ('YACKING', 1) Search took 0.0189998 s
That's really fast, and would be even faster in other languages. Most of the time is spent in building the trie, which is irrelevant as it needs to be done just once and stored in memory.
The only minor downside to this is that tries take up a lot of memory (which can be reduced with a DAWG, at the cost of some speed).
Another approach: Peter Norvig has an great article (with complete source code) on spelling correction.
The idea is to build possible edits of the words, and then choose the most likely spelling correction of that word.
I think that something better than this exists, but BK trees are a good optimization from brute force at least.
It uses the property of Levenshtein distance being a metric space, and hence if you get a Levenshtein distance of d between your query and an arbitrary string s from the dict, then all your results must be at a distance from (d+n) to (d-n) to s. Here n is the maximum Levenshtein distance from the query you want to output.
It's explained in detail here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
If you are interested in the very code itself I implemented an algorithm for finding optimal alignment between two strings. It basically shows how to transform one string into the other with k operations (where k is Levenstein/Edit Distance of the strings). It could be bit simplified for your needs (as you only need the distance itself). By the way it works in O(mn) where m and n are lengths of the strings. My implementation is based on: this and this.
#optimalization: using int instead of strings: #1 ~ "left", "insertion" #7 ~ "up", "deletion" #17 ~ "up-left", "match/mismatch" def GetLevenshteinDistanceWithBacktracking(sequence1,sequence2): distances = [[0 for y in range(len(sequence2)+1)] for x in range(len(sequence1)+1)] backtracking = [[1 for y in range(len(sequence2)+1)] for x in range(len(sequence1)+1)] for i in range(1, len(sequence1)+1): distances[i]=i for i in range(1, len(sequence2)+1): distances[i]=i for j in range(1, len(sequence2)+1): for i in range(1, len(sequence1)+1): if sequence1[i-1] == sequence2[j-1]: distances[i][j]=distances[i-1][j-1] backtracking[i][j] = 17 else: deletion = distances[i-1][j]+1 substitution = distances[i-1][j-1]+1 insertion = distances[i][j-1] + 1 distances[i][j]=min( deletion, substitution, insertion) if distances[i][j] == deletion: backtracking[i][j] = 7 elif distances[i][j] == insertion: backtracking[i][j] = 1 else: backtracking[i][j] = 17 return (distances[len(sequence1)][len(sequence2)], backtracking) def Alignment(sequence1, sequence2): cost, backtracking = GetLevenshteinDistanceWithBacktracking(sequence1, sequence2) alignment1 = alignment2 = "" i = len(sequence1) j = len(sequence2) #from backtracking-matrix we get optimal-alignment while(i > 0 or j > 0): if i > 0 and j > 0 and backtracking[i][j] == 17: alignment1 = sequence1[i-1] + alignment1 alignment2 = sequence2[j-1] + alignment2 i -= 1 j -= 1 elif i > 0 and backtracking[i][j] == 7: alignment1 = sequence1[i-1] + alignment1 alignment2 = "-" + alignment2 i -= 1 elif j > 0 and backtracking[i][j]==1: alignment2 = sequence2[j-1] + alignment2 alignment1 = "-" + alignment1 j -= 1 elif i > 0: alignment1 = sequence1[i-1] + alignment1 alignment2 = "-" + alignment2 i -= 1 return (cost, (alignment1, alignment2))
It depends on broader context and how accurate you want to be. But what I would (probably) start with:
- Only consider the subset which starts with the same character as the query word. It would decrease the amount of work by ~20 for a single query.
- I would categorized words according to their lengths and for each category would allow the maximal distance to be a different number. In case of 4 categories e.g.: 0 -- if length is between 0 and 2; 1 -- if length is between 3 and 5; 2 -- if length is between 6 and 8; 3 -- if length is 9+. Then based on the query length you could just check the words from given category. Moreover it should not be hard to implement the algorithm to stop when the max. distance has been exceeded.
- If needed would start to think about implementing some machine learning approach.