JEFF BLANKENBURG

Random geekery since 2005.

Husband, father, programmer, speaker, writer, blogger, podcaster, collector, traveler, golfer.

Obscure Knowledge: The Levenshtein Distance

Written in

August 11, 2008

jeffblankenburg

Recently while trying to complete my goal of reading the entire Interwebs, I stumbled upon something I thought the rest of you might be intrigued by: The Levenshtein Distance.

According to Wikipedia, it is “a metric for measuring the amount of difference between two sequences. The Levenshtien distance between two strings is given by the minimum number of operations needed to transform one string into the other.”

An operation, in this case, is an insertion, deletion, or substitution of a single character.

It’s named after Vladimir Iosifovich Levenshtein, the Russian scientist who developed it in 1965.

Jeff, why should I care?

The reason you should be interested in this is because this is how spell-checkers generally recommend words for you. Have you ever wondered how they figure out what you meant?

So how’s it work?

Let’s look at example to see how this measurement works. Let’s figure out the distance from the word “developer” to the word “development.”

1) developer –> developen (substituted the ‘r’ for the ‘n’)
2) developen –> developmen (inserted the ‘m’)
3) developmen –> development (inserted the ‘t’)

So, the Levenshtein distance for those two words is: 3

Are there any other special rules?

Why yes, there are. Here’s a couple of things to keep in mind about this measurement:

1) It can’t be less than the difference between the length of the two strings.
2) It can’t be greater than the longer of the two strings.
3) It can only be zero if the strings are identical.

So what does this look like in code?

I’ve put together a quick little C# console application to demonstrate this logic. You can download it here. Just change out the two strings I used above for anything.

So there you go!

Now you have a simple way to find words that are close matches to a user-entered string. Just zip through your term dictionary, picking out the words with the smallest distances. You’ll probably find their match. Certainly there are many applications for this other than spell checking. Think about recommending customer names as a user tries to search for one. Or account numbers.

Let me know how YOU would use this!

6 responses to “Obscure Knowledge: The Levenshtein Distance”

Darrell Hawley

June 16, 2010

I stumbled across this a while ago and had a short-lived obsession with it. This algorithm is WAY cool and I keep looking for a spot to use it but just haven’t been able to find the right situation. Great post.

Reply
Anonymous

June 16, 2010

Soundex is not appropriate for matching character based data. It is only useful for matching sounds.The Levenshtein Distance is more appropriate. I’ve had a lot of success with an n-gram method I wrote 10 years ago:http://sqlblindman.googlepages.com/fuzzysearchalgorithm–sqlblindman

Reply
David

June 16, 2010

I wrote an app several years ago that uses n-grams and computes the Euclidean distance to determine if database records were duplicates. I only had strings to compare – no concrete information like an ID, address, etc. that would help.It worked quite well, and it’s fairly simple to implement. I used tri-grams. Assuming you already know how to tokenize strings into a tri-gram, the algorithm is simple to implement.The numbers I used that worked the best for my data set were from Jeremy Hylton’s thesis: http://www.python.org/~jeremy/pubs/thesis/MIT-LCS-TR-678.ps.gzCheers,Dave

Reply
Brandon Joyce

June 16, 2010

Jeff, very cool post. I’ve done some work in the past with the fuzzy logic built into SQL Integration Services. It’s amazing. Also, you should check out this open source project. simmetricsIt’s a class library with all sorts of fuzzy matching algorithms. I really like the Jaro-Winkler one. There are many token based functions in this library as well. I believe SSIS uses the token based stuff which seems to work much better for me. I think the token based ones pick up on First/Last name transpositions much better than the Levenshtein or Jaro-Winkler distance will. Seeya!

Reply
Bruce

June 16, 2010

Jeff,We recently used a variant of the Levenshtein distance for strings and a Soundex implementation to help prevent the duplicate entry of patient information in a medical system. It is kind of a “fuzzy” match that can be tuned to be quite helpful.

Reply
Shawn Domer

June 16, 2010

Great post Jeff! I always wondered how a spell checker determined what words to recommend.

Reply