For a recent project I needed to dig into “fuzzy matching” to try and find duplicate names and addresses in a MySQL table. I decided using the Levenshtein Distance would be very useful for what I needed. I thought I was home free when I came across a Levenshtein U.D.F. (User Defined Function). However, my excitement was short lived when I discovered my MySQL database was going to be on a Windows server and that compiling U.D.F’s on Windows is quite the process. So, I decided to implement the Levenshtein algorithm using just SQL as a stored function. Obviously, it won’t be near so fast as the U.D.F. but it’s sure a lot easier to install on multiple Windows machines.
There are a few Levenshtein implementations for SQL out there, but I decided to use one I found written for SQL server. I chose it mostly because it uses a binary variable for creating the matrix instead of a temporary table (which, I think, would be quite a bit slower). Anyhow, I fixed a few bugs it had and spent too many hours figuring out how to convert it, but it’s done and it works (at least for all the tests I’ve thrown at it). The code for those interested:
CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END
To note:
- Maximum length of input strings is 255 characters. I’m sure you could edit the function to support more if needed.
- I’ve tested it with international characters on a utf8_bin column and it seemed to work, but I’ve not tested that capability exstensively.
- I’ve only tested it on MySQL 5.0+. No idea how it will work on versions less than that.
And as a bonus I also created a helper function that returns the ratio (as a percentage) of different : same characters which can be more helpful than just a straight edit distance (idea from here).
CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END
If anyone finds a bug or way to improve the functions I’d be happy to know!
March 7th, 2007 at 10:39 pm
It’s nice of you to make this code available.
I think it might be more nice if CHAR_LENGTH were used in place of LENGTH (for multibyte charsets support).
March 8th, 2007 at 10:17 am
Thanks for that fix, orangeful. I tested it out (with english characters anyhow — not sure how to test with multibyte characters) and it worked fine, so I updated the code.
March 12th, 2007 at 11:14 pm
It pasted the code into the editor, but it did not work. it gave errors on
WHILE j c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
lines.
March 13th, 2007 at 2:27 pm
Thanks nagmat, it looks like some spaces got inserted by wordpress when I posted. I’ve updated the code and made sure it works when pasted into MySQL.
April 5th, 2007 at 11:39 am
Hi,
Thought folks interested in this page might also be interested in two alternative compiled implementations (included a precompiled version for Windows).
Source version:
http://joshdrew.com/mysql_levenshtein_udf-1.0.tar.gz
I successfully compiled and installed this on CentOS 4/MySQL 5. The README has compilation notes. I used the BSD instructions, changing the paths from /usr/local/* to /usr/*. The .so library file needs to be installed in /usr/lib for MySQL to find it.
You could possibly compile this on Windows too, but why, since there’s a precompiled binary available from another kind soul.
Windows DLL:
http://www.sherlocksoftware.org/page.php?id=58
After copying the dll to your MySql/bin folder, create with: CREATE FUNCTION Levenshtein RETURNS INT SONAME ‘LevenshteinUDF.dll’
I’ve tested both versions and they seem to work correctly, and are hundreds (if not thousands) of times faster than the version implemented in pure SQL.
May 21st, 2007 at 5:45 pm
Thanks for the stored function… I’m trying to figure out how you are manipulating the binary into a matrix though…
I want to incorporate 1 change… basically implementing: Damerau-Levenshtein distance… so transpositions only cost 1 instead of 2.
Eg:
Levenshtein: DOG vs DGO = 2
Damerau-Levenshtein: DOG vs DGO = 1
The algorithm is almost identical except for the check with the previous char to check if it was transposed… (i’d do it if i could figure out the binary part.
Thanks!
May 21st, 2007 at 5:45 pm
Forgot the link to the algorithm..
http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance#The_algorithm
May 22nd, 2007 at 10:30 am
Hi Eric,
The way these VARBINARY variables act as matrixes is as follows:
1. Each binary variable (i.e. cv0) is length 256, which is the same as 256 bytes.
2. Because we will never be storing a number bigger than 255 (length requirement of s1 and s2) we can store up to 256 numbers in each of our binary variables. It’s easier to visualize in hex. Take, for example a binary variable of length 2. Empty it would look like: 0×0000. If we stored the number 255 in position 1 and the number 1 in position 2 it would look like: 0xff01. So, in essence we can treat these binary variables as an array of numbers, and we can access any element in the array using the SUBSTRING() function.
3. In your link to the modified algorithm, there is a variable d which is a two dimensional table. That is what cv0 and cv1 act as in the function I posted. cv0 is the horizontal part of the table and cv1 is the vertical part of the table.
Hopefully that makes sense. I haven’t looked close enough to know how exactly you would implement the Damerau version, but it doesn’t look too difficult.
May 31st, 2008 at 4:24 pm
I was stumped getting this to work until I realised you needed to add a “DELIMITER //” line to the top of the code and a corresponding “//” at the bottom when loading this code into MySQL in batch mode.
eg:
DELIMITER //
CREATE FUNCTION[...]
[...code...]
END
//
August 4th, 2008 at 5:59 am
To compile and run mysql_levenshtein_udf suggested by David Weingart, I had to do following steps
g++ -Wall -I/usr/local/mysql/include -dynamiclib -o mysqllevenshtein.dylib -lstdc++ -lc -I`/usr/local/mysql/bin/mysql_config –cflags` mysqllevenshtein.cc
sudo cp mysqllevenshtein.dylib /usr/lib/mysqllevenshtein.dylib
DROP FUNCTION IF EXISTS levenshtein;
CREATE FUNCTION levenshtein RETURNS INT SONAME ‘mysqllevenshtein.dylib’;
Cheers Tobi
August 4th, 2008 at 6:00 am
Forgot to say: works on Mac OS X 10.5
August 27th, 2008 at 11:35 am
I’ve just published a few Damerau-Levenshtein versions of Josh Drew’s MySQL Levenshtein User Defined Function on my blog at:
http://blog.lolyco.com/sean/2008/08/27/damerau-levenshtein-algorithm-levenshtein-with-transpositions/
They’re Damerau-Levenshtein, so produce lower results for comparisons with transpositions in. They haven’t had the most intensive of tests, but seem to work reliably on one of my projects at the moment. There’s a vanilla version and two with successive performance hacks.