For a recent project I needed to dig into “fuzzy matching” to try and find duplicate names and addresses in a MySQL table. I decided using the Levenshtein Distance would be very useful for what I needed. I thought I was home free when I came across a Levenshtein U.D.F. (User Defined Function). However, my excitement was short lived when I discovered my MySQL database was going to be on a Windows server and that compiling U.D.F’s on Windows is quite the process. So, I decided to implement the Levenshtein algorithm using just SQL as a stored function. Obviously, it won’t be near so fast as the U.D.F. but it’s sure a lot easier to install on multiple Windows machines.
There are a few Levenshtein implementations for SQL out there, but I decided to use one I found written for SQL server. I chose it mostly because it uses a binary variable for creating the matrix instead of a temporary table (which, I think, would be quite a bit slower). Anyhow, I fixed a few bugs it had and spent too many hours figuring out how to convert it, but it’s done and it works (at least for all the tests I’ve thrown at it). The code for those interested:
CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END
To note:
- Maximum length of input strings is 255 characters. I’m sure you could edit the function to support more if needed.
- I’ve tested it with international characters on a utf8_bin column and it seemed to work, but I’ve not tested that capability exstensively.
- I’ve only tested it on MySQL 5.0+. No idea how it will work on versions less than that.
And as a bonus I also created a helper function that returns the ratio (as a percentage) of different : same characters which can be more helpful than just a straight edit distance (idea from here).
CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END
If anyone finds a bug or way to improve the functions I’d be happy to know!
It’s nice of you to make this code available.
I think it might be more nice if CHAR_LENGTH were used in place of LENGTH (for multibyte charsets support).
Thanks for that fix, orangeful. I tested it out (with english characters anyhow — not sure how to test with multibyte characters) and it worked fine, so I updated the code.
It pasted the code into the editor, but it did not work. it gave errors on
WHILE j c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
lines.
Thanks nagmat, it looks like some spaces got inserted by wordpress when I posted. I’ve updated the code and made sure it works when pasted into MySQL.
Hi,
Thought folks interested in this page might also be interested in two alternative compiled implementations (included a precompiled version for Windows).
Source version:
http://joshdrew.com/mysql_levenshtein_udf-1.0.tar.gz
I successfully compiled and installed this on CentOS 4/MySQL 5. The README has compilation notes. I used the BSD instructions, changing the paths from /usr/local/* to /usr/*. The .so library file needs to be installed in /usr/lib for MySQL to find it.
You could possibly compile this on Windows too, but why, since there’s a precompiled binary available from another kind soul.
Windows DLL:
http://www.sherlocksoftware.org/page.php?id=58
After copying the dll to your MySql/bin folder, create with: CREATE FUNCTION Levenshtein RETURNS INT SONAME ‘LevenshteinUDF.dll’
I’ve tested both versions and they seem to work correctly, and are hundreds (if not thousands) of times faster than the version implemented in pure SQL.
Thanks for the stored function… I’m trying to figure out how you are manipulating the binary into a matrix though…
I want to incorporate 1 change… basically implementing: Damerau-Levenshtein distance… so transpositions only cost 1 instead of 2.
Eg:
Levenshtein: DOG vs DGO = 2
Damerau-Levenshtein: DOG vs DGO = 1
The algorithm is almost identical except for the check with the previous char to check if it was transposed… (i’d do it if i could figure out the binary part.
Thanks!
Forgot the link to the algorithm..
http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance#The_algorithm
Hi Eric,
The way these VARBINARY variables act as matrixes is as follows:
1. Each binary variable (i.e. cv0) is length 256, which is the same as 256 bytes.
2. Because we will never be storing a number bigger than 255 (length requirement of s1 and s2) we can store up to 256 numbers in each of our binary variables. It’s easier to visualize in hex. Take, for example a binary variable of length 2. Empty it would look like: 0×0000. If we stored the number 255 in position 1 and the number 1 in position 2 it would look like: 0xff01. So, in essence we can treat these binary variables as an array of numbers, and we can access any element in the array using the SUBSTRING() function.
3. In your link to the modified algorithm, there is a variable d which is a two dimensional table. That is what cv0 and cv1 act as in the function I posted. cv0 is the horizontal part of the table and cv1 is the vertical part of the table.
Hopefully that makes sense. I haven’t looked close enough to know how exactly you would implement the Damerau version, but it doesn’t look too difficult.
I was stumped getting this to work until I realised you needed to add a “DELIMITER //” line to the top of the code and a corresponding “//” at the bottom when loading this code into MySQL in batch mode.
eg:
DELIMITER //
CREATE FUNCTION[...]
[...code...]
END
//
To compile and run mysql_levenshtein_udf suggested by David Weingart, I had to do following steps
g++ -Wall -I/usr/local/mysql/include -dynamiclib -o mysqllevenshtein.dylib -lstdc++ -lc -I`/usr/local/mysql/bin/mysql_config –cflags` mysqllevenshtein.cc
sudo cp mysqllevenshtein.dylib /usr/lib/mysqllevenshtein.dylib
DROP FUNCTION IF EXISTS levenshtein;
CREATE FUNCTION levenshtein RETURNS INT SONAME ‘mysqllevenshtein.dylib’;
Cheers Tobi
Forgot to say: works on Mac OS X 10.5
I’ve just published a few Damerau-Levenshtein versions of Josh Drew’s MySQL Levenshtein User Defined Function on my blog at:
http://blog.lolyco.com/sean/2008/08/27/damerau-levenshtein-algorithm-levenshtein-with-transpositions/
They’re Damerau-Levenshtein, so produce lower results for comparisons with transpositions in. They haven’t had the most intensive of tests, but seem to work reliably on one of my projects at the moment. There’s a vanilla version and two with successive performance hacks.
只需10元,就能让你的广告遍布各个网站!
支持图片/超链发送!支持分类/地域网站发送!
10元,就可能换来意想不到的效果,为什么不尝试下呢!
QQ:224549200(注明“广告”)
[...] Gehts dir also jetzt nurnoch um die Performance oder klappt das ganze Skript nicht? CSV in die MySQL druecken: MySQL :: MySQL 5.0 Reference Manual :: 12.2.6 LOAD DATA INFILE Syntax Levenshtein als STORED PROCEDURE: codejanitor Levenshtein Distance as a MySQL Stored Function [...]
Now everyone is talking about the American economy and eclections, nice to read something different. Eugene
I keep getting sintax errors! I really need this function!
PLease anyone? i need this mysql function working and I keep getting sintax errors, i am using WAMP (mysql5.0.51b)
please i need help
I keep getting this error
Error
SQL query:
CREATE FUNCTION LEVENSHTEIN(
s1 VARCHAR( 255 ) ,
s2 VARCHAR( 255 )
) RETURNS INTDETERMINISTIC BEGIN DECLARE s1_len,
s2_len,
i,
j,
c,
c_temp,
cost INT;
MySQL said: Documentation
#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ” at line 5
Are you changing the delimiter before trying to create the function? You’ll need to do that so that the semicolons in the function don’t mess up creating the function. See here: http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html
I get serious problems when trying to compare strings containing numbers only. MySQL server seems to crash!
query: …LEVENSHTEIN(`POSTALCODE`,’14482′) <= 2…
result: #2006 - MySQL server has gone away
What could be the problem?
Thanks for the code; it works fine, no problems so far.
But I wish to use it on a large amount of entries (~30000+), which takes quite some time. I wonder if there’s any possibility to limit the maximum difference to a constant value, e.g. 10, so differences greater than that don’t need to be calculated to reduce computing time.
Would that be possible?
Thanks for the code if works well but I would like to use it with the Damerau option included. Does anyone have a stored procedure available with the Damerau option included?
@Tobi: thanks for the compile tips. works nice.
but beware that mysql_config argument is minus minus cflags, this blog thing seems to modify the minus minus into some other kind of dash.
Good work. I add this part to check those null values. Thanks
DECLARE cv0, cv1 VARBINARY(256);
SET cv1 = 0×00, j = 1, i = 1, c = 0;
IF s1 IS NULL THEN SET s1_len =0;
ELSE SET s1_len = CHAR_LENGTH(s1);
END IF;
IF s2 IS NULL THEN SET s2_len =0;
ELSE SET s2_len = CHAR_LENGTH(s2);
END IF;
I have a zip code database for the US. When I do a query like this it took almost 30 seconds:
SELECT `zipcode` , `city` , `statecode` , (
( 5 * levenshtein(
‘85326′, LOWER( `zipcode` ) )
)
) AS ld
FROM `zipcode`
HAVING `ld` <50
ORDER BY `ld` , `city`
LIMIT 50
The same query after compiling and installing the UDF from http://joshdrew.com/ took .1 seconds. I don’t know if it’s related to the all number issue mentioned above or if it’s because it’s an 8.7M table with 70k+ rows, but it’s definitely too slow for me to use.
@Aaron ‘took 1 seconds’. That’s not bad for 70,000 rows. You should avoid doing table scans if you’re planning to do this sort of thing online. Your SQL is a bit odd - you seem to be allowing for an edit distance of 10 in a 5 character string. What I do with one of my projects is search the table on string length first, to limit the number of times the levenshtein function is invoked. I use a version of Levenshtein that returns early if an edit distance limit is met, so it doesn’t check whole strings when they’re already too distant from each other. Also the data is already LOWER()ed in the database, that would save you 70,000 LOWER()s in your application, each time you check a zip code.
I’d be interested to know if the hacks get you an appreciable speedup - the damlev UDFs (and some Java) versions are on my blog - linked above.
[...] MySQL implementiert soweit ich weiß. Daher könnte vielleicht das hier für dich interessant sein: codejanitor Levenshtein Distance as a MySQL Stored Function Die Datenbank normalisiert könnte nun so aussehen: author id | name book id | title publisher [...]
Good contribution!
Thanks!
This is my version:
[code]
CREATE PROC EDIT_DISTANCE
@string1 nvarchar(3999),
@string2 nvarchar(3999),
@edit_distance int output
AS
BEGIN
IF @string1 = @string2
BEGIN
SET @edit_distance = 0
RETURN
END
DECLARE @string1_len int,
@string2_len int,
@i int,
@j int,
@string1_char nchar,
@eval int,
@next_set varbinary(8000),
@current_set varbinary(8000)
SELECT @string1_len = LEN(@string1),
@string2_len = LEN(@string2),
@current_set = cast(0 as binary(2)),
@j = 1,
@i = 1,
@edit_distance = 0
IF @string1_len = 0
BEGIN
SET @edit_distance = @string2_len
RETURN
END
IF @string2_len = 0
BEGIN
SET @edit_distance = @string1_len
RETURN
END
IF Len(@string1) < Len(@string2) AND CharIndex(@string1, @string2) 0
BEGIN
SET @edit_distance = Len(@string2) - Len(@string1)
RETURN
END
IF Len(@string2) < Len(@string1) AND CharIndex(@string2, @string1) 0
BEGIN
SET @edit_distance = Len(@string1) - Len(@string2)
RETURN
END
WHILE @j <= @string2_len
SELECT @current_set = @current_set + CAST(@j AS binary(2)),
@j = @j + 1
WHILE @i <= @string1_len
BEGIN
SELECT @string1_char = SUBSTRING(@string1, @i, 1),
@edit_distance = @i,
@next_set = CAST(@i AS binary(2)),
@j = 1
WHILE @j @eval
SET @edit_distance = @eval
SET @eval = CAST(SUBSTRING(@current_set, 2 * @j + 1, 2) AS int)+1
IF @edit_distance > @eval
SET @edit_distance = @eval
SELECT @next_set = @next_set + CAST(@edit_distance AS binary(2)), @j = @j + 1
END
SELECT @current_set = @next_set, @i = @i + 1
END
END
GO
[/code]
looks like minor_equal operators get messed while posting. changed them to !>
–here is the code
CREATE PROC EDIT_DISTANCE
@string1 nvarchar(3999),
@string2 nvarchar(3999),
@edit_distance int output
AS
BEGIN
IF @string1 = @string2
BEGIN
SET @edit_distance = 0
RETURN
END
DECLARE @string1_len int,
@string2_len int,
@i int,
@j int,
@string1_char nchar,
@eval int,
@next_set varbinary(8000),
@current_set varbinary(8000)
SELECT @string1_len = LEN(@string1),
@string2_len = LEN(@string2),
@current_set = cast(0 as binary(2)),
@j = 1,
@i = 1,
@edit_distance = 0
IF @string1_len = 0
BEGIN
SET @edit_distance = @string2_len
RETURN
END
IF @string2_len = 0
BEGIN
SET @edit_distance = @string1_len
RETURN
END
IF Len(@string1) < Len(@string2) AND CharIndex(@string1, @string2) 0
BEGIN
SET @edit_distance = Len(@string2) - Len(@string1)
RETURN
END
IF Len(@string2) < Len(@string1) AND CharIndex(@string2, @string1) 0
BEGIN
SET @edit_distance = Len(@string1) - Len(@string2)
RETURN
END
WHILE @j !> @string2_len
SELECT @current_set = @current_set + CAST(@j AS binary(2)),
@j = @j + 1
WHILE @i !> @string1_len
BEGIN
SELECT @string1_char = SUBSTRING(@string1, @i, 1),
@edit_distance = @i,
@next_set = CAST(@i AS binary(2)),
@j = 1
WHILE @j !> @string2_len
BEGIN
SET @edit_distance = @edit_distance + 1
SET @eval = CAST(SUBSTRING(@current_set, 2 * @j - 1, 2) AS int) +
CASE
WHEN @string1_char = SUBSTRING(@string2, @j, 1)
THEN 0
ELSE
1
END
IF @edit_distance > @eval
SET @edit_distance = @eval
SET @eval = CAST(SUBSTRING(@current_set, 2 * @j + 1, 2) AS int)+1
IF @edit_distance > @eval
SET @edit_distance = @eval
SELECT @next_set = @next_set + CAST(@edit_distance AS binary(2)), @j = @j + 1
END
SELECT @current_set = @next_set, @i = @i + 1
END
END
GO
mmm, again, the different operator doesn’t work. changed it to !=
final code:
CREATE PROC EDIT_DISTANCE
@string1 nvarchar(3999),
@string2 nvarchar(3999),
@edit_distance int output
AS
BEGIN
IF @string1 = @string2
BEGIN
SET @edit_distance = 0
RETURN
END
DECLARE @string1_len int,
@string2_len int,
@i int,
@j int,
@string1_char nchar,
@eval int,
@next_set varbinary(8000),
@current_set varbinary(8000)
SELECT @string1_len = LEN(@string1),
@string2_len = LEN(@string2),
@current_set = cast(0 as binary(2)),
@j = 1,
@i = 1,
@edit_distance = 0
IF @string1_len = 0
BEGIN
SET @edit_distance = @string2_len
RETURN
END
IF @string2_len = 0
BEGIN
SET @edit_distance = @string1_len
RETURN
END
IF Len(@string1) < Len(@string2) AND CharIndex(@string1, @string2) != 0
BEGIN
SET @edit_distance = Len(@string2) - Len(@string1)
RETURN
END
IF Len(@string2) @string2_len
SELECT @current_set = @current_set + CAST(@j AS binary(2)),
@j = @j + 1
WHILE @i !> @string1_len
BEGIN
SELECT @string1_char = SUBSTRING(@string1, @i, 1),
@edit_distance = @i,
@next_set = CAST(@i AS binary(2)),
@j = 1
WHILE @j !> @string2_len
BEGIN
SET @edit_distance = @edit_distance + 1
SET @eval = CAST(SUBSTRING(@current_set, 2 * @j - 1, 2) AS int) +
CASE
WHEN @string1_char = SUBSTRING(@string2, @j, 1)
THEN 0
ELSE
1
END
IF @edit_distance > @eval
SET @edit_distance = @eval
SET @eval = CAST(SUBSTRING(@current_set, 2 * @j + 1, 2) AS int)+1
IF @edit_distance > @eval
SET @edit_distance = @eval
SELECT @next_set = @next_set + CAST(@edit_distance AS binary(2)), @j = @j + 1
END
SELECT @current_set = @next_set, @i = @i + 1
END
END
GO
mmmm i give up. is there any way i can post without getting messed up by this blog’s html parser?