For a recent project I needed to dig into “fuzzy matching” to try and find duplicate names and addresses in a MySQL table. I decided using the Levenshtein Distance would be very useful for what I needed. I thought I was home free when I came across a Levenshtein U.D.F. (User Defined Function). However, my excitement was short lived when I discovered my MySQL database was going to be on a Windows server and that compiling U.D.F’s on Windows is quite the process. So, I decided to implement the Levenshtein algorithm using just SQL as a stored function. Obviously, it won’t be near so fast as the U.D.F. but it’s sure a lot easier to install on multiple Windows machines.
There are a few Levenshtein implementations for SQL out there, but I decided to use one I found written for SQL server. I chose it mostly because it uses a binary variable for creating the matrix instead of a temporary table (which, I think, would be quite a bit slower). Anyhow, I fixed a few bugs it had and spent too many hours figuring out how to convert it, but it’s done and it works (at least for all the tests I’ve thrown at it). The code for those interested:
CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END
To note:
- Maximum length of input strings is 255 characters. I’m sure you could edit the function to support more if needed.
- I’ve tested it with international characters on a utf8_bin column and it seemed to work, but I’ve not tested that capability exstensively.
- I’ve only tested it on MySQL 5.0+. No idea how it will work on versions less than that.
And as a bonus I also created a helper function that returns the ratio (as a percentage) of different : same characters which can be more helpful than just a straight edit distance (idea from here).
CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END
If anyone finds a bug or way to improve the functions I’d be happy to know!
It may look quiet around here, but that’s because we’ve been working on a new project utilizing the FastFrame framework. Under the