codejanitor

Levenshtein Distance as a MySQL Stored Function

Feb. 10th, 2007

For a recent project I needed to dig into “fuzzy matching” to try and find duplicate names and addresses in a MySQL table. I decided using the Levenshtein Distance would be very useful for what I needed. I thought I was home free when I came across a Levenshtein U.D.F. (User Defined Function). However, my excitement was short lived when I discovered my MySQL database was going to be on a Windows server and that compiling U.D.F’s on Windows is quite the process. So, I decided to implement the Levenshtein algorithm using just SQL as a stored function. Obviously, it won’t be near so fast as the U.D.F. but it’s sure a lot easier to install on multiple Windows machines.

There are a few Levenshtein implementations for SQL out there, but I decided to use one I found written for SQL server. I chose it mostly because it uses a binary variable for creating the matrix instead of a temporary table (which, I think, would be quite a bit slower). Anyhow, I fixed a few bugs it had and spent too many hours figuring out how to convert it, but it’s done and it works (at least for all the tests I’ve thrown at it). The code for those interested:

CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        DECLARE s1_char CHAR;
        DECLARE cv0, cv1 VARBINARY(256);
        SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
        IF s1 = s2 THEN
          RETURN 0;
        ELSEIF s1_len = 0 THEN
          RETURN s2_len;
        ELSEIF s2_len = 0 THEN
          RETURN s1_len;
        ELSE
          WHILE j <= s2_len DO
            SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
          END WHILE;
          WHILE i <= s1_len DO
            SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
            WHILE j <= s2_len DO
                SET c = c + 1;
                IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            END WHILE;
            SET cv1 = cv0, i = i + 1;
          END WHILE;
        END IF;
        RETURN c;
      END

To note:

  • Maximum length of input strings is 255 characters. I’m sure you could edit the function to support more if needed.
  • I’ve tested it with international characters on a utf8_bin column and it seemed to work, but I’ve not tested that capability exstensively.
  • I’ve only tested it on MySQL 5.0+. No idea how it will work on versions less than that.

And as a bonus I also created a helper function that returns the ratio (as a percentage) of different : same characters which can be more helpful than just a straight edit distance (idea from here).

CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, max_len INT;
        SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
        IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
        RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
      END

If anyone finds a bug or way to improve the functions I’d be happy to know!

Filed under: Jason @ 7:09 pm

12 Responses to “Levenshtein Distance as a MySQL Stored Function”

  1. orangeful Says:

    It’s nice of you to make this code available.
    I think it might be more nice if CHAR_LENGTH were used in place of LENGTH (for multibyte charsets support).

  2. Jason Says:

    Thanks for that fix, orangeful. I tested it out (with english characters anyhow — not sure how to test with multibyte characters) and it worked fine, so I updated the code.

  3. nagmat Says:

    It pasted the code into the editor, but it did not work. it gave errors on
    WHILE j c_temp THEN SET c = c_temp; END IF;
    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
    IF c > c_temp THEN SET c = c_temp; END IF;
    SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
    END WHILE;

    lines.

  4. Jason Says:

    Thanks nagmat, it looks like some spaces got inserted by wordpress when I posted. I’ve updated the code and made sure it works when pasted into MySQL.

  5. David Weingart Says:

    Hi,

    Thought folks interested in this page might also be interested in two alternative compiled implementations (included a precompiled version for Windows).

    Source version:
    http://joshdrew.com/mysql_levenshtein_udf-1.0.tar.gz

    I successfully compiled and installed this on CentOS 4/MySQL 5. The README has compilation notes. I used the BSD instructions, changing the paths from /usr/local/* to /usr/*. The .so library file needs to be installed in /usr/lib for MySQL to find it.

    You could possibly compile this on Windows too, but why, since there’s a precompiled binary available from another kind soul. :-)

    Windows DLL:
    http://www.sherlocksoftware.org/page.php?id=58

    After copying the dll to your MySql/bin folder, create with: CREATE FUNCTION Levenshtein RETURNS INT SONAME ‘LevenshteinUDF.dll’

    I’ve tested both versions and they seem to work correctly, and are hundreds (if not thousands) of times faster than the version implemented in pure SQL.

  6. Eric Says:

    Thanks for the stored function… I’m trying to figure out how you are manipulating the binary into a matrix though…

    I want to incorporate 1 change… basically implementing: Damerau-Levenshtein distance… so transpositions only cost 1 instead of 2.

    Eg:
    Levenshtein: DOG vs DGO = 2
    Damerau-Levenshtein: DOG vs DGO = 1

    The algorithm is almost identical except for the check with the previous char to check if it was transposed… (i’d do it if i could figure out the binary part. :)

    Thanks!

  7. Eric Says:

    Forgot the link to the algorithm..

    http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance#The_algorithm

  8. Jason Says:

    Hi Eric,
    The way these VARBINARY variables act as matrixes is as follows:
    1. Each binary variable (i.e. cv0) is length 256, which is the same as 256 bytes.
    2. Because we will never be storing a number bigger than 255 (length requirement of s1 and s2) we can store up to 256 numbers in each of our binary variables. It’s easier to visualize in hex. Take, for example a binary variable of length 2. Empty it would look like: 0×0000. If we stored the number 255 in position 1 and the number 1 in position 2 it would look like: 0xff01. So, in essence we can treat these binary variables as an array of numbers, and we can access any element in the array using the SUBSTRING() function.
    3. In your link to the modified algorithm, there is a variable d which is a two dimensional table. That is what cv0 and cv1 act as in the function I posted. cv0 is the horizontal part of the table and cv1 is the vertical part of the table.

    Hopefully that makes sense. I haven’t looked close enough to know how exactly you would implement the Damerau version, but it doesn’t look too difficult.

  9. Gareth Says:

    I was stumped getting this to work until I realised you needed to add a “DELIMITER //” line to the top of the code and a corresponding “//” at the bottom when loading this code into MySQL in batch mode.

    eg:
    DELIMITER //
    CREATE FUNCTION[...]
    [...code...]
    END
    //

  10. Toibi Says:

    To compile and run mysql_levenshtein_udf suggested by David Weingart, I had to do following steps

    g++ -Wall -I/usr/local/mysql/include -dynamiclib -o mysqllevenshtein.dylib -lstdc++ -lc -I`/usr/local/mysql/bin/mysql_config –cflags` mysqllevenshtein.cc

    sudo cp mysqllevenshtein.dylib /usr/lib/mysqllevenshtein.dylib

    DROP FUNCTION IF EXISTS levenshtein;
    CREATE FUNCTION levenshtein RETURNS INT SONAME ‘mysqllevenshtein.dylib’;

    Cheers Tobi

  11. Toibi Says:

    Forgot to say: works on Mac OS X 10.5

  12. Sean Says:

    I’ve just published a few Damerau-Levenshtein versions of Josh Drew’s MySQL Levenshtein User Defined Function on my blog at:
    http://blog.lolyco.com/sean/2008/08/27/damerau-levenshtein-algorithm-levenshtein-with-transpositions/
    They’re Damerau-Levenshtein, so produce lower results for comparisons with transpositions in. They haven’t had the most intensive of tests, but seem to work reliably on one of my projects at the moment. There’s a vanilla version and two with successive performance hacks.

Leave a Reply