Organizational Research By

Surprising Reserch Topic

levenshtein mysql php


levenshtein mysql php  using -'php,mysql,levenshtein-distance'

$word = strtolower($_GET['term']);

$lev = 0;

$q = mysql_query("SELECT `term` FROM `words`");
while($r = mysql_fetch_assoc($q))
{
    $r['term'] = strtolower($r['term']);

    $lev = levenshtein($word, $r['term']);

    if($lev >= 0 && $lev < 5)
    {
        $word = $r['term'];
    }
}


How can I move all that into just one query? Don't want to have to query through all terms and do the filtering in PHP.
    

asked Sep 15, 2015 by JFNLydakellz
0 votes
7 views



Related Hot Questions

8 Answers

0 votes

You need a levenshtein function in MySQL and query like

$word = mysql_real_escape_string($word);
mysql_query("SELECT `term` FROM `words` WHERE levenshtein('$word', `term`) BETWEEN 0 AND 4");

Edit: Just in case the link above stops working some day, here's a copy of the MySQL levenshtein() implementation by Jason Rust:

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
  RETURNS INT
  DETERMINISTIC
  BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR;
    -- max strlen=255
    DECLARE cv0, cv1 VARBINARY(256);
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    IF s1 = s2 THEN
      RETURN 0;
    ELSEIF s1_len = 0 THEN
      RETURN s2_len;
    ELSEIF s2_len = 0 THEN
      RETURN s1_len;
    ELSE
      WHILE j <= s2_len DO
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
      END WHILE;
      WHILE i <= s1_len DO
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
        WHILE j <= s2_len DO
          SET c = c + 1;
          IF s1_char = SUBSTRING(s2, j, 1) THEN 
            SET cost = 0; ELSE SET cost = 1;
          END IF;
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
          IF c > c_temp THEN SET c = c_temp; END IF;
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
            IF c > c_temp THEN 
              SET c = c_temp; 
            END IF;
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
        END WHILE;
        SET cv1 = cv0, i = i + 1;
      END WHILE;
    END IF;
    RETURN c;
  END;

And a helper function:

CREATE FUNCTION levenshtein_ratio( s1 VARCHAR(255), s2 VARCHAR(255) )
  RETURNS INT
  DETERMINISTIC
  BEGIN
    DECLARE s1_len, s2_len, max_len INT;
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
    IF s1_len > s2_len THEN 
      SET max_len = s1_len; 
    ELSE 
      SET max_len = s2_len; 
    END IF;
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
  END; 
answered Sep 15, 2015 by JorT73
0 votes

There are two ways to implement a Levenshtein function in MySQL. The first is to create a STORED FUNCTION which operates much like a STORED TRANSACTION, except it has distinct inputs and an output. This is fine for small datasets, but a little slow on anything approaching several thousand rows. You can find more info here: http://kristiannissen.wordpress.com/2010/07/08/mysql-levenshtein/

The second method is to implement a User Defined Function in C/C++ and link it into MySQL as a shared library (*.so file). This method also uses a STORED FUNCTION to call the library, which means the actual query for this or the first method may be identical (providing the inputs to both functions are the same). You can find out more about this method here: http://samjlevy.com/mysql-levenshtein-and-damerau-levenshtein-udfs/

With either of these methods, your query would be something like:

SELECT term FROM words WHERE levenshtein(term, 'term') < 5;

Also, remember that the 'threshold' value should change in relation to the original word length. It's better to think of it in terms of a percentage value, i.e. half your word = 50%, half of 'term' = 2.

answered Sep 15, 2015 by JanMuncy
0 votes

If you have a huge database, you can filter the words first using SOUNDEX:

$word = strtolower($_GET['term']);

$q = mysql_query("SELECT LOWER(`term`) FROM `words` WHERE SOUNDEX(term) = SOUNDEX(" . $word . ")");

while($r = mysql_fetch_assoc($q)) { 

    $lev = levenshtein($word, $r['term']);

    ....

}
answered Sep 15, 2015 by ScottyStower
0 votes

I had a cryptic error message with @rik 's solution, because I didn't know the concept of delimiters in MySQL:

    #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near

This is a completely working example of creating the levenstein function, which also set the right delimiter beforehand:

Delimiter $$

CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
  RETURNS INT
  DETERMINISTIC
  BEGIN
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
    DECLARE s1_char CHAR;
    -- max strlen=255
    DECLARE cv0, cv1 VARBINARY(256);
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
    IF s1 = s2 THEN
      RETURN 0;
    ELSEIF s1_len = 0 THEN
      RETURN s2_len;
    ELSEIF s2_len = 0 THEN
      RETURN s1_len;
    ELSE
      WHILE j <= s2_len DO
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
      END WHILE;
      WHILE i <= s1_len DO
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
        WHILE j <= s2_len DO
          SET c = c + 1;
          IF s1_char = SUBSTRING(s2, j, 1) THEN 
            SET cost = 0; ELSE SET cost = 1;
          END IF;
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
          IF c > c_temp THEN SET c = c_temp; END IF;
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
            IF c > c_temp THEN 
              SET c = c_temp; 
            END IF;
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
        END WHILE;
        SET cv1 = cv0, i = i + 1;
      END WHILE;
    END IF;
    RETURN c;
END$$

DELIMITER ;
answered Sep 15, 2015 by RoseComoqsz
0 votes

If you are dealing with very large data sets I have found that it is much more efficient to handle the Levenshtein operations and sorting in PHP than it is in MySQL. e.g. query of about 1000 records:

MySQL( ~ 0.0050s) -> PHP Levenshtein( ~ 1.300s)

vs.

MySQL Levenshtein( >= 5.000s) -> PHP( ~ 0.250s)

There are also many other options for optimizing search engines but if you want to use Levenshtein just be aware of the data you'll be handling and the latencies you want.

answered Sep 15, 2015 by AdeJiminez
0 votes

You can make this code look a bit neater but @profitphp is right, you can't doing it in MySQL without a levenstein library.

$word = strtolower($_GET['term']);

$q = mysql_uqery("SELECT LOWER(`term`) FROM `words`");

while($r = mysql_fetch_assoc($q)) { 

    $lev = levenshtein($word, $r['term']);

    ....

}
answered Sep 15, 2015 by KirbyBryson
0 votes

I suggest you including the call of the levenshtein(link: http://www.artfulsoftware.com/infotree/queries.php#552) into your query.

You should use mysqli_query($q) because mysql_query($q) is deprecated and may be removed in future php versions!

$word = mysql_real_escape_string($word);
$query = "SELECT `term` FROM `words` WHERE levenshtein('$word', `term`)   BETWEEN 0 AND 4";
mysqli_qery($query);
answered Sep 15, 2015 by MaryelleOqui
0 votes

That is one query. If you're asking if you can move the levenshtein functionality to mysql, you can't.

Ok, well you can, but its not any easier than just doing it in php.

http://www.artfulsoftware.com/infotree/queries.php?&bw=1280#552

answered Sep 15, 2015 by MaurineZSDe

...