Levenshtein distance - use on web site search

Levenshtein distance is the minimum number of character deletion (D), insertion (I) or substitution(S) operations to transform a string to another.

I used this algorithm to give alternatives to an user when he gets no results after he submitted search by a specific keyword. Often happens when the keyword is misspelled or even there are no matches.

If an e-shop sells very expensive bracelets and has one called "Diamond Tennis Bracelet", it would be nice when the customer searched for "tenis" and he found nothing to give him a close suggestion like "tennis" and no random ones.

Here follows the Levenshtein algorithm implementation in C# in a O(mn) space

 public static int distanceMN(string source, string destination, bool ignoreCase, int threshold) {
if (ignoreCase)
{
source = source.ToLower();
destination = destination.ToLower();
}
if (source == destination)
return 0;
var m = destination.Length;
var n = source.Length;

if (System.Math.Abs(m - n) > threshold)
{
return Int32.MaxValue;
}

var d = new int[m + 1, n + 1];
for (int i = 0; i <= m; i++)
d[i, 0] = i;
for (int j = 0; j <= n; j++)
d[0, j] = j;

for (int i = 1; i <= m; i++)
{

for (int j = 1; j <= n; j++)
{
var cost = destination[i - 1] == source[j - 1] ? 0 : 1;
var insertion = d[i, j - 1] + 1;
var substitution = d[i - 1, j - 1] + cost;
var deletion = d[i - 1, j] + 1;
d[i, j] = minimum(deletion, insertion, substitution);
}
}
return d[m, n];
}


Threshold parameter is used when want to have an upper bound limit for the distance. Is useful when the difference between words will be high for sure.

Based on wikipedia, I rewrote the algorithm for an O(m) space, by using a matrix with 2 lines.


public static int distance2M(string source, string destination, bool ignoreCase, int threshold)
{
if (ignoreCase)
{
source = source.ToLower();
destination = destination.ToLower();
}
if (source == destination)
return 0;
var m = destination.Length;
var n = source.Length;

if (System.Math.Abs(m - n) > threshold) {
return Int32.MaxValue;
}

var d = new int[2, n + 1];

d[0, 0] = 0;
d[1, 0] = 1;

for (int j = 0; j <= n; j++)
d[0, j] = j;

for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
var cost = destination[i-1] == source[j-1] ? 0 : 1;
var insertion = d[1, j - 1] + 1;
var substitution = d[0, j - 1] + cost;
var deletion = d[0, j] + 1;
d[1, j] = minimum(deletion, insertion, substitution);
}
// move the new values on the first line
d[0, 0] = i;
d[0, 1] = i+1;
for (int j = 1; j <= n; j++) {
d[0, j] = d[1,j];
}
}

return d[1, n];
}

Can try this on the website urbanindustry.co.uk by searching "addidas" or anything weird.

Comments

  1. Addias doesn't return any results :)

    ReplyDelete
  2. Addias doesn't return any results :)

    ReplyDelete
    Replies
    1. Asa-i, pentru ca a suferit niste schimbari functia aia de search si am scos partea asta. Dar inca mai folosesc Levenshtein la unele chestii pe acolo
      Incearca URL-ul
      http://www.urbanindustry.co.uk/brands/addidas-originals.asp

      Tot cu Levenshtein e tratat.

      Delete

Post a Comment

Popular posts from this blog

IIS 7.5, HTTPS Bindings and ERR_CONNECTION_RESET

Table Per Hierarchy Inheritance with Column Discriminator and Associations used in Derived Entity Types

About the "The Controls collection cannot be modified because the control contains code blocks" symptom