### Levenshtein distance - use on web site search

Levenshtein distance is the minimum number of character deletion (D), insertion (I) or substitution(S) operations to transform a string to another.

I used this algorithm to give alternatives to an user when he gets no results after he submitted search by a specific keyword. Often happens when the keyword is misspelled or even there are no matches.

If an e-shop sells very expensive bracelets and has one called "Diamond Tennis Bracelet", it would be nice when the customer searched for "tenis" and he found nothing to give him a close suggestion like "tennis" and no random ones.

Here follows the Levenshtein algorithm implementation in C# in a O(mn) space

Based on wikipedia, I rewrote the algorithm for an O(m) space, by using a matrix with 2 lines.

Can try this on the website urbanindustry.co.uk by searching "addidas" or anything weird.

I used this algorithm to give alternatives to an user when he gets no results after he submitted search by a specific keyword. Often happens when the keyword is misspelled or even there are no matches.

If an e-shop sells very expensive bracelets and has one called "Diamond Tennis Bracelet", it would be nice when the customer searched for "tenis" and he found nothing to give him a close suggestion like "tennis" and no random ones.

Here follows the Levenshtein algorithm implementation in C# in a O(mn) space

public static int distanceMN(string source, string destination, bool ignoreCase, int threshold) {

if (ignoreCase)

{

source = source.ToLower();

destination = destination.ToLower();

}

if (source == destination)

return 0;

var m = destination.Length;

var n = source.Length;

if (System.Math.Abs(m - n) > threshold)

{

return Int32.MaxValue;

}

var d = new int[m + 1, n + 1];

for (int i = 0; i <= m; i++)

d[i, 0] = i;

for (int j = 0; j <= n; j++)

d[0, j] = j;

for (int i = 1; i <= m; i++)

{

for (int j = 1; j <= n; j++)

{

var cost = destination[i - 1] == source[j - 1] ? 0 : 1;

var insertion = d[i, j - 1] + 1;

var substitution = d[i - 1, j - 1] + cost;

var deletion = d[i - 1, j] + 1;

d[i, j] = minimum(deletion, insertion, substitution);

}

}

return d[m, n];

}

**Threshold**parameter is used when want to have an upper bound limit for the distance. Is useful when the difference between words will be high for sure.Based on wikipedia, I rewrote the algorithm for an O(m) space, by using a matrix with 2 lines.

public static int distance2M(string source, string destination, bool ignoreCase, int threshold)

{

if (ignoreCase)

{

source = source.ToLower();

destination = destination.ToLower();

}

if (source == destination)

return 0;

var m = destination.Length;

var n = source.Length;

if (System.Math.Abs(m - n) > threshold) {

return Int32.MaxValue;

}

var d = new int[2, n + 1];

d[0, 0] = 0;

d[1, 0] = 1;

for (int j = 0; j <= n; j++)

d[0, j] = j;

for (int i = 1; i <= m; i++) {

for (int j = 1; j <= n; j++) {

var cost = destination[i-1] == source[j-1] ? 0 : 1;

var insertion = d[1, j - 1] + 1;

var substitution = d[0, j - 1] + cost;

var deletion = d[0, j] + 1;

d[1, j] = minimum(deletion, insertion, substitution);

}

// move the new values on the first line

d[0, 0] = i;

d[0, 1] = i+1;

for (int j = 1; j <= n; j++) {

d[0, j] = d[1,j];

}

}

return d[1, n];

}

Can try this on the website urbanindustry.co.uk by searching "addidas" or anything weird.

Addias doesn't return any results :)

ReplyDeleteAddias doesn't return any results :)

ReplyDeleteAsa-i, pentru ca a suferit niste schimbari functia aia de search si am scos partea asta. Dar inca mai folosesc Levenshtein la unele chestii pe acolo

DeleteIncearca URL-ul

http://www.urbanindustry.co.uk/brands/addidas-originals.asp

Tot cu Levenshtein e tratat.