Levenshtein distance - use on web site search
Levenshtein distance is the minimum number of character deletion (D), insertion (I) or substitution(S) operations to transform a string to another.
I used this algorithm to give alternatives to an user when he gets no results after he submitted search by a specific keyword. Often happens when the keyword is misspelled or even there are no matches.
If an e-shop sells very expensive bracelets and has one called "Diamond Tennis Bracelet", it would be nice when the customer searched for "tenis" and he found nothing to give him a close suggestion like "tennis" and no random ones.
Here follows the Levenshtein algorithm implementation in C# in a O(mn) space
Threshold parameter is used when want to have an upper bound limit for the distance. Is useful when the difference between words will be high for sure.
Based on wikipedia, I rewrote the algorithm for an O(m) space, by using a matrix with 2 lines.
Can try this on the website urbanindustry.co.uk by searching "addidas" or anything weird.
I used this algorithm to give alternatives to an user when he gets no results after he submitted search by a specific keyword. Often happens when the keyword is misspelled or even there are no matches.
If an e-shop sells very expensive bracelets and has one called "Diamond Tennis Bracelet", it would be nice when the customer searched for "tenis" and he found nothing to give him a close suggestion like "tennis" and no random ones.
Here follows the Levenshtein algorithm implementation in C# in a O(mn) space
public static int distanceMN(string source, string destination, bool ignoreCase, int threshold) {
if (ignoreCase)
{
source = source.ToLower();
destination = destination.ToLower();
}
if (source == destination)
return 0;
var m = destination.Length;
var n = source.Length;
if (System.Math.Abs(m - n) > threshold)
{
return Int32.MaxValue;
}
var d = new int[m + 1, n + 1];
for (int i = 0; i <= m; i++)
d[i, 0] = i;
for (int j = 0; j <= n; j++)
d[0, j] = j;
for (int i = 1; i <= m; i++)
{
for (int j = 1; j <= n; j++)
{
var cost = destination[i - 1] == source[j - 1] ? 0 : 1;
var insertion = d[i, j - 1] + 1;
var substitution = d[i - 1, j - 1] + cost;
var deletion = d[i - 1, j] + 1;
d[i, j] = minimum(deletion, insertion, substitution);
}
}
return d[m, n];
}
Threshold parameter is used when want to have an upper bound limit for the distance. Is useful when the difference between words will be high for sure.
Based on wikipedia, I rewrote the algorithm for an O(m) space, by using a matrix with 2 lines.
public static int distance2M(string source, string destination, bool ignoreCase, int threshold)
{
if (ignoreCase)
{
source = source.ToLower();
destination = destination.ToLower();
}
if (source == destination)
return 0;
var m = destination.Length;
var n = source.Length;
if (System.Math.Abs(m - n) > threshold) {
return Int32.MaxValue;
}
var d = new int[2, n + 1];
d[0, 0] = 0;
d[1, 0] = 1;
for (int j = 0; j <= n; j++)
d[0, j] = j;
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
var cost = destination[i-1] == source[j-1] ? 0 : 1;
var insertion = d[1, j - 1] + 1;
var substitution = d[0, j - 1] + cost;
var deletion = d[0, j] + 1;
d[1, j] = minimum(deletion, insertion, substitution);
}
// move the new values on the first line
d[0, 0] = i;
d[0, 1] = i+1;
for (int j = 1; j <= n; j++) {
d[0, j] = d[1,j];
}
}
return d[1, n];
}
Can try this on the website urbanindustry.co.uk by searching "addidas" or anything weird.
Addias doesn't return any results :)
ReplyDeleteAddias doesn't return any results :)
ReplyDeleteAsa-i, pentru ca a suferit niste schimbari functia aia de search si am scos partea asta. Dar inca mai folosesc Levenshtein la unele chestii pe acolo
DeleteIncearca URL-ul
http://www.urbanindustry.co.uk/brands/addidas-originals.asp
Tot cu Levenshtein e tratat.