approximate string matching evlogi hristov telerik corporation student at telerik academy
Post on 14-Dec-2015
222 views
TRANSCRIPT
Bitap AlgorithmApproximate string matching
Evlogi Hristov
Telerik Corporation
Student at Telerik Academy
Table of Contents1. Levenshtein distance.
2. Bitap overview.
3. Bitap Exact search.
4. Bitap Fuzzy search.
5. Additional information.
2
Levenshtein distance Edit distance: Primitive operations
necessary to convert the string into an exact match. insertion: cot → coat
deletion: coat → cot
substitution: coat → cost
4
Example:
1. Set n to be the length of s = "GUMBO"Set m to be the length of t = "GAMBOL"If n = 0, return m and exitIf m = 0, return n and exit
0
1
2
3
4
5
1
1
2
3
4
5
2
2
1
2
3
4
3
3
2
1
2
3
4
4
3
2
1
2
G U M B O
0 1 2 3 4 5
G 1
A 2
M 3
B 4
O 5
L 6
Levenshtein distance (2)
2. Initialize matrix M [m + 1, n + 1]
3. Examine each character of s ( i from 1 to n )
4. Examine each character of t ( j from 1 to m )
5. If s[i] equals t[j], the cost is 0If s[i] is not equal to t[j], the cost is 1
6. Set cell M[j, i] equal to the minimum of:
a. The cell immediately above plus 1: M [j-1, i] + 1
b. The cell immediately to the left plus 1: M [j, i-1] + 1
c. The cell diagonally above and to the left plus the cost: M [j-1, i-1] + cost
7. After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell M [m - 1, n - 1]
5
Levenstein distance (3)private int Levenshtein(string source, string target){ if (string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(target)) { return target.Length; } return 0; }
if (string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(source)) { return source.Length; } return 0; }
int[,] dist = new int[source.Length + 1, target.Length + 1]; int min1, min2, min3, cost;
// ..continues on text page6
Levenstein distance (4) for (int i = 0; i < dist.GetLength(0); i += 1) { dist[i, 0] = i; } for (int i = 0; i < dist.GetLength(1); i += 1) { dist[0, i] = i; }
for (int i = 1; i < dist.GetLength(0); i++) { for (int j = 1; j < dist.GetLength(1); j++) { cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); min1 = dist[i - 1, j] + 1; min2 = dist[i, j - 1] + 1; min3 = dist[i - 1, j - 1] + cost; dist[i, j] = Math.Min(Math.Min(min1, min2), min3); } } return dist[dist.GetLength(0)-1,dist.GetLength(1)-1];}
7
Bitap algorithm Also known as the shift-or, shift-and or
Baeza–Yates–Gonnet algorithm.
Aproximate string matching algorithm.
Approximate equality is defined in terms of Levenshtein distance.
Often used for fuzzy search without indexing.
Does most of the work with bitwise operations.
Runs in O(mn) operations, no matter the structure of the text or the pattern.
9
Bitap Exact search(2)public static List<int> ExactMatch(string text, string pattern){ long[] alphabet = new long[128]; //ASCII range (0 – 127) for (int i = 0; i < pattern.Length; ++i) { int letter = (int)pattern[i]; alphabet[letter] = alphabet[letter] | (1 << i); } long result = 1; //0000 0001 List<int> indexes = new List<int>(); for (int index = 0; index < text.Length; index++) { result &= alphabet[text[index]]; //if result != pattern => result = 0 result = (result << 1) + 1;
if ((result & (1 << pattern.Length)) > 0) { indexes.Add(index - pattern.Length + 1); } } return indexes;}
10
Bitap Exact search
c b a b a
0 0 1 0 1
11
alphabet[a] =
0 1 2 3 4
a b a b c
c b a b a
0 1 0 1 0alphabet[b] =
c b a b a
1 0 0 0 0alphabet[c] =
= 5
= 10
= 16
Example: text = cbdabababc pattern = ababc
c b a b a
0 0 0 0 0alphabet[d] = = 0
4 3 2 1 0bits:
0 0 0 0 1start res:
c
0 0 0 0 0
c b
0 0 0 0 0
c b d
0 0 0 0 0
c b d a
0 0 0 0 1
c b d a b
0 0 0 1 0
b d a b a
0 0 1 0 1
d a b a b
0 1 0 1 0
a b a b a
0 0 1 0 1
b a b a b
0 1 0 1 0
a b a b c
1 0 0 0 0
res:
res:
res:
res:
res:
res:
res:
res:
res:
res:
text[i]
text[i]
text[i]
text[i]
text[i]
text[i]
text[i]
text[i]
text[i]
text[i]
= 1
Fuzzy searching
12
...long[] result = new long[k + 1]; for (int i = 0; i <= k; i++) { result[i] = 1; }... for (int j = 1; j <= k; ++j) { // Three operations of the Levenshtein distance long insertion = current | ((result[j] & patternMask[text[i]]) << 1); long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; long substitution = (previous | (result[j] & patternMask[text[i]])) << 1;
current = result[j]; result[j] = substitution | insertion | deletion | 1; previous = result[j]; } ...
Instead of having a single array result that changes over the length of the text, we now have k distinct arrays result 1..k
Shift-and vs. Shift-or Shift-and :
Uses bitwise & and 1’s for matches
More intuitive and easyer to understand
Needs to add result |= 1
Shift-or : Uses bitwise | and zeroes’s
for matches
A bit faster13
форум програмиране, форум уеб дизайнкурсове и уроци по програмиране, уеб дизайн – безплатно
програмиране за деца – безплатни курсове и уроцибезплатен SEO курс - оптимизация за търсачки
уроци по уеб дизайн, HTML, CSS, JavaScript, Photoshop
уроци по програмиране и уеб дизайн за ученициASP.NET MVC курс – HTML, SQL, C#, .NET, ASP.NET MVC
безплатен курс "Разработка на софтуер в cloud среда"
BG Coder - онлайн състезателна система - online judge
курсове и уроци по програмиране, книги – безплатно от Наков
безплатен курс "Качествен програмен код"
алго академия – състезателно програмиране, състезания
ASP.NET курс - уеб програмиране, бази данни, C#, .NET, ASP.NETкурсове и уроци по програмиране – Телерик академия
курс мобилни приложения с iPhone, Android, WP7, PhoneGap
free C# book, безплатна книга C#, книга Java, книга C#Дончо Минков - сайт за програмиранеНиколай Костов - блог за програмиранеC# курс, програмиране, безплатно
?
? ? ??
?? ?
?
?
?
??
?
?
? ?
Questions?
?
Bitap algorithm
http://algoacademy.telerik.com
Links for more information
Original paper of Baeza-Yates and Gonnet: http://www.akira.ruc.dk/~keld/teaching/alg
oritmedesign_f08/Artikler/09/Baeza92.pdf
Google implementation using bitap: https://code.google.com/p/google-diff-matc
h-patch
Levenshtein algorithm: http://www.codeproject.com/Articles/13525
/Fast-memory-efficient-Levenshtein-algorithm
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance
Free Trainings @ Telerik Academy
“C# Programming @ Telerik Academy csharpfundamentals.telerik.com
Telerik Software Academy academy.telerik.com
Telerik Academy @ Facebook facebook.com/TelerikAcademy
Telerik Software Academy Forums forums.academy.telerik.com