approximate string matching evlogi hristov telerik corporation student at telerik academy

16
Bitap Algorithm Approximate string matching Evlogi Hristov Telerik Corporation Student at Telerik Academy

Post on 14-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Bitap AlgorithmApproximate string matching

Evlogi Hristov

Telerik Corporation

Student at Telerik Academy

Table of Contents1. Levenshtein distance.

2. Bitap overview.

3. Bitap Exact search.

4. Bitap Fuzzy search.

5. Additional information.

2

Levenshtein distanceEdit distance

3

Levenshtein distance Edit distance: Primitive operations

necessary to convert the string into an exact match. insertion: cot → coat

deletion: coat → cot

substitution: coat → cost

4

Example:

1. Set n to be the length of s = "GUMBO"Set m to be the length of t = "GAMBOL"If n = 0, return m and exitIf m = 0, return n and exit

0

1

2

3

4

5

1

1

2

3

4

5

2

2

1

2

3

4

3

3

2

1

2

3

4

4

3

2

1

2

    G U M B O

  0 1 2 3 4 5

G 1

A 2

M 3

B 4

O 5

L 6

Levenshtein distance (2)

2. Initialize matrix M [m + 1, n + 1]

3. Examine each character of s ( i from 1 to n )

4. Examine each character of t ( j from 1 to m )

5. If s[i] equals t[j], the cost is 0If s[i] is not equal to t[j], the cost is 1

6. Set cell M[j, i] equal to the minimum of:

a. The cell immediately above plus 1: M [j-1, i] + 1

b. The cell immediately to the left plus 1: M [j, i-1] + 1

c. The cell diagonally above and to the left plus the cost: M [j-1, i-1] + cost

7. After the iteration steps (3, 4, 5, 6) are complete, the distance is found in the cell M [m - 1, n - 1]

5

Levenstein distance (3)private int Levenshtein(string source, string target){ if (string.IsNullOrEmpty(source)) { if (!string.IsNullOrEmpty(target)) { return target.Length; } return 0; }

if (string.IsNullOrEmpty(target)) { if (!string.IsNullOrEmpty(source)) { return source.Length; } return 0; }

int[,] dist = new int[source.Length + 1, target.Length + 1]; int min1, min2, min3, cost;

// ..continues on text page6

Levenstein distance (4) for (int i = 0; i < dist.GetLength(0); i += 1) { dist[i, 0] = i; } for (int i = 0; i < dist.GetLength(1); i += 1) { dist[0, i] = i; }

for (int i = 1; i < dist.GetLength(0); i++) { for (int j = 1; j < dist.GetLength(1); j++) { cost = Convert.ToInt32(!(source[i-1] == target[j - 1])); min1 = dist[i - 1, j] + 1; min2 = dist[i, j - 1] + 1; min3 = dist[i - 1, j - 1] + cost; dist[i, j] = Math.Min(Math.Min(min1, min2), min3); } } return dist[dist.GetLength(0)-1,dist.GetLength(1)-1];}

7

Bitap algorithmshift-or/shift-and

8

Bitap algorithm Also known as the shift-or, shift-and or

Baeza–Yates–Gonnet algorithm.

Aproximate string matching algorithm.

Approximate equality is defined in terms of Levenshtein distance.

Often used for fuzzy search without indexing.

Does most of the work with bitwise operations.

Runs in O(mn) operations, no matter the structure of the text or the pattern.

9

Bitap Exact search(2)public static List<int> ExactMatch(string text, string pattern){ long[] alphabet = new long[128]; //ASCII range (0 – 127) for (int i = 0; i < pattern.Length; ++i) { int letter = (int)pattern[i]; alphabet[letter] = alphabet[letter] | (1 << i); } long result = 1; //0000 0001 List<int> indexes = new List<int>(); for (int index = 0; index < text.Length; index++) { result &= alphabet[text[index]]; //if result != pattern => result = 0 result = (result << 1) + 1;

if ((result & (1 << pattern.Length)) > 0) { indexes.Add(index - pattern.Length + 1); } } return indexes;}

10

Bitap Exact search

c b a b a

0 0 1 0 1

11

alphabet[a] =

0 1 2 3 4

a b a b c

c b a b a

0 1 0 1 0alphabet[b] =

c b a b a

1 0 0 0 0alphabet[c] =

= 5

= 10

= 16

Example: text = cbdabababc pattern = ababc

c b a b a

0 0 0 0 0alphabet[d] = = 0

4 3 2 1 0bits:

0 0 0 0 1start res:

c

0 0 0 0 0

c b

0 0 0 0 0

c b d

0 0 0 0 0

c b d a

0 0 0 0 1

c b d a b

0 0 0 1 0

b d a b a

0 0 1 0 1

d a b a b

0 1 0 1 0

a b a b a

0 0 1 0 1

b a b a b

0 1 0 1 0

a b a b c

1 0 0 0 0

res:

res:

res:

res:

res:

res:

res:

res:

res:

res:

text[i]

text[i]

text[i]

text[i]

text[i]

text[i]

text[i]

text[i]

text[i]

text[i]

= 1

Fuzzy searching

12

...long[] result = new long[k + 1]; for (int i = 0; i <= k; i++) { result[i] = 1; }... for (int j = 1; j <= k; ++j) { // Three operations of the Levenshtein distance long insertion = current | ((result[j] & patternMask[text[i]]) << 1); long deletion = (previous | (result[j] & patternMask[text[i]])) << 1; long substitution = (previous | (result[j] & patternMask[text[i]])) << 1;

current = result[j]; result[j] = substitution | insertion | deletion | 1; previous = result[j]; } ...

Instead of having a single array result that changes over the length of the text, we now have k distinct arrays  result 1..k

Shift-and vs. Shift-or Shift-and :

Uses bitwise & and 1’s for matches

More intuitive and easyer to understand

Needs to add result |= 1

Shift-or : Uses bitwise | and zeroes’s

for matches

A bit faster13

форум програмиране, форум уеб дизайнкурсове и уроци по програмиране, уеб дизайн – безплатно

програмиране за деца – безплатни курсове и уроцибезплатен SEO курс - оптимизация за търсачки

уроци по уеб дизайн, HTML, CSS, JavaScript, Photoshop

уроци по програмиране и уеб дизайн за ученициASP.NET MVC курс – HTML, SQL, C#, .NET, ASP.NET MVC

безплатен курс "Разработка на софтуер в cloud среда"

BG Coder - онлайн състезателна система - online judge

курсове и уроци по програмиране, книги – безплатно от Наков

безплатен курс "Качествен програмен код"

алго академия – състезателно програмиране, състезания

ASP.NET курс - уеб програмиране, бази данни, C#, .NET, ASP.NETкурсове и уроци по програмиране – Телерик академия

курс мобилни приложения с iPhone, Android, WP7, PhoneGap

free C# book, безплатна книга C#, книга Java, книга C#Дончо Минков - сайт за програмиранеНиколай Костов - блог за програмиранеC# курс, програмиране, безплатно

?

? ? ??

?? ?

?

?

?

??

?

?

? ?

Questions?

?

Bitap algorithm

http://algoacademy.telerik.com

Free Trainings @ Telerik Academy

“C# Programming @ Telerik Academy csharpfundamentals.telerik.com

Telerik Software Academy academy.telerik.com

Telerik Academy @ Facebook facebook.com/TelerikAcademy

Telerik Software Academy Forums forums.academy.telerik.com