pattern matching rhys price jones anne r. haake. pattern matching algorithms - review finding all...

35
Pattern Matching Rhys Price Jones Anne R. Haake

Upload: tiffany-morton

Post on 04-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Pattern Matching

Rhys Price Jones

Anne R. Haake

Page 2: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Pattern matching algorithms - Review

• Finding all occurrences of pattern p in text t• P has length m, t has length n• Naïve algorithm, Rabin-Karp both have worst-

case O(mn) and expected case O(m+n) behavior

• Automaton approach preprocesses p to yield a O(n) algorithm

Page 3: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree

• Preprocess t• To yield a O(m) algorithm• Useful if t is fixed and there are lots of p’s that

you want to search for.

Page 4: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Tries

• Are often used for retrieval of keywords to provide efficient indexing.

• Suppose you want to index:– PATTERN– MONKEY – PATAPAN– PROBOSCIS– PATHETIC

Page 5: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Build a TRIE

• For PATTERN

PATTERN

Page 6: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Build a TRIE

• For PATTERN MONKEY

PATTERN MONKEY

Page 7: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Build a TRIE

• For PATTERN MONKEY PATAPAN

PATMONKEY

APANTERN

Page 8: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Build a TRIE

• For PATTERN MONKEY PATAPAN PROBOSCIS

MONKEY

APAN

P

TERN

ROBOSCIS

AT

Page 9: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Build a TRIE

• For PATTERN MONKEY PATAPAN PROBOSCIS PATHETIC

P M

HETICAPAN

TERN

ROBOSCISAT

Each keyword can be located in O(k) steps where k is the length of the keyword

Page 10: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Applications of a Trie

• Dictionary– Just need to check if you get to a leaf to know the

word exists– Or store a link to the word’s definition at the leaf

• Index for a book– Store a list of all pages where the keyword

appears at the leaf

• Finding reserved words or filtering unwanted words …

Page 11: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree

• For a text t• Is a Trie for the set of suffixes of t• BIOINFORMATICS IOINFORMATICS

OINFORMATICS INFORMATICS… ICS CS S

• Build it on the board

Page 12: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

ACACTACT

0

Page 13: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT

ACACTACTACACTACT CACTACT ACTACT CTACT TACT ACT CT T

ACACTACT CACTACT

01

Page 14: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

CACTACT

AC

ACTACT TACT

10 2

Page 15: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

AC

ACTACT TACT

C

TACT ACTACT

0 2 3 1

Page 16: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

AC

ACTACT TACT

C

TACT ACTACT

0 2 3 1

4

TACT

Page 17: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

AC

ACTACT TACT

C

TACT ACTACT

0 2 3 1

4

TACT

Page 18: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

What’s the Problem

ACACTACT CACTACT ACTACT CTACT TACT ACT CT T

AC

ACTACT TACT

C

TACT ACTACT

0 2 3 1

4

TACT

This suffix is a prefix of another suffix

Page 19: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

What’s the Fix?

• Add a new symbol to the end of the string• A symbol $ that does not appear elsewhere

Page 20: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

ACACTACT$

0

Page 21: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

ACACTACT$ CACTACT$

01

Page 22: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

CACTACT$

AC

ACTACT$ TACT$

10 2

Page 23: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ TACT$

C

TACT$ ACTACT$

0 2 3 1

Page 24: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ TACT$

C

TACT$ ACTACT$

0 2 3 1

4

TACT

Page 25: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ T

C

TACT$ ACTACT$

0 3 1

4

TACT

ACT$ $2 5

Suffix is prefix problem went away

Page 26: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ T

C

ACT$

ACTACT$

0

3

1

4

TACT

ACT$$

2 5

$

T

6

Page 27: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ T

C

ACT$

ACTACT$

0

3

1

T

ACT$$

2 5

$

T

6

TACT $4 7

Page 28: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Suffix Tree For ACACTACT$

ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $

AC

ACTACT$ T

C

ACT$

ACTACT$

0

3

1

T

ACT$$

2 5

$

T

6

TACT $4 7

$8

Page 29: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Use for Testing Substrings

ACT follow links, get 2,5 CAC follow links, get 1

AC

ACTACT$ T

C

ACT$

ACTACT$

0

3

1

T

ACT$$

2 5

$

T

6

TACT $4 7

$8

Page 30: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Reprise

• To review the procedure, let’s build a suffix tree for MISSISSIPPI

• On the board• Don’t forget the $

Page 31: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Code

(define suffix-tree ; input string output suffix tree (lambda (t) (trie (suffixes-of t) (string-length t))))

(define suffixes-of ; input string output list of all its suffixes (lambda (t) (cond ((zero? (string-length t)) '()) (else (cons t (suffixes-of (substring t 1 (string-length t))))))))

(define trie ; input list of strings and n ; output suffix-tree style trie with ; n-(length of keyword) at the leaves (lambda (l n) ; list of keywords to put in a trie (tries (sort (lambda (x y) (string<=? x y)) l) n)))

Page 32: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

More code

(define tries ; builds a trie from sorted list l. Leaves as above (lambda (l n) (cond ((null? l) (make-empty-trie)) ((singleton? (samestarts l)) (make-internal-node (make-edge (car l) (make-leaf (- n (string-length (car l))))) (tries (cdr l) n))) (else (let ((childstrings (samestarts l)))

(let ((label (commonprefix childstrings))) (let ((childnode

(trie (map (chop (string-length label)) childstrings) (- n (string-length label)))))

(make-internal-node (make-edge label childnode) (tries (nthcdr l (length childstrings)) n)))))))))

Page 33: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Analysis

• Building suffix tree: O(n2)• Searching for p: O(m+k)

– Where p appears k times

Page 34: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Improvement possible

• Suffix tree for text length n can be built in time O(n)

• Thereafter all searches are O(m)

Page 35: Pattern Matching Rhys Price Jones Anne R. Haake. Pattern matching algorithms - Review Finding all occurrences of pattern p in text t P has length m, t

Applications in Biology

• Suffix Trees in Computational Biology• Link doesn’t work