Pattern Matching
Rhys Price Jones
Anne R. Haake
Pattern matching algorithms - Review
• Finding all occurrences of pattern p in text t• P has length m, t has length n• Naïve algorithm, Rabin-Karp both have worst-
case O(mn) and expected case O(m+n) behavior
• Automaton approach preprocesses p to yield a O(n) algorithm
Suffix Tree
• Preprocess t• To yield a O(m) algorithm• Useful if t is fixed and there are lots of p’s that
you want to search for.
Tries
• Are often used for retrieval of keywords to provide efficient indexing.
• Suppose you want to index:– PATTERN– MONKEY – PATAPAN– PROBOSCIS– PATHETIC
Build a TRIE
• For PATTERN
PATTERN
Build a TRIE
• For PATTERN MONKEY
PATTERN MONKEY
Build a TRIE
• For PATTERN MONKEY PATAPAN
PATMONKEY
APANTERN
Build a TRIE
• For PATTERN MONKEY PATAPAN PROBOSCIS
MONKEY
APAN
P
TERN
ROBOSCIS
AT
Build a TRIE
• For PATTERN MONKEY PATAPAN PROBOSCIS PATHETIC
P M
HETICAPAN
TERN
ROBOSCISAT
Each keyword can be located in O(k) steps where k is the length of the keyword
Applications of a Trie
• Dictionary– Just need to check if you get to a leaf to know the
word exists– Or store a link to the word’s definition at the leaf
• Index for a book– Store a list of all pages where the keyword
appears at the leaf
• Finding reserved words or filtering unwanted words …
Suffix Tree
• For a text t• Is a Trie for the set of suffixes of t• BIOINFORMATICS IOINFORMATICS
OINFORMATICS INFORMATICS… ICS CS S
• Build it on the board
Suffix Tree For ACACTACT
ACACTACT CACTACT ACTACT CTACT TACT ACT CT T
ACACTACT
0
Suffix Tree For ACACTACT
ACACTACTACACTACT CACTACT ACTACT CTACT TACT ACT CT T
ACACTACT CACTACT
01
Suffix Tree For ACACTACT
ACACTACT CACTACT ACTACT CTACT TACT ACT CT T
CACTACT
AC
ACTACT TACT
10 2
Suffix Tree For ACACTACT
ACACTACT CACTACT ACTACT CTACT TACT ACT CT T
AC
ACTACT TACT
C
TACT ACTACT
0 2 3 1
Suffix Tree For ACACTACT
ACACTACT CACTACT ACTACT CTACT TACT ACT CT T
AC
ACTACT TACT
C
TACT ACTACT
0 2 3 1
4
TACT
Suffix Tree For ACACTACT
ACACTACT CACTACT ACTACT CTACT TACT ACT CT T
AC
ACTACT TACT
C
TACT ACTACT
0 2 3 1
4
TACT
What’s the Problem
ACACTACT CACTACT ACTACT CTACT TACT ACT CT T
AC
ACTACT TACT
C
TACT ACTACT
0 2 3 1
4
TACT
This suffix is a prefix of another suffix
What’s the Fix?
• Add a new symbol to the end of the string• A symbol $ that does not appear elsewhere
Suffix Tree For ACACTACT$
ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $
ACACTACT$
0
Suffix Tree For ACACTACT$
ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $
ACACTACT$ CACTACT$
01
Suffix Tree For ACACTACT$
ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $
CACTACT$
AC
ACTACT$ TACT$
10 2
Suffix Tree For ACACTACT$
ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $
AC
ACTACT$ TACT$
C
TACT$ ACTACT$
0 2 3 1
Suffix Tree For ACACTACT$
ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $
AC
ACTACT$ TACT$
C
TACT$ ACTACT$
0 2 3 1
4
TACT
Suffix Tree For ACACTACT$
ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $
AC
ACTACT$ T
C
TACT$ ACTACT$
0 3 1
4
TACT
ACT$ $2 5
Suffix is prefix problem went away
Suffix Tree For ACACTACT$
ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $
AC
ACTACT$ T
C
ACT$
ACTACT$
0
3
1
4
TACT
ACT$$
2 5
$
T
6
Suffix Tree For ACACTACT$
ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $
AC
ACTACT$ T
C
ACT$
ACTACT$
0
3
1
T
ACT$$
2 5
$
T
6
TACT $4 7
Suffix Tree For ACACTACT$
ACACTACT$ CACTACT$ ACTACT$ CTACT$ TACT$ ACT$ CT$ T$ $
AC
ACTACT$ T
C
ACT$
ACTACT$
0
3
1
T
ACT$$
2 5
$
T
6
TACT $4 7
$8
Use for Testing Substrings
ACT follow links, get 2,5 CAC follow links, get 1
AC
ACTACT$ T
C
ACT$
ACTACT$
0
3
1
T
ACT$$
2 5
$
T
6
TACT $4 7
$8
Reprise
• To review the procedure, let’s build a suffix tree for MISSISSIPPI
• On the board• Don’t forget the $
Code
(define suffix-tree ; input string output suffix tree (lambda (t) (trie (suffixes-of t) (string-length t))))
(define suffixes-of ; input string output list of all its suffixes (lambda (t) (cond ((zero? (string-length t)) '()) (else (cons t (suffixes-of (substring t 1 (string-length t))))))))
(define trie ; input list of strings and n ; output suffix-tree style trie with ; n-(length of keyword) at the leaves (lambda (l n) ; list of keywords to put in a trie (tries (sort (lambda (x y) (string<=? x y)) l) n)))
More code
(define tries ; builds a trie from sorted list l. Leaves as above (lambda (l n) (cond ((null? l) (make-empty-trie)) ((singleton? (samestarts l)) (make-internal-node (make-edge (car l) (make-leaf (- n (string-length (car l))))) (tries (cdr l) n))) (else (let ((childstrings (samestarts l)))
(let ((label (commonprefix childstrings))) (let ((childnode
(trie (map (chop (string-length label)) childstrings) (- n (string-length label)))))
(make-internal-node (make-edge label childnode) (tries (nthcdr l (length childstrings)) n)))))))))
Analysis
• Building suffix tree: O(n2)• Searching for p: O(m+k)
– Where p appears k times
Improvement possible
• Suffix tree for text length n can be built in time O(n)
• Thereafter all searches are O(m)
Applications in Biology
• Suffix Trees in Computational Biology• Link doesn’t work