string sorts tries substring search: kmp, bm, rk

Download String Sorts Tries Substring Search: KMP, BM, RK

If you can't read please download the document

Upload: benjamin-walsh

Post on 06-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

key-indexed counting LSD radix sort MSD radix sort String Sorts key-indexed counting LSD radix sort MSD radix sort ACKNOWLEDGEMENTS: http://algs4.cs.princeton.edu

TRANSCRIPT

String Sorts Tries Substring Search: KMP, BM, RK
Lecture 16 Strings String Sorts Tries Substring Search: KMP, BM, RK key-indexed counting LSD radix sort MSD radix sort
String Sorts key-indexed counting LSD radix sort MSD radix sort ACKNOWLEDGEMENTS: key-indexed counting LSD radix sort MSD radix sort
String Sorts key-indexed counting LSD radix sort MSD radix sort Review: sorting algorithms
Lower bound. ~NlgN compares required byany compare-based algorithm. Key-indexed counting assumptions
Assumption. Keys are integers between 0 and R - 1. Implication. Can use key as an array index. Applications. Sort string by first letter. Sort class roster by section. Sort phone numbers by area code. Subroutine in a sorting algorithm. Key-indexed counting demo
Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array. Key-indexed counting demo
Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array. Key-indexed counting demo
Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array. Key-indexed counting demo
Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array. Key-indexed counting demo
Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array. Key-indexed counting analysis
Proposition. Key-indexed counting takes time proportional toN+R. Proposition. Key-indexed counting uses extra spaceproportional to N+R. Stable? Yes. key-indexed counting LSD radix sort MSD radix sort
String Sorts key-indexed counting LSD radix sort MSD radix sort Least-significant-digit-first string sort
LSD string (radix) sort. Consider characters from right to left. Stably sort using dth character as the key (using key-indexedcounting). LSD string sort: correctness proof
Proposition. LSD sorts fixed-length strings in ascending order. Pf. [ by induction on i ] After pass i, strings are sorted by last i characters. If two strings differ on sort key,key-indexed sort puts them inproper relative order. If two strings agree on sort key,stability keeps them in properrelative order. Proposition. LSD sort is stable. Pf. Key-indexed counting is stable. Summary: sorting algorithms key-indexed counting LSD radix sort MSD radix sort
String Sorts key-indexed counting LSD radix sort MSD radix sort Reverse LSD Consider characters from left to right.
Stably sort using dth character as the key (using key-indexedcounting). Most-significant-digit-first string sort
MSD string (radix) sort. Partition array into R pieces according to first character(use key-indexed counting). Recursively sort all strings that start with each character(key-indexed counts delineate subarrays to sort). MSD string sort example Variable-length strings
Treat strings as if they had an extra char at end (smaller thatany char). C strings. Have extra char \0 at end => no extra work needed. MSD string sort problem
Observation 1. Much too slow for smallsubarrays. Each function call needs its own count[] array. ASCII (256 counts): 100x slower than copypass for N=2. Unicode (65536 counts): 32000x slower forN=2. Observation 2. Huge number of small subarraysbecause of recursion. Summary: sorting algorithms Retrieve DFA simulation Trie Radix Tree Suffix Trie(Tree)
Tries Retrieve DFA simulation Trie Radix Tree Suffix Trie(Tree) Alphabet Word:{i,a,in,at,an,inn,int,ate,age,adv,ant}
letters on the path prefix of the word Common Prefix Common Ancestor Leaf Node longest prefix Construct Node Find word Add word Is Leaf?(Is End?) Edge.
Edge of next letter? Exist: jump to next Node. Not exist: return false. Is end of the word Add word Not exist: add new Node, jump to it. Mark. Example Find ant Find and Add and Other version Child-Brother Tree Double Array Trie
Binary Tree Double Array Trie Ternary search tries How to save edge? Array List BST Analysis Time complexity Space complexity Add: length of string
Find: length of string Space complexity Total length of string Radix Tree Internal node has least two child
Leaf Node =2 oSoS TSST (4). S1S2 S1#S2$#$(#) Suffix Tree:construct
Time complexity Construction A Trie O(n^2) Esko Ukkonen Algorithm Nodes = 0; --i) if (suff[i] == i + 1) for (; j < m i; ++j) if (bmGs[j] == m) bmGs[j] = m i; for (i = 0; i =m) Output: s --> T[s+1, s+2, .. s+m] (s 1: Hash will help. Detailed Procedure (For convenience, numbers only.) Detailed Procedure Optimization: 10*(31415 3*10000)+2(mod 13)
(Dynamic Programming) Pseudo Space Complexity: O(n-m) Time Complexity: O(m*(n-m))
RABIN-KARP-MATCHER(T, P, d, q) n = T.length m = P.length h = d^(m-1) mod q p = 0 t[0] = 0 for i = 1 to m p = (d*p + P[i]) mod q t[0] = (d*t[0]+T[i]) mod q for s = 0 to n-m if p==t[s] if P[1..m] == T[s+1..s+m] print (Find P with shift: +s) if s < n-m t[s+1] = (d*(t[s]-T(s+1)*h) + T[s+m+1]) mod q Space Complexity: O(n-m) Time Complexity: O(m*(n-m)) Analysis pros When q increases Reduce the chance of conflicts
Decrease the time of confirmation cons (May)Increase the space requirement (May)Increase the time of Mod operation