by - luc devroyeluc.devroye.org/henrimertens-stringology.pdf · aat¥⇒¥-linked list of children...
TRANSCRIPT
DATA STRUCTURESFOR
STRINGS,
COMPRESSION
( Notes by Gauri Mertens )
DATA STRUCTURES FOR STRINGS
Overview : For collections of strings,words :
-
Tries,
PATRICIA trees, digital search trees
For tents, files:
Affix tries, suffix trees
, suffix arrays
TRIESFrom
"
retired"
CF£k¥n!n%%7fasa)Data !
nap from an alphabet As e.g . ,
A-to.mg Cbinary )Az 40,1 ,
-
→ 97 I decimal )A
= 9A,c
, GT } C DNA )A- had ,c ,
. .
> 3A ,.
→ 70,1 ,. . .gs . . . . } ( tent )
Every string is infinite Iotherwise pad it ) .
If I Atk,
then a tie is a h-
any position tree.
Each string corresponds to a path .
For binary strings , a YE"
n indicates"
,go left"
go right" 0.9
raft 01o
. path for
→Anim \
.
( o& 1101 - -I
, %
④- hit :son:p .:}Ky . My
NIROMy Nz NJ 26
ORDINARY TRIE
at each a string meannually so that all strings endup
in a unique leaf .
O Wagemandata
o/\oas : ( ooo -
- - )o¥° 2
Rz : (O O I O . - . ),at IX§: foo " - - -kIto 6¥:
>6¥.
µ→Ng: ( I l I I I . - n )
,
* Leaves= n = # Strings
Leaves point to the strings.
Remark: Br finite strings that can be prefixes of
- other strings ,one can store all string in the
tire,
and mark the ends of strings
O
TI 8- I
¥¥÷¥÷.
" am .
:*I 00
°o I I
Remade: Large alphabets .
Instead of having cells
on arrays of pointersto children
LIMB 1¥11- a
k childrenk ohildren
one can use linked lists of children ( de be Brian dais )
noAaT¥⇒¥-linked list of children
This saves space !
Remark : to control the tree size,
one can collapse all unmarked-
I-
child modes,
as in the EAT Ricin true ( Marxism,
1968)
of
: :in .
÷÷÷÷÷%ft Associate
a
subduingwith a left child
.
Ina PATRICIA tree
,the # of nodes is E Em -1
.
Froot .
⇒ Leaves = a
F- Nodes = No t Mst Mz t - - -
MEO (by collapse)C Mi
= # nodes of degree i )
# Edges = # Nodes- I = 2M£ t 3Mt 4My t - - -
⇒ 2 (Mzt My t My t - - )=L ( # hodes - n )
* Nodes son - I. D
Note: We assumed that MEO .
However,
in thepresence of marked
nodes,
or when the root has only one child,
Me > o.
However,
we still have # Nodes s In in that case .( Exercise )
DIGITAL SEARCH TREE C DST )
Add the Amigo one by one and associate each stringwith one mode,
namely,
the first free mode on its path .
Example. Ma6991,9: Tree Dst
023= 00 I O I - - -
⑦3¥I99888
-
6%Y¥a °Y ②↳mo0¥€£10 xLIE of DST = n .
④
THE SUFFIX TRIE :
A trie for all suffixes in a test :
Tent T
#ItsSuffixes : as -- Tft . .
n ] off! byeAge Tff - . n ]:
km =T[ n . . n ]
Suffix TREE : lollapopgffff.au.am#evmary nodes as in
a
§ edges refer to shearings in the tent, e.g ,
Go ,-24J
and leaves point to places in the teat.
Candsome internals )
Teats 0 I 00001 O•
1234567 8 CD '
-
¥-47,83147-0YEA. D 7
Ez . ¥:$2%0010=6, g
.
i 0£,g↳400010=14,8][6,83--0*017,8]3 4
point back to test.
Storage e an (Exercise)
Eroica: Find a binary suffix tie of size Rtn ).
Note: Searching for a string Pepe . - Pa involves
• root
← path pz - - - - Pk.
^
n•Aall marked nodes in the subtree
¥•js↳
SUFFIX ARRAY
Anarray of sorted saltines .
Example Teat = HELLO
apian:g"He
Search for a subduing proceedsOleg binary search.