folded trie: efficient data structure for all of unicode
Post on 07-Apr-2018
225 Views
Preview:
TRANSCRIPT
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
1/21
21st International Unicode Conference Dublin, Ireland, May 2002 1
Folded Trie: Efficient Data
Structure for All of Unicode
Vladimir Weinstein
vweinste@us.ibm.com
Globalization Center of Competency, San Jose, CA
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
2/21
21st International Unicode Conference Dublin, Ireland, May 2002 2
Introduction
A lot of data for each code point
Need appropriate data structures
Unicode version 3.1 introduced code points
into supplementary spaceaddressable range
grew to more than a million
Repetitive data
Sparsely populated range, especially the
supplementary space
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
3/21
21st International Unicode Conference Dublin, Ireland, May 2002 3
Data Structures
Arrays
Advantages: very fast access time, fast write time
Disadvantage: Unacceptable memory consumption
Hash tables Advantages: Easy to use, Reasonably fast, General
Disadvantages: High overhead, complicated sequential
access, slower than array lookup, data within ranges is
not shared
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
4/21
21st International Unicode Conference Dublin, Ireland, May 2002 4
Data Structures (continued)
Inversion Maps
Advantages: simple, very compact, fast boolean
operations
Disadvantages: worse access time than arrays andpossibly hash tables
For more details see Bits of Unicode athttp://www.macchiato.com/slides/Bits_of_Unicode.ppt
http://www.macchiato.com/slides/Bits_of_Unicode.ppthttp://www.macchiato.com/slides/Bits_of_Unicode.ppt -
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
5/21
21st International Unicode Conference Dublin, Ireland, May 2002 5
Tries
A trie is a structure with one or more indexes
and one data storage.
Name comes from Information Retrieval
Shares repetitive data
Good compaction
Not appropriate for frequently changing data
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
6/21
21st International Unicode Conference Dublin, Ireland, May 2002 6
Single-Index Trie
A trie structure with an index array and a dataarray.
Advantages
Excellent size Very good access performance (two array accesses,
shift, mask and addition)
Disadvantages
Not appropriate for frequently changing data Index array gets too big when dealing with
supplementary code points
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
7/21
21st International Unicode Conference Dublin, Ireland, May 2002 7
Single-Index Trie Diagram
BMP code point Upper Lower
15 0
LOWER_MASK
UPPER_WIDTH LOWER_WIDTH
Index
Data Array
0
Data0
Block
Block
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
8/21
21st International Unicode Conference Dublin, Ireland, May 2002 8
Double-Index Trie
Two index arrays and a data block
Compared to single-index trie:
1. Provides better compression of the index array
2. Worse performance, but still very fast
3. Feasible for supplementary code points
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
9/21
21st International Unicode Conference Dublin, Ireland, May 2002 9
Double-Index Trie Diagram
Block
Code point Upper Middle
20 0
Index 1
Index 2
0
Index2
Lower
Data
0
Data
MIDDLE_MASK LOWER_MASK
UPPER_WIDTH MIDDLE_WIDTH LOWER_WIDTH
Index1
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
10/21
21st International Unicode Conference Dublin, Ireland, May 2002 10
Folded Trie
Fast access for BMP code points
Slower access for supplementary code points,
but far less frequent
Compacts supplementary index
Needs additional build time processing
Fast address with UTF-16 code units
no need to construct code point
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
11/21
21st International Unicode Conference Dublin, Ireland, May 2002 11
Folded TrieSupplementary Access Diagram
Lead Surrogate
110110..
15 0
0Trail Surrogate
110111..
15 9
Pseudo Code Point
Final Data
6
Folded Trie
Index + Data
5
1
2
Has data for
surrogate block?No
Yes
3
Data
Same for thesurrogate block
44
Lead Surrogate Data
BMP code points access same as with single-index
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
12/21
21st International Unicode Conference Dublin, Ireland, May 2002 12
ICU Implementation: UTrie
ICU implementation is called UTrie
Stores either 16 bit or 32 bit wide data
(extensible in the future)
Up to 256K different data elements
Can be frozen and reused as memory mapped
image for fast startup
Using UTrie requires custom code
More about ICU at the end of presentation
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
13/21
21st International Unicode Conference Dublin, Ireland, May 2002 13
Range Enumeration
Allows enumerating over a set of
contiguous maximal ranges of
same data elements Elements can be preprocessed by
additional callback
Saves time when processing the
whole Unicode range byefficiently walking the trie
structure
start
limit Element 3
Element 2
Element 2
Element 2Element 2
Element 2
Element 2
Element 1start-1
limit-1
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
14/21
21st International Unicode Conference Dublin, Ireland, May 2002 14
Latin-1 Fast Path
Build time option
Allows direct array access for the Latin-1
range (0x00-0xFF)
Latin-1 range is not compressed if this option
is used
Appropriate when access for Latin-1 range is
critical collation
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
15/21
21st International Unicode Conference Dublin, Ireland, May 2002 15
Normalization data is stored using UTries
For example, main data has the following
format
Example: Normalization Data
Extra data index Combining class BCK FWD QC_MAYBE
31 15 7 6 5 3
Combines back
Combines forward
Can be either:
-index to variable length data
- first part of supplementary
lookup value
-Special handling indicator(Hangul, Jamo)
QC_NO
0
Values for normalization quick
check
Variable-length data contains composition anddecomposition info
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
16/21
21st International Unicode Conference Dublin, Ireland, May 2002 16
Example: Character Properties Data
The result of UTrie lookup is an index
Double indexing allows for even better compression,
since many code points have the same property value
UTrie data width is 16 bit (thousands of data entries),while the property data width is 32 bits (few hundred
unique data words).
Index Data
Folded Trie
16 bits
Property data
32 bits
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
17/21
21st International Unicode Conference Dublin, Ireland, May 2002 17
International Components for Unicode
International Components for Unicode(ICU) isa library that provides robust and full-featuredUnicode support
Several library services use the common UTrieimplementation
Wide variety of supported platforms open source (X licensenon-viral)
C/C++ and Java versions
http://oss.software.ibm.com/icu/
http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/ -
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
18/21
21st International Unicode Conference Dublin, Ireland, May 2002 18
Conclusion
UTrie data structure provides good
compression with fast access
The main constraint for usage is the nature of
the data that needs to be stored
Designed for repetitive and sparse data
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
19/21
21st International Unicode Conference Dublin, Ireland, May 2002 19
Q & A
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
20/21
21st International Unicode Conference Dublin, Ireland, May 2002 20
Folding and Surrogate Access
Folding process compacts the index for
supplementaries and moves it right above the
BMP index
Access in ICU4C: Define a C callback, invoked when special lead
surrogate is detected
Manually detect special lead surrogates
In ICU4J, provide a subclass with a method
that detects special lead surrogates
-
8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode
21/21
21st International Unicode Conference Dublin, Ireland, May 2002 21
Summary
Introduction: Storing Unicode data
Types of data structures
Tries
Single-index trie
Double-index trie
Folded trie
Usage of folded trie in normalization
Usage of folded trie for character properties
top related