Bits of Unicode
-
Upload
plakalscribd -
Category
Documents
-
view
222 -
download
0
Transcript of Bits of Unicode
-
8/8/2019 Bits of Unicode
1/36
Bits of Unicode
Data structures for a
largecharacter setMark Davis
IBM Emerging Technologies
-
8/8/2019 Bits of Unicode
2/36
Caution
Characters ambiguous, sometimes:
Graphemes: x (also ch,)
Code points: 0078 0323
Code units: 0078 0323 (or UTF-8: 78 CC A3)
For programmers
Unicode associates codepoints (or sequences ofcodepoints) with properties
See UTR#17
-
8/8/2019 Bits of Unicode
3/36
The Problem Programs often have to do
lookups
Look up properties by codepoint
Map codepoints to values
Test codepoints for inclusion in set
e.g. value == true/false Easy with 256 codepoints: just use
array
-
8/8/2019 Bits of Unicode
4/36
Size Matters
Not so easy with Unicode!
Unicode 3.0subset(except PUA)
up to FFFF16 = 65,53510
Unicode 3.1full range
up to 10FFFF16 = 1,114,11110
-
8/8/2019 Bits of Unicode
5/36
Array Lookup
With ASCII
Simple
Fast
Compact
codepoint bit:
32 bytescodepoint short:
K
With Unicode
Simple
Fast
Huge (esp. v3.1)
codepoint bit:
136 Kcodepoint short:
2.2 M
-
8/8/2019 Bits of Unicode
6/36
Further complications
Mappings, tests, properties often must
be for sequencesof codepoints.
Human languages dont just use singlecodepoints.
ch in Spanish, Slovak; etc.
-
8/8/2019 Bits of Unicode
7/36
First step:
Avoidance
Properties from libraries often suffice
Test for (Character.getType(c) == Nd)
instead of long list of codepoints Easier
Automatically updated with new versions
Data structures from libraries often suffice
Java HashtableICU (Java or C++) CompactArray
JavaScript properties
Consult http://www.unicode.org
-
8/8/2019 Bits of Unicode
8/36
Data structures: criteria Speed
Read (static)
Write (dynamic)Startup
Memory footprint
RamDisk
Multi-threading
-
8/8/2019 Bits of Unicode
9/36
Hashtables Advantages
Easy to use out-of-the-box
Reasonably fast
General
Disadvantages
High overheadDiscrete (no range lookup)
Much slower than array lookup
-
8/8/2019 Bits of Unicode
10/36
Overhead: char1
char2
value
next
key
overhead
char1overhead
char2overhead
hash
overhead
-
8/8/2019 Bits of Unicode
11/36
Trie Advantages
Nearly as fast as array lookupMuchsmaller than arrays or Hashtables
Take advantage of repetition
DisadvantagesNot suited for rapidly changing data
Best for static, preformed data
-
8/8/2019 Bits of Unicode
12/36
Trie structure
Index
Data
M1 M2
Codepoint
-
8/8/2019 Bits of Unicode
13/36
Trie code
5 Operations
Shift, Lookup, Mask, Add, Lookup
v = data[index[c>>S1]+(c&M2)]]
S1
M1 M2
Codepoint
-
8/8/2019 Bits of Unicode
14/36
Trie: double indexed
Double, for more compaction:
Slightly slower than single indexSmaller chunks of data, so more
compaction
-
8/8/2019 Bits of Unicode
15/36
Trie: double indexed
Index2
Data
Index1
M1 M3M2
Codepoint
-
8/8/2019 Bits of Unicode
16/36
Trie code: double indexed
b1 = index1[ c >> S1 ]
b2 = index2[ b1 + ((c >> S2) & M2)]
v = data[ b2 + (c & M3) ]
S2
S1
M1 M3M2
Codepoint
-
8/8/2019 Bits of Unicode
17/36
Inversion List
Compaction of set of codepoints
Advantages
Simple
Very compact
Faster write than trie
Very fast boolean operations Disadvantages
Slower read than trie or hashtable
-
8/8/2019 Bits of Unicode
18/36
Inversion ListStructure
Structure
Index (optional)
List of codepoints inascending order
Example Set
[ 0020-0061, 0135,19A3-201B ]
00200062
0135013619A3201C
Index
0:
1:
2:
3:
4:
5:
in
out
in
out
in
out
-
8/8/2019 Bits of Unicode
19/36
Inversion List Example
Find smallest i such thatc < data[i]
If no i, i = length
Then
c List odd(i)
Examples:In: 0023, 0135
Out: 001A, 0136, A357
00200062
0135013619A3201C
Index
0:
1:
2:
3:
4:
5:
in
out
in
out
in
out
-
8/8/2019 Bits of Unicode
20/36
Inversion ListOperations
Fast Boolean Operations
Example: Negation
00200062
0135013619A3201C
Index
0:
1:
2:3:
4:
5:
0020
00620135013619A3
201C
Index
1:
3:
2:
4:
5:
6:
00000:
-
8/8/2019 Bits of Unicode
21/36
Inversion List: BinarySearch
from Programming Pearls
Completely unrolled, precalculatedparameters
int index = startIndex;
if (x >= data[auxStart]) {
index += auxStart;
}
switch (power) {
case 21: if (x < data[t = index-0x10000])
index = t;
case 20: if (x < data[t = index-0x8000])
index = t;
-
8/8/2019 Bits of Unicode
22/36
Inversion Map
Inversion List
plus
Associated Values
Lookup index just as
in Inversion ListTake corresponding
value
0020
00620135013619A3201C
Index
0:
1:2:
3:
4:
5:
05
3983
0
0:
1:
2:3:
4:
5:
6:
-
8/8/2019 Bits of Unicode
23/36
Key String Value
Problem
Often almost all values are 1 codepoint
But, must map to strings in a few casesDont want overhead for strings always
Solution
Exception values indicate extra processingCan use same solution for UTF-16 code
units
-
8/8/2019 Bits of Unicode
24/36
Example
Get a character ch
Find its valuev
If v is in [D800..E000], may be string
check v2 =valueException[v -D800]
ifv2not null, process it, continue
Process v
-
8/8/2019 Bits of Unicode
25/36
StringKey Value
Problem
Often almost all keys are 1 codepoint
Must have string keys in a few cases
Dont want overhead for strings always
Solution
Exception values indicate possible follow-on
codepointsCan use same solution for UTF-16 code units
Use key closure!
-
8/8/2019 Bits of Unicode
26/36
Closure
If (X + Y) is a key, then X is a key
Before
s x
sh yshch z
After
shc yw
c w
s x
sh yshch z
c w
-
8/8/2019 Bits of Unicode
27/36
WhyClosure?
s h c h a
x
y
ywz
not found,
use last
-
8/8/2019 Bits of Unicode
28/36
Bitpacking
Squeeze information into value
Example: Character Properties
category: 5 bits
bidi: 4 bits (+ exceptions)
canonical category: 6 bits + expansion
compressCanon = [bits >> SHIFT] & MASK;
canon = expansionArray[compressCanon];
-
8/8/2019 Bits of Unicode
29/36
Statetables
Classic:
entry = stateTable[ state, ch ];
state = entry.state;
doSomethingWith( entry.action );
until (state < 0);
-
8/8/2019 Bits of Unicode
30/36
Statetables
Unicode:
type = trie[ch];
entry = stateTable[ state, type ];
state = entry.state;
doSomethingWith( entry.action );
until (state < 0);
Also, String Key Value
-
8/8/2019 Bits of Unicode
31/36
SampleDataStructures: ICU
Trie: CompactArray
Customized for each datatype
Automatic expansionCompact after setting
Character Properties
use CompactArray, Bitpacking Inversion List: UnicodeSet
Boolean Operations
-
8/8/2019 Bits of Unicode
32/36
Sample Usage #1: ICU
Collation
Trie lookup
Expanding character: String Key ValueContracting character: Key String Value
Break Iterators
For grapheme, word, line, sentence breakStatetable
-
8/8/2019 Bits of Unicode
33/36
Sample Usage #2: ICU
Transliteration
Requires
Mapping codepoints in contextto others Rearranging codepoints
Controlling the choice of mapping
Character Properties
Inversion List
Exception values
-
8/8/2019 Bits of Unicode
34/36
Sample Usage #3: ICU
Character Conversion
From Unicode to bytes
Trie
From bytes to Unicode Arrays for simple maps
Statetables for complex maps
recognizes valid / invalid mappings
provides compaction
Complications
Invalid vs. Valid mapped vs. Valid unmapped
Fallbacks
-
8/8/2019 Bits of Unicode
35/36
References
Unicode Open Source ICU
http://oss.software.ibm.com/icu
ICU4j: Java APIICU4c: C and C++ APIs
Other references see Marks
website:http://www.macchiato.com
-
8/8/2019 Bits of Unicode
36/36
Q& A