Folded Trie: Efficient Data Structure for All of Unicode

download Folded Trie: Efficient Data Structure for All of Unicode

of 21

Transcript of Folded Trie: Efficient Data Structure for All of Unicode

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    1/21

    21st International Unicode Conference Dublin, Ireland, May 2002 1

    Folded Trie: Efficient Data

    Structure for All of Unicode

    Vladimir Weinstein

    [email protected]

    Globalization Center of Competency, San Jose, CA

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    2/21

    21st International Unicode Conference Dublin, Ireland, May 2002 2

    Introduction

    A lot of data for each code point

    Need appropriate data structures

    Unicode version 3.1 introduced code points

    into supplementary spaceaddressable range

    grew to more than a million

    Repetitive data

    Sparsely populated range, especially the

    supplementary space

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    3/21

    21st International Unicode Conference Dublin, Ireland, May 2002 3

    Data Structures

    Arrays

    Advantages: very fast access time, fast write time

    Disadvantage: Unacceptable memory consumption

    Hash tables Advantages: Easy to use, Reasonably fast, General

    Disadvantages: High overhead, complicated sequential

    access, slower than array lookup, data within ranges is

    not shared

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    4/21

    21st International Unicode Conference Dublin, Ireland, May 2002 4

    Data Structures (continued)

    Inversion Maps

    Advantages: simple, very compact, fast boolean

    operations

    Disadvantages: worse access time than arrays andpossibly hash tables

    For more details see Bits of Unicode athttp://www.macchiato.com/slides/Bits_of_Unicode.ppt

    http://www.macchiato.com/slides/Bits_of_Unicode.ppthttp://www.macchiato.com/slides/Bits_of_Unicode.ppt
  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    5/21

    21st International Unicode Conference Dublin, Ireland, May 2002 5

    Tries

    A trie is a structure with one or more indexes

    and one data storage.

    Name comes from Information Retrieval

    Shares repetitive data

    Good compaction

    Not appropriate for frequently changing data

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    6/21

    21st International Unicode Conference Dublin, Ireland, May 2002 6

    Single-Index Trie

    A trie structure with an index array and a dataarray.

    Advantages

    Excellent size Very good access performance (two array accesses,

    shift, mask and addition)

    Disadvantages

    Not appropriate for frequently changing data Index array gets too big when dealing with

    supplementary code points

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    7/21

    21st International Unicode Conference Dublin, Ireland, May 2002 7

    Single-Index Trie Diagram

    BMP code point Upper Lower

    15 0

    LOWER_MASK

    UPPER_WIDTH LOWER_WIDTH

    Index

    Data Array

    0

    Data0

    Block

    Block

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    8/21

    21st International Unicode Conference Dublin, Ireland, May 2002 8

    Double-Index Trie

    Two index arrays and a data block

    Compared to single-index trie:

    1. Provides better compression of the index array

    2. Worse performance, but still very fast

    3. Feasible for supplementary code points

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    9/21

    21st International Unicode Conference Dublin, Ireland, May 2002 9

    Double-Index Trie Diagram

    Block

    Code point Upper Middle

    20 0

    Index 1

    Index 2

    0

    Index2

    Lower

    Data

    0

    Data

    MIDDLE_MASK LOWER_MASK

    UPPER_WIDTH MIDDLE_WIDTH LOWER_WIDTH

    Index1

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    10/21

    21st International Unicode Conference Dublin, Ireland, May 2002 10

    Folded Trie

    Fast access for BMP code points

    Slower access for supplementary code points,

    but far less frequent

    Compacts supplementary index

    Needs additional build time processing

    Fast address with UTF-16 code units

    no need to construct code point

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    11/21

    21st International Unicode Conference Dublin, Ireland, May 2002 11

    Folded TrieSupplementary Access Diagram

    Lead Surrogate

    110110..

    15 0

    0Trail Surrogate

    110111..

    15 9

    Pseudo Code Point

    Final Data

    6

    Folded Trie

    Index + Data

    5

    1

    2

    Has data for

    surrogate block?No

    Yes

    3

    Data

    Same for thesurrogate block

    44

    Lead Surrogate Data

    BMP code points access same as with single-index

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    12/21

    21st International Unicode Conference Dublin, Ireland, May 2002 12

    ICU Implementation: UTrie

    ICU implementation is called UTrie

    Stores either 16 bit or 32 bit wide data

    (extensible in the future)

    Up to 256K different data elements

    Can be frozen and reused as memory mapped

    image for fast startup

    Using UTrie requires custom code

    More about ICU at the end of presentation

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    13/21

    21st International Unicode Conference Dublin, Ireland, May 2002 13

    Range Enumeration

    Allows enumerating over a set of

    contiguous maximal ranges of

    same data elements Elements can be preprocessed by

    additional callback

    Saves time when processing the

    whole Unicode range byefficiently walking the trie

    structure

    start

    limit Element 3

    Element 2

    Element 2

    Element 2Element 2

    Element 2

    Element 2

    Element 1start-1

    limit-1

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    14/21

    21st International Unicode Conference Dublin, Ireland, May 2002 14

    Latin-1 Fast Path

    Build time option

    Allows direct array access for the Latin-1

    range (0x00-0xFF)

    Latin-1 range is not compressed if this option

    is used

    Appropriate when access for Latin-1 range is

    critical collation

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    15/21

    21st International Unicode Conference Dublin, Ireland, May 2002 15

    Normalization data is stored using UTries

    For example, main data has the following

    format

    Example: Normalization Data

    Extra data index Combining class BCK FWD QC_MAYBE

    31 15 7 6 5 3

    Combines back

    Combines forward

    Can be either:

    -index to variable length data

    - first part of supplementary

    lookup value

    -Special handling indicator(Hangul, Jamo)

    QC_NO

    0

    Values for normalization quick

    check

    Variable-length data contains composition anddecomposition info

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    16/21

    21st International Unicode Conference Dublin, Ireland, May 2002 16

    Example: Character Properties Data

    The result of UTrie lookup is an index

    Double indexing allows for even better compression,

    since many code points have the same property value

    UTrie data width is 16 bit (thousands of data entries),while the property data width is 32 bits (few hundred

    unique data words).

    Index Data

    Folded Trie

    16 bits

    Property data

    32 bits

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    17/21

    21st International Unicode Conference Dublin, Ireland, May 2002 17

    International Components for Unicode

    International Components for Unicode(ICU) isa library that provides robust and full-featuredUnicode support

    Several library services use the common UTrieimplementation

    Wide variety of supported platforms open source (X licensenon-viral)

    C/C++ and Java versions

    http://oss.software.ibm.com/icu/

    http://oss.software.ibm.com/icu/http://oss.software.ibm.com/icu/
  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    18/21

    21st International Unicode Conference Dublin, Ireland, May 2002 18

    Conclusion

    UTrie data structure provides good

    compression with fast access

    The main constraint for usage is the nature of

    the data that needs to be stored

    Designed for repetitive and sparse data

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    19/21

    21st International Unicode Conference Dublin, Ireland, May 2002 19

    Q & A

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    20/21

    21st International Unicode Conference Dublin, Ireland, May 2002 20

    Folding and Surrogate Access

    Folding process compacts the index for

    supplementaries and moves it right above the

    BMP index

    Access in ICU4C: Define a C callback, invoked when special lead

    surrogate is detected

    Manually detect special lead surrogates

    In ICU4J, provide a subclass with a method

    that detects special lead surrogates

  • 8/3/2019 Folded Trie: Efficient Data Structure for All of Unicode

    21/21

    21st International Unicode Conference Dublin, Ireland, May 2002 21

    Summary

    Introduction: Storing Unicode data

    Types of data structures

    Tries

    Single-index trie

    Double-index trie

    Folded trie

    Usage of folded trie in normalization

    Usage of folded trie for character properties