Camomile : A Unicode library for OCaml

65
Camomile : A Unicode library for OCaml Yoriyuki Yamagata National Institute of Advanced Science and Technology (AIST) ML Workshop, September 18, 2011

Transcript of Camomile : A Unicode library for OCaml

Page 1: Camomile : A Unicode library for OCaml

Camomile : A Unicode library for OCaml

Yoriyuki Yamagata

National Institute of Advanced Science and Technology (AIST)

ML Workshop, September 18, 2011

Page 2: Camomile : A Unicode library for OCaml

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

Page 3: Camomile : A Unicode library for OCaml

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

Page 4: Camomile : A Unicode library for OCaml

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)

Page 5: Camomile : A Unicode library for OCaml

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)

Page 6: Camomile : A Unicode library for OCaml

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character type

I UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)

Page 7: Camomile : A Unicode library for OCaml

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 strings

I Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)

Page 8: Camomile : A Unicode library for OCaml

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodings

I Case mappingI Collation (sort and search)

Page 9: Camomile : A Unicode library for OCaml

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mapping

I Collation (sort and search)

Page 10: Camomile : A Unicode library for OCaml

Overview - functionality

Camomile - A Unicode library for OCaml

I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)

Page 11: Camomile : A Unicode library for OCaml

Overview - feature

I Only support “logical” operationsI No support for rendering or formattingI Purely written in OCamlI Functors and lazy evaluation play crucial roles

Page 12: Camomile : A Unicode library for OCaml

Overview - featureI Only support “logical” operations

I No support for rendering or formattingI Purely written in OCamlI Functors and lazy evaluation play crucial roles

Page 13: Camomile : A Unicode library for OCaml

Overview - featureI Only support “logical” operationsI No support for rendering or formatting

I Purely written in OCamlI Functors and lazy evaluation play crucial roles

Page 14: Camomile : A Unicode library for OCaml

Overview - featureI Only support “logical” operationsI No support for rendering or formattingI Purely written in OCaml

I Functors and lazy evaluation play crucial roles

Page 15: Camomile : A Unicode library for OCaml

Overview - featureI Only support “logical” operationsI No support for rendering or formattingI Purely written in OCamlI Functors and lazy evaluation play crucial roles

Page 16: Camomile : A Unicode library for OCaml

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

Page 17: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualization

Large number of characters

Multiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

Page 18: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

Multiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

Page 19: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffff

Multiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

Page 20: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

Page 21: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32

legacy encodingsCombining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

Page 22: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

Page 23: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining characters

ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

Page 24: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨

Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

Page 25: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en

â. = a + . + ˆ = a + ˆ + .Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

Page 26: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

Page 27: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventions

Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)

Page 28: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventionsCase mapping OΣOΣ → oσoς (Greek)

Sorting ... < H < CH < I < ... (Slovak)

Page 29: Camomile : A Unicode library for OCaml

ASCII to Unicode : challenge of multilingualizationLarge number of characters

code range 0x0 - 0x10ffffMultiple representation of strings

UTF-8, UTF-16 and UTF-32legacy encodings

Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .

Diverse cultural conventionsCase mapping OΣOΣ → oσoς (Greek)

Sorting ... < H < CH < I < ... (Slovak)

Page 30: Camomile : A Unicode library for OCaml

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

Page 31: Camomile : A Unicode library for OCaml

Unicode normal forms - what is it?

Unicode has multiple representations of “same” strings.

E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.

Normal forms give the unique representationsThere are 4 normal forms

1. NFD2. NFC3. NFKD4. NFKC

We concentrate NFD

Page 32: Camomile : A Unicode library for OCaml

Unicode normal forms - what is it?

Unicode has multiple representations of “same” strings.

E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.

Normal forms give the unique representationsThere are 4 normal forms

1. NFD2. NFC3. NFKD4. NFKC

We concentrate NFD

Page 33: Camomile : A Unicode library for OCaml

Unicode normal forms - what is it?

Unicode has multiple representations of “same” strings.E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.

Normal forms give the unique representationsThere are 4 normal forms

1. NFD2. NFC3. NFKD4. NFKC

We concentrate NFD

Page 34: Camomile : A Unicode library for OCaml

Unicode normal forms - what is it?

Unicode has multiple representations of “same” strings.E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.

Normal forms give the unique representationsThere are 4 normal forms

1. NFD2. NFC3. NFKD4. NFKC

We concentrate NFD

Page 35: Camomile : A Unicode library for OCaml

Unicode normal forms - what is it?

Unicode has multiple representations of “same” strings.E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.

Normal forms give the unique representationsThere are 4 normal forms

1. NFD2. NFC3. NFKD4. NFKC

We concentrate NFD

Page 36: Camomile : A Unicode library for OCaml

Unicode normal form - NFD

1. Decompose characters as much as possibleâ. ⇒ a. + ˆ ⇒ a + . + ˆ

2. Do stable sort on combining characters based oncombining class

a + . + ˆ ⇒ a + . + ˆ

Page 37: Camomile : A Unicode library for OCaml

Unicode normal form - NFD

1. Decompose characters as much as possibleâ. ⇒ a. + ˆ ⇒ a + . + ˆ

2. Do stable sort on combining characters based oncombining class

a + . + ˆ ⇒ a + . + ˆ

Page 38: Camomile : A Unicode library for OCaml

Unicode normal form - NFD

1. Decompose characters as much as possibleâ. ⇒ a. + ˆ ⇒ a + . + ˆ

2. Do stable sort on combining characters based oncombining class

a + . + ˆ ⇒ a + . + ˆ

Page 39: Camomile : A Unicode library for OCaml

Camomile strings - UTF8, UTF16, UCS4

UTF8UTF-8 string as a string

UTF16UTF-16 string as an unsigned 16-bit integer bigarray

UCS4UTF-32 string as a 32-bit integer bigarray

UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type

Page 40: Camomile : A Unicode library for OCaml

Camomile strings - UTF8, UTF16, UCS4

UTF8UTF-8 string as a string

UTF16UTF-16 string as an unsigned 16-bit integer bigarray

UCS4UTF-32 string as a 32-bit integer bigarray

UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type

Page 41: Camomile : A Unicode library for OCaml

Camomile strings - UTF8, UTF16, UCS4

UTF8UTF-8 string as a string

UTF16UTF-16 string as an unsigned 16-bit integer bigarray

UCS4UTF-32 string as a 32-bit integer bigarray

UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type

Page 42: Camomile : A Unicode library for OCaml

Camomile strings - UTF8, UTF16, UCS4

UTF8UTF-8 string as a string

UTF16UTF-16 string as an unsigned 16-bit integer bigarray

UCS4UTF-32 string as a 32-bit integer bigarray

UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type

Page 43: Camomile : A Unicode library for OCaml

Camomile strings - UTF8, UTF16, UCS4

UTF8UTF-8 string as a string

UTF16UTF-16 string as an unsigned 16-bit integer bigarray

UCS4UTF-32 string as a 32-bit integer bigarray

UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type

Page 44: Camomile : A Unicode library for OCaml

Camomile modules - UNFModule for Unicode normal form

module type Type =sig

type text

val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text

val canon_compare : text -> text -> intend

module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index

Page 45: Camomile : A Unicode library for OCaml

Camomile modules - UNFCreate a module for a given Unicode string

module type Type =sig

type text

val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text

val canon_compare : text -> text -> intend

module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index

Page 46: Camomile : A Unicode library for OCaml

Camomile modules - UNFConversion to NFD

module type Type =sig

type text

val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text

val canon_compare : text -> text -> intend

module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index

Page 47: Camomile : A Unicode library for OCaml

Camomile modules - UNFCompare strings by semantic equivalence

module type Type =sig

type text

val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text

val canon_compare : text -> text -> intend

module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index

Page 48: Camomile : A Unicode library for OCaml

Camomile modules - UNFBy lazily building NFD and compare them

module type Type =sig

type text

val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text

val canon_compare : text -> text -> intend

module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index

Page 49: Camomile : A Unicode library for OCaml

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

Page 50: Camomile : A Unicode library for OCaml

ulib - a yet another Unicode libraryNow under development

Page 51: Camomile : A Unicode library for OCaml

ulib - a yet another Unicode libraryulib is compact

I Minimum functionalitiesI No data fileI No initialization

Page 52: Camomile : A Unicode library for OCaml

ulib - a yet another Unicode libraryulib is compact

I Minimum functionalities

I No data fileI No initialization

Page 53: Camomile : A Unicode library for OCaml

ulib - a yet another Unicode libraryulib is compact

I Minimum functionalitiesI No data file

I No initialization

Page 54: Camomile : A Unicode library for OCaml

ulib - a yet another Unicode libraryulib is compact

I Minimum functionalitiesI No data fileI No initialization

Page 55: Camomile : A Unicode library for OCaml

ulib - a yet another Unicode libraryulib is modern

I Rope for Unicode stringI Zipper for indexing ropeI Pluggable code converter using first class modules

Page 56: Camomile : A Unicode library for OCaml

ulib - a yet another Unicode libraryulib is modern

I Rope for Unicode string

I Zipper for indexing ropeI Pluggable code converter using first class modules

Page 57: Camomile : A Unicode library for OCaml

ulib - a yet another Unicode libraryulib is modern

I Rope for Unicode stringI Zipper for indexing rope

I Pluggable code converter using first class modules

Page 58: Camomile : A Unicode library for OCaml

ulib - a yet another Unicode libraryulib is modern

I Rope for Unicode stringI Zipper for indexing ropeI Pluggable code converter using first class modules

Page 59: Camomile : A Unicode library for OCaml

Outline

Overview

ASCII to Unicode : A challenge of multilingualization

Example : Unicode normal forms

ulib

Conclusion

Page 60: Camomile : A Unicode library for OCaml

Conclusion

I Unicode is different from ASCIII Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.

Page 61: Camomile : A Unicode library for OCaml

ConclusionI Unicode is different from ASCII

I Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.

Page 62: Camomile : A Unicode library for OCaml

ConclusionI Unicode is different from ASCIII Camomile addresses a "logical" part of Unicode

I Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.

Page 63: Camomile : A Unicode library for OCaml

ConclusionI Unicode is different from ASCIII Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial roles

I More simplified library "ulib" is now under development.

Page 64: Camomile : A Unicode library for OCaml

ConclusionI Unicode is different from ASCIII Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.

Page 65: Camomile : A Unicode library for OCaml

Project URL

Camomile https://github.com/yoriyuki/Camomileulib https://github.com/yoriyuki/ulib