Camomile : A Unicode library for OCaml
-
Upload
yamagata-yoriyuki -
Category
Technology
-
view
1.643 -
download
0
Transcript of Camomile : A Unicode library for OCaml
Camomile : A Unicode library for OCaml
Yoriyuki Yamagata
National Institute of Advanced Science and Technology (AIST)
ML Workshop, September 18, 2011
Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
Overview - functionality
Camomile - A Unicode library for OCaml
I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)
Overview - functionality
Camomile - A Unicode library for OCaml
I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)
Overview - functionality
Camomile - A Unicode library for OCaml
I Unicode character type
I UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)
Overview - functionality
Camomile - A Unicode library for OCaml
I Unicode character typeI UTF-8, UTF-16, UTF-32 strings
I Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)
Overview - functionality
Camomile - A Unicode library for OCaml
I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodings
I Case mappingI Collation (sort and search)
Overview - functionality
Camomile - A Unicode library for OCaml
I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mapping
I Collation (sort and search)
Overview - functionality
Camomile - A Unicode library for OCaml
I Unicode character typeI UTF-8, UTF-16, UTF-32 stringsI Conversion to/from approx 200 encodingsI Case mappingI Collation (sort and search)
Overview - feature
I Only support “logical” operationsI No support for rendering or formattingI Purely written in OCamlI Functors and lazy evaluation play crucial roles
Overview - featureI Only support “logical” operations
I No support for rendering or formattingI Purely written in OCamlI Functors and lazy evaluation play crucial roles
Overview - featureI Only support “logical” operationsI No support for rendering or formatting
I Purely written in OCamlI Functors and lazy evaluation play crucial roles
Overview - featureI Only support “logical” operationsI No support for rendering or formattingI Purely written in OCaml
I Functors and lazy evaluation play crucial roles
Overview - featureI Only support “logical” operationsI No support for rendering or formattingI Purely written in OCamlI Functors and lazy evaluation play crucial roles
Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
ASCII to Unicode : challenge of multilingualization
Large number of characters
Multiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining characters
ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
Multiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining characters
ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
code range 0x0 - 0x10ffff
Multiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining characters
ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
code range 0x0 - 0x10ffffMultiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining characters
ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
code range 0x0 - 0x10ffffMultiple representation of strings
UTF-8, UTF-16 and UTF-32
legacy encodingsCombining characters
ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
code range 0x0 - 0x10ffffMultiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining characters
ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
code range 0x0 - 0x10ffffMultiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining characters
ä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
code range 0x0 - 0x10ffffMultiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining charactersä = a + ¨
Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
code range 0x0 - 0x10ffffMultiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + en
â. = a + . + ˆ = a + ˆ + .Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
code range 0x0 - 0x10ffffMultiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
code range 0x0 - 0x10ffffMultiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventions
Case mapping OΣOΣ → oσoς (Greek)Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
code range 0x0 - 0x10ffffMultiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventionsCase mapping OΣOΣ → oσoς (Greek)
Sorting ... < H < CH < I < ... (Slovak)
ASCII to Unicode : challenge of multilingualizationLarge number of characters
code range 0x0 - 0x10ffffMultiple representation of strings
UTF-8, UTF-16 and UTF-32legacy encodings
Combining charactersä = a + ¨Nguyên = Nguyê + ˜ + en = Nguye + ˆ + ˜ + enâ. = a + . + ˆ = a + ˆ + .
Diverse cultural conventionsCase mapping OΣOΣ → oσoς (Greek)
Sorting ... < H < CH < I < ... (Slovak)
Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
Unicode normal forms - what is it?
Unicode has multiple representations of “same” strings.
E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.
Normal forms give the unique representationsThere are 4 normal forms
1. NFD2. NFC3. NFKD4. NFKC
We concentrate NFD
Unicode normal forms - what is it?
Unicode has multiple representations of “same” strings.
E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.
Normal forms give the unique representationsThere are 4 normal forms
1. NFD2. NFC3. NFKD4. NFKC
We concentrate NFD
Unicode normal forms - what is it?
Unicode has multiple representations of “same” strings.E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.
Normal forms give the unique representationsThere are 4 normal forms
1. NFD2. NFC3. NFKD4. NFKC
We concentrate NFD
Unicode normal forms - what is it?
Unicode has multiple representations of “same” strings.E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.
Normal forms give the unique representationsThere are 4 normal forms
1. NFD2. NFC3. NFKD4. NFKC
We concentrate NFD
Unicode normal forms - what is it?
Unicode has multiple representations of “same” strings.E.g. â. = a. + ˆ = a + . + ˆ = a + ˆ + . etc.
Normal forms give the unique representationsThere are 4 normal forms
1. NFD2. NFC3. NFKD4. NFKC
We concentrate NFD
Unicode normal form - NFD
1. Decompose characters as much as possibleâ. ⇒ a. + ˆ ⇒ a + . + ˆ
2. Do stable sort on combining characters based oncombining class
a + . + ˆ ⇒ a + . + ˆ
Unicode normal form - NFD
1. Decompose characters as much as possibleâ. ⇒ a. + ˆ ⇒ a + . + ˆ
2. Do stable sort on combining characters based oncombining class
a + . + ˆ ⇒ a + . + ˆ
Unicode normal form - NFD
1. Decompose characters as much as possibleâ. ⇒ a. + ˆ ⇒ a + . + ˆ
2. Do stable sort on combining characters based oncombining class
a + . + ˆ ⇒ a + . + ˆ
Camomile strings - UTF8, UTF16, UCS4
UTF8UTF-8 string as a string
UTF16UTF-16 string as an unsigned 16-bit integer bigarray
UCS4UTF-32 string as a 32-bit integer bigarray
UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type
Camomile strings - UTF8, UTF16, UCS4
UTF8UTF-8 string as a string
UTF16UTF-16 string as an unsigned 16-bit integer bigarray
UCS4UTF-32 string as a 32-bit integer bigarray
UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type
Camomile strings - UTF8, UTF16, UCS4
UTF8UTF-8 string as a string
UTF16UTF-16 string as an unsigned 16-bit integer bigarray
UCS4UTF-32 string as a 32-bit integer bigarray
UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type
Camomile strings - UTF8, UTF16, UCS4
UTF8UTF-8 string as a string
UTF16UTF-16 string as an unsigned 16-bit integer bigarray
UCS4UTF-32 string as a 32-bit integer bigarray
UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type
Camomile strings - UTF8, UTF16, UCS4
UTF8UTF-8 string as a string
UTF16UTF-16 string as an unsigned 16-bit integer bigarray
UCS4UTF-32 string as a 32-bit integer bigarray
UnicodeString.TypeUTF-8/16 and UCS4 all confirm UnicodeString.TypeString operations are functors over UnicodeString.Type
Camomile modules - UNFModule for Unicode normal form
module type Type =sig
type text
val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text
val canon_compare : text -> text -> intend
module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index
Camomile modules - UNFCreate a module for a given Unicode string
module type Type =sig
type text
val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text
val canon_compare : text -> text -> intend
module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index
Camomile modules - UNFConversion to NFD
module type Type =sig
type text
val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text
val canon_compare : text -> text -> intend
module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index
Camomile modules - UNFCompare strings by semantic equivalence
module type Type =sig
type text
val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text
val canon_compare : text -> text -> intend
module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index
Camomile modules - UNFBy lazily building NFD and compare them
module type Type =sig
type text
val nfd : text -> textval nfkd : text -> textval nfc : text -> textval nfkc : text -> text
val canon_compare : text -> text -> intend
module Make (Text : UnicodeString.Type) :Type with type text = Text.t andtype index = Text.index
Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
ulib - a yet another Unicode libraryNow under development
ulib - a yet another Unicode libraryulib is compact
I Minimum functionalitiesI No data fileI No initialization
ulib - a yet another Unicode libraryulib is compact
I Minimum functionalities
I No data fileI No initialization
ulib - a yet another Unicode libraryulib is compact
I Minimum functionalitiesI No data file
I No initialization
ulib - a yet another Unicode libraryulib is compact
I Minimum functionalitiesI No data fileI No initialization
ulib - a yet another Unicode libraryulib is modern
I Rope for Unicode stringI Zipper for indexing ropeI Pluggable code converter using first class modules
ulib - a yet another Unicode libraryulib is modern
I Rope for Unicode string
I Zipper for indexing ropeI Pluggable code converter using first class modules
ulib - a yet another Unicode libraryulib is modern
I Rope for Unicode stringI Zipper for indexing rope
I Pluggable code converter using first class modules
ulib - a yet another Unicode libraryulib is modern
I Rope for Unicode stringI Zipper for indexing ropeI Pluggable code converter using first class modules
Outline
Overview
ASCII to Unicode : A challenge of multilingualization
Example : Unicode normal forms
ulib
Conclusion
Conclusion
I Unicode is different from ASCIII Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.
ConclusionI Unicode is different from ASCII
I Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.
ConclusionI Unicode is different from ASCIII Camomile addresses a "logical" part of Unicode
I Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.
ConclusionI Unicode is different from ASCIII Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial roles
I More simplified library "ulib" is now under development.
ConclusionI Unicode is different from ASCIII Camomile addresses a "logical" part of UnicodeI Functors and lazyness play crucial rolesI More simplified library "ulib" is now under development.
Project URL
Camomile https://github.com/yoriyuki/Camomileulib https://github.com/yoriyuki/ulib