Building Blocks for Accessing Multilingual Data: CLDR

Post on 22-Jan-2018

313 views 2 download

Transcript of Building Blocks for Accessing Multilingual Data: CLDR

Building Blocks for Accessing Multilingual Data: CLDRSteven R. Loomis, IBM GFTT

1

Access available handouts at ala.15.ala.org/sessions/handouts.

About Me

• Senior Software Engineer, IBM Global Foundations Technology Team • IBM’s technical lead for the ICU4C/C++ software library, and primary voting representative to Unicode

• Member of CLDR-TC, lead of ULI-TC

2

Access available handouts at ala.15.ala.org/sessions/handouts.

Agenda• About CLDR • Focus Areas: • Language Identification • Transliteration • Searching and Sorting • Keyboards/Entry

• Q&A

3

Access available handouts at ala.15.ala.org/sessions/handouts.

What is CLDR?

• Common Locale Data Repository • Language and region-specific data • Covers hundreds of language/region pairs • Open data (like Unicode itself), XML/JSON format

• Community input, carefully curated

4

Access available handouts at ala.15.ala.org/sessions/handouts.

Who is CLDR?

• CLDR’s Technical Committee, the CLDR-TC, is part of the Unicode Consortium

• Active participation by industry, academic, open source projects, national standards bodies, individuals

5

Access available handouts at ala.15.ala.org/sessions/handouts.

Who uses CLDR?

• Apple, Google, IBM, Microsoft… • Wikimedia foundation, jQuery, … • Java, node.js, php, … • Many users via ICU C/C++/Java library

6

Access available handouts at ala.15.ala.org/sessions/handouts.

Locale Data

• Data required for respecting the linguistic, cultural, geopolitical requirements of specific users

• Example: "What day is it?"

7

Access available handouts at ala.15.ala.org/sessions/handouts.

XML / JSON

• XML: “es-US” • <month type="6">Junio</month>

• JSON: “es-US” • { … "6": "Junio", …}

8

Access available handouts at ala.15.ala.org/sessions/handouts.

CLDR Coverage

• Coverage vs. number of languages

9

Access available handouts at ala.15.ala.org/sessions/handouts.

CLDR site and SurveyTool (DEMO)

• DEMO: • http://unicode.org/cldr • http://st.unicode.org/cldr-apps

10

Access available handouts at ala.15.ala.org/sessions/handouts.

Locale Identifiers — BCP47

• Example: sr-Latn-RS • sr : ISO-639 "Serbian" • Latn : ISO-15924 "Latin Script" (vs Cyrillic)

• RS : ISO 3166 / UN M.49 "Serbia"

LatnLatnsr

LatnLatnLatn

LatnLatnRS

11

Access available handouts at ala.15.ala.org/sessions/handouts.

Language/Territory/Script info

Facts: • “The Cyrillic Script can be used to write Mongolian, Russian, Serbian…”

• “Italian is spoken in Italy, San Marino, Switzerland…”

12

Access available handouts at ala.15.ala.org/sessions/handouts.

Language Identification: ExemplarsEnglish (Latin)

a b c d e f g h i j k l m n o p q r s t u v w x y z

Serbian (Latin)

a b c ć č d đ dž e f g h i j k l lj m n nj o p r s š t u v z ž

Serbian (Cyrillic)

а б в г д ђ е ж з и ј к л љ м н њ о п р с т ћ у ф х ц ч џ ш

Russian (Cyrillic)

а б в г д е ё ж з и й к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я

13

Access available handouts at ala.15.ala.org/sessions/handouts.

Transliteration

• Existing data for rule sets. • ALA-LC format could be included. • Rule based engine.

14

Access available handouts at ala.15.ala.org/sessions/handouts.

Transliteration Rule Example: Greek

• <tRule>Σ ↔ S ;</tRule> • <tRule>τ ↔ t ;</tRule> • <tRule>Τ ↔ T ;</tRule>

15

Access available handouts at ala.15.ala.org/sessions/handouts.

Demo: ICU transliterator demo

• http://demo.icu-project.org/icu-bin/translit

16

Access available handouts at ala.15.ala.org/sessions/handouts.

Searching and Sorting

• Unicode (UCA) provides base • CLDR “tailors”: English vs. Danish vs. French

• German: Mueller = Müller = MUELLER • Multiple stages and options: • blackbird vs black-bird vs BlackBird

17

Access available handouts at ala.15.ala.org/sessions/handouts.

Demo: Collator

• http://demo.icu-project.org/icu-bin/collation.html

18

Access available handouts at ala.15.ala.org/sessions/handouts.

Keyboards / Entry

• Standardized identifier for keyboard tables

• Allows comparison between keyboard providers

19

Access available handouts at ala.15.ala.org/sessions/handouts.

Demo: MARC processor

CLDR data

Script: Armn (Armenian) Exemplar text matches hy “Armenian” Transliterate to latin: “Hayastaneayc‘ ekeġec‘i” Regions where spoken: Armenia, Russia, Georgia, Syria, Lebanon, Iran, Turkey, Cyprus

20

uses: CLDR, ICU4J, MARC4J

Access available handouts at ala.15.ala.org/sessions/handouts.

Thank You / Q&A

• srloomis@us.ibm.com • @srl295 ( Twitter, GitHub, Freenode ) • ibm.biz/srloomis

21