Unicode for Under Resourced Languages

43
Unicode for Under Resourced Languages Daniel Yacob The Ge’ez Frontier Foundation SALTMIL 5: Genoa, Italy 2006 SALTMIL 5: Genoa, Italy 2006

description

Unicode for Under Resourced Languages. Daniel Yacob The Ge’ez Frontier Foundation. SALTMIL 5: Genoa, Italy 2006. Overview. What is “Unicode”? More than Just Encoded Letters! Working with Unicode How Unicode can help you. Resources and how to apply them. Working for Unicode - PowerPoint PPT Presentation

Transcript of Unicode for Under Resourced Languages

Page 1: Unicode for Under Resourced Languages

Unicode for Under Resourced Languages

Daniel Yacob

The Ge’ez Frontier Foundation

SALTMIL 5: Genoa, Italy 2006SALTMIL 5: Genoa, Italy 2006

Page 2: Unicode for Under Resourced Languages

Overview

• What is “Unicode”?– More than Just Encoded Letters!

• Working with Unicode– How Unicode can help you.– Resources and how to apply them.

• Working for Unicode– How you can help Unicode.– How Unicode can help your U-RL.

Page 3: Unicode for Under Resourced Languages

My Background

• Started Ethiopic software work in 1993– transliterator, keyboard, fonts

• Amharic Computational Linguistics in 1994

• “Extended Ethiopic” Unicode Standardization 1995-2004

• Corpus Collection 1997 – Present

• Began Using Unicode in 1995 for Ethiopic– but no Unicode standard existed until 2000!

Page 4: Unicode for Under Resourced Languages

My Background

• Little or no Unicode based resources in 1993-1997– Today there is almost always an OpenSource

project that you can start with and extend.– Minimize the time and labour you put into

developing basic resources.– Avoid the maintenance trap.

• We will assume the worst case scenario– You work on a language, using a script, with

no pre-existing software resources at all.

Page 5: Unicode for Under Resourced Languages

What Unicode is

Unicode …– is a consortium– is a process– is a community– is a conference– is a database– is a standard– is a collection of standards

Page 6: Unicode for Under Resourced Languages

What Unicode is not

Unicode …– is not a font– is not a keyboard system– is not a transliteration system– is not the ISO– is not perfect– is not complete

Page 7: Unicode for Under Resourced Languages

Over 80 Scripts not Encoded!India, Nepal, Bangladesh:

• Chakma

• Methei / Manipuri

• Newari

• Sorang Sompeng

• Varang Kshiti

Southeast Asia (excluding China):

• Batak

• Cham

• Javanese

• Pahawh Hmong

• Viet Thai

China:

• Lanna

• Naxi Geba

• Naxi Tomba

• Pollard

Africa:

• Bamum

• Bassa

• Mende

Courtesy of Michael Everson: http://evertype.com

Page 8: Unicode for Under Resourced Languages

Over 80 Scripts not Encoded!•Ahom•Alpine•Aramaic•Avestan•Aztec Pictograms•Balti•Brahmi•Büthakukye•Byblos•Chalukya•Chola•Cypro-Minoan•Egyptian Hieroglyphs•Elbasan•Elymaic

•Grantha

•Hatran

•Iberian

•Indus Valley

•Jurchin

•Kaithi

•Kawi

•Khotanese

•Kitan Large Script

•Kitan Small Script

•Landa

•Linear A

•Luwian

•Mandaic•Manichaean•Mayan Hieroglyphs•Meroitic•Modi•Nabataean•North Arabic•Numidian•Old Hungarian•Old Permic•Orkhon•Pahlavi

•Palmyrene•Proto-Elamite•Pyu•Rongorongo•Samaritan•Satavahana•Sharada•Siddham•South Arabian•Soyombo•Takri•Tangut Ideograms•Uighur•Vedic accents

Courtesy of Michael Everson: http://evertype.com

Page 9: Unicode for Under Resourced Languages

Current State of the Unicode Standard: New Script Additions

For Unicode 5.0 (2006):

N’Ko (West Africa)

Balinese (Indonesia)

Phags-pa (historical)

Phoenician (historical)

Cuneiform (historical)

For Unicode 5.1 (2008):

Lepcha (India)Ol Chiki (India)

Vai (Liberia)Saurashtra (India)

Myanmar minorities (Myanmar)Kayah Li (Myanmar)Rejang (Indonesia)

Sundanese (Indonesia)Carian, Lycian, Lydian

(historical)

Courtesy of Michael Everson: http://evertype.com

Page 10: Unicode for Under Resourced Languages

Working with Unicode

Unicode is all About Text• Most applicable to problems where

language is represented by text.• Unicode addresses some vocabulary but

under the scope of localization (CLDR).• May not be the solution if you are not

working with text represented in written form– Although, Unicode can be used for symbol

processing

Page 11: Unicode for Under Resourced Languages

Working with Unicode

Operating Systems

• Most anything from this millennia.

• Apple MacOS Version ≥ 9.2

• Microsoft Windows CE, NT, XP, 2000

• Solaris ≥ 2.8

• Any GNU/Linux (for console use)– GNOME 2.0 or KDE 2.0 and Later

Page 12: Unicode for Under Resourced Languages

Working with Unicode

The International Phonetic Alphabet (IPA)

Page 13: Unicode for Under Resourced Languages
Page 14: Unicode for Under Resourced Languages

Working with Unicode

The International Phonetic Alphabet (IPA)

• SIL Charis, Doulos, Gentium – free and most complete– matches “New Times Roman” style– http://scripts.sil.org/IPAhome

Page 15: Unicode for Under Resourced Languages

Working with Unicode

If you need more letters…

• Create Your own Fonts!

• Use the Unicode Private Use Area (PUA)– this is Unicode’s extension mechanism.– does not break compatibility with Unicode

software.– you must send your fonts with your work.– encode non-letter symbols (tokens, tags), no

need for fonts.

Page 16: Unicode for Under Resourced Languages

Working with Unicode

The PUA

• 6,400 code points in the range E000-F8FF

• 218 additional available in “planes” 15 & 16

• Work in Plane 0 first (0000 – FFFF)

• Intended for company logos, ligatures used by typesetting software, etc.

Page 17: Unicode for Under Resourced Languages

Working with Unicode

Creating Your Own Fonts• Bitmap (BDF)

– Faster to create– One size per font, not so scalable– Works best with X-Windows (Unix)

• Outline (TrueType, PostScipt, OpenType)– Takes more time– Scalable– MS Windows, Mac, Modern Unixes

Page 18: Unicode for Under Resourced Languages

Working with Unicode

Bitmap Editors

• Each letter is a matrix of pixels, like tiles

• You toggle them on or off to shape your letters

• GBDFED for recent GNOME/Linux

• XBDFED for general Unix

• Or search for “BDF Editor”

Page 19: Unicode for Under Resourced Languages

Working with Unicode

Page 20: Unicode for Under Resourced Languages

Working with Unicode

Bitmap Editors

Zoom View Within Edit Window

Page 21: Unicode for Under Resourced Languages

Working with Unicode

Outline Editors

• Create Bezier curves to outline scalable shapes

• Here traced around a scanned image

• FontForge http://fontforge.sf.net

Page 22: Unicode for Under Resourced Languages

Working with Unicode

Creating Your Own Keyboards

• No standard formats

• Different on every operating system

• May require some painful programming– transliteration may be a better alternative.

• For small amounts of typing try: Ctrl+Shift+X1X2X3X4

Ctrl+Shift+1234

Page 23: Unicode for Under Resourced Languages

Working with Unicode

Creating Your Own KeyboardsLinux• Migration Toward Smart Common Input

Method (SCIM)– simple table based– more complex as needed– http://scim.sf.net- or Yudit, Emacs for older Unixes, but you can

only type in these applications.

Page 24: Unicode for Under Resourced Languages

Working with Unicode

Creating Your Own KeyboardsWindows• Keyman, most mature & robust• Keyboards created with KeymanDeveloper

– $59 academic and developing world license– worth every cent– compiled keyboards also run under Linux with

a SCIM module– http://tavultesoft.com

Page 25: Unicode for Under Resourced Languages

Working with Unicode

Text Processing• International Components for

Unicode (ICU)– http://icu.sf.net– Java, C/C++– Bindings in: Python, Ruby, C#,

Perl 6 (some Perl 5)– started by IBM, is OpenSource– managed by the Unicode president– check with ICU before

• 700+ Encoding Conversions– convert legacy systems to and from Unicode– migrate corpora to Unicode

Page 26: Unicode for Under Resourced Languages

Working with Unicode

Text Processing

ICU: Normalization• Equate letters and

diacritical symbols

n006E

+ ˜0303

= ñ00F1

u0075

+ ¨0308

= ü00FC

A0031

+ °030A

= Å212B

e0065

+ ^0302

+ .0323

= ê

1EC7

e0065

+ .0323

+ ^0302

ê00EA

+ .0323

Page 27: Unicode for Under Resourced Languages

Working with UnicodeText ProcessingICU: Regular Expressions• Applies the Unicode Character Database• Categorize every character as one of

– Letter– Number– Separator– Punctuation– Marks– Symbols– Others

• Subcategories within each. Examples– Letter, Uppercase, lowercase, Other, …– Symbols, Math, Currency, Modifiers, …– Mark, spacing, non-spacing, enclosing

• Defines 80 character property types

Page 28: Unicode for Under Resourced Languages

Working with Unicode

Text ProcessingICU: Regular ExpressionsSet Operations• [^\p{Letter}] Negation• [\p{Letter}\p{Number}] Union• [\p{Letter}&\p{script=Cyrllic}] Intersection• [\p{Letter}-\p{Latin}] Difference

• Important for a character set the size of Unicode.

Page 29: Unicode for Under Resourced Languages

Working with Unicode

Text Processing

ICU: Regular Expressions

• Enhanced Word Boundaries:

Hello There. G’day 123.456Classic RE

Hello There. G’day 123.456Unicode Word Boundaries

Page 30: Unicode for Under Resourced Languages

Working with Unicode

Text Processing

ICU: Regular Expressions

• Equivalence Classes– [=e=] matches all “e” [eèéêëēĕėęě]– not yet implemented– use Perl instead

Page 31: Unicode for Under Resourced Languages

Working with Unicode

Simple Plurals:[#7#]ች

vs

[ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]ች

Overloading Perl Regex with Regexp::Ethiopic

Page 32: Unicode for Under Resourced Languages

Working with Unicode

• /[#3#]ያ/– አንባቢያን– ሚያዚያ– ኢትዮጵያዊያን

• /[#3,6#]ያ/– አንባቢያን አንባብያን– ሚያዚያ ሚያዝያ– ኢትዮጵያዊያን ኢትዮጵያውያን

Overloading Perl Regex with Regexp::Ethiopic

Page 33: Unicode for Under Resourced Languages

Working with Unicode

Text Processing

ICU: Transliteration

• Defined by “transform rules”– One to one mappings:

• α <> a;• β <> b;

– Context Rules: • β } [aeiou] > b; • β } [^aeiou] > v;

Page 34: Unicode for Under Resourced Languages

Working with Unicode

Text Processing

ICU: Transliteration

• Defined by “transform rules”– Applying UCD Properties

• Θ } [:LowercaseLetter:] <> Th; • Θ <> TH;

– Reverse Transliteration Context Rules • σ < [:^Letter:] { s } [:^Letter:] ;• ς < s } [:^Letter:] ;• σ < s ;

Page 35: Unicode for Under Resourced Languages

Working with Unicode

Text Processing

• ICU: Transliteration– Gets much more sophisticated

• See also Perl’s Text::Transliterate

Page 36: Unicode for Under Resourced Languages

Working for Unicode

Taking Your Work a Step Further• You’ve helped create an orthography

–now make it official.• You’ve worked with a pre-existing un-encoded

script using the PUA –now formalize it.• You’ve created a transliteration system

–make it an ISO standard.• You’ve identified a dialect –encode it in ISO 639.• You’ve developed a keyboard

–make it a national standard.• etc.

Page 37: Unicode for Under Resourced Languages

Working for Unicode

Why go the extra mile kilometer?• Ethnic pride and identity is promoted.• Literacy efforts can be encouraged.• The study of historic scripts is kept alive.• Communication between and amongst members

of the community is promoted.• Government communication in times of

emergency (disease, war, natural disaster).• Leads to localization, greater access to ICT.• …and you become the expert!

Page 38: Unicode for Under Resourced Languages

Working for Unicode

What to Consider• The work will be more social than technical.• The work will take years (at least two).• Review Encoding History

– Has this been attempted before and failed? Why?– Are there any non-Unicode encodings?

• Determine the Stakeholders– The Government –will they support you, oppose you, jail you?– Political Parties, Religious, Education, Cultural Groups

• does anyone have something to lose by the encoding?

• Communicate, Communicate, Communicate…– and be transparent.– the perception of being closed breeds suspicion and opposition.

• …even 11 years after the fact, trust me on this.

Page 39: Unicode for Under Resourced Languages

Working for Unicode

New Keyboard?

• No international standardization working groups

• Contribute Keyboard back to main project

• Contact Local ICT Professionals Organization

• Contact Local University CS Department

• Contact Local Standards Body

Page 40: Unicode for Under Resourced Languages

Working for Unicode

New Language or Dialect?

• Contact the ICO/DIS 639-3 Registration Authority– http://sil.org/iso639-3/ – [email protected]

• Contact Language or Cultural Authority

• Contact Local University Linguistics Department

Page 41: Unicode for Under Resourced Languages

Working for Unicode

New Orthography? Or Un-encoded?• Contact the ISO 15924 Registration Authority

– http://unicode.org/iso15924/

• Contact Language or Cultural Authority• Contact Local ICT Professionals Organization• Contact Local University CS Department• Contact Local University Linguistics Department• Contact Local Standards Body• Contact the Script Encoding Initiative

Page 42: Unicode for Under Resourced Languages

Working for Unicode

The Script Encoding Initiative• http://linguistics.berkeley.edu/sei• Works with users on script proposals.• Helps raise money for script proposals to be

written and free fonts to be created.• Works collaboratively with other groups (e.g.

SIL) to avoid duplication of effort.• Helps seek experts to review proposals.• Participates at standards meetings on behalf of

minority groups and scholars.

Page 43: Unicode for Under Resourced Languages

~fini~

• Conclusion– Use Unicode Now!– You can do it!– Yes you can do it!– There are no excuses anymore…– …its 2006 already, I’m telling you can do this!– and when you do (remember I have faith in you!) consider

feeding back into the system via standardization.– Be a good citizen of earth, always ☺.

Thank You for Listening.Are There Any Questions?

This presentation: http://yacob.org/papers/