Unicode for Under Resourced Languages

Post on 15-Jan-2016

66 views 0 download

Tags:

description

Unicode for Under Resourced Languages. Daniel Yacob The Ge’ez Frontier Foundation. SALTMIL 5: Genoa, Italy 2006. Overview. What is “Unicode”? More than Just Encoded Letters! Working with Unicode How Unicode can help you. Resources and how to apply them. Working for Unicode - PowerPoint PPT Presentation

Transcript of Unicode for Under Resourced Languages

Unicode for Under Resourced Languages

Daniel Yacob

The Ge’ez Frontier Foundation

SALTMIL 5: Genoa, Italy 2006SALTMIL 5: Genoa, Italy 2006

Overview

• What is “Unicode”?– More than Just Encoded Letters!

• Working with Unicode– How Unicode can help you.– Resources and how to apply them.

• Working for Unicode– How you can help Unicode.– How Unicode can help your U-RL.

My Background

• Started Ethiopic software work in 1993– transliterator, keyboard, fonts

• Amharic Computational Linguistics in 1994

• “Extended Ethiopic” Unicode Standardization 1995-2004

• Corpus Collection 1997 – Present

• Began Using Unicode in 1995 for Ethiopic– but no Unicode standard existed until 2000!

My Background

• Little or no Unicode based resources in 1993-1997– Today there is almost always an OpenSource

project that you can start with and extend.– Minimize the time and labour you put into

developing basic resources.– Avoid the maintenance trap.

• We will assume the worst case scenario– You work on a language, using a script, with

no pre-existing software resources at all.

What Unicode is

Unicode …– is a consortium– is a process– is a community– is a conference– is a database– is a standard– is a collection of standards

What Unicode is not

Unicode …– is not a font– is not a keyboard system– is not a transliteration system– is not the ISO– is not perfect– is not complete

Over 80 Scripts not Encoded!India, Nepal, Bangladesh:

• Chakma

• Methei / Manipuri

• Newari

• Sorang Sompeng

• Varang Kshiti

Southeast Asia (excluding China):

• Batak

• Cham

• Javanese

• Pahawh Hmong

• Viet Thai

China:

• Lanna

• Naxi Geba

• Naxi Tomba

• Pollard

Africa:

• Bamum

• Bassa

• Mende

Courtesy of Michael Everson: http://evertype.com

Over 80 Scripts not Encoded!•Ahom•Alpine•Aramaic•Avestan•Aztec Pictograms•Balti•Brahmi•Büthakukye•Byblos•Chalukya•Chola•Cypro-Minoan•Egyptian Hieroglyphs•Elbasan•Elymaic

•Grantha

•Hatran

•Iberian

•Indus Valley

•Jurchin

•Kaithi

•Kawi

•Khotanese

•Kitan Large Script

•Kitan Small Script

•Landa

•Linear A

•Luwian

•Mandaic•Manichaean•Mayan Hieroglyphs•Meroitic•Modi•Nabataean•North Arabic•Numidian•Old Hungarian•Old Permic•Orkhon•Pahlavi

•Palmyrene•Proto-Elamite•Pyu•Rongorongo•Samaritan•Satavahana•Sharada•Siddham•South Arabian•Soyombo•Takri•Tangut Ideograms•Uighur•Vedic accents

Courtesy of Michael Everson: http://evertype.com

Current State of the Unicode Standard: New Script Additions

For Unicode 5.0 (2006):

N’Ko (West Africa)

Balinese (Indonesia)

Phags-pa (historical)

Phoenician (historical)

Cuneiform (historical)

For Unicode 5.1 (2008):

Lepcha (India)Ol Chiki (India)

Vai (Liberia)Saurashtra (India)

Myanmar minorities (Myanmar)Kayah Li (Myanmar)Rejang (Indonesia)

Sundanese (Indonesia)Carian, Lycian, Lydian

(historical)

Courtesy of Michael Everson: http://evertype.com

Working with Unicode

Unicode is all About Text• Most applicable to problems where

language is represented by text.• Unicode addresses some vocabulary but

under the scope of localization (CLDR).• May not be the solution if you are not

working with text represented in written form– Although, Unicode can be used for symbol

processing

Working with Unicode

Operating Systems

• Most anything from this millennia.

• Apple MacOS Version ≥ 9.2

• Microsoft Windows CE, NT, XP, 2000

• Solaris ≥ 2.8

• Any GNU/Linux (for console use)– GNOME 2.0 or KDE 2.0 and Later

Working with Unicode

The International Phonetic Alphabet (IPA)

Working with Unicode

The International Phonetic Alphabet (IPA)

• SIL Charis, Doulos, Gentium – free and most complete– matches “New Times Roman” style– http://scripts.sil.org/IPAhome

Working with Unicode

If you need more letters…

• Create Your own Fonts!

• Use the Unicode Private Use Area (PUA)– this is Unicode’s extension mechanism.– does not break compatibility with Unicode

software.– you must send your fonts with your work.– encode non-letter symbols (tokens, tags), no

need for fonts.

Working with Unicode

The PUA

• 6,400 code points in the range E000-F8FF

• 218 additional available in “planes” 15 & 16

• Work in Plane 0 first (0000 – FFFF)

• Intended for company logos, ligatures used by typesetting software, etc.

Working with Unicode

Creating Your Own Fonts• Bitmap (BDF)

– Faster to create– One size per font, not so scalable– Works best with X-Windows (Unix)

• Outline (TrueType, PostScipt, OpenType)– Takes more time– Scalable– MS Windows, Mac, Modern Unixes

Working with Unicode

Bitmap Editors

• Each letter is a matrix of pixels, like tiles

• You toggle them on or off to shape your letters

• GBDFED for recent GNOME/Linux

• XBDFED for general Unix

• Or search for “BDF Editor”

Working with Unicode

Working with Unicode

Bitmap Editors

Zoom View Within Edit Window

Working with Unicode

Outline Editors

• Create Bezier curves to outline scalable shapes

• Here traced around a scanned image

• FontForge http://fontforge.sf.net

Working with Unicode

Creating Your Own Keyboards

• No standard formats

• Different on every operating system

• May require some painful programming– transliteration may be a better alternative.

• For small amounts of typing try: Ctrl+Shift+X1X2X3X4

Ctrl+Shift+1234

Working with Unicode

Creating Your Own KeyboardsLinux• Migration Toward Smart Common Input

Method (SCIM)– simple table based– more complex as needed– http://scim.sf.net- or Yudit, Emacs for older Unixes, but you can

only type in these applications.

Working with Unicode

Creating Your Own KeyboardsWindows• Keyman, most mature & robust• Keyboards created with KeymanDeveloper

– $59 academic and developing world license– worth every cent– compiled keyboards also run under Linux with

a SCIM module– http://tavultesoft.com

Working with Unicode

Text Processing• International Components for

Unicode (ICU)– http://icu.sf.net– Java, C/C++– Bindings in: Python, Ruby, C#,

Perl 6 (some Perl 5)– started by IBM, is OpenSource– managed by the Unicode president– check with ICU before

• 700+ Encoding Conversions– convert legacy systems to and from Unicode– migrate corpora to Unicode

Working with Unicode

Text Processing

ICU: Normalization• Equate letters and

diacritical symbols

n006E

+ ˜0303

= ñ00F1

u0075

+ ¨0308

= ü00FC

A0031

+ °030A

= Å212B

e0065

+ ^0302

+ .0323

= ê

1EC7

e0065

+ .0323

+ ^0302

ê00EA

+ .0323

Working with UnicodeText ProcessingICU: Regular Expressions• Applies the Unicode Character Database• Categorize every character as one of

– Letter– Number– Separator– Punctuation– Marks– Symbols– Others

• Subcategories within each. Examples– Letter, Uppercase, lowercase, Other, …– Symbols, Math, Currency, Modifiers, …– Mark, spacing, non-spacing, enclosing

• Defines 80 character property types

Working with Unicode

Text ProcessingICU: Regular ExpressionsSet Operations• [^\p{Letter}] Negation• [\p{Letter}\p{Number}] Union• [\p{Letter}&\p{script=Cyrllic}] Intersection• [\p{Letter}-\p{Latin}] Difference

• Important for a character set the size of Unicode.

Working with Unicode

Text Processing

ICU: Regular Expressions

• Enhanced Word Boundaries:

Hello There. G’day 123.456Classic RE

Hello There. G’day 123.456Unicode Word Boundaries

Working with Unicode

Text Processing

ICU: Regular Expressions

• Equivalence Classes– [=e=] matches all “e” [eèéêëēĕėęě]– not yet implemented– use Perl instead

Working with Unicode

Simple Plurals:[#7#]ች

vs

[ሆሎሖሞሦሮሶሾቆቦቮቶቾኆኖኞኦኮዖዞዦዮዶጆጎጦጮጶጾፆፎፖ]ች

Overloading Perl Regex with Regexp::Ethiopic

Working with Unicode

• /[#3#]ያ/– አንባቢያን– ሚያዚያ– ኢትዮጵያዊያን

• /[#3,6#]ያ/– አንባቢያን አንባብያን– ሚያዚያ ሚያዝያ– ኢትዮጵያዊያን ኢትዮጵያውያን

Overloading Perl Regex with Regexp::Ethiopic

Working with Unicode

Text Processing

ICU: Transliteration

• Defined by “transform rules”– One to one mappings:

• α <> a;• β <> b;

– Context Rules: • β } [aeiou] > b; • β } [^aeiou] > v;

Working with Unicode

Text Processing

ICU: Transliteration

• Defined by “transform rules”– Applying UCD Properties

• Θ } [:LowercaseLetter:] <> Th; • Θ <> TH;

– Reverse Transliteration Context Rules • σ < [:^Letter:] { s } [:^Letter:] ;• ς < s } [:^Letter:] ;• σ < s ;

Working with Unicode

Text Processing

• ICU: Transliteration– Gets much more sophisticated

• See also Perl’s Text::Transliterate

Working for Unicode

Taking Your Work a Step Further• You’ve helped create an orthography

–now make it official.• You’ve worked with a pre-existing un-encoded

script using the PUA –now formalize it.• You’ve created a transliteration system

–make it an ISO standard.• You’ve identified a dialect –encode it in ISO 639.• You’ve developed a keyboard

–make it a national standard.• etc.

Working for Unicode

Why go the extra mile kilometer?• Ethnic pride and identity is promoted.• Literacy efforts can be encouraged.• The study of historic scripts is kept alive.• Communication between and amongst members

of the community is promoted.• Government communication in times of

emergency (disease, war, natural disaster).• Leads to localization, greater access to ICT.• …and you become the expert!

Working for Unicode

What to Consider• The work will be more social than technical.• The work will take years (at least two).• Review Encoding History

– Has this been attempted before and failed? Why?– Are there any non-Unicode encodings?

• Determine the Stakeholders– The Government –will they support you, oppose you, jail you?– Political Parties, Religious, Education, Cultural Groups

• does anyone have something to lose by the encoding?

• Communicate, Communicate, Communicate…– and be transparent.– the perception of being closed breeds suspicion and opposition.

• …even 11 years after the fact, trust me on this.

Working for Unicode

New Keyboard?

• No international standardization working groups

• Contribute Keyboard back to main project

• Contact Local ICT Professionals Organization

• Contact Local University CS Department

• Contact Local Standards Body

Working for Unicode

New Language or Dialect?

• Contact the ICO/DIS 639-3 Registration Authority– http://sil.org/iso639-3/ – iso639-3@sil.org

• Contact Language or Cultural Authority

• Contact Local University Linguistics Department

Working for Unicode

New Orthography? Or Un-encoded?• Contact the ISO 15924 Registration Authority

– http://unicode.org/iso15924/

• Contact Language or Cultural Authority• Contact Local ICT Professionals Organization• Contact Local University CS Department• Contact Local University Linguistics Department• Contact Local Standards Body• Contact the Script Encoding Initiative

Working for Unicode

The Script Encoding Initiative• http://linguistics.berkeley.edu/sei• Works with users on script proposals.• Helps raise money for script proposals to be

written and free fonts to be created.• Works collaboratively with other groups (e.g.

SIL) to avoid duplication of effort.• Helps seek experts to review proposals.• Participates at standards meetings on behalf of

minority groups and scholars.

~fini~

• Conclusion– Use Unicode Now!– You can do it!– Yes you can do it!– There are no excuses anymore…– …its 2006 already, I’m telling you can do this!– and when you do (remember I have faith in you!) consider

feeding back into the system via standardization.– Be a good citizen of earth, always ☺.

Thank You for Listening.Are There Any Questions?

This presentation: http://yacob.org/papers/