Computing in Vietnamese:Progress & Challenges
{James} ĐỖ Bá Phước杜 伯 福IMUG 2005-05-19
Copyright©2005 by JDo. All rights reserved.
Overview
Vietnamese writingLatin: Quốc ngữIdeographic: Chữ Nôm
ConsiderationsRepertoireCharacter encodingInput methodsFonts
Quốc ngữ{National script}
Copyright©2005 by JDo. All rights reserved.
Orthographic units
Vowelsa ă â e ê i o ô ơ u ư y
Consonantsb c d đ g h k l m n p q r s t v x
Tone marks_o ̀_ _o ̉_ _o ̃_ _o ́_ _o ̣_
A vowel can combine with one tone mark
Copyright©2005 by JDo. All rights reserved.
192 characters, 6 “too many”
A total of 192 upper- and lower-case “pre-composed” characters6 characters beyond 8-bit character set
6 characters missing from original ISO 10646 repertoire (in 1988)Restored through Unicode-10646 merger
Copyright©2005 by JDo. All rights reserved.
43 8-bit character sets
Pre-composedDual fontsTCVN 5712:1995, aka “ABC”
(TCVN = Tiêu chuẩn Việt Nam {Vietnam Standard})
Glyph overlapVNI
CombiningWindows Vietnamese (cp-1258)
Copyright©2005 by JDo. All rights reserved.
Unicode
Encodes both:Combining characters (from Unicode)Pre-composed characters (from ISO)
Getting more widely supported
Wide acceptance after and for the WebTCVN 6909:2001
Pre-composed characters only
Copyright©2005 by JDo. All rights reserved.
Vietnamese writing
Handwritingv i e ^ t ́ viết {write}v i e ^ ́t viết {write}
TelexTypewriter
Dead-keyCarriage stops at combination of diacritics, thenBase letter
Computer
Copyright©2005 by JDo. All rights reserved.
Telex convention
a ă â e ê i o ô ơ u ư yaw aa ee oo ow uw
b c d đ g h k l m n p q r s t v xdd
_o ̀_ _o ̉_ _o ̃_ _o ́_ _o ̣_f r x s j
Example: vieets viết
Copyright©2005 by JDo. All rights reserved.
Computer input methods
TelexVNIVIQR (VIetnamese Quoted-Readable)
MnemonicInternet RFC 1456
TCVN 6064:1995Orthographic units
Copyright©2005 by JDo. All rights reserved.
VIQR, aka VietNet
a ă â e ê i o ô ơ u ư ya( a^ e^ o^ o+ u+
b c d đ g h k l m n p q r s t v xdd
_o ̀_ _o ̉_ _o ̃_ _o ́_ _o ̣_` ? ~ ’ .
Example: vie^’t viết
Copyright©2005 by JDo. All rights reserved.
Are diacritics necessary?
Ma ghostMà butMả tombMã codeMá motherMạ rice seedlingMua to buyMưa rain
Copyright©2005 by JDo. All rights reserved.
TCVN 6064:1995
Corresponds to orthographic unitsCloser to handwriting
Native to:Windows XP
Emits combining character sequencesMac OS X
Emits pre-composed characters in OS X 10.4 TigerWas emitting combining character sequences
Copyright©2005 by JDo. All rights reserved.
TCVN 6064:1995
ControlAltGrAltControlShift/.,mnbvcxzShift
Enter\';lkjhgfdsaCapsLock][poiuytrewqTab
BkSp=-0987654321`
ControlAltGrAltControlShift/.,mnbvcxzShift
Enter\';lkjhgfdsaCapsLockơưpoiuytrewqTab
BkSp₫-đ̣́̃̉̀ôêâă`
Copyright©2005 by JDo. All rights reserved.
Input software
http://unikey.sourceforge.net Different input conventionsMultiple character encodingsClipboard converter
Extremely convenient for handling documents in legacy encodings
Free, lightweight, powerfulUse from your USB memory device!
Copyright©2005 by JDo. All rights reserved.
Localization
LocaleTranslation of computer terminologyWindows XP and Office 2003 SE
Using LIP (Language Interface Pack)
Copyright©2005 by JDo. All rights reserved.
Progress
Vietnamese web sites are universally in UnicodeException: http://www.vietmercury.com
This site never shows up in Vietnamese web searches!
SearchWeb: Google, Yahoo!, MSNDesktop: Google, MSN
Blogs, wikis, ...Desktop & server applications
Copyright©2005 by JDo. All rights reserved.
Progress & Challenges
Unicode-savvy?Yahoo!Mail, AOL, AIM Mail: charset “iso8859-1”Eudora
Unicode-savvy!Outlook Express, Outlook, ThunderbirdGmail, Netscape MailYahoo!Messenger, MSN Messenger, Skype
Not enough Unicode fonts with VietnameseExample: Trebuchet MS (which reverts to Arial)
Copyright©2005 by JDo. All rights reserved.
Challenges
User educationGIGO
Any non-Unicode string
LegacyEncodings
Remove all non-Unicode fonts
No physical standard keyboard
Chữ Nôm{Demotic script}
Copyright©2005 by JDo. All rights reserved.
Chữ Nôm
Started to appear in the Xth century, after a thousand years of Chinese ruleBased on Chinese charactersIn use for the next thousand yearsNow replaced by Quốc ngữ
Copyright©2005 by JDo. All rights reserved.
Nôm example
Sound
cốt {sound}Meaning
mộc {tree}
pillarcột
Latin scriptNômideographic script
Quốc ngữLatin script
EnglishVietnamese
Copyright©2005 by JDo. All rights reserved.
Nôm dictionaries
1971 ~ Tự điển chữ Nôm {Nôm Dictionary}, NguyễnQuang Xỹ & Vũ Văn Kính1988 ~ Chu-Nomu Jiten 字 字 , Takeuchi Yonosuke1999 ~ Đại từ điển chữ Nôm {Nôm Super-Dictionary},Vũ Văn Kính2004 ~ Giúp đọc Nôm và Hán-Việt {Nôm & Hán-ViệtReading Guide}, Father Anthony Trần Văn KiệmSoon ~ Từ điển chữ Nôm tiếng Việt {Vietnamese NômDictionary}, Nguyễn Quang Hồng
Copyright©2005 by JDo. All rights reserved.
Nôm standard encoding
First proposed in 1992Unicode 3.1
9,299 characters, of which5,067 characters in BMP (CJKV Extension A)4,232 “Nôm proper” in Plane 2 (CJKV Extension B)
IRG: CJKV Extension CAbout 2,200 additional characters(IRG = Ideographic Rapporteur Group)(CJKV = Chinese, Japanese, Korean, Vietnamese)
Copyright©2005 by JDo. All rights reserved.
Nôm input methods
http://www.viethoc.com/hannom/bango_intro.phpSource
6 dictionariesOther databases
Currently available for input16,638 Chinese characters (from 22,975 possible)11,600 Nôm characters (from 20,732 possible)
HanoKeyHanoSoft
Copyright©2005 by JDo. All rights reserved.
Nôm fonts
9,299 charactersMojikyo Institute, TokyoDynalab, Taipei: DFSong Light Vietnam
30,000+ charactersĐạo Uyển, Viên Chiếu Monastery: HanNom A & B
17,000+ charactersNôm Na Group, Hà Nội: Nôm Na Tong Light4,415 basic Hán-Nôm components
Copyright©2005 by JDo. All rights reserved.
Nôm online
http://www.viethoc.com/hannom/tdnom_beta.phpNôm Annotated DictionaryUses HanNom fonts, Java applet
http://nomfoundation.org/nomdb/lookup.phpNôm Lookup ToolUses .GIF images (was SVG)
http://www.huesoft.com.vn/hannom/http://sager-pc.cs.nyu.edu/~huesoft/
Việt-Hán-Nôm Dictionary
Copyright©2005 by JDo. All rights reserved.
Challenges
RepertoireNewly discovered charactersNo coordination between active groups
Character encodingsSlow international standardization processPrivate Use AreaNo coordination between active groups
Copyright©2005 by JDo. All rights reserved.
Challenges
Input methodsLarge character repertoire
FontsUnicode surrogates
Copyright©2005 by JDo. All rights reserved.
Other options
PresentationSVG (Scalable Vector Graphics)CDL (Character Description Language)
http://www.wenlin.com/cdl/
Wrapup
Copyright©2005 by JDo. All rights reserved.
Brief historyXth century ~ first Nôm writing1651 ~ first Latin-based dictionary1910 ~ Quốc ngữ adopted nationally1991 ~ Quốc ngữ orthographic units in Unicode 1.01993 ~ RFC 1456 (VIQR)1993 ~ Quốc ngữ pre-composed in Unicode 1.1, ISO/IEC 10646-11995 ~ TCVN 5712 (8-bit), 5773 (Chữ Nôm)1995 ~ TCVN 6064 (keyboard)2000 ~ Chữ Nôm in Unicode 3.12001 ~ TCVN 6909 (Unicode)2004 ~ First International Nôm Conference2005 ~ Vietnamese Windows XP and Office 2003 SE
Copyright©2005 by JDo. All rights reserved.
Dichotomies (& synergy?)
Latin, ideographicDifferent encodingsCombining, pre-composedTelex, VNI, TCVN 6064FontsActive working groups
Copyright©2005 by JDo. All rights reserved.
Challenges
Standardization Qn NômRepertoire ☺
Character encoding ☺
Input methodsFonts ☺
Copyright©2005 by JDo. All rights reserved.
Challenges
Usage Qn NômLegacy ☺
Application support ☺
User education
Copyright©2005 by JDo. All rights reserved.
Ultimately
To make Vietnamese like any other language (such as English) in computersGoal: an ordinary user of Vietnamese on computers should not have to know about UTF-8 or character encodings at all
Thanks!
[email protected]://vietual.blogspot.com
Copyright©2005 by JDo. All rights reserved.
Acknowledgements
With thanks to:Roger ShermanDavid MurphyJames Turley
for enabling the presentation at IMUG and on the webHồ Văn TiếnNgô Thanh NhànKen LundeTex Texin
for comments and correctionsThe IMUG (International Macintosh Users Group) audience
for interesting questions and a very lively exchange
Q & A
Top Related