Making Sense of Language Tags 10 th Metadata Open Forum.
-
Upload
charles-kirk -
Category
Documents
-
view
215 -
download
0
Transcript of Making Sense of Language Tags 10 th Metadata Open Forum.
Making Senseof Language Tags
10th Metadata Open Forum
Presenter
Addison Phillips
Globalization Architect, Yahoo! Chair, W3C Internationalization Core Working Group Co-Editor, Language Tag Registry Update (LTRU) Working
Group (RFC 4646, RFC 4647, RFC 4646bis)
Languages, Language Tags, and Locales (oh my!)
Identifying language (and locale): the challenge
ISO 639 IETF BCP 47
– RFC 4646, RFC 4647– RFC 4646bis
Challenges for users
Human Language as Metadata
Some data is just data, but some data is human-readable text.
Text processing depends on language:– spelling, stemming, tokenization, word/line/sentence
boundaries, thesauri, terminology, morphological analysis, font and stylistic traditions, collation.
IT systems depend on language negotiation:– localization, message selection, user interface,
presentation, number/date/time/etc. formatting, list presentation
Human Language
"Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!"
(Mark Twain, Puddinhead Wilson)
Identifying Languages
Languages don’t form nice hierarchies– “splitters” vs “lumpers”– dialects, subdialects, regional and stylistic
differences, patois
Differing communities with different needs– terminology, librarians, computer systems,
translators, etc.
In the Beginning (ca. 1980 CE)
Received Wisdom from the Dark Ages Locales:
– japanese, french, german, C– ENU, FRA, JPN– ja_JP.PCK– AMERICAN_AMERICA.WE8ISO8859P1
Languages…… looked a lot like locales (and vice
versa)
ISO 639
Defines language identifier codes Multiple parts:
– ISO 639-1 (alpha2 codes676) (136 codes)– ISO 639-2 (alpha3 codes17576) (about 500)– ISO 639-3 (alpha3 codes) (about 7000)– ISO 639-4 (principles for encoding)– ISO 639-5 (language families)– ISO 639-6 (alpha4 codes) (under development)
Impact of ISO 639-3
ISO 639-2 and 639-3 share a codespace– all 639-2 codes are also 639-3 codes– Macrolanguages
Human Language
"Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!"
(Mark Twain, Puddinhead Wilson)
en
Parallel Efforts
ISO 639– ISO 639-1 (early 1980s)
– ISO 639-2 (alpha3)
– ISO 639-3
IETF BCP 47– RFC 1766 (1995)
– RFC 3066 (2001)
– RFC 4646 (2006)– RFC 4646bis (2007)
BCP 47
Internet Engineering Task Force (IETF) “Best Current Practice” (BCP)
Enable presentation, selection, and negotiation of content in protocols and formats– Widely used!
XML, HTML, RSS, MIME, SOAP, SMTP, LDAP, CSS, XSL, CCXML, Java, C#, ASP, perl, Apache, IE, Mozilla……….
Adds Granularity
Need to identify language on varying levels of mutual intelligibility and granularity
"Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!"
(Mark Twain, Puddinhead Wilson)en
en-US
What’s a Locale
– “a concept or identifier used by programmers to represent a particular collection of cultural, regional, or linguistic preferences.”
java.util.Locale .Net Culture LANG (setlocale in C, C++) NLS_LANG in Oracle … and so on…
Locales? Huh?
Theatre Center News: The date of the last version of this document was 2003 年 3 月 20. A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt.
Locale Identifiers
Different ideas:– “Accept-Locale” vs. Accept-Language– URIs/URNs, etc.– CLDR/LDML
And Requirements:– Operating environments and harmonization– App Servers– Web Services
New Solution? Cost of Adoption:– UTF-8 to the browser: 8 long years
Locales and Language Tags meet
We really need locale identifiers.
Language tags are being (ab)used as locale identifiers
anyway…
Not going to need a big new
thing…
… we can do this really fast…
Yeah, we’ll write an RFC
IUC23, March 2003
BCP 47 (Historic) Basic Structure
Alphanumeric (ASCII only) subtags Up to eight characters long Separated by hyphens Case not important (i.e. zh = ZH = zH = Zh)
1*8alphanum * [ “-” 1*8 alphanum ]
RFC 1766
zh-TW
ISO
63
9-1
(alp
ha2
)
ISO
31
66 (a
lpha2)
i-klingoni-klingonR
egiste
red
valu
e
RFC 3066
sco-GB
ISO
63
9-2
(alp
ha 3
codes)
But use…
enengg-GB-GBalpha 2 codes when they exist
X
Problems
Script Variation:– zh-Hant/zh-Hans– (sr-Cyrl/sr-Latn, az-Arab/az-Latn/az-Cyrl, etc.)
Obsolence of registrations:– art-lojban (now jbo), i-klingon (now tlh)
Instability in underlying standards:– sr-CS (CS used to be Czechoslovakia
Lack of a single authoritative, stable source
And More Problems
Lack of scripts Little support for registered values in software Reassignment of values by ISO 3166 Lack of consistent tag formation (Chinese dialects?) Standards not readily available, bad references Bad implementation assumptions
– 1*8 alphanum *[ “-” 1*8 alphanum]– 2*3 ALPHA [ “-” 2ALPHA ]
Many registrations to cover small variations– 8 German registrations to cover two variations
LTRU and RFC 4646
Defines a generative syntax – machine readable– future proof, extensible
Defines a single source (IANA Language Subtag Registry)
– Stable subtags, no conflicts– Machine readable
Defines when to use subtags– (sometimes)
Anatomy of a Language Tag
sl-Latn-IT-rozaj-1994-x-mine
ISO
63
9-1
/2 (a
lpha2/3
)
ISO
15
924 scrip
t codes
(alp
ha 4
)
ISO
31
66 (a
lpha2) o
r UN
M
49 R
egiste
red v
aria
nts
Priv
ate
Use
and
Exte
nsio
n
More Examples
fr, de, nl, en, ja fr-FR, fr-CA, de-DE, de-CH… es-419 (Spanish for Americas) en-US (English for USA) de-CH-1996 (Old tags are all valid) sl-rozaj-1994 (Multiple variants) zh-t-wadegile (Extensions)
Solves the Script problem
zh-Hant (!= zh-TW) zh-Hans (!= zh-CN)
Azerbaijani (az)– Arab, Cyrl, Latn
Serbian (sr)– Cyrl, Latn
Yiddish (yi)– Hebr, Latn
Mongolian (mn)– Cyrl, Latn, Hani
Belarussian (bs)– Cyrl, Latn
Etc.
Benefits
Subtag registry in one place: one source, machine-readable
Subtags identified by length/content Extensible Compatible with RFC 3066 tags Stable: subtags are forever
Tag Choice
“Tag Content Wisely”– use the shortest tag reasonable– use as many subtags as necessary to
disambiguate– don’t invent things; use the registry– map deprecated values to modern equivalents
Specialized Codes
zxx und mis Zxxx
Problems
Matching– Does “en-US” match “en-Latn-US”?
Tag Choices– Users have more to choose from.
Implementations– More to do, more to think about– (easier to parse, process, support the good stuff)
Tag Matching (RFC 4647)
Uses “Language Ranges” in a “Language Priority List” to select sets of content according to the language tag
Three Schemes– Basic Filtering– Extended Filtering– Lookup
Tags are not Tokens!
Many technologies would like language tags (attributes, etc.) to be atomic—but language tags have structure
<span class=“foo” xml:lang=“en-US” />
foo(lang:en) {color: red;
}
Accept-Language=zh;q=1.0;de-DE;q=0.8
Filtering
Ranges specify the least specific item – “en” matches “en”, “en-US”, “en-Brai”, “en-boont”
Basic matching uses plain prefixes– “en-US” matches “en-US” or “en-US-boont” but
not “en-Latn-US”
Extended matching can match “inside bits”– “en-*-US”
Lookup
Range specifies the most specific tag in a match.
Returns exactly one item.– “en-US” might return either “en” or “en-US” but
not “en-US-boont”
Mirrors the locale fallback mechanism and many language negotiation schemes.
Lookup and Language Negotiation
Resources “fall back” to find the best match
Global Binary
Resources
zh-Hans-SG (Chinese, Simplified script, Singapore)
zh-Hans (Chinese, Simplified script)
zh (Chinese)
(root)
Fallin
g b
ack
What Do I Do (Content Author)?
Not much.– Existing tags are all still valid: tagging is mostly
unchanged.– Resist temptation to (ab)use the private use
subtags. Unless your language has script variations:
– Tag content with the appropriate script subtag(s) Script subtags only apply to a small number of
languages: “zh”, “sr”, “uz”, “az”, “mn”, and a very small number of others.
What Do I Do (Programmer)?
Check code for compliance with 4646– Decide on well-formed or validating– Implement suppress-script– Change to using the registry– Bother infrastructure folks (Java, MS, Mozilla, etc)
to implement the standard
I need a new subtag…
Register new subtags with [email protected]– only primary language or variant subtags– read RFC 4646 for instructions– two-week review period with expert approval
LTRU Milestone Dates
RFC 4646 – Registry went live in December 2005
RFC 4647 (Anticipated) RFC 4646bis
– This includes ISO 639-3 support, extended language subtags, and possibly ISO 639-6
RFC 4646bis (Internet-Draft)
Currently taking shape– Adds about 7000 additional primary language
subtags from ISO 639-3– Extended language subtags for Chinese and
other languages being debated– … and some cleanup work on processes and
procedures
Macrolanguages and Extlang
zh-Hant-HK Chinese, Traditional Script, Hong Kong SAR
yue-Hant-HK Cantonese, Traditional Script, Hong Kong SAR
zh-yue-Hant-HKChinese, Cantonese, Traditional Script, Hong Kong SAR
extlang
Things to Do (languages)
Get involved in LTRU Get involved in W3C I18N Activity Write implementations Work on adoption of BCP 47: understand the
impact
Then get involved with Locale identifiers…
Back to Locales…
IUC 20 Round Table Suzanne Topping’s
Multilingual Article Tex Texin and the Locales
list…
Locale Identifiers and Web Services
W3C and Unicode
W3C– Identifiers and cross-over with language tags– Web services– XML, HTML
Unicode Consortium– LDML– CLDR– Standards for content
Language Tags and Locale Identifiers REC (LTLI)
Working Draft developed by W3C I18N Architecture WG– effort currently moribund: needs community
participation– defines standards and guidelines for using
language tags in W3C technologies– defines relationship of language tags to locale
identifiers basis for efforts such as WS-I18N
Things to Read
Tag and Registry RFChttp://www.ietf.org/rfc/rfc4646.txt
Matching RFChttp://www.ietf.org/rfc/rfc4647.txt
4646bis Drafthttp://www.ietf.org/internet-drafts/draft-ltru-4646bis-06.txt
Referenceshttp://www.langtag.nethttp://www.inter-locale.com
LTRU Mailing Listhttps://www1.ietf.org/mailman/listinfo/ltru
Ideas and Questions