Making Sense of Language Tags 10 th Metadata Open Forum.

48
Making Sense of Language Tags 10 th Metadata Open Forum

Transcript of Making Sense of Language Tags 10 th Metadata Open Forum.

Page 1: Making Sense of Language Tags 10 th Metadata Open Forum.

Making Senseof Language Tags

10th Metadata Open Forum

Page 2: Making Sense of Language Tags 10 th Metadata Open Forum.

Presenter

Addison Phillips

Globalization Architect, Yahoo! Chair, W3C Internationalization Core Working Group Co-Editor, Language Tag Registry Update (LTRU) Working

Group (RFC 4646, RFC 4647, RFC 4646bis)

Page 3: Making Sense of Language Tags 10 th Metadata Open Forum.

Languages, Language Tags, and Locales (oh my!)

Identifying language (and locale): the challenge

ISO 639 IETF BCP 47

– RFC 4646, RFC 4647– RFC 4646bis

Challenges for users

Page 4: Making Sense of Language Tags 10 th Metadata Open Forum.

Human Language as Metadata

Some data is just data, but some data is human-readable text.

Text processing depends on language:– spelling, stemming, tokenization, word/line/sentence

boundaries, thesauri, terminology, morphological analysis, font and stylistic traditions, collation.

IT systems depend on language negotiation:– localization, message selection, user interface,

presentation, number/date/time/etc. formatting, list presentation

Page 5: Making Sense of Language Tags 10 th Metadata Open Forum.

Human Language

"Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!"

(Mark Twain, Puddinhead Wilson)

Page 6: Making Sense of Language Tags 10 th Metadata Open Forum.

Identifying Languages

Languages don’t form nice hierarchies– “splitters” vs “lumpers”– dialects, subdialects, regional and stylistic

differences, patois

Differing communities with different needs– terminology, librarians, computer systems,

translators, etc.

Page 7: Making Sense of Language Tags 10 th Metadata Open Forum.

In the Beginning (ca. 1980 CE)

Received Wisdom from the Dark Ages Locales:

– japanese, french, german, C– ENU, FRA, JPN– ja_JP.PCK– AMERICAN_AMERICA.WE8ISO8859P1

Languages…… looked a lot like locales (and vice

versa)

Page 8: Making Sense of Language Tags 10 th Metadata Open Forum.

ISO 639

Defines language identifier codes Multiple parts:

– ISO 639-1 (alpha2 codes676) (136 codes)– ISO 639-2 (alpha3 codes17576) (about 500)– ISO 639-3 (alpha3 codes) (about 7000)– ISO 639-4 (principles for encoding)– ISO 639-5 (language families)– ISO 639-6 (alpha4 codes) (under development)

Page 9: Making Sense of Language Tags 10 th Metadata Open Forum.

Impact of ISO 639-3

ISO 639-2 and 639-3 share a codespace– all 639-2 codes are also 639-3 codes– Macrolanguages

Page 10: Making Sense of Language Tags 10 th Metadata Open Forum.

Human Language

"Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!"

(Mark Twain, Puddinhead Wilson)

en

Page 11: Making Sense of Language Tags 10 th Metadata Open Forum.

Parallel Efforts

ISO 639– ISO 639-1 (early 1980s)

– ISO 639-2 (alpha3)

– ISO 639-3

IETF BCP 47– RFC 1766 (1995)

– RFC 3066 (2001)

– RFC 4646 (2006)– RFC 4646bis (2007)

Page 12: Making Sense of Language Tags 10 th Metadata Open Forum.

BCP 47

Internet Engineering Task Force (IETF) “Best Current Practice” (BCP)

Enable presentation, selection, and negotiation of content in protocols and formats– Widely used!

XML, HTML, RSS, MIME, SOAP, SMTP, LDAP, CSS, XSL, CCXML, Java, C#, ASP, perl, Apache, IE, Mozilla……….

Page 13: Making Sense of Language Tags 10 th Metadata Open Forum.

Adds Granularity

Need to identify language on varying levels of mutual intelligibility and granularity

"Ole Missus, de house of plum` jam full o` people, en dey`s jes a-spi`lin` to see de gen`lemen!"

(Mark Twain, Puddinhead Wilson)en

en-US

Page 14: Making Sense of Language Tags 10 th Metadata Open Forum.

What’s a Locale

– “a concept or identifier used by programmers to represent a particular collection of cultural, regional, or linguistic preferences.”

java.util.Locale .Net Culture LANG (setlocale in C, C++) NLS_LANG in Oracle … and so on…

Page 15: Making Sense of Language Tags 10 th Metadata Open Forum.

Locales? Huh?

Theatre Center News: The date of the last version of this document was 2003 年 3 月 20. A copy can be obtained for $50,0 or 1.234,57 грн. We would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug Felt.

Page 16: Making Sense of Language Tags 10 th Metadata Open Forum.

Locale Identifiers

Different ideas:– “Accept-Locale” vs. Accept-Language– URIs/URNs, etc.– CLDR/LDML

And Requirements:– Operating environments and harmonization– App Servers– Web Services

New Solution? Cost of Adoption:– UTF-8 to the browser: 8 long years

Page 17: Making Sense of Language Tags 10 th Metadata Open Forum.

Locales and Language Tags meet

We really need locale identifiers.

Language tags are being (ab)used as locale identifiers

anyway…

Not going to need a big new

thing…

… we can do this really fast…

Yeah, we’ll write an RFC

IUC23, March 2003

Page 18: Making Sense of Language Tags 10 th Metadata Open Forum.

BCP 47 (Historic) Basic Structure

Alphanumeric (ASCII only) subtags Up to eight characters long Separated by hyphens Case not important (i.e. zh = ZH = zH = Zh)

1*8alphanum * [ “-” 1*8 alphanum ]

Page 19: Making Sense of Language Tags 10 th Metadata Open Forum.

RFC 1766

zh-TW

ISO

63

9-1

(alp

ha2

)

ISO

31

66 (a

lpha2)

i-klingoni-klingonR

egiste

red

valu

e

Page 20: Making Sense of Language Tags 10 th Metadata Open Forum.

RFC 3066

sco-GB

ISO

63

9-2

(alp

ha 3

codes)

But use…

enengg-GB-GBalpha 2 codes when they exist

X

Page 21: Making Sense of Language Tags 10 th Metadata Open Forum.

Problems

Script Variation:– zh-Hant/zh-Hans– (sr-Cyrl/sr-Latn, az-Arab/az-Latn/az-Cyrl, etc.)

Obsolence of registrations:– art-lojban (now jbo), i-klingon (now tlh)

Instability in underlying standards:– sr-CS (CS used to be Czechoslovakia

Lack of a single authoritative, stable source

Page 22: Making Sense of Language Tags 10 th Metadata Open Forum.

And More Problems

Lack of scripts Little support for registered values in software Reassignment of values by ISO 3166 Lack of consistent tag formation (Chinese dialects?) Standards not readily available, bad references Bad implementation assumptions

– 1*8 alphanum *[ “-” 1*8 alphanum]– 2*3 ALPHA [ “-” 2ALPHA ]

Many registrations to cover small variations– 8 German registrations to cover two variations

Page 23: Making Sense of Language Tags 10 th Metadata Open Forum.

LTRU and RFC 4646

Defines a generative syntax – machine readable– future proof, extensible

Defines a single source (IANA Language Subtag Registry)

– Stable subtags, no conflicts– Machine readable

Defines when to use subtags– (sometimes)

Page 24: Making Sense of Language Tags 10 th Metadata Open Forum.

Anatomy of a Language Tag

sl-Latn-IT-rozaj-1994-x-mine

ISO

63

9-1

/2 (a

lpha2/3

)

ISO

15

924 scrip

t codes

(alp

ha 4

)

ISO

31

66 (a

lpha2) o

r UN

M

49 R

egiste

red v

aria

nts

Priv

ate

Use

and

Exte

nsio

n

Page 25: Making Sense of Language Tags 10 th Metadata Open Forum.

More Examples

fr, de, nl, en, ja fr-FR, fr-CA, de-DE, de-CH… es-419 (Spanish for Americas) en-US (English for USA) de-CH-1996 (Old tags are all valid) sl-rozaj-1994 (Multiple variants) zh-t-wadegile (Extensions)

Page 26: Making Sense of Language Tags 10 th Metadata Open Forum.

Solves the Script problem

zh-Hant (!= zh-TW) zh-Hans (!= zh-CN)

Azerbaijani (az)– Arab, Cyrl, Latn

Serbian (sr)– Cyrl, Latn

Yiddish (yi)– Hebr, Latn

Mongolian (mn)– Cyrl, Latn, Hani

Belarussian (bs)– Cyrl, Latn

Etc.

Page 27: Making Sense of Language Tags 10 th Metadata Open Forum.

Benefits

Subtag registry in one place: one source, machine-readable

Subtags identified by length/content Extensible Compatible with RFC 3066 tags Stable: subtags are forever

Page 28: Making Sense of Language Tags 10 th Metadata Open Forum.

Tag Choice

“Tag Content Wisely”– use the shortest tag reasonable– use as many subtags as necessary to

disambiguate– don’t invent things; use the registry– map deprecated values to modern equivalents

Page 29: Making Sense of Language Tags 10 th Metadata Open Forum.

Specialized Codes

zxx und mis Zxxx

Page 30: Making Sense of Language Tags 10 th Metadata Open Forum.

Problems

Matching– Does “en-US” match “en-Latn-US”?

Tag Choices– Users have more to choose from.

Implementations– More to do, more to think about– (easier to parse, process, support the good stuff)

Page 31: Making Sense of Language Tags 10 th Metadata Open Forum.

Tag Matching (RFC 4647)

Uses “Language Ranges” in a “Language Priority List” to select sets of content according to the language tag

Three Schemes– Basic Filtering– Extended Filtering– Lookup

Page 32: Making Sense of Language Tags 10 th Metadata Open Forum.

Tags are not Tokens!

Many technologies would like language tags (attributes, etc.) to be atomic—but language tags have structure

<span class=“foo” xml:lang=“en-US” />

foo(lang:en) {color: red;

}

Accept-Language=zh;q=1.0;de-DE;q=0.8

Page 33: Making Sense of Language Tags 10 th Metadata Open Forum.

Filtering

Ranges specify the least specific item – “en” matches “en”, “en-US”, “en-Brai”, “en-boont”

Basic matching uses plain prefixes– “en-US” matches “en-US” or “en-US-boont” but

not “en-Latn-US”

Extended matching can match “inside bits”– “en-*-US”

Page 34: Making Sense of Language Tags 10 th Metadata Open Forum.

Lookup

Range specifies the most specific tag in a match.

Returns exactly one item.– “en-US” might return either “en” or “en-US” but

not “en-US-boont”

Mirrors the locale fallback mechanism and many language negotiation schemes.

Page 35: Making Sense of Language Tags 10 th Metadata Open Forum.

Lookup and Language Negotiation

Resources “fall back” to find the best match

Global Binary

Resources

zh-Hans-SG (Chinese, Simplified script, Singapore)

zh-Hans (Chinese, Simplified script)

zh (Chinese)

(root)

Fallin

g b

ack

Page 36: Making Sense of Language Tags 10 th Metadata Open Forum.

What Do I Do (Content Author)?

Not much.– Existing tags are all still valid: tagging is mostly

unchanged.– Resist temptation to (ab)use the private use

subtags. Unless your language has script variations:

– Tag content with the appropriate script subtag(s) Script subtags only apply to a small number of

languages: “zh”, “sr”, “uz”, “az”, “mn”, and a very small number of others.

Page 37: Making Sense of Language Tags 10 th Metadata Open Forum.

What Do I Do (Programmer)?

Check code for compliance with 4646– Decide on well-formed or validating– Implement suppress-script– Change to using the registry– Bother infrastructure folks (Java, MS, Mozilla, etc)

to implement the standard

Page 38: Making Sense of Language Tags 10 th Metadata Open Forum.

I need a new subtag…

Register new subtags with [email protected]– only primary language or variant subtags– read RFC 4646 for instructions– two-week review period with expert approval

Page 39: Making Sense of Language Tags 10 th Metadata Open Forum.

LTRU Milestone Dates

RFC 4646 – Registry went live in December 2005

RFC 4647 (Anticipated) RFC 4646bis

– This includes ISO 639-3 support, extended language subtags, and possibly ISO 639-6

Page 40: Making Sense of Language Tags 10 th Metadata Open Forum.

RFC 4646bis (Internet-Draft)

Currently taking shape– Adds about 7000 additional primary language

subtags from ISO 639-3– Extended language subtags for Chinese and

other languages being debated– … and some cleanup work on processes and

procedures

Page 41: Making Sense of Language Tags 10 th Metadata Open Forum.

Macrolanguages and Extlang

zh-Hant-HK Chinese, Traditional Script, Hong Kong SAR

yue-Hant-HK Cantonese, Traditional Script, Hong Kong SAR

zh-yue-Hant-HKChinese, Cantonese, Traditional Script, Hong Kong SAR

extlang

Page 42: Making Sense of Language Tags 10 th Metadata Open Forum.

Things to Do (languages)

Get involved in LTRU Get involved in W3C I18N Activity Write implementations Work on adoption of BCP 47: understand the

impact

Then get involved with Locale identifiers…

Page 43: Making Sense of Language Tags 10 th Metadata Open Forum.

Back to Locales…

IUC 20 Round Table Suzanne Topping’s

Multilingual Article Tex Texin and the Locales

list…

Page 44: Making Sense of Language Tags 10 th Metadata Open Forum.

Locale Identifiers and Web Services

Page 45: Making Sense of Language Tags 10 th Metadata Open Forum.

W3C and Unicode

W3C– Identifiers and cross-over with language tags– Web services– XML, HTML

Unicode Consortium– LDML– CLDR– Standards for content

Page 46: Making Sense of Language Tags 10 th Metadata Open Forum.

Language Tags and Locale Identifiers REC (LTLI)

Working Draft developed by W3C I18N Architecture WG– effort currently moribund: needs community

participation– defines standards and guidelines for using

language tags in W3C technologies– defines relationship of language tags to locale

identifiers basis for efforts such as WS-I18N

Page 47: Making Sense of Language Tags 10 th Metadata Open Forum.

Things to Read

Tag and Registry RFChttp://www.ietf.org/rfc/rfc4646.txt

Matching RFChttp://www.ietf.org/rfc/rfc4647.txt

4646bis Drafthttp://www.ietf.org/internet-drafts/draft-ltru-4646bis-06.txt

Referenceshttp://www.langtag.nethttp://www.inter-locale.com

LTRU Mailing Listhttps://www1.ietf.org/mailman/listinfo/ltru

Page 48: Making Sense of Language Tags 10 th Metadata Open Forum.

Ideas and Questions