Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow...

15
PanLex Panlingual Lexical Collaboration Jonathan Pool University of Washington Computational Linguistics Laboratory 22 April 2008

Transcript of Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow...

Page 1: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

PanLexPanlingual

Lexical Collaboration

Jonathan PoolUniversity of Washington Computational Linguistics Laboratory

22 April 2008

Page 2: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

Task 1.

You encounter the word “!धाना%यापक”.

What language is it in?

What does it mean?

Task 2.

You encounter the word list at http://www.geonames.de/peace.html.

Is its content already in PanLex?

If not, how can you contribute it?

Pre-DemoPanLex (http://panlex.org/cgi-bin/panlex13.cgi)

Page 3: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

Facilitate panlingual:

4. Vigor.

3. Discursive intertranslatability.

2. Lexical intertranslatability.

1. Lexical collaboration.

Goals

Page 4: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

Goal 1: Facilitate panlingual lexical collaboration

How?

Strategy

1. Assemble valuable panlingual data.

2. Make the data accessible.

3. Invite contributions to the data.

4. Localize the interface panlingually.

Page 5: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

Tactic 1: Assemble valuable panlingual data.

How?

Tactics

Borrow data from TransGraph.

Expression (lexeme) equivalences from 357 dictionaries.

13 multilingual, 344 bilingual.

1050 languages.

2.5 million expressions.

8 million expression tokens.

Accept (mainly) TransGraph’s lightweight schema.

An expression is just a string in a language.

A meaning is just a source-specific ID.

A denotation is just a source assigning a meaning to an expression.

A translation is just 2+ denotations with the same meaning.

Page 6: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

TransGraph Data

Example

englinguisticsell

γλωσσολογία

turdilbilim

estkeeleteadus

6290415713

Page 7: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

Tactic 2: Make the data accessible.

How?

Tactics

Open-source (PostgreSQL) database (vs. TransGraph).

Perl CGI-DBI application to query and modify the data.

Domain "panlex.org" to access the application.

All data exposed (vs. PanImages).

Data retrievable interactively and by plain-text or XML file export.

Page 8: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

Tactic 3: Invite contributions to the data.

How?

Tactics

User contributions nondestructive.

Not a Wiki, not moderated.

Contributable data:

[Language varieties (vs. TransGraph languages).]

Expressions.

Sources.

Denotations.

Contribution modes:

Batch (file upload; plain-text or XML).

Incremental (interactive editing).

Page 9: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

Tactic 4: Localize the interface panlingually.

How?

Tactics

In vivo localization.

Interface entirely lemmatic.

Therefore, PanLex can translate the interface.

Translation core: developer-attested translations.

Translation periphery: election with sources voting.

Page 10: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

Test 1 (expert user):

15 query and modification tasks with test questions.

Failures and comments inspired interface changes.

Test 2 (expert user):

Found, formatted, checked, and uploaded data from:

Nepali-Esperanto dictionary.

English-Yiddish dictionary.

Eight-language medical glossary.

Evaluation

Page 11: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

Coverage

Add dictionaries.

Recruit user-added dictionaries.

Add source types:

Thesauri.

WordNets.

Library subject headings.

Locale repositories.

Monolingual resources.

Export additions to TransGraph.

Future Work

Page 12: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

*eng: Englishhun: magyar*fra: français

*deu: Deutschces: češtinahrv: hrvatski*tur: Türkçe

spa: españolest: eesti

ita: italiano*epo: Esperanto

ara: العربيةfin: suomi

jpn: 日本語nld: Nederlandspor: português*rus: русский

bre: brezhonegsrp: српскиron: română

kur: kurdîswe: svenska

0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000

24,495

25,315

26,516

27,072

29,436

36,779

38,237

46,921

50,247

54,439

56,122

62,928

72,861

73,628

82,503

92,146

96,735

110,623

135,505

172,435

264,927

428,550

Page 13: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

isl: íslenskasqi: shqipepol: polski

*nob: bokmålnci: Classical Nahuatl

nah: nawatlahtollibel: беларуская

cat: catalàlat: latine

gle: Gaeilgedan: dansk

oci: lenga occitanaplt: Plateau Malagasy

tuk: türkmenslk: slovenčina

slv: slovenščinaoji: ᐊᓂᔑᓇᐯ

chy: Tsétsêhéstaestselij: lengua lígure

zho: 漢語eus: euskara

frp: lenga arpitanaell: ελληνικά

glv: chengey Vanninglg: galego

cym: Cymraegmlt: Malti

art: ISO 639afr: Afrikaansnep: )पाली

yid: ייִדישheb: עבריתkor: 한국어

ltz: Lëtzebuergesch Sproochang: Englisce sprǣc

ind: bahasa Indonesiafao: føroyskt

aym: aymar arupap: Papiamentu

0 4,000 8,000 12,000 16,000 20,000 24,000

4,7984,9605,1815,3655,9796,3706,7506,8676,8687,0057,5187,6557,6757,8767,9038,2248,7808,8508,9029,0929,2409,3809,5549,81910,04910,05110,593

12,56213,00313,595

14,89115,614

17,13217,24518,01218,37819,35620,03220,513

Page 14: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

lit: lietuviųhbs: Serbo-Croatian

bul: българскиgla: Gàidhlig na h-Alba

yua: yukatekyor: èdè Yorùbáfro: Old French

quz: Cusco Quechuacor: yeth Kernewek

pqm: Malecite-Passamaquoddyfry: Frysk

nds: Plattdüütsche Sprookvie: tiếng Việthmn: Hmoob

qul: North Bolivian Quechuaido: Ido

lav: latviešubos: bosanskitel: తJలుగు

roh: lingua rumantschaina: interlingua

urd: اردوary: Moroccan Arabic

ukr: українськаkab: ثاقبايليث

pcd: langue picardefas: فارسی

tgl: Tagalogcos: lingua corsa

got: gutiska razdamly: Bahasa Melayu

tpi: Tok Pisinswa: kiswahili

msa: bahasa Melayutha: ภาษาไทย

nap: lengua nnapulitanapes: فارسی

hin: ,हदीqus: Santiago del Estero Quichua

0 1,000 2,000 3,000 4,000 5,000

1,5251,5431,5871,6161,6891,7371,8091,9581,9682,0122,0832,1852,2392,2712,2962,3912,4852,5362,6492,6662,7582,8572,8702,9122,9193,0083,1213,2043,3403,4633,463

3,7573,7763,9604,1034,3214,3354,482

4,747

Page 15: Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

Features

More query functions.

User SQL entry.

Usability

Test and improve interface.

Non-expert interface.

Standards

Lemmatic forms (e.g., English “to”).

Multiword lexemes.

Future Work