ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage...
Transcript of ALTW 2010 Shared Task: Multilingual Language …saffsd.net/pdf/alta2010-sharedtask.pdfLanguage...
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
ALTW 2010 Shared Task:
Multilingual Language Identification
Marco Lui & Tim BaldwinNICTA VRL
Department of Computer Science and Software EngineeringUniversity of Melbourne, VIC 3010, Australia
[email protected], [email protected]
University of Melbourne
10 December 2010
1 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
What is Language Identification?
Source(s): Wikipedia2 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Can you LangID?
Source(s): Wikipedia3 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Basic Assumptions
Monolingual
Homogeneous
Closed World
Narrow Scope
4 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Cavnar & Trenkle - Dataset
• 3478 samples from the soc.culture newsgroup hierarchy
• 8 languages:
English 1208Spanish 697German 481Italian 316French 273Dutch 235Portuguese 151Polish 117
Reference(s): Cavnar and Trenkle, 19945 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Cavnar & Trenkle - TechniquesData Representation
• keep only letters, apostrophes, whitespace
• union over byte-level N-grams (N = 1. . .5)
Examples
language identification
1-gram l, a, n, g, u . . .
2-gram la, an, gu, ua, ag . . .
3-gram lan, ang, gua, uag, age . . .
4-gram lang, angu, guag, uage, age . . .
5-gram langu, angua, guage, uage , age i . . .
Reference(s): Cavnar and Trenkle, 19946 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Cavnar & Trenkle - TechniquesFeature Selection
• N-Gram Frequency Profile
• Top X (X = 100 . . . 400)
Examples
X = 3
from a:20 b:15 c:10 ab:12 ac:8 . . .
select a, b, ab
Reference(s): Cavnar and Trenkle, 19947 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Cavnar & Trenkle - TechniquesClassification Algorithm
• nearest prototype
• 1 prototype per language
• sum of term frequencies across all instances
• out-of-place distance metric
Examples
doc1 a:10 b:15 c:2
doc2 a:2 b:3 c:1
doc3 a:25 b:20 c:15
prototype a:37 b:38 c:18
Reference(s): Cavnar and Trenkle, 19948 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Cavnar & Trenkle - TechniquesOut-of-Place distance metric
Reference(s): Cavnar and Trenkle, 19949 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Cavnar & Trenkle - Results
• 98.6% accuracy for articles ≤300 bytes
• 99.8% accuracy for articles > 300 bytes
• A solved problem?
Reference(s): Cavnar and Trenkle, 199410 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Baldwin & Lui - Task Description
Corpus Docs Langs Encs Document Length (bytes)
EuroGOV 1500 10 1 17460.5±39353.4
TCL 3174 60 12 2623.2±3751.9
Wikipedia 4963 67 1 1480.8±4063.9
Reference(s): Baldwin and Lui, 201011 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Baldwin & Lui - Method
• 10-fold cross-validation on each dataset
• 42 distinct classifiers
model (×7): nearest-neighbour (Cos1NN, Skew1NN, OOP1NN)nearest-prototype (CosAM, SkewAM)Naive BayesSVM
tokenisation (×2): byte, codepoint
n-gram (×3): 1-gram, 2-gram, 3-gram
Reference(s): Baldwin and Lui, 201012 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Baldwin & Lui - TechniquesSkew Divergence
D(x || y) =∑
i
xi(log2 xi − log2 yi )
skewα(x , y) = D(x || αy + (1− α)x)
• variant of Kullback-Leibler divergence
• linear interpolation between x and y with smoothing factor α
• α typically 0.99
Reference(s): Lee, 200113 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Baldwin & Lui - ResultsTokenization: Choice of n-gram order (Wikipedia)
Reference(s): Baldwin and Lui, 201014 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Baldwin & Lui - ResultsTokenization: Bytes vs Codepoints (2-gram)
Reference(s): Baldwin and Lui, 201015 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Baldwin & Lui - ResultsPerformance vs Time Taken
Reference(s): Baldwin and Lui, 201016 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Baldwin & Lui - ResultsThe Long Tail
• Wikipedia
• byte bigram
• Skew Divergence
• Nearest Prototype
Language N P R FTamil 6 1.000 1.000 1.000Japanese 219 0.990 0.992 0.955English 1629 0.972 0.899 0.934
. . .Italian 202 0.735 0.906 0.812Danish 37 0.710 0.595 0.647Icelandic 10 0.188 0.300 0.231
Reference(s): Baldwin and Lui, 201017 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Baldwin & Lui - ResultsConfusion Pairs
• Wikipedia
• byte bigram
• Skew Divergence
• Nearest Prototype
From To ProportionIndonesian Malay 0.405Malay Indonesian 0.214Danish Norwegian 0.270Norwegian Danish 0.043Russian Ukrainian 0.090Ukrainian Russian 0.043
Reference(s): Baldwin and Lui, 201018 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Open Issues
Supporting Minority Languages
Open Class Language Identification
Sparse or Impoverished Training Data
Multilingual Documents
Standard Evaluation Corpora
Performance Evaluation Criteria
Reference(s): Hughes et al., 200619 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
ALTW 2010 Shared Task
• multiclass text categorization task
• select 2 languages from a closed set of 74
• addresses a number of open issues:• Sparse or Impoverished Training Data• Multilingual Documents• Standard Evaluation Corpora• Performance Evaluation Criteria
20 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
ALTW 2010 Shared Task Dataset
• 10000 synthetic bilingual documents in 74 languages
• randomly partitioned into• 8000 training documents• 1000 developement documents• 1000 test documents
• compiled from static dumps of language-specific Wikipedias
• downloaded between 9 June and 1 August 2008
• selected languages with > 1000 articles
21 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Generating a synthetic bilingual document
• semantic linkage
• language-links: [[<language-prefix>:<page title>]]
1. select primary document
2. select secondary document via language-link
3. normalize: remove redirects, language-links and templates
4. chunk: split on two consecutive paragraphs
5. retain top 50% of paragaphs from primary, bottom 50% fromsecondary
22 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Evaluation MetricsMulti-class Text Categorization
• IR-style performance metrics:precision= TP
TP+FP
recall= TPTP+FN
f-score= 2×precision×recallprecision+recall
• macroaveraging vs microaveraging
• competition metric: micro-averaged f-scoreReference(s): Sebastiani, 2002
23 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Majority-class baseline
• most common classes• en(3330) de(747) fr(747) ja(442)
• most common pairs• en-de(1283) en-fr(1053) en-ja(606) en-it(479)
Baseline PM RM FM Pµ Rµ Fµ
en .011 .015 .012 .701 .350 .467en+de .014 .030 .018 .458 .458 .458
24 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Nearest-prototype benchmarkSkew Divergence with Arithmetic Mean Language Prototypes
N-Gram Multiclass PM RM FM Pµ Rµ Fµ
1 .440 .274 .295 .264 .132 .1762 single .540 .376 .413 .583 .291 .3893 .564 .412 .453 .814 .407 .543
1 .412 .458 .414 .629 .622 .6252 stratified .460 .448 .435 .775 .768 .7713 .497 .467 .464 .833 .826 .829
1 .115 .786 .155 .057 .878 .1072 binarised .171 .705 .221 .114 .885 .2023 .227 .686 .292 .259 .903 .402
25 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Source(s): Google Translate26 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Questions?
27 / 28
Language Identification Related Work Open Issues Task Description Dataset Baseline Results Conclusion References
Reference
Timothy Baldwin and Marco Lui. Language identification: The long and theshort of the matter. In Proceedings of Human Language Technologies: The11th Annual Conference of the North American Chapter of the Associationfor Computational Linguistics (NAACL HLT 2010), pages 229–237, LosAngeles, USA, 2010.
William B. Cavnar and John M. Trenkle. N-gram-based text categorization. InProceedings of the Third Symposium on Document Analysis andInformation Retrieval, Las Vegas, USA, 1994.
Baden Hughes, Timothy Baldwin, Steven Bird, Jeremy Nicholson, and AndrewMacKinlay. Reconsidering language identification for written languageresources. In Proceedings of the 5th International Conference on LanguageResources and Evaluation (LREC 2006), pages 485–488, Genoa, Italy, 2006.
Lillian Lee. On the effectiveness of the skew divergence for statistical languageanalysis. In Proceedings of Artificial Intelligence and Statistics 2001(AISTATS 2001), pages 65–72, Key West, USA, 2001.
Fabrizio Sebastiani. Machine learning in automated text categorization. ACMcomputing surveys (CSUR), 34(1):1–47, 2002.
28 / 28