Cross Language Concept Mining

Post on 22-Jan-2018

117 views 0 download

Transcript of Cross Language Concept Mining

Cross Language Concept Mining{ Motaz.Saad and David.Langlois and Kamel.Smaili }@loria.fr

1. OVERVIEW

Journalist Review System JRSObjective: Build a Journalist Review System (JRS) that enables me-

dia trackers (journalists) to collect multilingual comparable articles con-cerning a given topic, and perform the following:• Explore & review opinions.• Automatically detect the split of public opinions (e.g.: with vs

against an issue or person ...).• Identify & review more detailed opinions (joy, sad, anger, ...).Requirements:• Comparable corpora for training/testing.• Comparability Measure (CM): to compare multilingual articles• Sentiment Based Comparability Measure (SCM): to compare opin-

ions of comparable articles.

2. COMPARABLE CORPORA

• Sources: Wikipedia encyclopedia and Euronews website.• Aligning Wikipedia articles⇒ Use interlanguage links⇒ [[ar:rW�]] [[de:Regen]] [[es:Lluvia]] [[fr:Pluie]] [[en:Rain]]

• Aligning Euronews articles⇒ parsing html links of each Englisharticle and fetching corresponding Arabic/French articles.

• Corpora Information: publicly available athttp://sf.net/projects/crlcl/

AFEWC eNewsEnglish French Arabic English French Arabic

Articles 40290 40290 40290 34442 34442 34442Sentences 4.8M 2.7M 1.2M 744K 746K 622KAvg #sentences/article 119 69 30 21 21 17Avg #words/article 2266 1435 548 198 200 161Words 91.3M 57.8M 22M 6.8M 6.9M 5.5MVocabulary 2.8M 1.9M 1.5M 232K 256K 373K

3. COMPARABILITY MEASURE (CM)• CM is based on cosine similarity between comparable articles.• Word’s weight are represented as binary and frequency of words.• Cosine similarity is better for CM

R1 R5 R10

0.4

0.6

0.8

1

0.36

0.81

1

0.49

0.86

1

Rec

all

binCM cosineCM

4. SENTIMENT BASED COMPARABILITY MEASURE (SCM)

scm(c) =

∣∣∣∣∣∣∣∑

C(Sx)=c

P (Sx|c)

Nx−

∑C(Sy)=c

P (Sy|c)

Ny

∣∣∣∣∣∣∣

5. SCM RESULTS

Corpora scm(o) scm(o) scm(p) scm(p)

parallel-p2

AFP 0.02 0.02 0.1 0.12ANN 0.05 0.06 0.1 0.1ASB 0.07 0.1 0.12 0.14TED 0.06 0.06 0.08 0.07UN 0.05 0.02 0.07 0.08

ComparableENews 0.07 0.15 0.11 0.15AFEWC 0.11 0.19 0.11 0.16

o = subjective, o = objective, p = negative, (p) = positive

AFP: Associated France Press, ANN, Annahar newspaper, ASB: Assabah newspaper, TED: talks fromted.com, UN: United nations resolutions.

- Comparing CM results for parallel/comparable corpora⇒ CM can capture comparability- Comparable articles do not have the same opinions⇒ they variate in their objectivityand positivity

6. MORPHOLOGICAL ANALYSIS

katabكتب to writeécrire

tairطير to flyvoler

maktab مكتبoffice

bureau

kitab كتابbooklivre

maktaba مكتبةlibrary

bibliothèque

ta-iar طيارpilotpilote

matar مطارairport

aéroport

ta-ira طائرةairplane

avion

ta-ir طائرbird

oiseau

• Stemming and lemmatization for English and French• Rooting and light stemming for Arabic⇒ Light stemming removes suffixes and prefixes⇒ Rooting removes suffixes and prefixes and reduce to the root

7. COVERAGE RATE OF THE BILINGUAL DICTIONARY

57%morphAr-lemma50%morphAr-stemEn

40%root-lemma39%root-stemEn41%lightStem-lemma41%LightStem-stemEn

0% 20% 40% 60% 80% 100%

8. FUTURE WORK

• Elaborate a multilingual document representation model based on Latent SemanticIndexing to enhance CM.

• Elaborate SCM by enhancing sentiment detecting and by reviewing more detailedsentiments, i.e emotion in words (joy, anger, pleasure, ...). This will be done byexploiting annotated lexicons and semantic network.

• Develop an interface for journalists to review comparable articles.

9. REFERENCES• Saad, M.; Langlois, D. & Smaili, K. (2013), Comparing Multilingual Comparable Articles Based On Opinions, in ’Proceedings of

the Sixth Workshop on Building and Using Comparable Corpora’ , Association for Computational Linguistics, Sofia, Bulgaria , pp.105-111.

• Saad, M.; Langlois, D. & Smaili, K. (2013), Extracting Comparable Articles from Wikipedia and Measuring Their Comparabilities, in’5th International Conference on Corpus Linguistics’ , University of Alicante, Spain.