SomeInterestingData - Nanyang Technological...

37
Some Interesting Data Francis Bond NTU NICT <[email protected]> HG-251

Transcript of SomeInterestingData - Nanyang Technological...

Some Interesting Data

Francis Bond

NTU

NICT

<[email protected]>

HG-251

Overview

➣ Japanese WordNet

➢ 60% coverage of the English WordNet (with Pictures)

➣ Tanaka Corpus

➣ NICT Multilingual Corpus

➣ NTU Multilingual Corpus

➣ Bracket Dic

HG-251 1

WordNet Overview

➣ We are building an open Japanese WordNet,

inspired by the Princeton WordNet of English

➣ Version 1.1 now available

➢ nlpwww.nict.go.jp/wn-ja

∗ 49,190 Synsets

∗ 85,966 Words

∗ 156,684 Senses

∗ Illustrations for 541 Synsets

➢ Semantic structure based on Princeton WordNet

HG-251 2

Overview

➣ Recently Added

➢ Japanese Definitions and Examples

➢ Links to other resources

➣ Still being extended

➢ Revised Structure

➢ Sense Tagged Corpora

➣ Imperfect Version Available now

Release Early Release Often

HG-251 3

Example Synset

HG-251 4

Release Formats

➣ SynSet Word pairs (TAB separated)

➣ English and Japanese combined in sqlite3 database

➢ Includes sense links and ancestor table

➢ Perl module for manipulating

➢ Python: http://subtech.g.hatena.ne.jp/y_yanbe/

20090314/p2

➣ SynSet Illustration Pairs

➣ WordNet-LMF (xml)

➣ Online lookup (NICT, MLSN, LangGrid, Kyoto Project, . . . )

HG-251 5

Illustrations

➣ 849 illustrations (541 synsets)

➢ Tagged as OK (default), wierd or best (iff more than one

illustration)

➢ An illustration illustrates its hypernyms

➣ SVG images (include metadata)

➣ From the Open ClipArt Library (public domain)

➣ Many more untagged images (11,209) in 2009-01-10

snapshot)

HG-251 6

Illustration Example

dir animals/mammals/ recreation/sports/

basename bat_orlando_karam cricket_bat

title bat Cricket Bat

tags bat, mammal, animal sports, cricket, recreation

synset bat#n#1 cricket bat#n#1,bat#n#4

Ja 蝙蝠 バット

match hypernym monosemous

bat ⊂ mammal

HG-251 7

Text Annotation

➣ Base the sense inventory on actual usage

➣ Obtain sense frequencies

➣ Annotate data for WSD

Corpus Sentences Words Content Words Trans

Semcor 12,842 224,260 120,000 En, It, Ja

Glosses 165,977 1,468,347 459,000 En, Ja, . . .

Kyoto 38,383 969,558 527,000 Ja, En, Zh

Table 1: Corpora to be Sense Tagged

HG-251 8

Translating Glosses

➣ Translated Glosses and Examples in the Princeton WordNet

➢ Can the use for the Japanese WordNet

➢ Useful for unsupervised WSD (LESK)

➢ Freely redistributable (aligned to: En, Ko, Es, . . . )

➣ Sense Tagged as the Princeton WordNet Gloss Corpus

➣ Definition for Seal-アザラシ

➢ “any of numerous marine mammals that come on shore to

breed; chiefly of cold regions”

➢ 「繁殖のために岸に上がる海洋性哺乳動物の各種;主

に寒帯地域に」

HG-251 9

Tanaka Corpus

➣ http://tatoeba.org/

➣ Aligned short sentences

➢ 150,000 English

➢ 150,000 Japanese

➢ 19,000 Chinese

➣ Can be used to find correspondences

➣ Petter is using to find translation rules

http://172.21.171.235/~petterha/comp-mrs/overview.

html

HG-251 10

Trilingual Data

➣ Use NICT multilingual corpus (JEC)

➢ crosslingual links narrow the interpretations

➣ The result is a cheaply tagged corpus

委員長として党の結束を大切にしたいAs the chairperson, A 作为

I B 委员长 ,would like to 我

regard C 希望

the unity of E 维护

the party F 党内

as important. G 团结。

HG-251 11

Multilingual WSD

➣ English

➢ party1 “an organization to gain political power”

➢ party2 “a group of people gathered together for pleasure”

➢ party3 “a band of people associated temporarily in some

activity”

➢ party4 “an occasion on which people can assemble for

social interaction”

➣ Japanese

➢ 党1 “an organization to gain political power”

HG-251 12

Conclusion

➣ Created the Japanese WordNet

➢ Usable

➢ Accessible

➣ Similar information available for Chinese

➣ Can use in assignment two

HG-251 13

NICT multilingual corpus

➣ Aiming to have 10 million sentences of parallel text

➢ 2-3 million Ja-Zh

➢ Remainder Ja-En

➢ Small amount of other languages

➣ Make as free as copyright allows us

➢ Used for SMT, EBMT

➢ MASTAR - tourism, manga, JC - scientific

➣ Cathedral and Bazaar test corpus (Language Grid)

➢ En, Zh, Ja, Ko, Fr, Es, De, It, Pt

HG-251 14

NTU Multilingual Corpus

➣ aligned corpus of Chinese, English, Malay, Tamil

and other languages if possible

➢ get data from government publications

national and local

➢ many data cleansing issues — pdf2txt, pictures, ...

HG-251 15

NTU Multilingual Corpus Sample

➣ So practise the 10-Minute Mozzie Wipeout everyday to

ensure that you and your loved ones stay safe, healthy

and happy all year long.

➣ Berita baiknya ialah, oleh kerana cara penularan adalah

serupa, dengan mempraktikkan kebiasaan anti-nyamuk

secara berterusan, kita boleh, dengan secara efektif,

menglindungi diri dari ancaman berkembar Chikungunya dan

Denggi.

HG-251 16

Bracket-Dic

➣ Translation quality is getting better

however unknown words and combinations remain a problem

➣ Dictionaries have incomplete cover

➣ Bilingual corpora are still relatively small in size

➣ Look for examples in basically monolingual data

➢ Text with English glosses in brackets

Pustejovskyの生成的辞書(generative lexicon)の記述方

式を利用して . . .

HG-251 17

➢ Extract the English and the text before it

Pustejovskyの生成的辞書(generative lexicon)の記述方式を利用し

て . . .

generative lexicon ⇔ Pustejovskyの生成的辞書

➢ Problems

∗ How much text before it should we extract?

∗ Is the bracketed text really a gloss?

HG-251 18

Previous Research

➣ Several earlier works:

Using the Web as a Bilingual Dictionary (Nagata et al.,

(2001); Using Bilingual Web Data to Mine and Rank

Translations (Li et al., 2003); Acquiring Compound Word

Translations Both Automatically and Dynamically (Zhang

and Isahara, 2004)

➣ Why redo it? We have new corpora – not all on the web

➢ Possible to improve by looking at many terms

➢ Possible to add domain info

➢ We want the translations

HG-251 19

Corpora Examined

Lang Name Size (MB) Comment

Ja WWW 514,212

J-STAGE 604 some English

NLP 43 some duplicates

Zh BLCU 80,000 OCR errors

SohuTechNews 974 XML

GigaWord 4,444 LDC (traditional)

Table 2: Size and types of Corpora Used

A lot of raw data — how many terms can we find?

HG-251 20

Two kinds of patterns

➣ Fully Bracketed Examples

(1) 「収 穫 逓 減 の 法 則(the law of diminishing

return)」

(2) 《德拉吉报道》(DrudgeReport)

(3) “魔兽世界”(World of Warcraft)

➣ Partly Bracketed Examples

(4) 図1に,明瞭性 (Clarity)・新奇性 (Novelty)

(5) 目标递归策略 (GoalRecursionS

trategy),这是一种内部指导的策略。

HG-251 21

Regular Expressions

full1 = tlbr(term+)trbr lbr(gloss{3,})rbr

《德拉吉报道》(DrudgeReport)

full2 = tlbr(term+)lbr(gloss{3,})rbr trbr

「収穫逓減の法則(the law of diminishing return)」

part = (term+)lbr(gloss{3,})rbr

図1に,明瞭性 (Clarity)

term = any non punctuation (1+ nonlatin (CJK))

gloss = latin, connector punctuation,

full space latin, whitespace

lbr = (( rbr = ))

tlbr = Unicode: Punctuation, Open

trbr = Unicode: Punctuation, Close

HG-251 22

Stop Words

Roman Numerals: xii, iii, . . .

Units: MPa, Kmh, . . .

Smilies: T T, o , m m, x x . . .

Week Days: mon, wed, fri, . . .

Other: pdf, PDF . . .

HG-251 23

Distribution of Bracketed Terms

Lang Name Fully Partly

Ja WWW 896,000 14,861,000

J-STAGE 552 45,000

NLP 64 1,300

Zh BLCU 151,000 6,563,000

SohuTechNews 5,400 123,000

➣ A lot of hits!

➣ Terms the authors think are important

➣ Many terms not found in lexicons

➢ 生成的辞書 ≡ generative lexicon

HG-251 24

世世世ののの中中中はははそそそううう甘甘甘くくくななないいい

# English Chinese

10 World of Warcraft 魔兽世界

5 WOW 魔兽世界

3 WoW 魔兽世界

2 WorldofWarcraft 魔兽世界

1 World orWarcraft 魔兽世界

1 World of WarcraftTM 魔兽世界

1 Warcraft 魔兽世界

➣ Errors in the source corpus

➢ OCR errors

➢ Mistyping

HG-251 25

Partially bracketted is harder

➣ Discard unshared left hand contexts

➢ Macaca fuscata 特にニホンザル

➢ Macaca fuscata 日本産哺乳類の中でこのような動作

が可能な ニホンザル

➣ Discard non-term left hand contexts

➢ s/(̂.* を )//;

➢ s/ˆや //;

➢ s/ˆ対//;

➣ Merge whitespace variations:

World of Warcraft ≈ WorldofWarcraft

HG-251 26

Results after Merging

Lang Name Raw # Merged

Ja WWW 14,861,000 1,635,000

J-STAGE 45,000 20,000

NLP 1,300 372

Zh BLCU 6,563,000 964,000

Sohu 123,000 33,000

➣ Fewer, better pairs

HG-251 27

More Examples (Table 5)

Corpus Rank English J/C Freq

BLCU 1 SOD 超氧化物歧化酶 18,000

1001 quercetin 槲皮素 121

2001 Alcan 加拿大铝业公司 55

3001 CSTC 中国软件评测中心 34

4001 John 约翰 18

JST 1 Bunseki Kagaku 分析化学 517

1001 STEM 走査型TEM 2

2001 structural factor 構造係数 2

3001 explicit attitude *的態度 1

4001 Lake Magadi マガディ湖 1

HG-251 28

Evaluation: 言言言語語語処処処理理理学学学会会会誌誌誌

Known: good terms already in our lexicons

文法機能 grammatical function

(EDR, JMDict, CICC and lingdic)

New: good terms not in any of our lexicons

生成的辞書 generative lexicon

Now in lingdic

General: good translations but not NLP terms

すべての学生 all of the students (Example)

Other: the remainder

似テイル ohxap ketidu (Mongolian)

形態素解析システム JUMAN (Description)

HG-251 29

Results for the NLP Corpus

Status # %

Known 61 16%

New 138 37%

General 74 20%

Other 99 27%

Total 372 100%

➣ Many new useful terms (83% ok)

➣ Useful as-is for alignment

➣ Could still be cleaned further

HG-251 30

ToDo: Other Languages vs English

➣ Extract Data from Other Languages

➢ Thai, Korean, Russian, Greek, . . .

➣ Test the English as English

➢ Re-space

➢ Compare to a language model

HG-251 31

ToDo: Internal Structure

➣ Is it compositional? (if so do we need it?)

➢ 複合述部 ≡ complex predicate

➣ Are the lengths roughly equivalent?

➢ One en word ≈ two characters (we can measure)

➢ What about TLAs (three letter acronyms)?

GTF:http://www.xs4all.nl/~jtv/gtf/

➣ Is it a transliteration?

ペンシルバニア大学 University of Pennsylvania

德拉吉报道 Drudge Report

HG-251 32

Link Japanese-Chinese Results

➣ Sohu-JST (1,695 terms)ウェブログ blog 博客

ファイアーウォール firewall 防火墙

分解能 resolution 分辨率

ヒューマンインターフェース human interface 人机界面

➣ Should evaluate with JC dic

➣ Can do internal and external confirmation

➣ Give the the data to Tsunakawa and Erdenebat

HG-251 33

ToDo: Knowledge Extraction

➣ System Names: (look for /システム$/)

➢ GREEN 選択していく論説文要約システム

➣ Better name handling

写信给当时的大数学家欧拉 Euler

大数学家 欧拉 Euler

➣ General text mining

HG-251 34

Conclusions: Bracket-Dic

➣ Currently extracted for Japanese and Chinese

➣ Some cleaning/merging

⋆ Will release cleaned high frequency data

⋆ Will also release RAW data (as far as possible)

➢ let other people clean it

➢ ask for (but don’t expect) feedback

HG-251 35

Meta Conclusions

➣ There is a lot of interesting data

➣ You now know enough to do things with it

➣ Let’s try to do something fun for the second assignment

➢ Adding illustrations

➢ Finding illustrations

➢ Making translation rules

➢ Disambiguating text

HG-251 36