Automated Building of Classic Chinese-English Dictionary ...
Building a Dictionary from WWW
description
Transcript of Building a Dictionary from WWW
![Page 1: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/1.jpg)
Building a Dictionary from WWW
Virach SornlertlamvanichInformation Research and Development Division
National Electronics and Computer Technology Center
Languages and Cultures of the East and West (LACEW)July 25-27, 2001, Tsukuba University
![Page 2: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/2.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
2
Motivations
• WWW is an only one huge, up-to-date, language-and-area thoroughness online resource
• Lexicon & terminology database needed in – Electronic Dictionary– Machine Translation– Text Summarization, etc.
• Lack of open sharable resource– No standard formats– Legal issues
Collaborative Open Lexicon Development
![Page 3: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/3.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
3
Concepts
• Open Format– XML-based
• Open Protocol– XML-based
request/response– DICT (RFC 2229)
• Open Participation– Data entry– Approver
• Open Source– Software Tools– Dictionary Content
• Corpus-based– To reflect the uses
and meanings of terms in real life
– To assist human thinking process
![Page 4: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/4.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
4
Format : Standards Survey
• Models– ISO 12620:1999 : Terminology data categories– ISO 12200:1999 : Machine-Readable Terminology
Interchange Format (MARTIF)
• Variations– OTELO : OLIF (Open Lexicon Interchange Format)
A format for MT dictionary interchange– OSCAR (LISA) : TMX (Translation Memory eXchange
format)– SALT : Standards-based Access to multilingual Lexicons
and Terminologies XLT (XML representation of Lexicons and Terminologies)
![Page 5: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/5.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
5
Development Procedure
Robot TextCorpus
WWW
SampleTexts
Terms
Ontology
Term-Concepts
AnnotatedConcepts
Term CandidateExtraction
ConceptClassification
Syntactic StructureAnalysis
Concept CorrelationDiscovery
• Document collection with robot (w/ language identification)• Term candidate extraction
(C4.5 on MI, Entropy, etc.)
• Syntactic structure extraction (POS tagger)
• Semantic correlation discovery (Ontology)
• Context-based concept classification (text classification)
![Page 6: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/6.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
6
A Thai Running Text
สวั�สดี�ครั�บ ผมชื่ �อวั�รั�ชื่ ศรัเลิ�ศลิ���วั�ณิ�ชื่ ปั�จจ�บ�นเปั�นผ��อ��นวัยก�รัฝ่#�ยวั�จ�ยแลิะพั�ฒน�ส�ข�ส�รัสนเทศ ศ�นย*เทคโนโลิย�อ�เลิ,กทรัอน�กส*แลิะคอมพั�วัเตอรั*แห่/งชื่�ต� ผมเรั��มสนใจง�นวั�จ�ยในส�ข�ก�รัปัรัะมวัลิผลิภ�ษ�ธรัรัมชื่�ต�ต��งแต/ท��ไดี�ม�โอก�สเข��รั/วัมโครังก�รัวั�จ�ยแลิะพั�ฒน�รัะบบแปัลิภ�ษ�ในปั6 1989
สวั�สดี� ครั�บ ผม ชื่ �อ วั�รั�ชื่ ศรัเลิ�ศลิ���วั�ณิ�ชื่ ปั�จจ�บ�น เปั�น ผ��อ��นวัยก�รั ฝ่#�ย วั�จ�ย แลิะ พั�ฒน� ส�ข� ส�รัสนเทศ ศ�นย* เทคโนโลิย� อ�เลิ,กทรัอน�กส* แลิะ คอมพั�วัเตอรั* แห่/ง ชื่�ต� ผม เรั��ม สนใจ ง�น วั�จ�ย ใน ส�ข� ก�รั ปัรัะมวัลิผลิ ภ�ษ� ธรัรัมชื่�ต� ต��งแต/ ท�� ไดี� ม� โอก�ส เข��รั/วัม โครังก�รั วั�จ�ย แลิะ พั�ฒน� รัะบบ แปัลิภ�ษ� ใน ปั6 1989
Word/Sentence Segmentation
. ..
.
![Page 7: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/7.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
7
Writing System
• 46 consonants; 18 vowels;4 tones; 9 symbols; 10 digitswritten 4 levels
• No punctuation• No word/sentence marker• No upper/lower case letter• No inflection
consonant
tone
vowel
vowel
Hard to identify (single/compound) word/phrase/sentence
baseline
![Page 8: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/8.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
8
Sentence Extraction
Winnow(Feature-based ML)
Word segmentation andPOS tagging
Winnow
Input paragraph
Paragraph with sentence break
Word sequence with tagged POS
Trained network
Training POS tagged corpus
![Page 9: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/9.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
9
Accuracy in Word/Sentence Segmentation
• Word Segmentation– Longest matching(92%)– Maximal matching (93%)– POS tri-gram (96%)– Machine learning (97%)
• Sentence Segmentation– POS tri-gram (85%)– Machine learning (89%)
Supervised approaches
![Page 10: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/10.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
10
Term Candidate Extraction
• Virach Sornlertlamvanich et. al. (COLING 2000) :– Automatic Corpus-Based Thai Word
Extraction with the C4.5 Learning Algorithm– C4.5-trained decision tree for determining
potential word boundary from MI, Entropy, Linguistic information
– Capable of discovering new words in document without assistance from static dictionary
![Page 11: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/11.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
11
Mutual Information
High mutual information implies that xy - z co occurs more than expected by chance. If xy z is a word, its Lm and Rm must be high.
…E function… and ...Function...
x y z
z
where x is the leftmost character of string xyzy is the middle substring of xyz z is the rightmost character of string xyzp( ) is the probability function.
x y
![Page 12: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/12.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
12
Entropy
Entropy shows the variety of characters before and after a word. If y is a word, its left and right entropy must be high.
...?function... , ...?unction...
where A is the set of charactersx is the leftmost character of string xyzy is the middle substring of xyz z is the rightmost character of string xyzp( ) is the probability function.
x y
zy
![Page 13: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/13.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
13
Other Features
• FrequencyWords tend to be used more often than
non-word string sequences.
• LengthShort strings are likely to happen by chance.
The long and short strings should be treated differently.
• Functional WordsFunctional words are used mostly in phrases. They
are useful to disambiguate words and phrases.Result of subjective test :
Word precision 85%Word recall 56%
![Page 14: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/14.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
14
Evaluation Result of Word Extraction
Extracted words
Existing RID Not existing in RID
Training set(2933)
1643 1028(65.9%)
561(34.1%)
Test set(2720)
1526 1046(68.5%)
480(31.5%)
RID : Royal Institute Dictionary (30,000 words of Thai-Thai dictionary)
![Page 15: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/15.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
15
Concept Classification
• Word and their contexts in the corpora• Manual word-sense disambiguation• Unsupervised word sense disambiguation (Yarowsky
1995)เกาะ 1(sense : to attach)
… ม�น เกาะ ต�วัเองก�บก��งไม� … (It clings itself on a tree ) … ผ��โดียส�รัไม/จ��เปั�นต�องย น เกาะ ห่/วังอ�กต/อไปัแลิ�วั … (Passengers
don't have to hold peddles anymore.)เกาะ ( 2 : )sense an island …บ��นผมอย�/ท�� เกาะ สม�ย… (I live at the Samui island.)
…ญี่��ปั�#นปัรัะกอบดี�วัย เกาะ ให่ญี่/ 4 เก�ะ… (There are four big islands in Japan.)
![Page 16: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/16.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
16
Concept Classification
![Page 17: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/17.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
17
Syntactic Structure Analysis
• Sentence/word segmentation by POS trigram tagger– POS assignment– Word co-occurrence
• Parser– Pattern of usages
![Page 18: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/18.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
18
Ontologies
• EDR– Approach: Word description as employed in
dictionaries– Problem: Ambiguities and incomputability
• Wordnet– Approach: Synonym set and simple semantic
relations to other words– Problem: Ambiguities
• UW– Approach: Headwords and semantic restrictions– Advantage: Computability and no ambiguity
![Page 19: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/19.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
19
Ontologies
EDR Wordnet 1.5 UW
Representation of concept “tired” in different schemes
- having or displaying a need for rest- having lost of interest- lack of imagination
- A1 : tired (vs. rested)- 2A : bromidic, commonplace, hackneyed, …- V1 : tire, pall, grow weary, fatigue- 2V : tire, wear upon, fag out- 3V : run down, exhaust, sap, …- bbbbb bbbbb bbb4
- tired- tired(icl>physical)- tired(icl>mental)
![Page 20: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/20.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
20
Universal Word (UW)
• UW format : <headword> ( <list of restrictions> ) e.g. book (icl > do, obj > room)
• Headword : An English word roughly describes the UW sense.
• Restrictions :– Inclusion (icl ) indicates the class of the sense
e.g. car ( icl > movable thing)
![Page 21: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/21.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
21
Universal Word (UW)
• Restrictions (continued)– UNL semantic relations
e.g. eat ( agt > volitional thing, obj > food )The agent of this UW is restricted to be volitional thing.The object of this UW is restricted to be food.
UW Class Hierarchy
![Page 22: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/22.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
22
Architecture
• Centralized activities– For data integrity & consistency
• Distributed sites– For open participation– For backing up
• Job-based– Jobs generated by corpus analysis tools– Participants download jobs to work off-line
and submit back when done
![Page 23: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/23.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
23
Job-based Working
CentralDB
Job A Generator
Job A Acceptor
Job A ApprovalAcceptor
Participant
Approver
Job B Generator
Job B Acceptor
Job B ApprovalAcceptor
Participant
Approver
Job Billboard
Job A Pool
Job ASubmittal
Job AApproval Pool
Job AApproved Pool
Job B Pool
Job BSubmittal
Job BApproval Pool
Job BApproved Pool
![Page 24: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/24.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
24
Network Connection
• Committee nodes– Replicate same
database– Closely synchronized– Provide service to
neighbor participants
• Agents (optional)– Propagate
communications between committee and participants
Committee
Committee
Committee
Committee
Participant
Participant
Participant Participant
Participant
Participant
Participant
Participant
Agent Agent
Participant Participant
![Page 25: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/25.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
25
Contents to Develop
• Thai word list• Corpus-based Thai lexicon• Co-occurrence dictionary• Thai ontology
![Page 26: Building a Dictionary from WWW](https://reader035.fdocuments.us/reader035/viewer/2022062309/5681502b550346895dbe1c84/html5/thumbnails/26.jpg)
25-27/7/2001 National Electronics and Computer Technology Center
26
SNLP+O-COCOSDA
The Fifth Symposium on Natural Language Processing + Oriental COCOSDA Workshop 2002
9-11 May 2002Hua Hin, Prachuapkirikhan, Thailand
http://kind.siit.tu.ac.th/snlp-o-cocosda2002/ orhttp://www.links.nectec.or.th/itech/snlp-o-cocosda2002/