230 Pièces faciles pour le Banjo [Hitchcock's Banjo ... · SALUTATORY, vh.
How can we capture multiword expressions? · • The Collaborative International Dictionary of...
Transcript of How can we capture multiword expressions? · • The Collaborative International Dictionary of...
![Page 1: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/1.jpg)
How can we capture multiword expressions?
Seongmin Mun1, Guillaume Desagulier2, Anne Lacheret3 , Kyungwon Lee4
1 Lifemedia Interdisciplinary Program, Ajou University, South Korea1,3 UMR 7114 MoDyCo - CNRS, University Paris Nanterre, France
2 UMR 7114 MoDyCo - University Paris 8, CNRS, University Nanterre4 Department of Digital Media, Ajou University, South Korea
![Page 2: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/2.jpg)
Introduction
Topics in a text corpus include features and information.
Analyzing these topics can improve a user’s understanding of the corpus.
2/31
![Page 3: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/3.jpg)
Previous studies
WEIWEI CUI SHIXIA LIU Z. W. H. W.: How hierarchical topics evolve in large text corpora. In IEEE Transactions on Visualization and Computer Graphics (2014), vol. 20, pp. 2281–2290.
3/31
![Page 4: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/4.jpg)
Research background and purpose
Topics can be broadly divided into two categories.
4/31
![Page 5: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/5.jpg)
Research background and purpose
“With profound gratitude and great humility, I accept your nomination for the presidency of the United States.”
5/31
![Page 6: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/6.jpg)
Research background and purpose
“With profound gratitude and great humility, I accept your nomination for the presidency of the United States.”
Gratitude meaning that can be expressed in one word
6/31
![Page 7: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/7.jpg)
Research background and purpose
“With profound gratitude and great humility, I accept your nomination for the presidency of the United States.”
United States meaning must be described using a combination of words.
7/31
![Page 8: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/8.jpg)
Research background and purpose
How can we capture multiword expressions?
To this aim, we designed an algorithm.
8/31
![Page 9: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/9.jpg)
Data processing
Raw corpus Processing Topic candidate
Topic validation
Generate topics
9/31
![Page 10: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/10.jpg)
Data processing
Raw corpus Processing Topic candidate
Topic validation
Generate topics
10/31
![Page 11: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/11.jpg)
Data processing
Raw corpus(U.S. president speeches)
https://millercenter.org/the-presidency/presidential-speeches
11/31
![Page 12: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/12.jpg)
Data processing
Raw corpus(U.S. president speeches)
12/31
![Page 13: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/13.jpg)
Data processing
Raw corpus(U.S. president speeches)
13/31
![Page 14: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/14.jpg)
Data processing
Raw corpus Processing Topic candidate
Topic validation
Generate topics
14/31
![Page 15: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/15.jpg)
Data processing
Processing• N-grams• POS tagging
Pre-processing• Cleaning with RegExp• Lemmatization• Tokenization• Lowercasing
N-gram method is a contiguous sequence of N items from a given sequence of text.
15/31
![Page 16: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/16.jpg)
Data processing
Processing• N-grams• POS tagging
Pre-processing• Cleaning with RegExp• Lemmatization• Tokenization• Lowercasing
“Time flies like an arrow.”
16/31
![Page 17: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/17.jpg)
Data processing
Processing• N-grams• POS tagging
Pre-processing• Cleaning with RegExp• Lemmatization• Tokenization• Lowercasing
“Time flies like an arrow.”
Unigram : Time, flies, like, an, arrow.Bigram : Time flies, flies like, like an, an arrow.Trigram : Time flies like, flies like an, like an arrow.
17/31
![Page 18: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/18.jpg)
Data processing
Processing• N-grams• POS tagging
Pre-processing• Cleaning with RegExp• Lemmatization• Tokenization• Lowercasing
18/31
![Page 19: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/19.jpg)
Data processing
Raw corpus Processing Topic candidate
Topic validation
Generate topics
19/31
![Page 20: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/20.jpg)
Data processing
Topic candidate extraction & filtering• Frequency counting• Filters :
ü Stopwordsü Thresholds
20/31
![Page 21: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/21.jpg)
Data processing
Raw corpus Processing Topic candidate
Topic validation
Generate topics
21/31
![Page 22: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/22.jpg)
Data processing
Topic validation• Human annotation• Matching
with Dictionaries
English dictionaries
• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993)• Easton's 1897 Bible Dictionary• Elements database 20001107• The Free On-line Dictionary of Computing (27 SEP 03)• U.S. Gazetteer (1990)• The Collaborative International Dictionary of English v.0.44• Hitchcock's Bible Names Dictionary (late 1800's)• Jargon File (4.3.1, 29 June 2001)• Virtual Entity of Relevant Acronyms (Version 1.9, June 2002)• WordNet (r) 2.0• CIA World Factbook 2002• User Dictionary
22/31
![Page 23: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/23.jpg)
Data processing
Raw corpus Processing Topic candidate
Topic validation
Generate topics
23/31
![Page 24: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/24.jpg)
Visual system
http://ressources.modyco.fr/sm/MultiwordVis/
24/31
![Page 25: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/25.jpg)
Ambiguous sentence
“Shall I wake him up?”
25/31
![Page 26: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/26.jpg)
Ambiguous sentence
We can’t extract wake up if we only use N-gram algorithm.
“Shall I wake him up?”
26/31
![Page 27: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/27.jpg)
Dependency tag
Dependency tag can provide a simple description of the grammatical relationships in a sentence.
27/31
![Page 28: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/28.jpg)
Improving algorithm
28/31
![Page 29: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/29.jpg)
Improving algorithm
N-gram Dependency tag
29/31
![Page 30: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File](https://reader035.fdocuments.us/reader035/viewer/2022071023/5fd7de2922e6a437fb2af765/html5/thumbnails/30.jpg)
Data processing
Raw corpus Processing Topic candidate
Topic validation
Generate topics
DistinguishSentence
Storing results
Processing• N-grams• Dependency tag• POS tagging
Pre-processing• Cleaning with RegExp• Lemmatization• Tokenization• Lowercasing
30/31