How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15...

25
How can we capture multiword expressions? Seongmin Mun 1 , Guillaume Desagulier 2 , Kyungwon Lee 3 1 Lifemedia Interdisciplinary Program, Ajou University, South Korea 1 UMR 7114 MoDyCo - CNRS, University Paris Nanterre, France 2 UMR 7114 MoDyCo - University Paris 8, CNRS, University Nanterre 3 Department of Digital Media, Ajou University, South Korea

Transcript of How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15...

Page 1: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

How can we capture multiword expressions?

Seongmin Mun1, Guillaume Desagulier2, Kyungwon Lee3

1 Lifemedia Interdisciplinary Program, Ajou University, South Korea1 UMR 7114 MoDyCo - CNRS, University Paris Nanterre, France

2 UMR 7114 MoDyCo - University Paris 8, CNRS, University Nanterre3 Department of Digital Media, Ajou University, South Korea

Page 2: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Introduction

Words in a text corpus include features and information.

Analyzing these words can improve a user’s understanding of the corpus.

2/25

Page 3: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Previous studies

WEIWEI CUI SHIXIA LIU Z. W. H. W.: How hierarchical topics evolve in large text corpora. In IEEE Transactions on Visualization and Computer Graphics (2014), vol. 20, pp. 2281–2290.

3/25

Page 4: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Research background and purpose

Words can be broadly divided into two categories.

4/25

Page 5: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Research background and purpose

“With profound gratitude and great humility, I accept your nomination for the presidency of the United States.”

5/25

Page 6: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Research background and purpose

“With profound gratitude and great humility, I accept your nomination for the presidency of the United States.”

Gratitude meaning that can be expressed in one word

6/25

Page 7: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Research background and purpose

“With profound gratitude and great humility, I accept your nomination for the presidency of the United States.”

United States meaning must be described using a combination of words.

7/25

Page 8: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Research background and purpose

How can we capture multiword expressions?

To this aim, we designed an algorithm.

8/25

Page 9: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

9/25

Page 10: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

10/25

InputCorpus

ü Java Code

ü Out Put

Page 11: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

11/25

ü MongoDB & JAVA

DistinguishSentence

Page 12: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

12/25

ü MongoDB & JAVA

ü Out Put

DistinguishSentence

Page 13: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

13/25

ü N-gram

Processing

ü Dependency Parser

N-gram method is a contiguous sequence of N items from a given sequence of text.

Dependency parser can provide a simple description of the grammatical relationships in a sentence.

Page 14: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

14/25

ü N-gram

Processing

ü Java Code

Page 15: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

15/25

ü N-gram

Processing

“Shall I wake him up?”

Unigram : Shall, I, wake, him, up.Bigram : Shall I, I wake, wake him, him up.Trigram : Shall I wake, I wake him, wake him up.

Page 16: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

16/25

ü Dependency parser

Processing

ü Java Code # Stanford_CoreNLP

Page 17: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

17/25

ü Dependency parser

Processing

“Shall I wake him up?”

Page 18: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

18/25

ü Dependency parser

Processing

“Shall I wake him up?”

“ Shall I wake him up ? ”(root)

(dobj)(compound:prt)

(punct)

(dep)

(nsubj)

Page 19: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

19/25

ü N-gram Sentence : “Shall I wake him up?”

Words candidate

Page 20: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

20/25

ü Dependency parser Sentence : “Shall I wake him up?”

Words candidate

Page 21: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

21/25

ü English Dictionaries

API : http://services.aonaware.com/DictService/

Words validation

English dictionaries

• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993)• Easton's 1897 Bible Dictionary• Elements database 20001107• The Free On-line Dictionary of Computing (27 SEP 03)• U.S. Gazetteer (1990)• The Collaborative International Dictionary of English v.0.44• Hitchcock's Bible Names Dictionary (late 1800's)• Jargon File (4.3.1, 29 June 2001)• Virtual Entity of Relevant Acronyms (Version 1.9, June 2002)• WordNet (r) 2.0• CIA World Factbook 2002• User Dictionary

Page 22: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

22/25

ü User Dictionary

Words validation

ü MongoDB & JAVA

ü Detail :

{ "_id" : { "$oid" : ”Unique number"} , "word" : "" , "meaning" : ""}

Page 23: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

23/25

ü Accuracy

Words validation

ü N-gram & Dependency parser

Sentence : “Shall I wake him up?”

N-gram Dependency graph + N-gram

Page 24: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Data processing

24/25

ü Data Base : MongoDB & JAVA

ü Dictionary Collection

StoringResults

ü Sentence Collection

ü Stopwords Collection

Page 25: How can we capture multiword expressions?...• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993) • Easton's 1897 Bible Dictionary • Elements database 20001107 • The Free

Q&A

Thank you for [email protected]

25/25