Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic...
Transcript of Collocation Extraction Based on Syntactic Criteria · Collocation Extraction Based on Syntactic...
Collocation Extraction Based on Syntactic Criteria
Violeta Seretan
Department of Translation TechnologyFaculty of Translation and Interpreting
University of Geneva
September 2013
Recent Advances in Natural Language Processing, 7-13 September 2013, Hissar, Bulgaria
Acknowledgment
Language Technology Laboratory, Department of Linguistics,University of Geneva
Eric Wehrli Luka Nerima Paola Merlo
Acknowledgment
Language Technology Laboratory, Department of Linguistics,University of Geneva
Eric Wehrli Luka Nerima Paola Merlo
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
1 On collocations
2 Importance of collocations
3 Distiguishing features
4 The need for morphosyntactic analysis
5 Syntax-based extractors
6 Method, results, evaluation
7 Conclusion
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 3 / 100
I chose to run for the presidency at this moment in history because Ibelieve deeply that we cannot solve the challenges of our time unless wesolve them together
Source: https://my.barackobama.com/page/content/hisownwords/
I chose to run for the presidency at this moment in history because Ibelieve deeply that we cannot solve the challenges of our time unless wesolve them together
Source: https://my.barackobama.com/page/content/hisownwords/
I chose to run for the presidency at this moment in history because Ibelieve deeply that we cannot solve the challenges of our time unless wesolve them together
Source: https://my.barackobama.com/page/content/hisownwords/
I chose to run for the presidency at this moment in history because Ibelieve deeply that we cannot solve the challenges of our time unless wesolve them together
Source: https://my.barackobama.com/page/content/hisownwords/
I chose to run for the presidency at this moment in history because Ibelieve deeply that we cannot solve the challenges of our time unless wesolve them together
Source: https://my.barackobama.com/page/content/hisownwords/
Language is made up of collocations
I tender my heartfelt gratitude to all of them, while taking fullresponsibility for all errors . . .
[Mel’cuk1998, 23]
Collocations
“In all kinds of texts, collocations are indispensable elements with whichour utterances are very largely made”
[Kjellmer1987, 10]
Collocations
“Collocation is the way words combine in a language to producenatural-sounding speech and writing”
[Lea and Runcie2002, vii]
‘Appendix to the Grammar’
“Knowledge that will account for speakers‘ ability to construct andunderstand phrases and expressions in their language which are notcovered by the grammar, the lexicon, and the principles of compositionalsemantics”
[Fillmore et al.1988, 504]
Only available to native speakers
“Advanced learners of second language have great difficulty with nativelikecollocation and idiomaticity. Many grammatical sentences generated bylanguage learners sound unnatural and foreign.”
[Ellis2008]
More examples
EN ask a question
IT fare una domanda, ES hacer una pregunta‘make’
RO a pune o intrebare, FR poser une question‘put’
More examples
EN ask a question
IT fare una domanda, ES hacer una pregunta‘make’
RO a pune o intrebare, FR poser une question‘put’
More examples
EN ask a question
IT fare una domanda, ES hacer una pregunta‘make’
RO a pune o intrebare, FR poser une question‘put’
More examples
EN reach an agreement
FR parvenir a un accord‘arrive, get to’
IT trovare un accordo‘find’
More examples
EN reach an agreement
FR parvenir a un accord‘arrive, get to’
IT trovare un accordo‘find’
More examples
EN reach an agreement
FR parvenir a un accord‘arrive, get to’
IT trovare un accordo‘find’
Even more examples
EN narrow majority
FR courte majorite‘short’
EN bring to justice
FR traduire en justice‘translate’
FR urmari ın justitie‘follow track’
ENstory breaksstrike a dealin sharp contrastdraw criticismentertain hopeexperience difficultyfoot the billmeet requirementfine weatherdeep impressionserious injury
Even more examples
EN narrow majority
FR courte majorite‘short’
EN bring to justice
FR traduire en justice‘translate’
FR urmari ın justitie‘follow track’
ENstory breaksstrike a dealin sharp contrastdraw criticismentertain hopeexperience difficultyfoot the billmeet requirementfine weatherdeep impressionserious injury
Even more examples
EN narrow majority
FR courte majorite‘short’
EN bring to justice
FR traduire en justice‘translate’
FR urmari ın justitie‘follow track’
ENstory breaksstrike a dealin sharp contrastdraw criticismentertain hopeexperience difficultyfoot the billmeet requirementfine weatherdeep impressionserious injury
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Importance
“Collocations make up the lion’s share of the phraseme inventory, and thusthey deserve our special attention.”
[Mel’cuk1998, 24]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 24 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Importance for Machine Translation
“collocations are the key to producing more acceptable output”
[Orliac and Dillinger2003, 292]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 25 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Importance for Natural Language Generation
“collocations are not only considered useful, but also a problem”
[Heylen et al.1994, 1240]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 26 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Importance for Language Analysis
1 Collocations act as lexical disambiguators
to break a record
break (V) - about 50 senses
about 500 potential interpretations
record (N) - about 10 senses
break - record (V-O collocation): 1 sense
“a polysemous word exhibits essentially only one sense per collocation”[Yarowsky1993, 266]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 27 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Importance for Language Analysis
1 Collocations act as lexical disambiguators
to break a record
break (V) - about 50 senses about 500 potential interpretationsrecord (N) - about 10 senses
break - record (V-O collocation): 1 sense
“a polysemous word exhibits essentially only one sense per collocation”[Yarowsky1993, 266]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 27 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Importance for Language Analysis
1 Collocations act as lexical disambiguators
to break a record
break (V) - about 50 senses about 500 potential interpretationsrecord (N) - about 10 senses break - record (V-O collocation): 1 sense
“a polysemous word exhibits essentially only one sense per collocation”[Yarowsky1993, 266]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 27 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Importance for Language Analysis
2 Collocations act as structural disambiguators
human resource development
NP
AP
human
NP
resource NP
development
vs. NP
NP
AP
human
resource
development
vs. ...
The number of potential parses is exponential in sentence length.Collocational knowledge guides the syntactic parsing (e.g.,[Wehrli et al.2010]).
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 28 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Importance for Language Analysis
2 Collocations act as structural disambiguators
human resource development
NP
AP
human
NP
resource NP
development
vs. NP
NP
AP
human
resource
development
vs. ...
The number of potential parses is exponential in sentence length.
Collocational knowledge guides the syntactic parsing (e.g.,[Wehrli et al.2010]).
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 28 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Importance for Language Analysis
2 Collocations act as structural disambiguators
human resource development
NP
AP
human
NP
resource NP
development
vs. NP
NP
AP
human
resource
development
vs. ...
The number of potential parses is exponential in sentence length.Collocational knowledge guides the syntactic parsing (e.g.,[Wehrli et al.2010]).
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 28 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Importance for Corpus Linguistics
Source: http://www.linguistik-online.de/31_07/danielsson.html
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 29 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Importance for Lexicography
Source: Collins COBUILD online
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 30 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
NLP applications
machine translation
syntactic parsing
natural language generation
word sense disambiguation
topic segmentation
text summarization
text classification
information retrieval
OCR, speech recognition
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 31 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
1 On collocations
2 Importance of collocations
3 Distiguishing features
4 The need for morphosyntactic analysis
5 Syntax-based extractors
6 Method, results, evaluation
7 Conclusion
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 32 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Distiguishing features
1 Collocations are partly compositional
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 33 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Distiguishing features
3 Collocations are morphosyntactically flexible
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 34 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Syntactic flexibility
make proposal
A proposal for the financing of the variable costs will be made to theCommittee . . .
submit proposal
A joint proposal which addressed such elements as notification,consultations, conciliation and mediation, arbitration, panel procedures,technical assistance, adoption of panel reports and GATTs surveillance oftheir implementation was submitted on behalf of fourteen participants.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 35 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Syntactic flexibility
make proposal
A proposal for the financing of the variable costs will be made to theCommittee . . .
submit proposal
A joint proposal which addressed such elements as notification,consultations, conciliation and mediation, arbitration, panel procedures,technical assistance, adoption of panel reports and GATTs surveillance oftheir implementation was submitted on behalf of fourteen participants.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 35 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Syntactic flexibility
serious problem
We all need to renounce the use of arms in order to be able to address thecountry’s serious political, social and economic problems . . .
important issue
The issue of new technologies and their application in education naturallygenerates considerable interest and is extremely important.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 36 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Syntactic flexibility
serious problem
We all need to renounce the use of arms in order to be able to address thecountry’s serious political, social and economic problems . . .
important issue
The issue of new technologies and their application in education naturallygenerates considerable interest and is extremely important.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 36 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Syntactic flexibility
paragraphe dispose
Notant en outre que le paragraphe 5 de l’Acte final reprenant les resultatsdes Negociations commerciales multilaterales du Cycle d’Uruguay (ci-apresdenommes respectivement l’“Acte final” et le “Cycle d’Uruguay”) disposeque . . .
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 37 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Identification approaches
1 Approaches based on linear proximityDefinition:
“Collocation is the cooccurrence of two or more wordswithin a short space of each other in a text. The usualmeasure of proximity is a maximum of four wordsintervening.” [Sinclair1991, 170].
2 Approaches based on structural proximityDefinition:
“lexically and/or pragmatically constrained recurrentco-occurrences of at least two lexical items which are in adirect syntactic relation with each other” [Bartsch2004, 76]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 38 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Identification approaches
1 Approaches based on linear proximityDefinition:
“Collocation is the cooccurrence of two or more wordswithin a short space of each other in a text. The usualmeasure of proximity is a maximum of four wordsintervening.” [Sinclair1991, 170].
2 Approaches based on structural proximityDefinition:
“lexically and/or pragmatically constrained recurrentco-occurrences of at least two lexical items which are in adirect syntactic relation with each other” [Bartsch2004, 76]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 38 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
1 On collocations
2 Importance of collocations
3 Distiguishing features
4 The need for morphosyntactic analysis
5 Syntax-based extractors
6 Method, results, evaluation
7 Conclusion
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 39 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Inflection
jouer ‘play’
∼ 25 forms
role ‘role’
2 forms
jouer – role
∼ 50 forms
1 type: jouer - role
50 forms: jouent role, role joue, joue roles, ...
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 40 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Lexical ambiguity
human
NounAdjective
rights
NounAdjectiveVerbAdverb
human – rights
Noun – NounNoun – VerbAdjective – Noun...
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 41 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Inversion
set – record
Michael Phelps sets all-time Olympic record.
A new world record has been set.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 42 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Inversion
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 43 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Long-distance dependencies
donner – exemple ‘give – example’
Le visionnaire a donne, lors de sa conference magistrale, a l’occasion de laremise du Prix Latsis 2011 aux differents laureats, le mercredi 30 novembre2011, dans la salle Piaget de l’Universite de Geneve a Uni-Dufour, devenutrop petite pour accueillir le monde scientifique et le public venus de tousles coins et recoins de la Suisse, l’exemple du professeur et ancienpresident senegalais le poete Leopold Sedar Senghor qui maıtrisait autantla culture du pays de Marianne que la langue de Moliere avec perfectionpour devenir le premier Noir membre de l’Academie francaise.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 44 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Long-distance dependencies
put – off
It is still wise to find a tactful way of putting those who have a virus orthe flu, or any other problem off for a little while longer.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 45 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Structural ambiguity
question - ask
Any question asked during the selection and interview process must berelated to the job and the performance of that job.
The question asked if the grant funding could be used as start-up capitalto develop this project.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 46 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Structural ambiguity
question - ask
Any question asked during the selection and interview process must berelated to the job and the performance of that job.
The question asked if the grant funding could be used as start-up capitalto develop this project.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 46 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Typical extraction method
Sliding window (Baseline)
1 Take into account all possible combinations within a 5-wordcollocational span
2 Apply an association measure to filter out noise and retain bestcandidates on top of the output list
WARNING
MANUALLY VALIDATE RESULTS BEFORE USE
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 47 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Typical extraction method
Sliding window (Baseline)
1 Take into account all possible combinations within a 5-wordcollocational span
2 Apply an association measure to filter out noise and retain bestcandidates on top of the output list
WARNING
MANUALLY VALIDATE RESULTS BEFORE USE
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 47 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
The multilingual challenge
In English, many syntactic relations are realised in a 5-word window.
But what about freer word order languages?
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 48 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
The multilingual challenge
German
“Some properties of the German language make the task of extractingV-N collocations from German text corpora more difficult than for Englishcorpora.”[Breidt1993, 77]
“the assumption that a “semantic agent [...] is principally used before theverb” and a “semantic object [...] is used after it” as described in Smadja(1991a:180) does not hold for German. Therefore, complicated parsing isnecessary to distinguish subject-verb from object-verb combinations.”[Breidt1993, 77]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 49 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
The multilingual challenge
German
“Some properties of the German language make the task of extractingV-N collocations from German text corpora more difficult than for Englishcorpora.”[Breidt1993, 77]
“the assumption that a “semantic agent [...] is principally used before theverb” and a “semantic object [...] is used after it” as described in Smadja(1991a:180) does not hold for German. Therefore, complicated parsing isnecessary to distinguish subject-verb from object-verb combinations.”[Breidt1993, 77]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 49 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
The multilingual challenge
Korean
“The free order of Korean makes it hard to identify collocations.”[Kim et al.1999, 71]
“Unfortunately, the approach for English has several limitations to work onKorean structure”[Kim et al.1999, 71]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 50 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
The multilingual challenge
Korean
“The free order of Korean makes it hard to identify collocations.”[Kim et al.1999, 71]
“Unfortunately, the approach for English has several limitations to work onKorean structure”[Kim et al.1999, 71]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 50 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Solution
“Ideally, in order to identify lexical relations in a corpus one would need tofirst parse it to verify that the words are used in a single phrase structure.
However, in practice, free-style texts contain a great deal of nonstandardfeatures over which automatic parsers would fail. [...] This fact is beingseriously challenged by current research [...] and might not be true in thenear future.”
[Smadja1993, 151]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 51 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Solution
“Ideally, in order to identify lexical relations in a corpus one would need tofirst parse it to verify that the words are used in a single phrase structure.However, in practice, free-style texts contain a great deal of nonstandardfeatures over which automatic parsers would fail.
[...] This fact is beingseriously challenged by current research [...] and might not be true in thenear future.”
[Smadja1993, 151]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 51 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Solution
“Ideally, in order to identify lexical relations in a corpus one would need tofirst parse it to verify that the words are used in a single phrase structure.However, in practice, free-style texts contain a great deal of nonstandardfeatures over which automatic parsers would fail. [...] This fact is beingseriously challenged by current research [...] and might not be true in thenear future.”[Smadja1993, 151]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 51 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Solution
“with recent significant increases in parsing efficiency and accuracy, thereis no reason why explicit parse information should not be used”[Pearce2002, 1530]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 52 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Solution
“Ideally, a full syntactic analysis of the source corpus would allow us toextract the cooccurrence directly from parse trees”[Evert2004, 31]
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 53 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
1 On collocations
2 Importance of collocations
3 Distiguishing features
4 The need for morphosyntactic analysis
5 Syntax-based extractors
6 Method, results, evaluation
7 Conclusion
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 54 / 100
Lin (1998, 1999) – English
dependency parser (unspecified)
sentences shorter than 25 words
9.7% errors in top results
6 types: N-D, N-A, N-N, V-N,N-V, V-Adv
Wu and Zhou (2003); Lu andZhou (2004) – English, Chinese
syntactic parser (NLPWin,Microsoft Research)
7.85% errors in top results
3 types: V-O, N-A, V-Adv
Villada Moiron (2005) – Dutch
dependency parser (Alpino)
sentences shorter than 20 words
many PP-attachment errors;parser only used for chunking
2 types: P-N-P and PP-V
Orliac and Dillinger (2003) –English
deep parser (Logos)
limited grammatical coverage(does not handle relatives)
3 types: S-V, V-O, V-P-N
OthersZinsmeister and Heid (2003), Schulte im Walde (2003) – German, statistical parser(LoPar)Charest et al. (2007) – French, dependency parser (Antidote)
Lin (1998, 1999) – English
dependency parser (unspecified)
sentences shorter than 25 words
9.7% errors in top results
6 types: N-D, N-A, N-N, V-N,N-V, V-Adv
Wu and Zhou (2003); Lu andZhou (2004) – English, Chinese
syntactic parser (NLPWin,Microsoft Research)
7.85% errors in top results
3 types: V-O, N-A, V-Adv
Villada Moiron (2005) – Dutch
dependency parser (Alpino)
sentences shorter than 20 words
many PP-attachment errors;parser only used for chunking
2 types: P-N-P and PP-V
Orliac and Dillinger (2003) –English
deep parser (Logos)
limited grammatical coverage(does not handle relatives)
3 types: S-V, V-O, V-P-N
OthersZinsmeister and Heid (2003), Schulte im Walde (2003) – German, statistical parser(LoPar)Charest et al. (2007) – French, dependency parser (Antidote)
Lin (1998, 1999) – English
dependency parser (unspecified)
sentences shorter than 25 words
9.7% errors in top results
6 types: N-D, N-A, N-N, V-N,N-V, V-Adv
Wu and Zhou (2003); Lu andZhou (2004) – English, Chinese
syntactic parser (NLPWin,Microsoft Research)
7.85% errors in top results
3 types: V-O, N-A, V-Adv
Villada Moiron (2005) – Dutch
dependency parser (Alpino)
sentences shorter than 20 words
many PP-attachment errors;parser only used for chunking
2 types: P-N-P and PP-V
Orliac and Dillinger (2003) –English
deep parser (Logos)
limited grammatical coverage(does not handle relatives)
3 types: S-V, V-O, V-P-N
OthersZinsmeister and Heid (2003), Schulte im Walde (2003) – German, statistical parser(LoPar)Charest et al. (2007) – French, dependency parser (Antidote)
Lin (1998, 1999) – English
dependency parser (unspecified)
sentences shorter than 25 words
9.7% errors in top results
6 types: N-D, N-A, N-N, V-N,N-V, V-Adv
Wu and Zhou (2003); Lu andZhou (2004) – English, Chinese
syntactic parser (NLPWin,Microsoft Research)
7.85% errors in top results
3 types: V-O, N-A, V-Adv
Villada Moiron (2005) – Dutch
dependency parser (Alpino)
sentences shorter than 20 words
many PP-attachment errors;parser only used for chunking
2 types: P-N-P and PP-V
Orliac and Dillinger (2003) –English
deep parser (Logos)
limited grammatical coverage(does not handle relatives)
3 types: S-V, V-O, V-P-N
OthersZinsmeister and Heid (2003), Schulte im Walde (2003) – German, statistical parser(LoPar)Charest et al. (2007) – French, dependency parser (Antidote)
Lin (1998, 1999) – English
dependency parser (unspecified)
sentences shorter than 25 words
9.7% errors in top results
6 types: N-D, N-A, N-N, V-N,N-V, V-Adv
Wu and Zhou (2003); Lu andZhou (2004) – English, Chinese
syntactic parser (NLPWin,Microsoft Research)
7.85% errors in top results
3 types: V-O, N-A, V-Adv
Villada Moiron (2005) – Dutch
dependency parser (Alpino)
sentences shorter than 20 words
many PP-attachment errors;parser only used for chunking
2 types: P-N-P and PP-V
Orliac and Dillinger (2003) –English
deep parser (Logos)
limited grammatical coverage(does not handle relatives)
3 types: S-V, V-O, V-P-N
OthersZinsmeister and Heid (2003), Schulte im Walde (2003) – German, statistical parser(LoPar)Charest et al. (2007) – French, dependency parser (Antidote)
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
The LATL syntax-based collocation extractor: FipsCo
Goldman et al. (2001) – English, French
deep parser (Fips)
broad grammatical coverage
many types of collocation configurations
FipsCo precedes many syntax-based extractors, and overcomeslimitations pertaining to parsing robustness, precision, coverage, aswell as limitations regarding the list of supported syntactic types.
It has been developed mainly as a CAT tool for WTO translators inthe project “Linguistic Analysis and Collocation Extraction”.
Initially available for English and French, it was further extended toSpanish, Italian, Greek, German and Romanian.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 56 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
The LATL syntax-based collocation extractor: FipsCo
Goldman et al. (2001) – English, French
deep parser (Fips)
broad grammatical coverage
many types of collocation configurations
FipsCo precedes many syntax-based extractors, and overcomeslimitations pertaining to parsing robustness, precision, coverage, aswell as limitations regarding the list of supported syntactic types.
It has been developed mainly as a CAT tool for WTO translators inthe project “Linguistic Analysis and Collocation Extraction”.
Initially available for English and French, it was further extended toSpanish, Italian, Greek, German and Romanian.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 56 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
The LATL syntax-based collocation extractor: FipsCo
Goldman et al. (2001) – English, French
deep parser (Fips)
broad grammatical coverage
many types of collocation configurations
FipsCo precedes many syntax-based extractors, and overcomeslimitations pertaining to parsing robustness, precision, coverage, aswell as limitations regarding the list of supported syntactic types.
It has been developed mainly as a CAT tool for WTO translators inthe project “Linguistic Analysis and Collocation Extraction”.
Initially available for English and French, it was further extended toSpanish, Italian, Greek, German and Romanian.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 56 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
The LATL syntax-based collocation extractor: FipsCo
Goldman et al. (2001) – English, French
deep parser (Fips)
broad grammatical coverage
many types of collocation configurations
FipsCo precedes many syntax-based extractors, and overcomeslimitations pertaining to parsing robustness, precision, coverage, aswell as limitations regarding the list of supported syntactic types.
It has been developed mainly as a CAT tool for WTO translators inthe project “Linguistic Analysis and Collocation Extraction”.
Initially available for English and French, it was further extended toSpanish, Italian, Greek, German and Romanian.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 56 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Experiments
Languages
English, French, Italian, Spanish [Seretan and Wehrli2009], German, Greek[Michou and Seretan2009], Romanian [Seretan and Wehrli2010]
Corpora
WTO translation archives, The Economist, Canadian Hansard, Le Monde,Europarl, Mesagerul, WWW (Web as a corpus) ...
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 64 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Exploitation
Lexical acquisition
for the Fips parserfor the Its-2 MT system [Wehrli et al.2009]
Text Summarization [Seretan2011]
Syntactic Parsing [Seretan and Wehrli2011]
Terminology assistance [Wehrli2003, Wehrli2006]
Teaching
Research
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 66 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Method
Fips parser [Wehrli2007] – Sample output
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 72 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Method
Fips parser [Wehrli2007] – Sample output
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 73 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Fips - Key facts
Constituent
simplified X-bar structure [XP L X R] (no intermediate level)X – lexical head (A, N, V, D, P, Conj, ...)L/R – lists of left/right subconstituents
Manually-built lexica
detailed morphosyntactic and semantic information: selectional properties,subcategorization information, syntactico-semantic features likely toinfluence the syntactic analysis
Algorithm - main operations
Project: assignment of constituent structures to lexical entries
Merge: combination of adjacent constituents
Move: creation of chains by linking surface positions of “moved”constituents to their corresponding canonical positions.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 74 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Fips - Key facts
Constituent
simplified X-bar structure [XP L X R] (no intermediate level)X – lexical head (A, N, V, D, P, Conj, ...)L/R – lists of left/right subconstituents
Manually-built lexica
detailed morphosyntactic and semantic information: selectional properties,subcategorization information, syntactico-semantic features likely toinfluence the syntactic analysis
Algorithm - main operations
Project: assignment of constituent structures to lexical entries
Merge: combination of adjacent constituents
Move: creation of chains by linking surface positions of “moved”constituents to their corresponding canonical positions.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 74 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Fips - Key facts
Constituent
simplified X-bar structure [XP L X R] (no intermediate level)X – lexical head (A, N, V, D, P, Conj, ...)L/R – lists of left/right subconstituents
Manually-built lexica
detailed morphosyntactic and semantic information: selectional properties,subcategorization information, syntactico-semantic features likely toinfluence the syntactic analysis
Algorithm - main operations
Project: assignment of constituent structures to lexical entries
Merge: combination of adjacent constituents
Move: creation of chains by linking surface positions of “moved”constituents to their corresponding canonical positions.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 74 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Method – Stage 1: Candidate selection
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 75 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Method – Stage 1: Candidate selection
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 76 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Method – Stage 1: Candidate selection
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 77 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Method – Stage 1: Candidate selection
1 Lexical filter:rule out auxiliary and modal verbs, proper nouns, common nounsrepresenting titles (Mr.)
2 Structural filter:predicate-argument relation in the arguments table of predicatescombinations <X, head of item in L/R> in a given syntactic relation,e.g., head-modifier, noun-adjective in FP (functional phrase)
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 78 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Method – Stage 1: Candidate selection
1 Lexical filter:rule out auxiliary and modal verbs, proper nouns, common nounsrepresenting titles (Mr.)
2 Structural filter:predicate-argument relation in the arguments table of predicatescombinations <X, head of item in L/R> in a given syntactic relation,e.g., head-modifier, noun-adjective in FP (functional phrase)
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 78 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Method – Stage 1: Candidate selection
Syntactic patterns
adjective-noun heavy smokernoun-[predicate]-adjective effort [be] devotednoun-noun suicide attacknoun-preposition-noun round of negotiationsnoun-preposition inquiry intoadjective-preposition crazy aboutsubject-verb war breaksverb-object meet requirementverb-preposition-argument bring to boilverb-preposition point outadverb-verb fully supportadverb-adjective highly important
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 79 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Method – Stage 2: Candidate ranking
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 80 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Method – Summing up:
Baseline
1 Candidate identification:Take into account all possible combinations within a 5-word collocationalspan
2 Candidate ranking:Apply an association measure to filter out noise and retain best candidateson top of the output list
Syntax-based extraction
1 Candidate identification:Take into account syntactically bound combinations, according to theparse tree built by Fips for the input sentence
2 Candidate ranking:Apply an association measure to retain best candidates on top of the outputlist
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 81 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Method – Summing up:
Baseline
1 Candidate identification:Take into account all possible combinations within a 5-word collocationalspan
2 Candidate ranking:Apply an association measure to filter out noise and retain best candidateson top of the output list
Syntax-based extraction
1 Candidate identification:Take into account syntactically bound combinations, according to theparse tree built by Fips for the input sentence
2 Candidate ranking:Apply an association measure to retain best candidates on top of the outputlist
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 81 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Results – Tokens identified
a very simple question which everyone in this country would like to ask
No doubt that will be partly due to the great contribution you and yourcolleagues in the Chair will make.
The provincial government made a very difficult but well balanceddecision that enhances environmental, economic and social values for thearea.
It is the new government’s responsibility to tackle with conviction andfairness the complex problems facing Canada ...
at a cost of $5 billion that is chiefly being met e by South Korea andJapan
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 82 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Results – Tokens identified
a very simple question which everyone in this country would like to ask
No doubt that will be partly due to the great contribution you and yourcolleagues in the Chair will make.
The provincial government made a very difficult but well balanceddecision that enhances environmental, economic and social values for thearea.
It is the new government’s responsibility to tackle with conviction andfairness the complex problems facing Canada ...
at a cost of $5 billion that is chiefly being met e by South Korea andJapan
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 82 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Results – Tokens identified
a very simple question which everyone in this country would like to ask
No doubt that will be partly due to the great contribution you and yourcolleagues in the Chair will make.
The provincial government made a very difficult but well balanceddecision that enhances environmental, economic and social values for thearea.
It is the new government’s responsibility to tackle with conviction andfairness the complex problems facing Canada ...
at a cost of $5 billion that is chiefly being met e by South Korea andJapan
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 82 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Results – Tokens identified
a very simple question which everyone in this country would like to ask
No doubt that will be partly due to the great contribution you and yourcolleagues in the Chair will make.
The provincial government made a very difficult but well balanceddecision that enhances environmental, economic and social values for thearea.
It is the new government’s responsibility to tackle with conviction andfairness the complex problems facing Canada ...
at a cost of $5 billion that is chiefly being met e by South Korea andJapan
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 82 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Results – Tokens identified
a very simple question which everyone in this country would like to ask
No doubt that will be partly due to the great contribution you and yourcolleagues in the Chair will make.
The provincial government made a very difficult but well balanceddecision that enhances environmental, economic and social values for thearea.
It is the new government’s responsibility to tackle with conviction andfairness the complex problems facing Canada ...
at a cost of $5 billion that is chiefly being met e by South Korea andJapan
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 82 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Results – Syntactic environments
passivization:
I see that amendments to the report by Mr Mendez de Vigo andMr Leinen have been tabled on this subject.
relativization:
The communication devotes no attention to the impact the newlyannounced policy measures will have on the candidate countries.
interrogation:
What impact do you expect this to have on reducing our deficitand our level of imports?
cleft constructions:
It is a very pressing issue that Mr Sacredeus is addressing.
coordinated clauses:
This motion implies that somehow the current income tax lawson alimony and maintenance payments are unfair, contribute tothe problem and therefore should be amended.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 83 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Results – Syntactic environments
passivization:
I see that amendments to the report by Mr Mendez de Vigo andMr Leinen have been tabled on this subject.
relativization:
The communication devotes no attention to the impact the newlyannounced policy measures will have on the candidate countries.
interrogation:
What impact do you expect this to have on reducing our deficitand our level of imports?
cleft constructions:
It is a very pressing issue that Mr Sacredeus is addressing.
coordinated clauses:
This motion implies that somehow the current income tax lawson alimony and maintenance payments are unfair, contribute tothe problem and therefore should be amended.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 83 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Results – Syntactic environments
passivization:
I see that amendments to the report by Mr Mendez de Vigo andMr Leinen have been tabled on this subject.
relativization:
The communication devotes no attention to the impact the newlyannounced policy measures will have on the candidate countries.
interrogation:
What impact do you expect this to have on reducing our deficitand our level of imports?
cleft constructions:
It is a very pressing issue that Mr Sacredeus is addressing.
coordinated clauses:
This motion implies that somehow the current income tax lawson alimony and maintenance payments are unfair, contribute tothe problem and therefore should be amended.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 83 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Results – Syntactic environments
passivization:
I see that amendments to the report by Mr Mendez de Vigo andMr Leinen have been tabled on this subject.
relativization:
The communication devotes no attention to the impact the newlyannounced policy measures will have on the candidate countries.
interrogation:
What impact do you expect this to have on reducing our deficitand our level of imports?
cleft constructions:
It is a very pressing issue that Mr Sacredeus is addressing.
coordinated clauses:
This motion implies that somehow the current income tax lawson alimony and maintenance payments are unfair, contribute tothe problem and therefore should be amended.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 83 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Results – Syntactic environments
passivization:
I see that amendments to the report by Mr Mendez de Vigo andMr Leinen have been tabled on this subject.
relativization:
The communication devotes no attention to the impact the newlyannounced policy measures will have on the candidate countries.
interrogation:
What impact do you expect this to have on reducing our deficitand our level of imports?
cleft constructions:
It is a very pressing issue that Mr Sacredeus is addressing.
coordinated clauses:
This motion implies that somehow the current income tax lawson alimony and maintenance payments are unfair, contribute tothe problem and therefore should be amended.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 83 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Evaluation experiment 1: n-best evaluation
CorpusHansard(Canadian Parliament debates)
LanguageFrench
Size∼1.2 M words
MethodsBaselineSyntax-based
Results
Significance list
Evaluation500 typestop leveltotal: 1000 types3 evaluators/method
Annotation
(-gram) erroneous
(+gram)(-lex) regular(+lex) interesting
Fleiss κ = 0.50 (Baseline)Fleiss κ = 0.39 (Syntax-Based)(moderate/fair agreement)
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 84 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Evaluation experiment 1: n-best evaluation
CorpusHansard(Canadian Parliament debates)
LanguageFrench
Size∼1.2 M words
MethodsBaselineSyntax-based
Results
Significance list
Evaluation500 typestop leveltotal: 1000 types3 evaluators/method
Annotation
(-gram) erroneous
(+gram)(-lex) regular(+lex) interesting
Fleiss κ = 0.50 (Baseline)Fleiss κ = 0.39 (Syntax-Based)(moderate/fair agreement)
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 84 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Evaluation experiment 1: n-best evaluation
CorpusHansard(Canadian Parliament debates)
LanguageFrench
Size∼1.2 M words
MethodsBaselineSyntax-based
Results
Significance list
Evaluation500 typestop leveltotal: 1000 types3 evaluators/method
Annotation
(-gram) erroneous
(+gram)(-lex) regular(+lex) interesting
Fleiss κ = 0.50 (Baseline)Fleiss κ = 0.39 (Syntax-Based)(moderate/fair agreement)
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 84 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Annotation examples
(-gram) erroneous petite entreprise (*V-O)
(+gram)(-lex) regular aide a renovation(+lex) interesting aborder sujet
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 85 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Confusion examples
attention particuliere, bref delai, grave probleme, lesion corporelle, offrir service,avoir droit, avoir droit, avoir honneur, creer emploi
(14.0% of the cases)
an prochain, dernier annee, ecouter discours, fin de semaine, monde entier, offre
finale, relation de travail (19.3% of the cases)
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 86 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Confusion examples
attention particuliere, bref delai, grave probleme, lesion corporelle, offrir service,avoir droit, avoir droit, avoir honneur, creer emploi
(14.0% of the cases)
an prochain, dernier annee, ecouter discours, fin de semaine, monde entier, offre
finale, relation de travail (19.3% of the cases)Violeta Seretan Collocation Extraction RANLP 2013, Hissar 86 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Evaluation results
Outer ring – baseline; inner ring – syntax-based method
Statistical significance: +gram t(982) = 10.78, p < 0.001; +lex t(982) = 2.90, p < 0.01
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 87 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Evaluation results
Grammatical precision by sets of 50 pairs
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 88 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Evaluation experiment 2: stratified evaluation
CorpusEuroparl(Koehn, 2005)
LanguagesFrenchEnglishItalianSpanish
Sizeon average, ∼3.7 Mwords/language
MethodsBaselineSyntax-based
Results
Significance list
Evaluation50 types5 levels (0-10%)total: 2000 types2 evaluators/method
Annotation(-gram) erroneous
(+gram)
(-lex) regular
(+lex)
named entitycompoundidiomcollocation
Cohen’s κ = 0.61(significant agreement)
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 89 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Evaluation experiment 2: stratified evaluation
CorpusEuroparl(Koehn, 2005)
LanguagesFrenchEnglishItalianSpanish
Sizeon average, ∼3.7 Mwords/language
MethodsBaselineSyntax-based
Results
Significance list
Evaluation50 types5 levels (0-10%)total: 2000 types2 evaluators/method
Annotation(-gram) erroneous
(+gram)
(-lex) regular
(+lex)
named entitycompoundidiomcollocation
Cohen’s κ = 0.61(significant agreement)
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 89 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Evaluation experiment 2: stratified evaluation
CorpusEuroparl(Koehn, 2005)
LanguagesFrenchEnglishItalianSpanish
Sizeon average, ∼3.7 Mwords/language
MethodsBaselineSyntax-based
Results
Significance list
Evaluation50 types5 levels (0-10%)total: 2000 types2 evaluators/method
Annotation(-gram) erroneous
(+gram)
(-lex) regular
(+lex)
named entitycompoundidiomcollocation
Cohen’s κ = 0.61(significant agreement)
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 89 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Annotation examples
(-gram) erroneous human development
(+gram)
(-lex) regular next item
(+lex)
named entity European Unioncompound point of orderidiom same umbrellacollocation table amendment
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 90 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Confusion matrix
regular named entity collocation compound idiomregular 422 6 222 51 7
European research economic stability sea fleet short route
named entity 26 2 11 0emisfero sud cour de compte
collocation 315 63 11wheel vehicle open door
compound 64 5titre (de) exemple
idiom 11
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 91 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Evaluation results
Outer ring – baseline; inner ring – syntax-based method
Statistical significance:+gram t(1436) = 26.7, p < .001+lex t(1436) = 11, p < .001collocation t(1436) = 9.2, p < .001
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 92 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Conclusion
Parsing technologies, traditionally seen as inappropriate for large-scaleprocessing of corpora, are today the main ingredient for accuratecollocation extraction.
The strong syntactic filter applied on the source text reduces theamount of data to process in the subsequent step to almost onequarter.
Parsing is the solution to the combinatorial explosion problem in thetask of identifying longer collocations in text (e.g., be a major turningpoint, to stand in stark contrast).
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 93 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Sabine Bartsch.
2004.Structural and Functional Properties of Collocations in English. A Corpus Study of Lexical and Pragmatic Constraints onLexical Cooccurrence.Gunter Narr Verlag, Tubingen.
Elisabeth Breidt.
1993.Extraction of V-N-collocations from text corpora: A feasibility study for German.In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, pages 74–83, Columbus,USA.
Simon Charest, Eric Brunelle, Jean Fontaine, and Bertrand Pelletier.
2007.Elaboration automatique d’un dictionnaire de cooccurrences grand public.In Actes de la 14e conference sur le Traitement Automatique des Langues Naturelles (TALN 2007), pages 283–292,Toulouse, France, June.
Nick Ellis.
2008.Phraseology: The periphery and the heart of language.In Fanny Meunier and Sylviane Granger, editors, Phraseology in Foreign Language and Teaching, pages 1–13. JohnBenjamins, Amsterdam/Philadelphia.
Stefan Evert.
2004.The Statistics of Word Cooccurrences: Word Pairs and Collocations.Ph.D. thesis, University of Stuttgart.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 95 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Charles Fillmore, Paul Kay, and Catherine O’Connor.
1988.Regularity and idiomaticity in grammatical constructions: The case of let alone.Language, 64(3):501–538.
Jean-Philippe Goldman, Luka Nerima, and Eric Wehrli.
2001.Collocation extraction using a syntactic parser.In Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, pages 61–66,Toulouse, France.
Dirk Heylen, Kerry G. Maxwell, and Marc Verhagen.
1994.Lexical functions and machine translation.In Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994), pages 1240–1244,Kyoto, Japan.
Seonho Kim, Zooil Yang, Mansuk Song, and Jung-Ho Ahn.
1999.Retrieving collocations from Korean text.In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and VeryLarge Corpora, pages 71–81, Maryland, USA.
Goran Kjellmer.
1987.Aspects of English collocations.In Willem Meijs, editor, Corpus Linguistics and Beyond, pages 133–140. Rodopi, Amsterdam.
Philipp Koehn.
2005.Europarl: A parallel corpus for statistical machine translation.In Proceedings of The Tenth Machine Translation Summit (MT Summit X), pages 79–86, Phuket, Thailand, September.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 96 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Diana Lea and Moira Runcie, editors.
2002.Oxford Collocations Dictionary for Students of English.Oxford University Press, Oxford.
Dekang Lin.
1998.Extracting collocations from text corpora.In First Workshop on Computational Terminology, pages 57–63, Montreal, Canada.
Dekang Lin.
1999.Automatic identification of non-compositional phrases.In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on ComputationalLinguistics, pages 317–324, Morristown, NJ, USA.
Yajuan Lu and Ming Zhou.
2004.Collocation translation acquisition using monolingual corpora.In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), pages 167–174,Barcelona, Spain.
Igor Mel’cuk.
1998.Collocations and lexical functions.In Anthony P. Cowie, editor, Phraseology. Theory, Analysis, and Applications, pages 23–53. Claredon Press, Oxford.
Athina Michou and Violeta Seretan.
2009.A tool for multi-word expression extraction in Modern Greek using syntactic parsing.In Proceedings of the Demonstrations Session at EACL 2009, pages 45–48, Athens, Greece, April. Association forComputational Linguistics.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 97 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Brigitte Orliac and Mike Dillinger.
2003.Collocation extraction for machine translation.In Proceedings of Machine Translation Summit IX, pages 292–298, New Orleans, Lousiana, USA.
Darren Pearce.
2002.A comparative evaluation of collocation extraction techniques.In Third International Conference on Language Resources and Evaluation, pages 1530–1536, Las Palmas, Spain.
Violeta Seretan and Eric Wehrli.
2007.Collocation translation based on sentence alignment and parsing.In Proceedings of TALN 2007, Toulouse, France.
Violeta Seretan and Eric Wehrli.
2009.Multilingual collocation extraction with a syntactic parser.Language Resources and Evaluation, 43(1):71–85.
Violeta Seretan and Eric Wehrli.
2010.Extending a multilingual symbolic parser to Romanian.In Dan Tufis and Corina Forascu, editors, Multilinguality and Interoperability in Language Processing with Emphasis onRomanian. Romanian Academy Publishing House.
Violeta Seretan and Eric Wehrli.
2011.FipsCoView: On-line visualisation of collocations extracted from multilingual parallel corpora.In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World, pages125–127, Portland, Oregon, USA, June. Association for Computational Linguistics.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 98 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Violeta Seretan.
2011.A collocation-driven approach to text summarization.In Actes de la 18e conference sur le Traitement Automatique des Langues Naturelles (TALN 2011), pages 9–14,Montpellier, France.
John Sinclair.
1991.Corpus, Concordance, Collocation.Oxford University Press, Oxford.
Frank Smadja.
1993.Retrieving collocations from text: Xtract.Computational Linguistics, 19(1):143–177.
Marıa Begona Villada Moiron.
2005.Data-driven identification of fixed expressions and their modifiability.Ph.D. thesis, University of Groningen.
Eric Wehrli, Luka Nerima, and Yves Scherrer.
2009.Deep linguistic multilingual translation and bilingual dictionaries.In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 90–94, Athens, Greece.
Eric Wehrli, Violeta Seretan, and Luka Nerima.
2010.Sentence analysis and collocation identification.In Proceedings of the Workshop on Multiword Expressions: from Theory to Applications (MWE 2010), pages 27–35,Beijing, China.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 99 / 100
Outline On collocations Importance Features Need for analysis Syntax-based extractors Evaluation
Eric Wehrli.
2003.Translation of words in context.In Proceedings of Machine Translation Summit IX, pages 502–504, New Orleans, Louisiana, USA.
Eric Wehrli.
2006.TwicPen: Hand-held scanner and translation software for non-native readers.In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pages 61–64, Sydney, Australia.
Eric Wehrli.
2007.Fips, a “deep” linguistic multilingual parser.In ACL 2007 Workshop on Deep Linguistic Processing, pages 120–127, Prague, Czech Republic.
Hau Wu and Ming Zhou.
2003.Synonymous collocation extraction using translation information.In Proceeding of the Annual Meeting of the Association for Computational Linguistics (ACL 2003), pages 120–127,Sapporo, Japan.
David Yarowsky.
1993.One sense per collocation.In Proceedings of ARPA Human Language Technology Workshop, pages 266–271, Princeton.
Violeta Seretan Collocation Extraction RANLP 2013, Hissar 100 / 100