Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
-
Upload
alicia-hopkins -
Category
Documents
-
view
220 -
download
2
Transcript of Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
![Page 1: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/1.jpg)
Collocations and Terminology
Vasileios Hatzivassiloglou
University of Texas at Dallas
![Page 2: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/2.jpg)
Collocations
• Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics, 1993
• Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning
• Technical and non-technical
![Page 3: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/3.jpg)
Examples of collocations
• The Dow Jones average of industrials
• The Dow average
• The Dow industrials
• *The Jones industrials
• The Dow Jones industrial
• *The industrial Dow
• *The Dow industrial
![Page 4: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/4.jpg)
Collocation properties
• Arbitrary (dialect dependent)– ride a bike, set the table
• Domain dependent– dry suit, wet suit
• Recurrent
• Cohesive– Part of a collocation primes for the rest
![Page 5: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/5.jpg)
Applications
• Lexicography
• Grammatical restrictions (compare with/to but associate with)
• Generation
• Translation
![Page 6: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/6.jpg)
Types of collocations
• Predicative relations– make a decision, hostile takeover– flexible (syntactic variability, intervening
words)
• Rigid word groups– over the counter market
• Phrases with open slots– fluency in a domain
![Page 7: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/7.jpg)
Issues in finding collocations
• Possibly more than two words– Need measure that extends beyond the binary
case
• Possibly intervening words
• Possibly morphological and syntactic variation
• Semantic constraints (cf. doctors-dentists and doctors-hospitals)
![Page 8: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/8.jpg)
Xtract stage one
• For a given word, find all collocates at positions -5 to +5
• Three criteria:– strength (normalized frequency); 95% rejection
vs. expected 68% under normal distribution– position histogram must not be flat– select peak from histogram
![Page 9: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/9.jpg)
Xtract stage two
• Start from word pairs
• Look at each position in between, to the left, and to the right
• Keep words that appear very often
• If that fails, keep parts of speech that satisfy this criterion
![Page 10: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/10.jpg)
Xtract stage three
• Applied to pairs of words
• Requires (partial) parsing
• Examines the syntactic relationship between words and keeps those pairs with consistent relationships (e.g., verb-object)
![Page 11: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/11.jpg)
Evaluation
• Ask lexicographer to evaluate output
• 40% precision after stages one and two
• 80% precision after stage three
• 94% conditional recall
![Page 12: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/12.jpg)
Terminology
• Béatrice Daille, “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology”, ACL Balancing Act workshop, 1994
• Terms refer to concepts
• Terms key for populating a domain ontology
• Terms are typically nominal compounds of certain structure, e.g., NN, N of N
![Page 13: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/13.jpg)
Defining terms
• Unique reference
• Unique translation
• Term extension by– modification (e.g., addition of an adjective)– substitution– extension of structure– coordination
![Page 14: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/14.jpg)
Algorithm
• Apply syntactic constraints to match pairs of words in a candidate term
• Filter by application of an association measure
• Measures examined: pointwise mutual information, Φ2 (chi-square), log-likelihood ratio
![Page 15: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/15.jpg)
Observations
• Compare with reference list
• Frequency a strong predictor
• Log-likelihood ratio works best
• Additional criteria:– diversity of the distribution of each word– distance between the two words (determines
flexibility but not term status)
![Page 16: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/16.jpg)
Justeson and Katz
• Justeson and Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text”, Natural Language Engineering, 1995.
![Page 17: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/17.jpg)
Analysis
• Examined association measures
• Well-known problems:– eliminating general-language constructs (e.g.,
collocations)– what to do with single word terms?
![Page 18: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/18.jpg)
Observations
• Frequency works well
• But a stronger predictor is P(k>1) compared to P(k≥1) in the same document
• Use syntactic patterns to propose terms, then check if they reappear in the same document
• Require this across multiple documents
![Page 19: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/19.jpg)
Term Expansion
• Jacquemin, Klavans, and Tzoukermann, “Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax”, ACL 1997.
• Need to expand a given list of terms, especially for scientific domains
![Page 20: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/20.jpg)
Term variation
• Syntactic (same words, different structure)
• Morphosyntactic (derivational forms of words)
• Semantic (synonyms are used)
• In IR, normalization through stemming and removal of stop words
![Page 21: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/21.jpg)
Approach
• Process corpus matching new candidate terms to old ones via unification
• Matching based on– inflectional morphology (transducer)– derivational morphology (rule-based)– syntactic transformations– additions of words
![Page 22: Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.](https://reader035.fdocuments.us/reader035/viewer/2022070413/5697bf911a28abf838c8e72d/html5/thumbnails/22.jpg)
Results
• Manual inspection of several thousand proposed terms
• Precision of 89%
• Effectiveness in indexing increases by a factor of three when using the variants (P/R from 99.7/72 to 97/93)