Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical...

Empirical Approaches to

Multilingual Lexical Acquisition

Lecturer: Timothy Baldwin

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Lecture 4

Unsupervised Approaches to Lexical

Acquisition: Word Segmentation and MWE

Extraction

Word Segmentation

Basic Task Description

• Take an unsegmented string and dynamically insert the word

boundaries:

spotthebreaksinthisstring : spot the breaks in this string

• Applications in:

⋆ the segmentation of non-segmenting languages (e.g. Japanese,

Chinese, Thai)

⋆ segmenting up a speech signal into words

⋆ modelling child word acquisition

Difficulties in Word Detection

• Interdepedence between word and boundary detection

• Segmentation of frequently-occurring strings

ad hoc, 東京都 tou·kyou·to “Tokyo prefecture”

• Inherent boundary ambiguities

大学長 dai·gaku·chou “(university) president”

• Data sparseness

Unsupervised Discovery ofMorphemes

(Creutz and Lagus 2002)

Types of morphological linkage

• Inflectional (table : tables, bus : buses/busses, phenomenon

: phenomena)

• Derivational (central : centralise)

• Compositional (house + boat : houseboat)

Semantic linkage

• employ : employer , employee

• Some of these are lost in the mists of time, e.g. ploy :

employ/deploy?

Basic tasks in morphology

• Discovering the morphemes

• Determining the interaction between the morphemes in a given

unfathomable

un·fathom·ablel

unNEG·fathomV ·ableV →N

Unsupervised Discovery of Morphemes

• Segment each word token into morphemes without using prior

knowledge of morphological processes or morphemes:

misinformed : mis·inform·edreconstruction : re·construct·ionkahvinjuojallekin : kahvi·n·juo·ja·lle·kin [FI]

• Assume agglutinative or simple inflectional morphology

• Based on orthographic regularities

Creutz and Lagus (2002) 9

Applications

• Language modelling (speech recognition)

• Language-independent lexicographic tool for determining

stems/word boundaries

• N.B. assumes a corpus of raw text up-front

Minimum Description Length

• Minimum Description Length (MDL, a.k.a. Minimum Message

Length or MML) is a method for simultaneously evaluating the

compactness of a particular encoding and the compactness of the

data using that encoding

Description length = model DL + data DL

‘Simplicity vs. Accuracy’

• Too much ‘accuracy’ : overfitting

• How can we compress / generalise the data model?

MDL in word segmentation

• In morpheme discovery terms, minimise C:

C = Cost(Source text) + Cost(Codebook)

tokens

− log p(mi) +∑

k × l(mj)

Example: Word splitting

howmuchwoodcouldawoodchuckchuckifawoodchuckcouldchuckwood

mi − log p(mi) freq(mi) l(mi)

howmuchwoodcoulda... 0.00 1 57

— 0.00 57

C = 285.00 (k = 5)

a 2.70 2 1

chuck 2.70 2 5

could 2.70 2 5

how 3.70 1 3

if 3.70 1 2

much 3.70 1 4

wood 2.70 2 4

woodchuck 2.70 2 9

— 38.11 33 C = 203.11 (k = 5)

a 2.91 2 1

chuck 1.91 4 5

could 2.91 2 5

how 3.91 1 3

if 3.91 1 2

much 3.91 1 4

wood 1.91 4 4

— 38.60 24 C = 158.60 (k = 5)

a 4.83 2 1

c 2.37 11 1

d 3.25 6 1

f 5.83 1 1

h 3.25 6 1

i 5.83 1 1

k 3.83 4 1

l 4.83 2 1

m 5.83 1 1

o 2.37 11 1

u 3.03 7 1

w 3.51 5 1

— 182.09 12 C = 242.09 (k = 5)

Greedy search algorithm

• Read in each word at a time, and recursively select the binary

segmentation of a word which minimises C

• Continue until no C-optimal segmentations remain

• Dynamically segment even if the word has been observed previously

in the data (removing any previous segmentations from the model)

Dreaming

• At regular intervals, randomly iterate over words already processed

to remove any global sub-optimal segmentations

• Dream for a fixed time (over a fixed number of words) OR until no

significant change in C is observed

• Demo:

http://www.cis.hut.fi/cgi-bin/morpho/nform.cgi

Mostly-UnsupervisedStatistical Segmentation ofJapanese Kanji Sequences

(Kubota Ando and Lee 2003)

Outline

• Unsupervised method for segmenting Japanese kanji sequences

社長兼業務部長l

社長 |兼 |業務 |部長 sha·chou|ken|gyou·mu|bu·chou “president

and general business manager”

• Based on character n-grams (sequences of n characters)

Kubota Ando and Lee (2003) 21

Why the Interest in Kanji?

• Japanese consists of three basic orthographies:

⋆ hiragana: 46 elements (functional words, onomatapaea,

verb/adjective suffixes)

⋆ katakana: 46 elements (words of foreign origin, scientific names)

⋆ kanji: 1,945 “official” elements, more in practice (content words)

• Kanji the most productive and sparse of the three

Basic Intuition

• N -gram “windows” over a kanji string can be used as predictors of

word boundaries in that:

1. frequent n-grams suggest the lack of a boundary

2. infrequent n-grams suggest the presence of a boundary

spotthebreaks with 4-gram window:s p o t t h e b r e a k s

s p o t

p o t t

o t t h...

Example: spotthebreaks

Word N-gram Freq

spot 119

pott 24

otth 120

tthe 6075

theb 2118

hebr 218

ebre 49

brea 328

reak 246

eaks 57

Method

• For each potential boundary point:

⋆ For each n-gram order (e.g. 4):

∗ For each of the n-grams to the left and right of the boundary:

◦ What is the proportion of n-grams straddling the boundary

that are less frequent than the n-gram adjoining it?

• vn = 12(n−1)

d∈{L,R}

∑n−1j=1 I>(#(sn

d), #(tnj ))

Returning to our Example

• For the boundary spot|theb:

v4 = 12(4−1)(1 + 2) = 0.5

• For the boundary pott|hebr :

v4 = 12(4−1)(0 + 1) ≈ 0.17

Combining n-gram Orders

• So as to balance up the effects of data sparseness, the method

advocates “voting” across different n-gram orders:

vN(k) = 1|N |

n∈N vn(k)

• Place boundaries at locations l where EITHER:

1. vN(l) > vN(l − 1) and vN(l) > vN(l + 1) OR

2. vN(l) > t

t = threshold constant

Evaluation

• Extract n-gram statistics from 150MB of newswire data (N.B.

unannotated)

• Hand-annotated 5 held-out datasets containing 500 kanji

sequences each

• For each held-out dataset, use 450 sequences for parameter training

(N, t) and the remaining 50 sequences for evaluation (hence

mostly-unsupervised)

Words vs. Morphemes

• Two separate evaluations were carried out, based on:

⋆ words: a stem and its affixes are considered as a single unit

[電話器] [deN·wa·ki] “telephone (device)”

⋆ morphemes: stems and affixes are each individual units

[電話][器] [deN·wa][ki] “[telephone][device]”

Parameter Optimisation

• Run the algorithm over each set of 450 sequences using the different

values of N and t

• Optimise according to one of:

⋆ word precision = correct proposed bracketsproposed brackets

⋆ word recall = correct proposed bracketsactual brackets

⋆ word F(-measure/score) = precision×recallprecision+recall

• Evaluate according to the optimal parameter setting

Results (Briefly)

• Word performance better than established Japanese morphological

analyzers (JUMAN, Chasen)

• Word performance considerably better than morpheme performance

• Single-character morphemes greatest source of errors

Summary

• Method proposed for segmenting up Japanese kanji sequences

based on simple n-gram statistics and voting

• Basic intuition that n-grams straddling word boundaries will be

less-frequent than those adjacent to word boundaries

• Impressive results achieved for such a conceptually simple algorithm

Multiword ExpressionDiscovery

Basic Task Description

• Identify the multiword expression (MWE) types in tagged (or raw)

text from observation of the token distribution

• Recall: a MWE is made up of multiple words and is lexically,

syntactically, semantically, pragmatically and/or statistically

idiosyncratic

• Considerable body of literature on collocation extraction

Lexico-syntactic Idiomaticityby and large

conjP Adj

by and large

ad hoc

wine and dine

V[trans]

conjV[intrans] V[intrans]

wine and dine

Semantic Idiomaticity

kick the bucket spill the beans

die’ reveal’(secret’)

kindle excitement

kindle’(excitement’)

Pragmatic Idiomaticity

• The Wheel of Fortune factor — how to represent the jumble of

phrases stored in the mental lexicon?

• The Monty Python factor — mish-mash of language fragments

which evoke particular events/individuals/memories

Statistical Idiomaticityunblemished spotless flawless immaculate impeccable

eye – – – – +gentleman – – ? – +home ? + – + ?lawn – – ? + –memory – – + – ?quality – – – – +record + + + + +reputation + – – + +taste – – – – +

Adapted from Cruse (1986)

Complications in MWE Discovery

• Working out the extent of the collocation (phrase boundary

detection)

trip the light X

trip the light fantastic V

trip the light fantastic at X

• Fine line between collocations and simple default lexical

combinations

buy a car/purchase power

Collocations vs. MWEs

• A collocation (≈ institutionalised phrase) is an arbitrary and

recurrent word combination

• Collocations tend to stand in opposition to anti-collocations =

lexically-marked synonym-substitutes

• Collocations can be semantically-marked (e.g. dark horse) but tend

to be compositional (e.g. strong coffee)

• Predicative collocations vs. rigid NPs vs. phrasal templates

Retrieving Collocations fromText: Xtract

(Smadja 1993)

Outline

• Automatic method for extracting collocations from raw text based

on n-gram statistics

• Basic intuition: collocations are more rigid syntactically and more

frequent than other word combinations

• Method: attempt to capture this intuition using the basic statistics

of word combinations

Smadja (1993) 42

Stage 1: Extract Significant Bigrams

• w and wi co-occur (wi is a collocate of w) if they are found in a

single sentence separated by fewer than 5 words

• a bigram (w,wi) is significant iff:

⋆ w and wi co-occur more frequently than chance

⋆ w and wi appear in a relatively rigid configuration

• Divide up the set of collocates according to POS

Smadja (1993) 43

Example Corpus

... multiword(−1) expressions ...

... dialect(−1) expressions ...

... dialect(−2) and expressions ...

... expressions of interest(2) ...

... collocation extraction ...

... expressions dialect(1) ...

Smadja (1993) 44

Noun Co-occurrence Table

w = expressions, D = {−2,−1, 1, 2}

multiword dialect collocation interestcollocate

w1 w2 w3 w4

p−2i 0 1 0 0

p−1i 5 1 0 0

p1i 0 1 0 0

p2i 0 0 0 1

freq i 5 3 0 1

wi ∈ C yes yes no yes

Smadja (1993) 45

Statistics of Expectation

• f =P

wi∈C freqi

|C| = 5+3+13 = 3 (frequency average)

• σ =

wi∈C(freqi−f)2

(5−3)2+(3−3)2+(1−3)2

3 ≈ 1.63

• ki = freqi−f

σ(strength)

• pi =P

j∈D pji

|D| (pair count average)

• Ui =P

j∈D(pji−pi)

|D| (pair count variance)

Smadja (1993) 46

Collocation Filters

• Strength: ki > kα(= 1)

: select frequent collocates

• Spread: Ui > U0(= 1)

: select spiky distributions

• Peakiness: pji ≥ pi + (kβ ×

√Ui) kβ = 0.51

: identify interesting spikes

1Value of 10 suggested for kβ in Smadja (1993)

Smadja (1993) 47

Back to our Example: Strength

• w1 (multiword)

⋆ k1 = 5−31.63 = 1.22 > 1 V

• w2 (dialect)

⋆ k2 = 3−31.63 < 1 X

• w4 (interest)

⋆ k3 = 1−31.63 < 1 X

Smadja (1993) 48

Spread and Peakiness (w1 = multiword)

• U1 = (0−1.25)2+(5−1.25)2+(0−1.25)2+(0−1.25)2

4 ≈ 20.31 > 1 V

⋆ p1 = 0+5+0+04 = 1.25

⋆ p1 + (kβ ×√

U1) = 1.25 + (0.5 ×√

20.31) ≈ 3.50

j pi + (kβ ×√

p−21 0 < 3.50 X

p−11 5 ≥ 3.50 V

p+11 0 < 3.50 X

p+21 0 < 3.50 X

Smadja (1993) 49

Stage 2: Bigrams to n-grams

• Independent filter to detect larger n-grams

• Method: for each fixed-distance collocate (w,wji ), extract out

contiguous word sequences where max(p(word[i])) > T (= 0.75)

Smadja (1993) 50

Example

w = resistance, w−3i = path

Concordances from the BNC:

... trod the path(−3) of least resistance , ...

... finding the path(−3) of least resistance will ...

... along the path(−3) of least resistance .

... the safest path(−3) of least resistance through ...

... took the path(−3) of least resistance and ...

: the path of least resistance is a rigid noun phrase

Smadja (1993) 51

Stage 3: Add Syntax

• Use cass dependency parser to assign syntactic labels to bigrams

• Same basic thresholding filter as used in Stage 2

Smadja (1993) 52

Summary

• Two methods proposed for extracting collocations based on

distributional statistics

⋆ 1st method based on the intuition that collocations occur

frequently in relatively fixed configurations

⋆ 2nd method based on the predictability of contiguous lexical

contexts for given concordances

• Basic method illustrated for determining the syntax of extracted

collocations

Smadja (1993) 53

Reflections

• (At the time) groundbreaking research on collocation extraction

• Not effective at extracting out low-frequency words

• Difficulties in evaluating the results of collocation extraction

(applies to this day)

• Difficulties in extracting non-contiguous (predicative) collocations

such as verb particles

Smadja (1993) 54

ReferencesCreutz, Mathias, and Krista Lagus. 2002. Unsupervised discovery of morphemes. In

Proc. of the 6th Meeting of the ACL Special Interest Group in Computational Phonology ,

Philadelphia, USA.

Cruse, D. Alan. 1986. Lexical Semantics. Cambridge, UK: Cambridge University Press.

Kubota Ando, Rie, and Lillian Lee. 2003. Mostly-unsupervised statistical segmentation of

Japanese kanji sequences. Natural Language Engineering 9.127–49.

Smadja, Frank. 1993. Retrieving collocations from text: Xtract. Computational Linguistics

19.143–77.

Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical...

Documents

Transcript of Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical...

Unsupervised Approaches to Sequence Tagging, …nschneid/ls2lit_slides.pdf · · 2010-04-16Unsupervised Approaches to Sequence Tagging, Morphology Induction, and Lexical Resource

Collocations in Multilingual Natural Language Generation: Lexical … · 2011. 12. 20. · Lexical Functional Grammar (LFG) (Bresnan, 2001), Head-drivenPhraseStructureGrammar(HPSG)(Pol-lard

Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture07.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 7 (18/7/2008) Basic Method

1 BioChain : Using Lexical Chaining Approaches for Biomedical Text Summarization Lawrence Reeve INFO780 - Final Report – Summer 2005.

Semantic Frames as an Interlingua for Multilingual Lexical Databases

1 / 25Sat. 31 Aug. 2002SEMANET Workshop Frameworks, Implementation & Open Problems for the Collaborative Building of a Multilingual Lexical Database Mathieu.

Chapter 4 · 2016. 8. 30. · Reasons to Separate Lexical and Syntax Analysis • Simplicity - less complex approaches can be used for lexical analysis; separating them simplifies

COMBINING LEXICAL CHAIN AND DOMAIN DRIVEN APPROACHES TO ENHANCE LEXICAL ... lexical chain and domain... · Domain Driven approach is proposed to integrate with Lexical Chain approach

Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai.

Multilingual Access in Digital Libraries - Dublin Core€¦ · 1 Approaches to Multilingual Access in Digital Libraries: The Case of TELDAP DC 2008 Berlin 2008.09.25 Hsueh-Hua Chen

Lexical Access in Monolingual, Bilingual and Multilingual ... · Lexical Access in Monolingual, Bilingual and Multilingual Children: A Comparison Study 49 bilinguals, then the question

Diversity-aware Multilingual Lexical Semantic Resources ... · • In contrast to traditional alphabetic dictionaries: • They are conceptual dictionaries • Divided into POS-categories

Dating Egyptian Literary Texts: Lexical Approaches

Chapter 4 - unimi.it · Reasons to Separate Lexical and Syntax Analysis •Simplicity- less complex approaches can be used for lexical analysis; separating them simplifies the parser

Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: on what basis should a multilingual lexical database centred on Arabic be built? Joseph Dichy.

The ACQUILEX Lexical Database System: System Description ...users.sussex.ac.uk/~johnca/papers/acquilex-ldb-92.pdf · collaborative multilingual research project using both mono- and

Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow data from TransGraph. Expression (lexeme) equivalences from 357 dictionaries. 13 multilingual,

Hans C. Boas Lexical and phrasal approaches to argument ...sites.la.utexas.edu/hcb/files/2014/09/Boas-TheoreticalLinguistics... · Hans C. Boas Lexical and phrasal approaches to argument

Activating learning using multilingual CALL lexical ... · dictionary with entries presented in three languages: Japanese, English, and Chinese (mainland and Taiwanese). All ... compilation

Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture01.pdf · Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008) Words and Multiword