Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical...

Post on 03-Sep-2019

8 views 0 download

Transcript of Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical...

Empirical Approaches to

Multilingual Lexical Acquisition

Lecturer: Timothy Baldwin

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Lecture 4

Unsupervised Approaches to Lexical

Acquisition: Word Segmentation and MWE

Extraction

1

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Word Segmentation

2

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Basic Task Description

• Take an unsegmented string and dynamically insert the word

boundaries:

spotthebreaksinthisstring : spot the breaks in this string

• Applications in:

⋆ the segmentation of non-segmenting languages (e.g. Japanese,

Chinese, Thai)

⋆ segmenting up a speech signal into words

⋆ modelling child word acquisition

3

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Difficulties in Word Detection

• Interdepedence between word and boundary detection

• Segmentation of frequently-occurring strings

ad hoc, 東京都 tou·kyou·to “Tokyo prefecture”

• Inherent boundary ambiguities

大学長 dai·gaku·chou “(university) president”

• Data sparseness

4

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Unsupervised Discovery ofMorphemes

(Creutz and Lagus 2002)

5

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Types of morphological linkage

• Inflectional (table : tables, bus : buses/busses, phenomenon

: phenomena)

• Derivational (central : centralise)

• Compositional (house + boat : houseboat)

6

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Semantic linkage

• employ : employer , employee

• Some of these are lost in the mists of time, e.g. ploy :

employ/deploy?

7

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Basic tasks in morphology

• Discovering the morphemes

• Determining the interaction between the morphemes in a given

word

unfathomable

l

un·fathom·ablel

unNEG·fathomV ·ableV →N

8

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Unsupervised Discovery of Morphemes

• Segment each word token into morphemes without using prior

knowledge of morphological processes or morphemes:

misinformed : mis·inform·edreconstruction : re·construct·ionkahvinjuojallekin : kahvi·n·juo·ja·lle·kin [FI]

• Assume agglutinative or simple inflectional morphology

• Based on orthographic regularities

Creutz and Lagus (2002) 9

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Applications

• Language modelling (speech recognition)

• Language-independent lexicographic tool for determining

stems/word boundaries

• N.B. assumes a corpus of raw text up-front

Creutz and Lagus (2002) 10

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Minimum Description Length

• Minimum Description Length (MDL, a.k.a. Minimum Message

Length or MML) is a method for simultaneously evaluating the

compactness of a particular encoding and the compactness of the

data using that encoding

Description length = model DL + data DL

‘Simplicity vs. Accuracy’

• Too much ‘accuracy’ : overfitting

• How can we compress / generalise the data model?

Creutz and Lagus (2002) 11

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

MDL in word segmentation

• In morpheme discovery terms, minimise C:

C = Cost(Source text) + Cost(Codebook)

=∑

tokens

− log p(mi) +∑

types

k × l(mj)

Creutz and Lagus (2002) 12

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Example: Word splitting

howmuchwoodcouldawoodchuckchuckifawoodchuckcouldchuckwood

Creutz and Lagus (2002) 13

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

mi − log p(mi) freq(mi) l(mi)

howmuchwoodcoulda... 0.00 1 57

— 0.00 57

C = 285.00 (k = 5)

Creutz and Lagus (2002) 14

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

mi − log p(mi) freq(mi) l(mi)

a 2.70 2 1

chuck 2.70 2 5

could 2.70 2 5

how 3.70 1 3

if 3.70 1 2

much 3.70 1 4

wood 2.70 2 4

woodchuck 2.70 2 9

— 38.11 33 C = 203.11 (k = 5)

Creutz and Lagus (2002) 15

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

mi − log p(mi) freq(mi) l(mi)

a 2.91 2 1

chuck 1.91 4 5

could 2.91 2 5

how 3.91 1 3

if 3.91 1 2

much 3.91 1 4

wood 1.91 4 4

— 38.60 24 C = 158.60 (k = 5)

Creutz and Lagus (2002) 16

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

mi − log p(mi) freq(mi) l(mi)

a 4.83 2 1

c 2.37 11 1

d 3.25 6 1

f 5.83 1 1

h 3.25 6 1

i 5.83 1 1

k 3.83 4 1

l 4.83 2 1

m 5.83 1 1

o 2.37 11 1

u 3.03 7 1

w 3.51 5 1

— 182.09 12 C = 242.09 (k = 5)

Creutz and Lagus (2002) 17

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Greedy search algorithm

• Read in each word at a time, and recursively select the binary

segmentation of a word which minimises C

• Continue until no C-optimal segmentations remain

• Dynamically segment even if the word has been observed previously

in the data (removing any previous segmentations from the model)

Creutz and Lagus (2002) 18

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Dreaming

• At regular intervals, randomly iterate over words already processed

to remove any global sub-optimal segmentations

• Dream for a fixed time (over a fixed number of words) OR until no

significant change in C is observed

• Demo:

http://www.cis.hut.fi/cgi-bin/morpho/nform.cgi

19

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Mostly-UnsupervisedStatistical Segmentation ofJapanese Kanji Sequences

(Kubota Ando and Lee 2003)

20

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Outline

• Unsupervised method for segmenting Japanese kanji sequences

社長兼業務部長l

社長 |兼 |業務 |部長 sha·chou|ken|gyou·mu|bu·chou “president

and general business manager”

• Based on character n-grams (sequences of n characters)

Kubota Ando and Lee (2003) 21

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Why the Interest in Kanji?

• Japanese consists of three basic orthographies:

⋆ hiragana: 46 elements (functional words, onomatapaea,

verb/adjective suffixes)

⋆ katakana: 46 elements (words of foreign origin, scientific names)

⋆ kanji: 1,945 “official” elements, more in practice (content words)

• Kanji the most productive and sparse of the three

Kubota Ando and Lee (2003) 22

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Basic Intuition

• N -gram “windows” over a kanji string can be used as predictors of

word boundaries in that:

1. frequent n-grams suggest the lack of a boundary

2. infrequent n-grams suggest the presence of a boundary

spotthebreaks with 4-gram window:s p o t t h e b r e a k s

s p o t

p o t t

o t t h...

Kubota Ando and Lee (2003) 23

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Example: spotthebreaks

Word N-gram Freq

spot 119

pott 24

otth 120

tthe 6075

theb 2118

hebr 218

ebre 49

brea 328

reak 246

eaks 57

Kubota Ando and Lee (2003) 24

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Method

• For each potential boundary point:

⋆ For each n-gram order (e.g. 4):

∗ For each of the n-grams to the left and right of the boundary:

◦ What is the proportion of n-grams straddling the boundary

that are less frequent than the n-gram adjoining it?

• vn = 12(n−1)

d∈{L,R}

∑n−1j=1 I>(#(sn

d), #(tnj ))

Kubota Ando and Lee (2003) 25

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Returning to our Example

• For the boundary spot|theb:

v4 = 12(4−1)(1 + 2) = 0.5

• For the boundary pott|hebr :

v4 = 12(4−1)(0 + 1) ≈ 0.17

Kubota Ando and Lee (2003) 26

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Combining n-gram Orders

• So as to balance up the effects of data sparseness, the method

advocates “voting” across different n-gram orders:

vN(k) = 1|N |

n∈N vn(k)

• Place boundaries at locations l where EITHER:

1. vN(l) > vN(l − 1) and vN(l) > vN(l + 1) OR

2. vN(l) > t

t = threshold constant

Kubota Ando and Lee (2003) 27

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Evaluation

• Extract n-gram statistics from 150MB of newswire data (N.B.

unannotated)

• Hand-annotated 5 held-out datasets containing 500 kanji

sequences each

• For each held-out dataset, use 450 sequences for parameter training

(N, t) and the remaining 50 sequences for evaluation (hence

mostly-unsupervised)

Kubota Ando and Lee (2003) 28

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Words vs. Morphemes

• Two separate evaluations were carried out, based on:

⋆ words: a stem and its affixes are considered as a single unit

[電話器] [deN·wa·ki] “telephone (device)”

⋆ morphemes: stems and affixes are each individual units

[電話][器] [deN·wa][ki] “[telephone][device]”

Kubota Ando and Lee (2003) 29

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Parameter Optimisation

• Run the algorithm over each set of 450 sequences using the different

values of N and t

• Optimise according to one of:

⋆ word precision = correct proposed bracketsproposed brackets

⋆ word recall = correct proposed bracketsactual brackets

⋆ word F(-measure/score) = precision×recallprecision+recall

• Evaluate according to the optimal parameter setting

Kubota Ando and Lee (2003) 30

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Results (Briefly)

• Word performance better than established Japanese morphological

analyzers (JUMAN, Chasen)

• Word performance considerably better than morpheme performance

• Single-character morphemes greatest source of errors

Kubota Ando and Lee (2003) 31

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Summary

• Method proposed for segmenting up Japanese kanji sequences

based on simple n-gram statistics and voting

• Basic intuition that n-grams straddling word boundaries will be

less-frequent than those adjacent to word boundaries

• Impressive results achieved for such a conceptually simple algorithm

32

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Multiword ExpressionDiscovery

33

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Basic Task Description

• Identify the multiword expression (MWE) types in tagged (or raw)

text from observation of the token distribution

• Recall: a MWE is made up of multiple words and is lexically,

syntactically, semantically, pragmatically and/or statistically

idiosyncratic

• Considerable body of literature on collocation extraction

34

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Lexico-syntactic Idiomaticityby and large

???

conjP Adj

by and large

ad hoc

Adj

? ?

ad hoc

wine and dine

V[trans]

conjV[intrans] V[intrans]

wine and dine

35

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Semantic Idiomaticity

kick the bucket spill the beans

die’ reveal’(secret’)

kindle excitement

kindle’(excitement’)

36

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Pragmatic Idiomaticity

• The Wheel of Fortune factor — how to represent the jumble of

phrases stored in the mental lexicon?

• The Monty Python factor — mish-mash of language fragments

which evoke particular events/individuals/memories

37

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Statistical Idiomaticityunblemished spotless flawless immaculate impeccable

eye – – – – +gentleman – – ? – +home ? + – + ?lawn – – ? + –memory – – + – ?quality – – – – +record + + + + +reputation + – – + +taste – – – – +

Adapted from Cruse (1986)

38

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Complications in MWE Discovery

• Working out the extent of the collocation (phrase boundary

detection)

trip the light X

trip the light fantastic V

trip the light fantastic at X

• Fine line between collocations and simple default lexical

combinations

buy a car/purchase power

39

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Collocations vs. MWEs

• A collocation (≈ institutionalised phrase) is an arbitrary and

recurrent word combination

• Collocations tend to stand in opposition to anti-collocations =

lexically-marked synonym-substitutes

• Collocations can be semantically-marked (e.g. dark horse) but tend

to be compositional (e.g. strong coffee)

• Predicative collocations vs. rigid NPs vs. phrasal templates

40

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Retrieving Collocations fromText: Xtract

(Smadja 1993)

41

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Outline

• Automatic method for extracting collocations from raw text based

on n-gram statistics

• Basic intuition: collocations are more rigid syntactically and more

frequent than other word combinations

• Method: attempt to capture this intuition using the basic statistics

of word combinations

Smadja (1993) 42

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Stage 1: Extract Significant Bigrams

• w and wi co-occur (wi is a collocate of w) if they are found in a

single sentence separated by fewer than 5 words

• a bigram (w,wi) is significant iff:

⋆ w and wi co-occur more frequently than chance

⋆ w and wi appear in a relatively rigid configuration

• Divide up the set of collocates according to POS

Smadja (1993) 43

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Example Corpus

... multiword(−1) expressions ...

... multiword(−1) expressions ...

... dialect(−1) expressions ...

... dialect(−2) and expressions ...

... expressions of interest(2) ...

... multiword(−1) expressions ...

... collocation extraction ...

... expressions dialect(1) ...

... multiword(−1) expressions ...

... multiword(−1) expressions ...

Smadja (1993) 44

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Noun Co-occurrence Table

w = expressions, D = {−2,−1, 1, 2}

multiword dialect collocation interestcollocate

w1 w2 w3 w4

p−2i 0 1 0 0

p−1i 5 1 0 0

p1i 0 1 0 0

p2i 0 0 0 1

freq i 5 3 0 1

wi ∈ C yes yes no yes

Smadja (1993) 45

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Statistics of Expectation

• f =P

wi∈C freqi

|C| = 5+3+13 = 3 (frequency average)

• σ =

P

wi∈C(freqi−f)2

|C| =

(5−3)2+(3−3)2+(1−3)2

3 ≈ 1.63

• ki = freqi−f

σ(strength)

• pi =P

j∈D pji

|D| (pair count average)

• Ui =P

j∈D(pji−pi)

2

|D| (pair count variance)

Smadja (1993) 46

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Collocation Filters

• Strength: ki > kα(= 1)

: select frequent collocates

• Spread: Ui > U0(= 1)

: select spiky distributions

• Peakiness: pji ≥ pi + (kβ ×

√Ui) kβ = 0.51

: identify interesting spikes

1Value of 10 suggested for kβ in Smadja (1993)

Smadja (1993) 47

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Back to our Example: Strength

• w1 (multiword)

⋆ k1 = 5−31.63 = 1.22 > 1 V

• w2 (dialect)

⋆ k2 = 3−31.63 < 1 X

• w4 (interest)

⋆ k3 = 1−31.63 < 1 X

Smadja (1993) 48

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Spread and Peakiness (w1 = multiword)

• U1 = (0−1.25)2+(5−1.25)2+(0−1.25)2+(0−1.25)2

4 ≈ 20.31 > 1 V

⋆ p1 = 0+5+0+04 = 1.25

⋆ p1 + (kβ ×√

U1) = 1.25 + (0.5 ×√

20.31) ≈ 3.50

j pi + (kβ ×√

Ui)

p−21 0 < 3.50 X

p−11 5 ≥ 3.50 V

p+11 0 < 3.50 X

p+21 0 < 3.50 X

Smadja (1993) 49

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Stage 2: Bigrams to n-grams

• Independent filter to detect larger n-grams

• Method: for each fixed-distance collocate (w,wji ), extract out

contiguous word sequences where max(p(word[i])) > T (= 0.75)

Smadja (1993) 50

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Example

w = resistance, w−3i = path

Concordances from the BNC:

... trod the path(−3) of least resistance , ...

... finding the path(−3) of least resistance will ...

... along the path(−3) of least resistance .

... the safest path(−3) of least resistance through ...

... took the path(−3) of least resistance and ...

: the path of least resistance is a rigid noun phrase

Smadja (1993) 51

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Stage 3: Add Syntax

• Use cass dependency parser to assign syntactic labels to bigrams

• Same basic thresholding filter as used in Stage 2

Smadja (1993) 52

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Summary

• Two methods proposed for extracting collocations based on

distributional statistics

⋆ 1st method based on the intuition that collocations occur

frequently in relatively fixed configurations

⋆ 2nd method based on the predictability of contiguous lexical

contexts for given concordances

• Basic method illustrated for determining the syntax of extracted

collocations

Smadja (1993) 53

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

Reflections

• (At the time) groundbreaking research on collocation extraction

• Not effective at extracting out low-frequency words

• Difficulties in evaluating the results of collocation extraction

(applies to this day)

• Difficulties in extracting non-contiguous (predicative) collocations

such as verb particles

Smadja (1993) 54

Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)

ReferencesCreutz, Mathias, and Krista Lagus. 2002. Unsupervised discovery of morphemes. In

Proc. of the 6th Meeting of the ACL Special Interest Group in Computational Phonology ,

Philadelphia, USA.

Cruse, D. Alan. 1986. Lexical Semantics. Cambridge, UK: Cambridge University Press.

Kubota Ando, Rie, and Lillian Lee. 2003. Mostly-unsupervised statistical segmentation of

Japanese kanji sequences. Natural Language Engineering 9.127–49.

Smadja, Frank. 1993. Retrieving collocations from text: Xtract. Computational Linguistics

19.143–77.

55