2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken...

20
2002-05-02 1 Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups … Finding Word Groups in Spoken Finding Word Groups in Spoken Dialogue with Narrow Context Based Dialogue with Narrow Context Based Similarities Similarities Leif Grönqvist & Magnus Gunnarsson Leif Grönqvist & Magnus Gunnarsson Presentation for the GSLT course: Presentation for the GSLT course: Statistical Methods 1 Statistical Methods 1 Växjö University, 2002-05-02: 16:00 Växjö University, 2002-05-02: 16:00
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    1

Transcript of 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken...

Page 1: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 1Växjö: Statistical Methods I

Finding Word Groups …Finding Word Groups …

Finding Word Groups in Spoken Finding Word Groups in Spoken Dialogue with Narrow Context Dialogue with Narrow Context

Based SimilaritiesBased Similarities

Leif Grönqvist & Magnus GunnarssonLeif Grönqvist & Magnus Gunnarsson

Presentation for the GSLT course: Statistical Presentation for the GSLT course: Statistical Methods 1Methods 1

Växjö University, 2002-05-02: 16:00Växjö University, 2002-05-02: 16:00

Page 2: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 2Växjö: Statistical Methods I

BackgroundBackground

NordTalk and SweDanes:NordTalk and SweDanes:

Jens Allwood, Elisabeth Ahlsén, Peter Jens Allwood, Elisabeth Ahlsén, Peter Juel Henrichsen, Leif & MagnusJuel Henrichsen, Leif & Magnus

Comparable Danish and Swedish Comparable Danish and Swedish corporacorpora

1.3 MToken each, natural spoken 1.3 MToken each, natural spoken interactioninteraction

We are mainly working withWe are mainly working with Spoken Spoken language – language – notnot written written

Page 3: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 3Växjö: Statistical Methods I

Peter Juel Henrichsen’s Peter Juel Henrichsen’s ideasideas

Words with similar context Words with similar context distibutions are called distibutions are called SiblingsSiblings

Some pairs (seed pairs) of Swedish Some pairs (seed pairs) of Swedish and Danish words with ”the same” and Danish words with ”the same” meaning are carefully selected: meaning are carefully selected: CousinsCousins

Groups of siblings in each corpus Groups of siblings in each corpus together with seed pairs gives new together with seed pairs gives new probable cousins.probable cousins.

Page 4: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 4Växjö: Statistical Methods I

Siblings as word groupsSiblings as word groups

Drop the Cousins for now – focus on Drop the Cousins for now – focus on SiblingsSiblings

Traditional parts-of-speech are not Traditional parts-of-speech are not necessarily validnecessarily valid

What we have is the corpus. Only the What we have is the corpus. Only the corpuscorpus

We will take information from the 1+1 We will take information from the 1+1 words contextwords context

Nothing else like morphology or lexicaNothing else like morphology or lexica

Page 5: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 5Växjö: Statistical Methods I

The original Sibling The original Sibling formulaformula

Page 6: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 6Växjö: Statistical Methods I

Improvements of the Sibling Improvements of the Sibling measuremeasure

Symmetry: sib(xSymmetry: sib(x11, x, x22)= sib(x)= sib(x22, x, x11)) Similarity should be possible even if the Similarity should be possible even if the

context on one of the sides is differentcontext on one of the sides is different

Page 7: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 7Växjö: Statistical Methods I

Trees instead of groupsTrees instead of groups

Iterative use of the ggsib similarity Iterative use of the ggsib similarity measuremeasure

1.1. Calculate ggsib between all word pairs Calculate ggsib between all word pairs above a frequency thresholdabove a frequency threshold

2.2. Pairs with similarity above a rather high Pairs with similarity above a rather high score threshold Sscore threshold Sthth are collected in a list are collected in a list LL

3.3. For each pair in L: replace the less For each pair in L: replace the less frequent of the words with the other, in frequent of the words with the other, in the corpusthe corpus

Page 8: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 8Växjö: Statistical Methods I

Trees instead of groups Trees instead of groups (forts)(forts)

4.4. If L is empty: decrement SIf L is empty: decrement Sthth slightly slightly

5.5. Run from step 1 again if SRun from step 1 again if Sthth is above is above a lowest score threshold.a lowest score threshold.

The result may be interpreted as The result may be interpreted as treestrees

Page 9: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 9Växjö: Statistical Methods I

An example treeAn example tree

Page 10: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 10Växjö: Statistical Methods I

ImplementationImplementation

Easy to implement: Peter made a Easy to implement: Peter made a Perl scriptPerl script

But… One step in the iteration with But… One step in the iteration with ~5000 word types took 100 hours~5000 word types took 100 hours

Our heavily optimized C-program ran Our heavily optimized C-program ran on less than 60 minutes, and 100 on less than 60 minutes, and 100 iterations on less than 100 hoursiterations on less than 100 hours

Page 11: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 11Växjö: Statistical Methods I

Most important Most important optimizationsoptimizations

Starting point: we have enough Starting point: we have enough memory but not enough timememory but not enough time

A compiled low level language A compiled low level language instead of an interpreted high level instead of an interpreted high level

Frequencies for words and word Frequencies for words and word pairs are stored in letter trees pairs are stored in letter trees instead of hash tablesinstead of hash tables

Try to move computation and Try to move computation and counting out in the loop hierarchycounting out in the loop hierarchy

Page 12: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 12Växjö: Statistical Methods I

OptimizationsOptimizations (letter (letter trees)trees)

Retrieving information from the letter Retrieving information from the letter trees is done at constant time to the size trees is done at constant time to the size of the lexicon (compared to log(n) for of the lexicon (compared to log(n) for hash tables)hash tables)

But in linear time to the average length But in linear time to the average length of the words, but this is constant when of the words, but this is constant when the lexicon grows.the lexicon grows.

Another drawback: our example needs Another drawback: our example needs 1GB to run (each node in the tree is an 1GB to run (each node in the tree is an array of all possible characters), but who array of all possible characters), but who cares.cares.

Page 13: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 13Växjö: Statistical Methods I

Optimizations (more)Optimizations (more)

An example of moving computation An example of moving computation to an outer loop is to calculate the to an outer loop is to calculate the set of all context words once, and set of all context words once, and use it for comparisons with all other use it for comparisons with all other wordswords

The set may be stored as an array of The set may be stored as an array of pointers to nodes (between words in pointers to nodes (between words in word pairs) in the letter treeword pairs) in the letter tree

Page 14: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 14Växjö: Statistical Methods I

Personal pronounsPersonal pronouns

Page 15: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 15Växjö: Statistical Methods I

Page 16: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 16Växjö: Statistical Methods I

ColoursColours

Page 17: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 17Växjö: Statistical Methods I

ProblemsProblems

Sparse dataSparse data HomonymsHomonyms When to stopWhen to stop Memory and time complexityMemory and time complexity

Page 18: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 18Växjö: Statistical Methods I

ConclusionsConclusions

Our method is an interesting way of Our method is an interesting way of finding word groupsfinding word groups

It works for all kinds of words It works for all kinds of words (syncategorematic as well as (syncategorematic as well as categorematic)categorematic)

Difficult to handle low frequent Difficult to handle low frequent words and homonymswords and homonyms

Page 19: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 19Växjö: Statistical Methods I

Page 20: 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken Dialogue with Narrow Context Based Similarities Leif Grönqvist.

2002-05-02 20Växjö: Statistical Methods I