2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken...

2002-05-02 1Växjö: Statistical Methods I

Finding Word Groups …Finding Word Groups …

Finding Word Groups in Spoken Finding Word Groups in Spoken Dialogue with Narrow Context Dialogue with Narrow Context

Based SimilaritiesBased Similarities

Leif Grönqvist & Magnus GunnarssonLeif Grönqvist & Magnus Gunnarsson

Presentation for the GSLT course: Statistical Presentation for the GSLT course: Statistical Methods 1Methods 1

Växjö University, 2002-05-02: 16:00Växjö University, 2002-05-02: 16:00


BackgroundBackground

NordTalk and SweDanes:NordTalk and SweDanes:

Jens Allwood, Elisabeth Ahlsén, Peter Jens Allwood, Elisabeth Ahlsén, Peter Juel Henrichsen, Leif & MagnusJuel Henrichsen, Leif & Magnus

Comparable Danish and Swedish Comparable Danish and Swedish corporacorpora

1.3 MToken each, natural spoken 1.3 MToken each, natural spoken interactioninteraction

We are mainly working withWe are mainly working with Spoken Spoken language – language – notnot written written


Peter Juel Henrichsen’s Peter Juel Henrichsen’s ideasideas

Words with similar context Words with similar context distibutions are called distibutions are called SiblingsSiblings

Some pairs (seed pairs) of Swedish Some pairs (seed pairs) of Swedish and Danish words with ”the same” and Danish words with ”the same” meaning are carefully selected: meaning are carefully selected: CousinsCousins

Groups of siblings in each corpus Groups of siblings in each corpus together with seed pairs gives new together with seed pairs gives new probable cousins.probable cousins.


Siblings as word groupsSiblings as word groups

Drop the Cousins for now – focus on Drop the Cousins for now – focus on SiblingsSiblings

Traditional parts-of-speech are not Traditional parts-of-speech are not necessarily validnecessarily valid

What we have is the corpus. Only the What we have is the corpus. Only the corpuscorpus

We will take information from the 1+1 We will take information from the 1+1 words contextwords context

Nothing else like morphology or lexicaNothing else like morphology or lexica


The original Sibling The original Sibling formulaformula


Improvements of the Sibling Improvements of the Sibling measuremeasure

Symmetry: sib(xSymmetry: sib(x11, x, x22)= sib(x)= sib(x22, x, x11)) Similarity should be possible even if the Similarity should be possible even if the

context on one of the sides is differentcontext on one of the sides is different


Trees instead of groupsTrees instead of groups

Iterative use of the ggsib similarity Iterative use of the ggsib similarity measuremeasure

1.1. Calculate ggsib between all word pairs Calculate ggsib between all word pairs above a frequency thresholdabove a frequency threshold

2.2. Pairs with similarity above a rather high Pairs with similarity above a rather high score threshold Sscore threshold Sthth are collected in a list are collected in a list LL

3.3. For each pair in L: replace the less For each pair in L: replace the less frequent of the words with the other, in frequent of the words with the other, in the corpusthe corpus


Trees instead of groups Trees instead of groups (forts)(forts)

4.4. If L is empty: decrement SIf L is empty: decrement Sthth slightly slightly

5.5. Run from step 1 again if SRun from step 1 again if Sthth is above is above a lowest score threshold.a lowest score threshold.

The result may be interpreted as The result may be interpreted as treestrees


An example treeAn example tree


ImplementationImplementation

Easy to implement: Peter made a Easy to implement: Peter made a Perl scriptPerl script

But… One step in the iteration with But… One step in the iteration with ~5000 word types took 100 hours~5000 word types took 100 hours

Our heavily optimized C-program ran Our heavily optimized C-program ran on less than 60 minutes, and 100 on less than 60 minutes, and 100 iterations on less than 100 hoursiterations on less than 100 hours


Most important Most important optimizationsoptimizations

Starting point: we have enough Starting point: we have enough memory but not enough timememory but not enough time

A compiled low level language A compiled low level language instead of an interpreted high level instead of an interpreted high level

Frequencies for words and word Frequencies for words and word pairs are stored in letter trees pairs are stored in letter trees instead of hash tablesinstead of hash tables

Try to move computation and Try to move computation and counting out in the loop hierarchycounting out in the loop hierarchy


OptimizationsOptimizations (letter (letter trees)trees)

Retrieving information from the letter Retrieving information from the letter trees is done at constant time to the size trees is done at constant time to the size of the lexicon (compared to log(n) for of the lexicon (compared to log(n) for hash tables)hash tables)

But in linear time to the average length But in linear time to the average length of the words, but this is constant when of the words, but this is constant when the lexicon grows.the lexicon grows.

Another drawback: our example needs Another drawback: our example needs 1GB to run (each node in the tree is an 1GB to run (each node in the tree is an array of all possible characters), but who array of all possible characters), but who cares.cares.


Optimizations (more)Optimizations (more)

An example of moving computation An example of moving computation to an outer loop is to calculate the to an outer loop is to calculate the set of all context words once, and set of all context words once, and use it for comparisons with all other use it for comparisons with all other wordswords

The set may be stored as an array of The set may be stored as an array of pointers to nodes (between words in pointers to nodes (between words in word pairs) in the letter treeword pairs) in the letter tree


Personal pronounsPersonal pronouns


ColoursColours


ProblemsProblems

Sparse dataSparse data HomonymsHomonyms When to stopWhen to stop Memory and time complexityMemory and time complexity


ConclusionsConclusions

Our method is an interesting way of Our method is an interesting way of finding word groupsfinding word groups

It works for all kinds of words It works for all kinds of words (syncategorematic as well as (syncategorematic as well as categorematic)categorematic)

Difficult to handle low frequent Difficult to handle low frequent words and homonymswords and homonyms

2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken...

Documents

Transcript of 2002-05-02 1Växjö: Statistical Methods I Finding Word Groups … Finding Word Groups in Spoken...