Random Forests for Language Modeling

34
1/24/2006 CLSP, The Johns Hopkins University 1 Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006

description

Random Forests for Language Modeling. Peng Xu and Frederick Jelinek IPAM: January 24, 2006. What Is a Language Model?. A probability distribution over word sequences Based on conditional probability distributions: probability of a word given its history (past words). A. W*. - PowerPoint PPT Presentation

Transcript of Random Forests for Language Modeling

Page 1: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University 1

Random Forests for Language Modeling

Peng Xu and Frederick Jelinek

IPAM: January 24, 2006

Page 2: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

What Is a Language Model?

A probability distribution over word sequencesBased on conditional probability distributions: probability of a word given its history (past words)

Page 3: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

What Is a Language Model for?

Speech recognition

A W*

AW

Source-channel model

Page 4: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

n-gram Language ModelsA simple yet powerful solution to LM (n-1) items in history: n-gram model Maximum Likelihood (ML) estimate:

Sparseness Problem: training and test mismatch, most n-grams are never seen; need for smoothing

Page 5: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Sparseness ProblemExample: Upenn Treebank portion of WSJ, 1 million words training data, 82 thousand words test data, 10-thousand-word open vocabularyn-gram 3 4 5 6

% unseen

54.5 75.4 83.1 86.0

Sparseness makes language modeling a difficult regression problem: an n-gram model needs at least |V|n words to cover all n-grams

Page 6: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

More Data

More data solution to data sparseness The web has “everything”: web data is

noisy. The web does NOT have everything:

language models using web data still have data sparseness problem.

[Zhu & Rosenfeld, 2001] In 24 random web news sentences, 46 out of 453 trigrams were not covered by Altavista.

In domain training data is not always easy to get.

Page 7: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Dealing With Sparseness in n-gram

Smoothing: take out some probability mass from seen n-grams and distribute among unseen n-grams Interpolated Kneser-Ney: consistently the best

performance [Chen & Goodman, 1998]

Page 8: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Our Approach

Extend the appealing idea of history to clustering via decision trees. Overcome problems in decision tree

construction

… by using Random Forests!

Page 9: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Decision Tree Language Models

Decision trees: equivalence classification of histories Each leaf is specified by the answers to

a series of questions (posed to “history”) which lead to the leaf from the root.

Each leaf corresponds to a subset of the histories. Thus histories are partitioned (i.e.,classified).

Page 10: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Decision Tree Language Models: An Example

Training data: aba, aca, bcb, bbb, ada

{ab,ac,bc,bb,ad}a:3 b:2

{ab,ac,ad}a:3 b:0

{bc,bb}a:0 b:2

Is the first word in {a}? Is the first word in {b}?

New event ‘bdb’ in testNew event ‘adb’ in test

New event ‘cba’ in test: Stuck!

Page 11: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Decision Tree Language Models: An Example

Example: trigrams (w-2,w-1,w0)

Questions about positions: “Is w-i2S?” and “Is w-i2Sc?” There are two history positions for trigram. Each pair, S and Sc, defines a possible split of a node, and therefore, training data. S and Sc are complements with respect to training data

A node gets less data than its ancestors.(S, Sc) are obtained by an exchange algorithm.

Page 12: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Construction of Decision Trees

Data Driven: decision trees are constructed on the basis of training dataThe construction requires:

1. The set of possible questions2. A criterion evaluating the desirability of

questions3. A construction stopping rule or post-

pruning rule

Page 13: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Construction of Decision Trees: Our Approach

Grow a decision tree until maximum depth using training data

Use training data likelihood to evaluate questions

Perform no smoothing during growing

Prune fully grown decision tree to maximize heldout data likelihood

Incorporate KN smoothing during pruning

Page 14: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Smoothing Decision TreesUsing similar ideas as interpolated Kneser-Ney smoothing:

Note: All histories in one node are not smoothed

in the same way. Only leaves are used as equivalence

classes.

Page 15: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Problems with Decision Trees

Training data fragmentation: As tree is developed, the questions are

selected on the basis of less and less data.

Lack of optimality: The exchange algorithm is a greedy

algorithm. So is the tree growing algorithm

Overtraining and undertraining: Deep trees: fit the training data well, will

not generalize well to new test data. Shallow trees: not sufficiently refined.

Page 16: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Amelioration: Random Forests

Breiman applied the idea of random forests to relatively small problems. [Breiman 2001] Using different random samples of data and

randomly chosen subsets of questions, construct K decision trees.

Apply test datum x to all the different decision trees. • Produce classes y1,y2,…,yK.

Accept plurality decision:

Page 17: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Example of a Random Forest

T1 T2 T3

An example x will be classified as according to this random forest.

Page 18: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Random Forests for Language Modeling

Two kinds of randomness: Selection of positions to ask about

Alternatives: position 1 or 2 or the better of the two. Random initialization of the exchange

algorithm

100 decision trees: ith tree estimatesPDT(i)(w0|w-2,w-1)

The final estimate is the average of all trees

Page 19: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Experiments

Perplexity (PPL):

UPenn Treebank part of WSJ: about 1 million words for training and heldout (90%/10%), 82 thousand words for test

Page 20: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Experiments: trigramBaseline: KN-trigramNo randomization: DT-trigram100 random DTs: RF-trigram

Model heldout Test

PPL Gain PPL Gain %

KN-trigram 160.1 - 145.0 -DT-trigram 158.6 0.9% 163.3 -12.6%RF-trigram 126.8 20.8% 129.7 10.5%

Page 21: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Experiments: Aggregating

Considerable improvement already with 10 trees!

Page 22: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Experiments: Analysisseen event : KN-trigram:

in training data

DT-trigram: in training data

)|( 11

inii ww

))(|( 11

iniDTi ww

Analyze test data events by number of times seen in 100 DTs

Page 23: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Experiments: Stability

PPL results of different realizations varies, but differences are small.

Page 24: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Experiments: Aggregation v.s. InterpolationAggregation:

Weighted average:

Estimate weights so as to maximize heldout data log-likelihood

Page 25: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Experiments: Aggregation v.s. Interpolation

Optimal interpolation gains almost nothing!

Page 26: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Experiments: High Order n-grams Models

Baseline: KN n-gram100 random DTs: RF n-gram

n-gram 3 4 5 6KN 145.0 140.0 138.8 138.6RF 129.7 126.4 126.0 126.3

Page 27: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Using Random Forests to Other Models: SLM

Structured Language Model (SLM): [Chelba & Jelinek, 2000]

Approximation: use tree triples

SLM

KN 137.9

RF 122.8

Page 28: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Speech Recognition Experiments (I)

Word Error Rate (WER) by N-best Rescoring: WSJ text: 20 or 40 million words training WSJ DARPA’93 HUB1 test data: 213

utterances, 3446 words N-best rescoring: baseline WER is

13.7% N-best lists were generated by a trigram

baseline using Katz backoff smoothing. The baseline trigram used 40 million words

for training. Oracle error rate is around 6%.

Page 29: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Speech Recognition Experiments (I)

Baseline: KN smoothing100 random DTs for RF 3-gram100 random DTs for the PREDICTOR in SLMApproximation in SLM

3-gram (20M)

3-gram (40M)

SLM (20M)

KN 14.0% 13.0% 12.8%

RF 12.9% 12.4% 11.9%

p-value <0.001 <0.05 <0.001

Page 30: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Speech Recognition Experiments (II)

Word Error Rate by Lattice Rescoring IBM 2004 Conversational Telephony System for

Rich Transcription: 1st place in RT-04 evaluation Fisher data: 22 million words WEB data: 525 million words, using frequent Fisher n-

grams as queries Other data: Switchboard, Broadcast News, etc.

Lattice language model: 4-gram with interpolated Kneser-Ney smoothing, pruned to have 3.2 million unique n-grams, WER is 14.4%

Test set: DEV04, 37,834 words

Page 31: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Speech Recognition Experiments (II)

Baseline: KN 4-gram110 random DTs for EB-RF 4-gramSampling data without replacementFisher and WEB models are interpolated

Fisher 4-gram

WEB 4-gram

Fisher+WEB 4-gram

KN 14.1% 15.2% 13.7%

RF 13.5% 15.0% 13.1%

p-value <0.001 - <0.001

Page 32: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Practical Limitations of the RF Approach

Memory: Decision tree construction uses much more

memory. Little performance gain when training data

is really large. Because we have 100 trees, the final model

becomes too large to fit into memory.Effective language model compression or pruning remains an open question.

Page 33: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Conclusions: Random Forests

New RF language modeling approachMore general LM: RF DT n-gramRandomized history clustering

Good generalization: better n-gram coverage, less biased to training dataExtension of Brieman’s random forests for data sparseness problem

Page 34: Random Forests for Language Modeling

1/24/2006 CLSP, The Johns Hopkins University

Conclusions: Random Forests

Improvements in perplexity and/or word error rate over interpolated Kneser-Ney smoothing for different models: n-gram (up to n=6) Class-based trigram Structured Language Model

Significant improvements in the best performing large vocabulary conversational telephony speech recognition system