An Empirical Study on Language Model Adaptation

32
An Empirical Study on Language Model Adaptation Jianfeng Gao , Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University Presented by Patty Liu

description

An Empirical Study on Language Model Adaptation. Jianfeng Gao , Hisami Suzuki, Microsoft Research Wei Yuan Shanghai Jiao Tong University. Presented by Patty Liu. Outline. Introduction The Language Model and the Task of IME Related Work LM Adaptation Methods Experimental Results - PowerPoint PPT Presentation

Transcript of An Empirical Study on Language Model Adaptation

Page 1: An Empirical Study on Language Model Adaptation

An Empirical Study on Language Model Adaptation

Jianfeng Gao , Hisami Suzuki,Microsoft Research

Wei YuanShanghai Jiao Tong University

Presented by Patty Liu

Page 2: An Empirical Study on Language Model Adaptation

2

Outline

• Introduction• The Language Model and the Task of IME• Related Work• LM Adaptation Methods• Experimental Results• Discussion• Conclusion and Future Work

Page 3: An Empirical Study on Language Model Adaptation

3

Introduction

• Language model adaptation attempts to adjust the parameters of a LM so that it will perform well on a particular domain of data.

• In particular, we focus on the so-called cross-domain LM adaptation paradigm, that is, to adapt a LM trained on one domain (background domain) to a different domain (adaptation domain), for which only a small amount of training data is available.

• The LM adaptation methods investigated here can be grouped into two categories:

(1) Maximum a posteriori (MAP) : Linear interpolation

(2) Discriminative training : boosting 、 perceptron 、

minimum sample risk

Page 4: An Empirical Study on Language Model Adaptation

4

The Language Model and the Task of IME

• IME (Input Method Editor) : The users first input phonetic strings, which are then converted into appropriate word strings by software.

• Unlike speech recognition, there is no acoustic ambiguity in IME, since the phonetic string is provided directly by users. Moreover, we can assume a unique mapping from W to A in IME, that is, .

• From the perspective of LM adaptation, IME faces the same problem that speech recognition does: the quality of the model depends heavily on the similarity between the training data and the test data.

)|()(maxarg)(

),(maxarg)|(maxarg

)()()(

* WAPWPAP

AWPAWPW

AGENWAGENWAGENW

1)|( WAP

Page 5: An Empirical Study on Language Model Adaptation

5

Related Work (1/3)

I. Measuring Domain Similarity:

• : a language

• : true underlying probability distribution of

• : another distribution (e.g., an SLM) which attempts to model

• : the cross entropy of with respect to

• : a word string in

nwwnn

nwwqwwp

nqLH

1

)(log)(1

lim),( 11

),( qLH

nww 1

L

L

LL

pq

Lq

Page 6: An Empirical Study on Language Model Adaptation

6

Related Work (2/3)

• However, in reality, the underlying is never known and the corpus size is never infinite. We therefore make the assumption that is an ergodic and stationary process, and approximate the cross entropy by calculating it for a sufficiently large n instead of calculating it for the limit.

• The cross entropy takes into account both the similarity between two distributions (given by KL divergence) and the entropy of the corpus in question.

)||()(),( qpDLHqLH

)(log1

),( 1 nwwqn

qLH

nww n

nnnn wwq

wwpwwpwwqwwpD

1

)(

)(log)())(||)((

1

1111

p

L

Page 7: An Empirical Study on Language Model Adaptation

7

Related Work (3/3)

II. LM Adaptation Methods• MAP : adjust the parameters of the background model

→ maximize the likelihood of the adaptation data• Discriminative training methods : using adaptation data

→ directly minimize the errors in it made by the background model

• These techniques have been applied successfully to language modeling in non-adaptation as well as adaptation scenarios for speech recognition.

Page 8: An Empirical Study on Language Model Adaptation

8

LM Adaptation Methods ─LI

I. The Linear Interpolation Method

• : the probability of the background model

• : the probability of the adaptation model

• : the history, corresponds to the two preceding words

• : For simplicity, we chose a single for all histories and tuned it on held-out data

)|()1()|()|( hwPhwPhwP iAiBi

BP

AP

h

Page 9: An Empirical Study on Language Model Adaptation

9

LM Adaptation Methods- Problem Definition Of Discriminative Training Methods (1/

3)II. Discriminative Training Methods

◎ Problem Definition

Page 10: An Empirical Study on Language Model Adaptation

10

LM Adaptation Methods- Problem Definition Of Discriminative Training Methods (2/3)

which views IME as a ranking problem, where the model gives the ranking score, not probabilities. We therefore do not evaluate the LM obtained using discriminative training via perplexity.

Page 11: An Empirical Study on Language Model Adaptation

11

LM Adaptation Methods- Problem Definition Of Discriminative Training Methods (3/

3)

• : reference transcript• : an error function which is an edit distance function in this

case• : sample risk , the sum of error counts over the training samples• Discriminative training methods strive to minimize the by optimizi

ng the model parameters. However, cannot be optimized easily, since is a piecewise constant (or step) function of and its gradient is undefined.

• Therefore, discriminative methods apply different approaches that optimize it approximately. The boosting and perceptron algorithms approximate by loss functions that are suitable for optimization, while MSR uses a simple heuristic training procedure to minimize

directly.

RW

Mi

iiR

i AWWErSR1

* )),(,(minarg)(minarg

),( WWEr R

SR

(.)SR

(.)SR

(.)SR

SR

(.)Er

Page 12: An Empirical Study on Language Model Adaptation

12

LM Adaptation Methods─ The Boosting Algorithm (1/2)

(i) The Boosting Algorithm

• margin :

• a ranking error : an incorrect candidate conversion gets a higher score than the correct conversion

• , where if , and 0 otherwise

• Optimizing the RLoss : NP-complete     → optimizes its upper bound, ExpLoss

• ExpLoss : convex

),(),(),( WScoreWScoreWWM RR

Mi AGENW

iR

i

i

WWMIRLoss1 )(

)),(()(

Mi AGENW

iR

i

ii

WWMExpLoss1 )(

)),(exp()(

1][ I 0

RWW

Page 13: An Empirical Study on Language Model Adaptation

13

LM Adaptation Methods─ The Boosting Algorithm (2/2)

• : a value increasing exponentially with the sum of the margins of pairs over the set where is seen in but not in

• : the value related to the sum of margins over the set where is seen in but not in

• : a smoothing factor (whose value is optimized on held-out data)

• :a normalization constant.

dC

ZC

ZC

d

dd

log2

1

dC

df

df

RW

RW

),( WW R

Z

W

W

Page 14: An Empirical Study on Language Model Adaptation

14

LM Adaptation Methods─ The Perceptron Algorithm (1/2)

(ii) The Perceptron Algorithm

• delta rule:

• stochastic approximation:

)( ddd G

2

1

)),(),((2

1)(

Mi

iR

i WScoreWScoreMSELoss

Miid

Ridi

Ri

dd

WfWfWScoreWScore

MSELossG

1

))()())(,(),((

)()(

))()())(,(),(()( idR

idiR

id WfWfWScoreWScoreG

2)),(),((2

1)( i

Rii WScoreWScoreMSELoss

Page 15: An Empirical Study on Language Model Adaptation

15

LM Adaptation Methods ─ The Perceptron Algorithm (2/2)

• averaged perceptron algorithm

)/()()(1 1

, MTT

t

M

i

itdavgd

Page 16: An Empirical Study on Language Model Adaptation

16

LM Adaptation Methods─ MSR(1/7)

(iii) The Minimum Sample Risk Method• Conceptually, MSR operates like any multidimensional

function optimization approach:

- The first direction (i.e., feature) is selected and SR is minimized along that direction using a line search, that is, adjusting the parameter of the selected feature while keeping all other parameters fixed.

- Then, from there, along the second direction to its minimum, and so on

- Cycling through the whole set of directions as many times as necessary, until SR stops decreasing.

Page 17: An Empirical Study on Language Model Adaptation

17

LM Adaptation Methods ─ MSR(2/7)

• This simple method can work properly under two assumptions.

- First, there exists an implementation of line search that efficiently optimizes the function along one direction.

- Second, the number of candidate features is not too large, and they are not highly correlated.

• However, neither of the assumptions holds in our case.

- First of all, Er(.) in

is a step function of λ, and thus cannot be optimized directly by regular gradient-based procedures –- a grid search has to be used instead. However, there are problems with simple grid search: using a large grid could miss the optimal solution, whereas using a fine-grained grid would lead to a very slow algorithm.

- Second, in the case of LM, there are millions of candidate features, some of which are highly correlated with each other.

Mi

iiR

i AWWErSR1

* )),(,(minarg)(minarg

Page 18: An Empirical Study on Language Model Adaptation

18

LM Adaptation Methods ─ MSR(3/7)

◎ active candidate of a group :

• : candidate word string,

• Since in our case takes integer values and ( is the count of a particular n-gram in ), we can group the candidates using so that candidates in each group have the same value of .

• In each group, we define the candidate with the

highest value of as the active candidate of the group because no matter what value takes, only this candidate could be selected according to :

),(maxarg),()(

* WScoreAWAGENW

)()()(),('0'

'' WfWfWfWScore dd

D

ddddd

D

ddddd Wf

'0''' )(

)(AGENW

)(Wfd )(Wfd

)(Wfd

)(Wfd

d

W

W

Page 19: An Empirical Study on Language Model Adaptation

19

LM Adaptation Methods ─ MSR(4/7)

◎ Grid Line Search

• By finding the active candidates, we can reduce to a much smaller list of active candidates. We can find a set of intervals for , within each of which a particular active candidate will be selected as .

• As a result, for each training sample, we obtain a sequence of intervals and their corresponding values. The optimal value can then be found by traversing the sequence and taking the midpoint of the interval with the lowest value.

• By merging the sequence of intervals of each training sample in the training set, we obtain a global sequence of intervals as well as their corresponding sample risk. We can then find the optimal value as well as the minimal sample risk by traversing the global interval sequence.

d

*d

*d

*W

)(AGEN

(.)ER

(.)ER

Page 20: An Empirical Study on Language Model Adaptation

20

LM Adaptation Methods ─ MSR(5/7)

◎ Feature Subset Selection

• Reducing the number of features is essential for two reasons: to reduce computational complexity and to ensure the generalization property of the linear model.

• Effectiveness of :

• The cross-correlation coefficient between two features and

]1,0[),(

),(

1 1

22

1

jiC

xx

xxjiC

M

m

M

m mjmi

M

m mjmi

))()((max

)()()(

001,

00

iiDif

ddd ffSRfSR

ffSRfSRfE

i

df

if jf

Page 21: An Empirical Study on Language Model Adaptation

21

LM Adaptation Methods ─ MSR(6/7)

Page 22: An Empirical Study on Language Model Adaptation

22

LM Adaptation Methods ─ MSR(7/7)

• : the number of all candidate features

• : the number of features in the resulting model,

• According to the feature selection method:

- step1: for each of the candidate features

- step4: estimates of are required

• Therefore, we only estimate the value of between each of the selected features and each of the top remaining features with the highest value of . This reduces the number of estimates of to .

)( DKO

)( NKO

DK

(.)C

(.)C

(.)E

(.)E

(.)C

D

D

K

N

Page 23: An Empirical Study on Language Model Adaptation

23

Experimental Results (1/3)

I. Data

• The data used in our experiments stems from five distinct sources of text.

• Different sizes of each adaptation training data were also used to show how different sizes of adaptation training data affected the performances of various adaptation methods.

Nikkei Yomiuri TuneUp Encarta Shinchonewspaper newspaper balanced corpus

(newspaper

and other sources)

encyclopedia novels

Page 24: An Empirical Study on Language Model Adaptation

24

Experimental Results (2/3)

II. Computing Domain Characteristics

(i) The similarity between two domains: cross entropy

- not symmetric

- self entropy (the diversity of the corpus) increases in the following order : N→Y→E→T→S

Page 25: An Empirical Study on Language Model Adaptation

25

Experimental Results (3/3)

III. Results of LM Adaptation

• We trained our baseline trigram model on our background (Nikkei) corpus .

Page 26: An Empirical Study on Language Model Adaptation

26

Discussion (1/6)

I. Domain Similarity and CER

• The more similar the adaptation domain is to the background domain, the better the CER results.

Page 27: An Empirical Study on Language Model Adaptation

27

Discussion (2/6)

II. Domain Similarity and the Robustness of Adaptation Methods

• The discriminative methods outperform LI in most cases.

• The performance of LI is greatly influenced by domain similarity. Such a limitation is not observed with the discriminative methods.

Page 28: An Empirical Study on Language Model Adaptation

28

Discussion (3/6)

III. Adaptation Data Size and CER Reduction

• X-axis : self entropy• Y-axis : the improvement in

CER reduction• a positive correlation

between the diversity of the adaptation corpus and the benefit of having more training data available

• An intuitive explanation: The less diverse the adaptation data, the fewer distinct training examples will be included for discriminative training.

Page 29: An Empirical Study on Language Model Adaptation

29

Discussion (4/6)

IV. Domain Characteristics and Error Ratios• error ratio (ER) metric, which measures the side effects of a new

model :

• : the number of errors found only in the new (adaptation) model• : the number of errors corrected by the new model• if the adapted model introduces no new errors• if the adapted model makes CER improvements• if the CER improvement is zero (i.e., the adapted model

makes as many new mistakes as it corrects old mistakes)• when the adapted model has worse CER performance than

the baseline model

||

||

B

A

E

EER

|| AE

|| BE

0ER

1ER

1ER

1ER

Page 30: An Empirical Study on Language Model Adaptation

30

Discussion (5/6)

• RER: relative error rate reduction, i.e., the CER difference between the background and adapted models in %

• A discriminative method (in this case MSR) is superior to linear interpolation, not only in terms of CER reduction but also in having fewer side effects.

Page 31: An Empirical Study on Language Model Adaptation

31

Discussion (6/6)

• Although the boosting and perceptron algorithms have the same CER for Yomiuri and TuneUp from Table III, the perceptron is better in terms of ER. This may be due to the use of an exponential loss function in the boosting algorithm, which is less robust against noisy data.

• Corpus diversity: the less stylistically diverse, the more consistent within the domain.

Page 32: An Empirical Study on Language Model Adaptation

32

Conclusion and Future Work

• Conclusion:

(1) cross-domain similarity (cross entropy) correlates with the CER of all models

(2) diversity (self entropy) correlates with the utility of more adaptation training data for discriminative training methods

• Future Work : an online learning scenario