An Empirical Study on Language Model Adaptation

An Empirical Study on Language Model Adaptation

Jianfeng Gao , Hisami Suzuki,Microsoft Research

Wei YuanShanghai Jiao Tong University

Presented by Patty Liu

2

Outline

• Introduction• The Language Model and the Task of IME• Related Work• LM Adaptation Methods• Experimental Results• Discussion• Conclusion and Future Work

3

Introduction

• Language model adaptation attempts to adjust the parameters of a LM so that it will perform well on a particular domain of data.

• In particular, we focus on the so-called cross-domain LM adaptation paradigm, that is, to adapt a LM trained on one domain (background domain) to a different domain (adaptation domain), for which only a small amount of training data is available.

• The LM adaptation methods investigated here can be grouped into two categories:

(1) Maximum a posteriori (MAP) : Linear interpolation

(2) Discriminative training : boosting 、 perceptron 、

minimum sample risk

4

The Language Model and the Task of IME

• IME (Input Method Editor) ： The users first input phonetic strings, which are then converted into appropriate word strings by software.

• Unlike speech recognition, there is no acoustic ambiguity in IME, since the phonetic string is provided directly by users. Moreover, we can assume a unique mapping from W to A in IME, that is, .

• From the perspective of LM adaptation, IME faces the same problem that speech recognition does: the quality of the model depends heavily on the similarity between the training data and the test data.

)|()(maxarg)(

),(maxarg)|(maxarg

)()()(

* WAPWPAP

AWPAWPW

AGENWAGENWAGENW

1)|( WAP

5

Related Work (1/3)

I. Measuring Domain Similarity:

• : a language

• : true underlying probability distribution of

• : another distribution (e.g., an SLM) which attempts to model

• : the cross entropy of with respect to

• : a word string in

nwwnn

nwwqwwp

nqLH

1

)(log)(1

lim),( 11

),( qLH

nww 1

L

L

LL

pq

Lq

6

Related Work (2/3)

• However, in reality, the underlying is never known and the corpus size is never infinite. We therefore make the assumption that is an ergodic and stationary process, and approximate the cross entropy by calculating it for a sufficiently large n instead of calculating it for the limit.

• The cross entropy takes into account both the similarity between two distributions (given by KL divergence) and the entropy of the corpus in question.

)||()(),( qpDLHqLH

)(log1

),( 1 nwwqn

qLH

nww n

nnnn wwq

wwpwwpwwqwwpD

1

)(

)(log)())(||)((

1

1111

p

L

7

Related Work (3/3)

II. LM Adaptation Methods• MAP : adjust the parameters of the background model

→ maximize the likelihood of the adaptation data• Discriminative training methods : using adaptation data

→ directly minimize the errors in it made by the background model

• These techniques have been applied successfully to language modeling in non-adaptation as well as adaptation scenarios for speech recognition.

8

LM Adaptation Methods ─LI

I. The Linear Interpolation Method

• : the probability of the background model

• : the probability of the adaptation model

• : the history, corresponds to the two preceding words

• : For simplicity, we chose a single for all histories and tuned it on held-out data

)|()1()|()|( hwPhwPhwP iAiBi

BP

AP

h

9

LM Adaptation Methods－ Problem Definition Of Discriminative Training Methods (1/

3)II. Discriminative Training Methods

◎ Problem Definition

10

LM Adaptation Methods－ Problem Definition Of Discriminative Training Methods (2/3)

which views IME as a ranking problem, where the model gives the ranking score, not probabilities. We therefore do not evaluate the LM obtained using discriminative training via perplexity.

11

LM Adaptation Methods－ Problem Definition Of Discriminative Training Methods (3/

3)

• : reference transcript• : an error function which is an edit distance function in this

case• : sample risk , the sum of error counts over the training samples• Discriminative training methods strive to minimize the by optimizi

ng the model parameters. However, cannot be optimized easily, since is a piecewise constant (or step) function of and its gradient is undefined.

• Therefore, discriminative methods apply different approaches that optimize it approximately. The boosting and perceptron algorithms approximate by loss functions that are suitable for optimization, while MSR uses a simple heuristic training procedure to minimize

directly.

RW

Mi

iiR

i AWWErSR1

* )),(,(minarg)(minarg

),( WWEr R

SR

(.)SR

(.)SR

(.)SR

SR

(.)Er

12

LM Adaptation Methods─ The Boosting Algorithm (1/2)

(i) The Boosting Algorithm

• margin :

• a ranking error : an incorrect candidate conversion gets a higher score than the correct conversion

• , where if , and 0 otherwise

• Optimizing the RLoss : NP-complete 　　　→ optimizes its upper bound, ExpLoss

• ExpLoss ： convex

),(),(),( WScoreWScoreWWM RR

Mi AGENW

iR

i

i

WWMIRLoss1 )(

)),(()(

Mi AGENW

iR

i

ii

WWMExpLoss1 )(

)),(exp()(

1][ I 0

RWW

13

LM Adaptation Methods─ The Boosting Algorithm (2/2)

• : a value increasing exponentially with the sum of the margins of pairs over the set where is seen in but not in

• : the value related to the sum of margins over the set where is seen in but not in

• : a smoothing factor (whose value is optimized on held-out data)

• :a normalization constant.

dC

ZC

ZC

d

dd

log2

1

dC

df

df

RW

RW

),( WW R

Z

W

W

14

LM Adaptation Methods─ The Perceptron Algorithm (1/2)

(ii) The Perceptron Algorithm

• delta rule:

•

• stochastic approximation:

)( ddd G

2

1

)),(),((2

1)(

Mi

iR

i WScoreWScoreMSELoss

Miid

Ridi

Ri

dd

WfWfWScoreWScore

MSELossG

1

))()())(,(),((

)()(

))()())(,(),(()( idR

idiR

id WfWfWScoreWScoreG

2)),(),((2

1)( i

Rii WScoreWScoreMSELoss

15

LM Adaptation Methods ─ The Perceptron Algorithm (2/2)

• averaged perceptron algorithm

)/()()(1 1

, MTT

t

M

i

itdavgd

16

LM Adaptation Methods─ MSR(1/7)

(iii) The Minimum Sample Risk Method• Conceptually, MSR operates like any multidimensional

function optimization approach:

- The first direction (i.e., feature) is selected and SR is minimized along that direction using a line search, that is, adjusting the parameter of the selected feature while keeping all other parameters fixed.

- Then, from there, along the second direction to its minimum, and so on

- Cycling through the whole set of directions as many times as necessary, until SR stops decreasing.

17

LM Adaptation Methods ─ MSR(2/7)

• This simple method can work properly under two assumptions.

- First, there exists an implementation of line search that efficiently optimizes the function along one direction.

- Second, the number of candidate features is not too large, and they are not highly correlated.

• However, neither of the assumptions holds in our case.

- First of all, Er(.) in

is a step function of λ, and thus cannot be optimized directly by regular gradient-based procedures –- a grid search has to be used instead. However, there are problems with simple grid search: using a large grid could miss the optimal solution, whereas using a fine-grained grid would lead to a very slow algorithm.

- Second, in the case of LM, there are millions of candidate features, some of which are highly correlated with each other.

Mi

iiR

i AWWErSR1

* )),(,(minarg)(minarg

18


◎ active candidate of a group :

• : candidate word string,

• Since in our case takes integer values and ( is the count of a particular n-gram in ), we can group the candidates using so that candidates in each group have the same value of .

• In each group, we define the candidate with the

highest value of as the active candidate of the group because no matter what value takes, only this candidate could be selected according to :

),(maxarg),()(

* WScoreAWAGENW

)()()(),('0'

'' WfWfWfWScore dd

D

ddddd

D

ddddd Wf

'0''' )(

)(AGENW

)(Wfd )(Wfd

)(Wfd

)(Wfd

d

W

W

19


◎ Grid Line Search

• By finding the active candidates, we can reduce to a much smaller list of active candidates. We can find a set of intervals for , within each of which a particular active candidate will be selected as .

• As a result, for each training sample, we obtain a sequence of intervals and their corresponding values. The optimal value can then be found by traversing the sequence and taking the midpoint of the interval with the lowest value.

• By merging the sequence of intervals of each training sample in the training set, we obtain a global sequence of intervals as well as their corresponding sample risk. We can then find the optimal value as well as the minimal sample risk by traversing the global interval sequence.

d

*d

*d

*W

)(AGEN

(.)ER

(.)ER

20


◎ Feature Subset Selection

• Reducing the number of features is essential for two reasons: to reduce computational complexity and to ensure the generalization property of the linear model.

• Effectiveness of :

• The cross-correlation coefficient between two features and

]1,0[),(

),(

1 1

22

1

jiC

xx

xxjiC

M

m

M

m mjmi

M

m mjmi

))()((max

)()()(

001,

00

iiDif

ddd ffSRfSR

ffSRfSRfE

i

df

if jf

21


22


• : the number of all candidate features

• : the number of features in the resulting model,

• According to the feature selection method:

- step1: for each of the candidate features

- step4: estimates of are required

• Therefore, we only estimate the value of between each of the selected features and each of the top remaining features with the highest value of . This reduces the number of estimates of to .

)( DKO

)( NKO

DK

(.)C

(.)C

(.)E

(.)E

(.)C

D

D

K

N

23

Experimental Results (1/3)

I. Data

• The data used in our experiments stems from five distinct sources of text.

• Different sizes of each adaptation training data were also used to show how different sizes of adaptation training data affected the performances of various adaptation methods.

Nikkei Yomiuri TuneUp Encarta Shinchonewspaper newspaper balanced corpus

(newspaper

and other sources)

encyclopedia novels

24


II. Computing Domain Characteristics

(i) The similarity between two domains: cross entropy

- not symmetric

- self entropy (the diversity of the corpus) increases in the following order : N→Y→E→T→S

25


III. Results of LM Adaptation

• We trained our baseline trigram model on our background (Nikkei) corpus .

26

Discussion (1/6)

I. Domain Similarity and CER

• The more similar the adaptation domain is to the background domain, the better the CER results.

27

Discussion (2/6)

II. Domain Similarity and the Robustness of Adaptation Methods

• The discriminative methods outperform LI in most cases.

• The performance of LI is greatly influenced by domain similarity. Such a limitation is not observed with the discriminative methods.

28

Discussion (3/6)

III. Adaptation Data Size and CER Reduction

• X-axis : self entropy• Y-axis : the improvement in

CER reduction• a positive correlation

between the diversity of the adaptation corpus and the benefit of having more training data available

• An intuitive explanation: The less diverse the adaptation data, the fewer distinct training examples will be included for discriminative training.

29

Discussion (4/6)

IV. Domain Characteristics and Error Ratios• error ratio (ER) metric, which measures the side effects of a new

model :

• : the number of errors found only in the new (adaptation) model• : the number of errors corrected by the new model• if the adapted model introduces no new errors• if the adapted model makes CER improvements• if the CER improvement is zero (i.e., the adapted model

makes as many new mistakes as it corrects old mistakes)• when the adapted model has worse CER performance than

the baseline model

||

||

B

A

E

EER

|| AE

|| BE

0ER

1ER

1ER

1ER

30

Discussion (5/6)

• RER: relative error rate reduction, i.e., the CER difference between the background and adapted models in %

• A discriminative method (in this case MSR) is superior to linear interpolation, not only in terms of CER reduction but also in having fewer side effects.

31

Discussion (6/6)

• Although the boosting and perceptron algorithms have the same CER for Yomiuri and TuneUp from Table III, the perceptron is better in terms of ER. This may be due to the use of an exponential loss function in the boosting algorithm, which is less robust against noisy data.

• Corpus diversity: the less stylistically diverse, the more consistent within the domain.

32

Conclusion and Future Work

• Conclusion:

(1) cross-domain similarity (cross entropy) correlates with the CER of all models

(2) diversity (self entropy) correlates with the utility of more adaptation training data for discriminative training methods

• Future Work : an online learning scenario

An Empirical Study on Language Model Adaptation

Documents

Transcript of An Empirical Study on Language Model Adaptation