The Hong Kong Polytechnic University Peking University · A new online training method ... ADF...

Xu SUN, Houfeng WANG, Wenjie LI

The Hong Kong Polytechnic UniversityPeking University

July 2012 ACL 2012 1

Peking University

Outline

IntroductionMethod

Joint modeling: ord segmentation + ne ord detectionJoint modeling: word segmentation + new word detectionNew features

A new online training methodFeature-frequency adaptive online trainingq y p gFinish the training in 10 passes

ExperimentsExperimentsConclusions

July 2012 ACL 2012 2

Introduction

3 proposals in this workp p

July 2012 ACL 2012 3

Introduction


) J i t d li1) Joint modelingword segmentation + new word detection

2) New features2) New featureshigh dimensional edge features on CRFs

July 2012 ACL 2012 4

Introduction


) J i t d li1) Joint modelingword segmentation + new word detection

2) New features2) New featureshigh dimensional edge features on CRFs

3) Very Fast online training **major proposal**3) Very Fast online training major proposalfinish the training in 10 passes

July 2012 ACL 2012 5

1) Joint modeling

Prior work on word seg & new word detectiongSeparate: two tasks are independent

Word seg

New word detection

July 2012 ACL 2012 6

1) Joint modeling

Prior work on word seg & new word detectiongSeparate: two tasks are independentPipeline: first word seg then new word detectionPipeline: first word seg, then new word detection

Word seg

New word detection

Word seg New word detection

July 2012 ACL 2012 7

1) Joint modeling

Our workJoint modeling: word seg + new word detectionFor simplicity we use a single CRF modelFor simplicity, we use a single CRF model

July 2012 ACL 2012 8

1) Joint modeling

Our workJoint modeling: word seg + new word detectionFor simplicity we use a single CRF modelFor simplicity, we use a single CRF model

Word seg

New word detection

July 2012 ACL 2012 9

2) New features

Traditional features are character ngramsg

July 2012 ACL 2012 10

2) New features

Traditional features are character ngramsgNew features: word-based ngrams

Word le icon is collected from training dataWord lexicon is collected from training dataWord unigram featuresWord bigram features

July 2012 ACL 2012 11

2) New features

Traditional features are character ngramsgNew features: word-based ngrams

Word le icon is collected from training dataWord lexicon is collected from training dataWord unigram featuresWord bigram features

y y y y yy1 y2 y3 y4 yn

x1 x2 x3 x4 xn

July 2012 ACL 2012 12

2) New edge (transition) features

Traditional edge features avoid observationsgto avoid feature explosion

July 2012 ACL 2012 13



y1 y2 y3 y4 yn

x1 x2 x3 x4 xn

July 2012 ACL 2012 14



Proposal: New edge features involvingProposal: New edge features involving observations

Tractable feature explosion: #feature increased 9 times10 millions features

July 2012 ACL 2012 15



Proposal: New edge features involvingProposal: New edge features involving observations

Tractable feature explosion: #feature increased 9 times10 millions features

y1 y2 y3 y4 yn

x1 x2 x3 x4 xn

July 2012 ACL 2012 16

3) Very fast online learning ***

Large-scale word segmentation is already slowg g y

July 2012 ACL 2012 17



With hi h di i l f t ( illi )With high dimensional new features (10 millions)

Even more slower training speed than before

July 2012 ACL 2012 18



With hi h di i l f t ( illi )With high dimensional new features (10 millions)

Even more slower training speed than before

Need a fast training method, even with 10 millions of features

July 2012 ACL 2012 19


Batch traininggLimited memory BFGS (LBFGS)Too slow need 300 passes for trainingToo slow, need 300 passes for training

July 2012 ACL 2012 20



Existing online training methodsStochastic gradient descent (SGD)Moderately fast, need 50 passes for training

July 2012 ACL 2012 21



Existing online training methodsStochastic gradient descent (SGD)Moderately fast, need 50 passes for training

Main proposal a very fast online training methodMain proposal: a very fast online training methodFinish the training in 10 passes

July 2012 ACL 2012 22


Potential problem of existing online trainingp g gOver-simplified settingLearning rate: use only 1 scalar for all weightsLearning rate: use only 1 scalar for all weights

July 2012 ACL 2012 23


Potential problem of existing online trainingp g gOver-simplified settingLearning rate: use only 1 scalar for all weightsLearning rate: use only 1 scalar for all weights

July 2012 ACL 2012 24


better to use different learning rates for different gfeatures?

July 2012 ACL 2012 25



YES We use a vector of learning ratesYES. We use a vector of learning rates

July 2012 ACL 2012 26




July 2012 ACL 2012 27




July 2012 ACL 2012 28



YES We use a vector of learning ratesYES. We use a vector of learning ratesLearning of Learning rates: How to learn the

values of the learning rates?

July 2012 ACL 2012 29



YES We use a vector of learning ratesYES. We use a vector of learning ratesLearning of Learning rates: How to learn the

values of the learning rates?

*Learning rates are learned from feature frequency informationfrequency information

*Higher frequency feature lower learning rateg q y g

July 2012 ACL 2012 30


July 2012 ACL 2012 31


July 2012 ACL 2012 32


July 2012 ACL 2012 33

Convergence Analysis

Good convergence properties of the proposed g p p p pmethod

The ADF training is convergent

July 2012 ACL 2012 34

Experiments

DataSighan bakeoff 2004

Microsoft Research data (MSR)Microsoft Research data (MSR)Peking University data (PKU)City University of Hongkong data (CU)

July 2012 ACL 2012 35

Experiments

ResultsBaseline: CRF with SGD training

July 2012 ACL 2012 36

Experiments

ResultsNew feature (word feature + high dimensional edge features) helpsp

July 2012 ACL 2012 37

Experiments

ResultsJoint modeling (word seg + new word detection) helps slightly on word segmentation, but largely on NWD rec.g y g , g y

July 2012 ACL 2012 38

Experiments

ResultsADF training greatly reduced training time, yet with even higher F-score!g

July 2012 ACL 2012 39

Experiments

ResultsADF training greatly reduced training time, yet with even higher F-score!g

We achieved best accuracy reports on the MSR & PKU datasets

July 2012 ACL 2012 40

Experiments

New training ADF vs. SGD vs. LBFGSg

July 2012 ACL 2012 41

Experiments


ADF training achieved optimum in 10 passes

July 2012 ACL 2012 42

Experiments


ADF training is faster in both # of passes and inADF training is faster in both # of passes, and in wall-clock time.

July 2012 ACL 2012 43

Conclusions

3 proposalsp pJoint modeling: segmentation + new word detectionNew features: word features + high dimensional edgeNew features: word features + high dimensional edge featuresA very fast online training method (main proposal)A very fast online training method (main proposal)

July 2012 ACL 2012 44

Conclusions


ExperimentsJoint modeling helps

July 2012 ACL 2012 45

Conclusions


ExperimentsJoint modeling helpsNew features helps a lot on model accuracyp y

July 2012 ACL 2012 46

Conclusions


ExperimentsJoint modeling helpsNew features helps a lot on model accuracyp yThe new training method finishes the training in 10 passes

July 2012 ACL 2012 47

Conclusions


ExperimentsJoint modeling helpsNew features helps a lot on model accuracyp yThe new training method finishes the training in 10 passesFinal results beat the existing best reports on the datasetsFinal results beat the existing best reports on the datasets

More accurate & faster!

July 2012 ACL 2012 48

Thanks!Any question?

July 2012 ACL 2012 49

The Hong Kong Polytechnic University Peking University · A new online training method ... ADF...

Documents

Transcript of The Hong Kong Polytechnic University Peking University · A new online training method ... ADF...