The Hong Kong Polytechnic University Peking University · A new online training method ... ADF...

49
Xu SUN, Houfeng WANG, Wenjie LI The Hong Kong Polytechnic University Peking University July 2012 ACL 2012 1 Peking University

Transcript of The Hong Kong Polytechnic University Peking University · A new online training method ... ADF...

Page 1: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Xu SUN, Houfeng WANG, Wenjie LI

The Hong Kong Polytechnic UniversityPeking University

July 2012 ACL 2012 1

Peking University

Page 2: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Outline

IntroductionMethod

Joint modeling: ord segmentation + ne ord detectionJoint modeling: word segmentation + new word detectionNew features

A new online training methodFeature-frequency adaptive online trainingq y p gFinish the training in 10 passes

ExperimentsExperimentsConclusions

July 2012 ACL 2012 2

Page 3: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Introduction

3 proposals in this workp p

July 2012 ACL 2012 3

Page 4: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Introduction

3 proposals in this workp p

) J i t d li1) Joint modelingword segmentation + new word detection

2) New features2) New featureshigh dimensional edge features on CRFs

July 2012 ACL 2012 4

Page 5: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Introduction

3 proposals in this workp p

) J i t d li1) Joint modelingword segmentation + new word detection

2) New features2) New featureshigh dimensional edge features on CRFs

3) Very Fast online training **major proposal**3) Very Fast online training major proposalfinish the training in 10 passes

July 2012 ACL 2012 5

Page 6: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

1) Joint modeling

Prior work on word seg & new word detectiongSeparate: two tasks are independent

Word seg

New word detection

July 2012 ACL 2012 6

Page 7: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

1) Joint modeling

Prior work on word seg & new word detectiongSeparate: two tasks are independentPipeline: first word seg then new word detectionPipeline: first word seg, then new word detection

Word seg

New word detection

Word seg New word detection

July 2012 ACL 2012 7

Page 8: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

1) Joint modeling

Our workJoint modeling: word seg + new word detectionFor simplicity we use a single CRF modelFor simplicity, we use a single CRF model

July 2012 ACL 2012 8

Page 9: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

1) Joint modeling

Our workJoint modeling: word seg + new word detectionFor simplicity we use a single CRF modelFor simplicity, we use a single CRF model

Word seg

New word detection

July 2012 ACL 2012 9

Page 10: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

2) New features

Traditional features are character ngramsg

July 2012 ACL 2012 10

Page 11: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

2) New features

Traditional features are character ngramsgNew features: word-based ngrams

Word le icon is collected from training dataWord lexicon is collected from training dataWord unigram featuresWord bigram features

July 2012 ACL 2012 11

Page 12: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

2) New features

Traditional features are character ngramsgNew features: word-based ngrams

Word le icon is collected from training dataWord lexicon is collected from training dataWord unigram featuresWord bigram features

y y y y yy1 y2 y3 y4 yn

x1 x2 x3 x4 xn

July 2012 ACL 2012 12

Page 13: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

2) New edge (transition) features

Traditional edge features avoid observationsgto avoid feature explosion

July 2012 ACL 2012 13

Page 14: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

2) New edge (transition) features

Traditional edge features avoid observationsgto avoid feature explosion

y1 y2 y3 y4 yn

x1 x2 x3 x4 xn

July 2012 ACL 2012 14

Page 15: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

2) New edge (transition) features

Traditional edge features avoid observationsgto avoid feature explosion

Proposal: New edge features involvingProposal: New edge features involving observations

Tractable feature explosion: #feature increased 9 times10 millions features

July 2012 ACL 2012 15

Page 16: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

2) New edge (transition) features

Traditional edge features avoid observationsgto avoid feature explosion

Proposal: New edge features involvingProposal: New edge features involving observations

Tractable feature explosion: #feature increased 9 times10 millions features

y1 y2 y3 y4 yn

x1 x2 x3 x4 xn

July 2012 ACL 2012 16

Page 17: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

Large-scale word segmentation is already slowg g y

July 2012 ACL 2012 17

Page 18: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

Large-scale word segmentation is already slowg g y

With hi h di i l f t ( illi )With high dimensional new features (10 millions)

Even more slower training speed than before

July 2012 ACL 2012 18

Page 19: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

Large-scale word segmentation is already slowg g y

With hi h di i l f t ( illi )With high dimensional new features (10 millions)

Even more slower training speed than before

Need a fast training method, even with 10 millions of features

July 2012 ACL 2012 19

Page 20: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

Batch traininggLimited memory BFGS (LBFGS)Too slow need 300 passes for trainingToo slow, need 300 passes for training

July 2012 ACL 2012 20

Page 21: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

Batch traininggLimited memory BFGS (LBFGS)Too slow need 300 passes for trainingToo slow, need 300 passes for training

Existing online training methodsStochastic gradient descent (SGD)Moderately fast, need 50 passes for training

July 2012 ACL 2012 21

Page 22: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

Batch traininggLimited memory BFGS (LBFGS)Too slow need 300 passes for trainingToo slow, need 300 passes for training

Existing online training methodsStochastic gradient descent (SGD)Moderately fast, need 50 passes for training

Main proposal a very fast online training methodMain proposal: a very fast online training methodFinish the training in 10 passes

July 2012 ACL 2012 22

Page 23: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

Potential problem of existing online trainingp g gOver-simplified settingLearning rate: use only 1 scalar for all weightsLearning rate: use only 1 scalar for all weights

July 2012 ACL 2012 23

Page 24: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

Potential problem of existing online trainingp g gOver-simplified settingLearning rate: use only 1 scalar for all weightsLearning rate: use only 1 scalar for all weights

July 2012 ACL 2012 24

Page 25: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

better to use different learning rates for different gfeatures?

July 2012 ACL 2012 25

Page 26: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

better to use different learning rates for different gfeatures?

YES We use a vector of learning ratesYES. We use a vector of learning rates

July 2012 ACL 2012 26

Page 27: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

better to use different learning rates for different gfeatures?

YES We use a vector of learning ratesYES. We use a vector of learning rates

July 2012 ACL 2012 27

Page 28: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

better to use different learning rates for different gfeatures?

YES We use a vector of learning ratesYES. We use a vector of learning rates

July 2012 ACL 2012 28

Page 29: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

better to use different learning rates for different gfeatures?

YES We use a vector of learning ratesYES. We use a vector of learning ratesLearning of Learning rates: How to learn the

values of the learning rates?

July 2012 ACL 2012 29

Page 30: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

better to use different learning rates for different gfeatures?

YES We use a vector of learning ratesYES. We use a vector of learning ratesLearning of Learning rates: How to learn the

values of the learning rates?

*Learning rates are learned from feature frequency informationfrequency information

*Higher frequency feature lower learning rateg q y g

July 2012 ACL 2012 30

Page 31: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

July 2012 ACL 2012 31

Page 32: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

July 2012 ACL 2012 32

Page 33: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

3) Very fast online learning ***

July 2012 ACL 2012 33

Page 34: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Convergence Analysis

Good convergence properties of the proposed g p p p pmethod

The ADF training is convergent

July 2012 ACL 2012 34

Page 35: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Experiments

DataSighan bakeoff 2004

Microsoft Research data (MSR)Microsoft Research data (MSR)Peking University data (PKU)City University of Hongkong data (CU)

July 2012 ACL 2012 35

Page 36: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Experiments

ResultsBaseline: CRF with SGD training

July 2012 ACL 2012 36

Page 37: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Experiments

ResultsNew feature (word feature + high dimensional edge features) helpsp

July 2012 ACL 2012 37

Page 38: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Experiments

ResultsJoint modeling (word seg + new word detection) helps slightly on word segmentation, but largely on NWD rec.g y g , g y

July 2012 ACL 2012 38

Page 39: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Experiments

ResultsADF training greatly reduced training time, yet with even higher F-score!g

July 2012 ACL 2012 39

Page 40: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Experiments

ResultsADF training greatly reduced training time, yet with even higher F-score!g

We achieved best accuracy reports on the MSR & PKU datasets

July 2012 ACL 2012 40

Page 41: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Experiments

New training ADF vs. SGD vs. LBFGSg

July 2012 ACL 2012 41

Page 42: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Experiments

New training ADF vs. SGD vs. LBFGSg

ADF training achieved optimum in 10 passes

July 2012 ACL 2012 42

Page 43: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Experiments

New training ADF vs. SGD vs. LBFGSg

ADF training is faster in both # of passes and inADF training is faster in both # of passes, and in wall-clock time.

July 2012 ACL 2012 43

Page 44: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Conclusions

3 proposalsp pJoint modeling: segmentation + new word detectionNew features: word features + high dimensional edgeNew features: word features + high dimensional edge featuresA very fast online training method (main proposal)A very fast online training method (main proposal)

July 2012 ACL 2012 44

Page 45: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Conclusions

3 proposalsp pJoint modeling: segmentation + new word detectionNew features: word features + high dimensional edgeNew features: word features + high dimensional edge featuresA very fast online training method (main proposal)A very fast online training method (main proposal)

ExperimentsJoint modeling helps

July 2012 ACL 2012 45

Page 46: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Conclusions

3 proposalsp pJoint modeling: segmentation + new word detectionNew features: word features + high dimensional edgeNew features: word features + high dimensional edge featuresA very fast online training method (main proposal)A very fast online training method (main proposal)

ExperimentsJoint modeling helpsNew features helps a lot on model accuracyp y

July 2012 ACL 2012 46

Page 47: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Conclusions

3 proposalsp pJoint modeling: segmentation + new word detectionNew features: word features + high dimensional edgeNew features: word features + high dimensional edge featuresA very fast online training method (main proposal)A very fast online training method (main proposal)

ExperimentsJoint modeling helpsNew features helps a lot on model accuracyp yThe new training method finishes the training in 10 passes

July 2012 ACL 2012 47

Page 48: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Conclusions

3 proposalsp pJoint modeling: segmentation + new word detectionNew features: word features + high dimensional edgeNew features: word features + high dimensional edge featuresA very fast online training method (main proposal)A very fast online training method (main proposal)

ExperimentsJoint modeling helpsNew features helps a lot on model accuracyp yThe new training method finishes the training in 10 passesFinal results beat the existing best reports on the datasetsFinal results beat the existing best reports on the datasets

More accurate & faster!

July 2012 ACL 2012 48

Page 49: The Hong Kong Polytechnic University Peking University · A new online training method ... ADF training greatly reduced training time, yet with even higher F-score! We achieved best

Thanks!Any question?

July 2012 ACL 2012 49