The Hong Kong Polytechnic University Peking University · A new online training method ... ADF...
Transcript of The Hong Kong Polytechnic University Peking University · A new online training method ... ADF...
Xu SUN, Houfeng WANG, Wenjie LI
The Hong Kong Polytechnic UniversityPeking University
July 2012 ACL 2012 1
Peking University
Outline
IntroductionMethod
Joint modeling: ord segmentation + ne ord detectionJoint modeling: word segmentation + new word detectionNew features
A new online training methodFeature-frequency adaptive online trainingq y p gFinish the training in 10 passes
ExperimentsExperimentsConclusions
July 2012 ACL 2012 2
Introduction
3 proposals in this workp p
July 2012 ACL 2012 3
Introduction
3 proposals in this workp p
) J i t d li1) Joint modelingword segmentation + new word detection
2) New features2) New featureshigh dimensional edge features on CRFs
July 2012 ACL 2012 4
Introduction
3 proposals in this workp p
) J i t d li1) Joint modelingword segmentation + new word detection
2) New features2) New featureshigh dimensional edge features on CRFs
3) Very Fast online training **major proposal**3) Very Fast online training major proposalfinish the training in 10 passes
July 2012 ACL 2012 5
1) Joint modeling
Prior work on word seg & new word detectiongSeparate: two tasks are independent
Word seg
New word detection
July 2012 ACL 2012 6
1) Joint modeling
Prior work on word seg & new word detectiongSeparate: two tasks are independentPipeline: first word seg then new word detectionPipeline: first word seg, then new word detection
Word seg
New word detection
Word seg New word detection
July 2012 ACL 2012 7
1) Joint modeling
Our workJoint modeling: word seg + new word detectionFor simplicity we use a single CRF modelFor simplicity, we use a single CRF model
July 2012 ACL 2012 8
1) Joint modeling
Our workJoint modeling: word seg + new word detectionFor simplicity we use a single CRF modelFor simplicity, we use a single CRF model
Word seg
New word detection
July 2012 ACL 2012 9
2) New features
Traditional features are character ngramsg
July 2012 ACL 2012 10
2) New features
Traditional features are character ngramsgNew features: word-based ngrams
Word le icon is collected from training dataWord lexicon is collected from training dataWord unigram featuresWord bigram features
July 2012 ACL 2012 11
2) New features
Traditional features are character ngramsgNew features: word-based ngrams
Word le icon is collected from training dataWord lexicon is collected from training dataWord unigram featuresWord bigram features
y y y y yy1 y2 y3 y4 yn
x1 x2 x3 x4 xn
July 2012 ACL 2012 12
2) New edge (transition) features
Traditional edge features avoid observationsgto avoid feature explosion
July 2012 ACL 2012 13
2) New edge (transition) features
Traditional edge features avoid observationsgto avoid feature explosion
y1 y2 y3 y4 yn
x1 x2 x3 x4 xn
July 2012 ACL 2012 14
2) New edge (transition) features
Traditional edge features avoid observationsgto avoid feature explosion
Proposal: New edge features involvingProposal: New edge features involving observations
Tractable feature explosion: #feature increased 9 times10 millions features
July 2012 ACL 2012 15
2) New edge (transition) features
Traditional edge features avoid observationsgto avoid feature explosion
Proposal: New edge features involvingProposal: New edge features involving observations
Tractable feature explosion: #feature increased 9 times10 millions features
y1 y2 y3 y4 yn
x1 x2 x3 x4 xn
July 2012 ACL 2012 16
3) Very fast online learning ***
Large-scale word segmentation is already slowg g y
July 2012 ACL 2012 17
3) Very fast online learning ***
Large-scale word segmentation is already slowg g y
With hi h di i l f t ( illi )With high dimensional new features (10 millions)
Even more slower training speed than before
July 2012 ACL 2012 18
3) Very fast online learning ***
Large-scale word segmentation is already slowg g y
With hi h di i l f t ( illi )With high dimensional new features (10 millions)
Even more slower training speed than before
Need a fast training method, even with 10 millions of features
July 2012 ACL 2012 19
3) Very fast online learning ***
Batch traininggLimited memory BFGS (LBFGS)Too slow need 300 passes for trainingToo slow, need 300 passes for training
July 2012 ACL 2012 20
3) Very fast online learning ***
Batch traininggLimited memory BFGS (LBFGS)Too slow need 300 passes for trainingToo slow, need 300 passes for training
Existing online training methodsStochastic gradient descent (SGD)Moderately fast, need 50 passes for training
July 2012 ACL 2012 21
3) Very fast online learning ***
Batch traininggLimited memory BFGS (LBFGS)Too slow need 300 passes for trainingToo slow, need 300 passes for training
Existing online training methodsStochastic gradient descent (SGD)Moderately fast, need 50 passes for training
Main proposal a very fast online training methodMain proposal: a very fast online training methodFinish the training in 10 passes
July 2012 ACL 2012 22
3) Very fast online learning ***
Potential problem of existing online trainingp g gOver-simplified settingLearning rate: use only 1 scalar for all weightsLearning rate: use only 1 scalar for all weights
July 2012 ACL 2012 23
3) Very fast online learning ***
Potential problem of existing online trainingp g gOver-simplified settingLearning rate: use only 1 scalar for all weightsLearning rate: use only 1 scalar for all weights
July 2012 ACL 2012 24
3) Very fast online learning ***
better to use different learning rates for different gfeatures?
July 2012 ACL 2012 25
3) Very fast online learning ***
better to use different learning rates for different gfeatures?
YES We use a vector of learning ratesYES. We use a vector of learning rates
July 2012 ACL 2012 26
3) Very fast online learning ***
better to use different learning rates for different gfeatures?
YES We use a vector of learning ratesYES. We use a vector of learning rates
July 2012 ACL 2012 27
3) Very fast online learning ***
better to use different learning rates for different gfeatures?
YES We use a vector of learning ratesYES. We use a vector of learning rates
July 2012 ACL 2012 28
3) Very fast online learning ***
better to use different learning rates for different gfeatures?
YES We use a vector of learning ratesYES. We use a vector of learning ratesLearning of Learning rates: How to learn the
values of the learning rates?
July 2012 ACL 2012 29
3) Very fast online learning ***
better to use different learning rates for different gfeatures?
YES We use a vector of learning ratesYES. We use a vector of learning ratesLearning of Learning rates: How to learn the
values of the learning rates?
*Learning rates are learned from feature frequency informationfrequency information
*Higher frequency feature lower learning rateg q y g
July 2012 ACL 2012 30
3) Very fast online learning ***
July 2012 ACL 2012 31
3) Very fast online learning ***
July 2012 ACL 2012 32
3) Very fast online learning ***
July 2012 ACL 2012 33
Convergence Analysis
Good convergence properties of the proposed g p p p pmethod
The ADF training is convergent
July 2012 ACL 2012 34
Experiments
DataSighan bakeoff 2004
Microsoft Research data (MSR)Microsoft Research data (MSR)Peking University data (PKU)City University of Hongkong data (CU)
July 2012 ACL 2012 35
Experiments
ResultsBaseline: CRF with SGD training
July 2012 ACL 2012 36
Experiments
ResultsNew feature (word feature + high dimensional edge features) helpsp
July 2012 ACL 2012 37
Experiments
ResultsJoint modeling (word seg + new word detection) helps slightly on word segmentation, but largely on NWD rec.g y g , g y
July 2012 ACL 2012 38
Experiments
ResultsADF training greatly reduced training time, yet with even higher F-score!g
July 2012 ACL 2012 39
Experiments
ResultsADF training greatly reduced training time, yet with even higher F-score!g
We achieved best accuracy reports on the MSR & PKU datasets
July 2012 ACL 2012 40
Experiments
New training ADF vs. SGD vs. LBFGSg
July 2012 ACL 2012 41
Experiments
New training ADF vs. SGD vs. LBFGSg
ADF training achieved optimum in 10 passes
July 2012 ACL 2012 42
Experiments
New training ADF vs. SGD vs. LBFGSg
ADF training is faster in both # of passes and inADF training is faster in both # of passes, and in wall-clock time.
July 2012 ACL 2012 43
Conclusions
3 proposalsp pJoint modeling: segmentation + new word detectionNew features: word features + high dimensional edgeNew features: word features + high dimensional edge featuresA very fast online training method (main proposal)A very fast online training method (main proposal)
July 2012 ACL 2012 44
Conclusions
3 proposalsp pJoint modeling: segmentation + new word detectionNew features: word features + high dimensional edgeNew features: word features + high dimensional edge featuresA very fast online training method (main proposal)A very fast online training method (main proposal)
ExperimentsJoint modeling helps
July 2012 ACL 2012 45
Conclusions
3 proposalsp pJoint modeling: segmentation + new word detectionNew features: word features + high dimensional edgeNew features: word features + high dimensional edge featuresA very fast online training method (main proposal)A very fast online training method (main proposal)
ExperimentsJoint modeling helpsNew features helps a lot on model accuracyp y
July 2012 ACL 2012 46
Conclusions
3 proposalsp pJoint modeling: segmentation + new word detectionNew features: word features + high dimensional edgeNew features: word features + high dimensional edge featuresA very fast online training method (main proposal)A very fast online training method (main proposal)
ExperimentsJoint modeling helpsNew features helps a lot on model accuracyp yThe new training method finishes the training in 10 passes
July 2012 ACL 2012 47
Conclusions
3 proposalsp pJoint modeling: segmentation + new word detectionNew features: word features + high dimensional edgeNew features: word features + high dimensional edge featuresA very fast online training method (main proposal)A very fast online training method (main proposal)
ExperimentsJoint modeling helpsNew features helps a lot on model accuracyp yThe new training method finishes the training in 10 passesFinal results beat the existing best reports on the datasetsFinal results beat the existing best reports on the datasets
More accurate & faster!
July 2012 ACL 2012 48
Thanks!Any question?
July 2012 ACL 2012 49