Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language...

CS 562: Empirical Methodsin Natural Language Processing

Nov 2010

Liang Huang (lhuang@isi.edu)

Unit 3: Structured Learning(part 2) Discriminative Models: Perceptron & CRF

Thursday, November 18, 2010

Discriminative Models

A Quiz on Forest

Generative models

t̂ = arg maxt

P(t | w)

= arg maxt

P(w, t)P(w)

= arg maxt

P(w, t)

More work to learn to generate w

But can handle incomplete data: P(w) =�

P(w, t)

Wednesday, November 25, 2009Thursday, November 18, 2010

Discriminative Models3

The generative story

P(w, t) = P(t) × P(w | t)

≈ P(t1)n�

P(ti | ti−1)P(stop | tn) ×n�

P(wi | ti)

SOURCE-CHANNEL

P(t | w) ≈n�

P(ti | wi)

DEFICIENT

P(t | w) ≈n�

P(ti | wi) × P(t1)n�

P(ti | ti−1)P(stop | tn)

Wednesday, November 25, 2009

you generated ti twice!

Perceptron is ...

• an extremely simple algorithm

• almost universally applicable

• and works very well in practice

Perceptron is ...

• an extremely simple algorithm

• almost universally applicable

• and works very well in practice

vanilla perceptron(Rosenblatt, 1958)

structured perceptron(Collins, 2002)

the man bit the dog

DT NN VBD DT NN

Generic Perceptron• online-learning: one example at a time

• learning by doing

• find the best output under the current weights

• update weights at mistakes

inferencexi

update weightszi

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

!(x, z)

!(x, y)

!(x, z)

!(x, y)

!(x, z)

!(x, y)

!(x, z)

!(x, y)

Structured Perceptron

inferencexi

update weightszi

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

inferencexi

update weightszi

argmaxy!GEN(x)

inferencexi

update weightszi

argmaxy!GEN(x)

inferencexi

update weightszi

argmaxy!GEN(x)

inferencexi

update weightszi

argmaxy!GEN(x)

inferencexi

update weightszi

argmaxy!GEN(x)

Averaged Perceptron

• more stable and accurate results

• approximation of voted perceptron (Freund & Schapire, 1999)

jj + 1

Averaging Tricks

• Daume (2006, PhD thesis)

Do we need smoothing?

• smoothing is much easier in discriminative models

• just make sure for each feature template, its subset templates are also included

• e.g., to include (t0 w0 w-1) you must also include

• (t0 w0) (t0 w-1) (w0 w-1)

• and maybe also (t0 t-1) because t is less sparse than w

Comparison with Other Models

from HMM to MEMM

HMM: joint distribution MEMM: locally normalized(per-state conditional)

Label Bias Problem

• bias towards states with fewer outgoing transitions

• a problem with all locally normalized models

Conditional Random Fields

• globally normalized (no label bias problem)

• but training requires expected features counts

• (related to the fractional counts in EM)

• need to use Inside-Outside algorithm (sum)

• Perceptron just needs Viterbi (max)

Experiments

Experiments: Tagging• (almost) identical features from (Ratnaparkhi, 1996)

• trigram tagger: current tag ti, previous tags ti-1, ti-2

• current word wi and its spelling features

• surrounding words wi-1 wi+1 wi-2 wi+2..

Experiments: NP Chunking

• B-I-O scheme

• features:

• unigram model

• surrounding words and POS tags

Rockwell International Corp. B I I's Tulsa unit said it signed B I I O B O a tentative agreement ...B I I

Experiments: NP Chunking• results

• (Sha and Pereira, 2003) trigram tagger

• voted perceptron: 94.09% vs. CRF: 94.38%19

Other NLP Applications

• dependency parsing (McDonald et al., 2005)

• parse reranking (Collins)

• phrase-based translation (Liang et al., 2006)

• word segmentation

• ... and many many more ...

Theory

Vanilla Perceptron

Convergence Theorem

• Data is separable if and only if perceptron converges

• number of updates is bounded by

• γ is the margin; R = maxi || xi ||

• This result generalizes to structured perceptron

• Also in Collins paper: theorems for non-separable cases and generalization bounds

(R/!)2

R = maxi

!!(xi, yi) " !(xi, zi)!

Perceptron Conclusion

• a very simple framework that can work with many structured problems and that works very well

• all you need is (fast) 1-best inference

• much simpler than CRFs and SVMs

• can be applied to parsing, translation, etc.

• generalization bounds depend on separability

• not the (exponential) size of the search space

• limitation 1: features local on y (non-local on x)

• limitation 2: no probabilistic interpretation (vs. CRF)

From HMM to CRF

27(Sutton and McCallum, 2007) -- recommended reading

Maximum likelihoodL = log

P(t j | wj)

= logN�

1Z(wj)

λi fi(wj, t j)

= logN�

exp��n

i=1 λi fi(wj, t j)�

�t exp

��ni=1 λi fi(wj, t)

λi fi(wj, t j) − log�

expn�

λi fi(wj, t)

Maximum likelihoodL = log

P(t j | wj)

= logN�

1Z(wj)

λi fi(wj, t j)

= logN�

exp��n

i=1 λi fi(wj, t j)�

�t exp

��ni=1 λi fi(wj, t)

λi fi(wj, t j) − log�

expn�

λi fi(wj, t)

Optimizing likelihood∂L∂λi=

�fi(wj, t j) − E[ fi(wj, t)]

Gradient ascent:

Stochastic gradient ascent: for each

λi ← λi + αN�

�fi(wj, t j) − E[ fi(wj, t)]

(wj, t j)

λi ← λi + α�

fi(wj, t j) − E[ fi(wj, t)]�

Expected feature counts

E[ fV,loves] =� 1

Zα[F]µ(V, loves)β[G]

Z = α[END]

µ(V, loves)α[F] β[G]

CRF : SGD : Perceptron• CRF = conditional random fields = structured maxent

• SGD = stochastic gradient descent = online CRF

• CRF => (online) => SGD => (viterbi) => Perceptron

• update rules (very similar):

• CRF: batch, E[...]

• SGD: one example at a time, expectation E[...]

• Perceptron: one example at a time, and Viterbi

Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language...

Documents

Transcript of Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language...

Semi-Automatically Mapping Structured Sources into the ... · Information Sciences Institute and Department of Computer Science fknoblock,pszekely,ambite,amangoel,shubhamg,lerman,mariam,mohseng@isi.edu

Structured Products in Private Wealth Management€¦ · Structured Products in Private Wealth Management ... Definition of Structured Products • A structured product is generally

Valuation of Structured Products - Securities Litigation & …slcg.com/pdf/workingpapers/Valuation of Structured... · · 2014-02-04Valuation of Structured Products Structured products

Harnessing quantum transport by transient chaosicpcs.lzu.edu.cn/lhuang/publication/Chaos_23_013125.pdf · however, was largely limited to nonlinear dynamical systems in the classical

Planning Ewa Deelman USC Information Sciences Institute deelman@isi.edu GriPhyN NSF Project Review 29-30 January 2003 Chicago.

Fifth Floor - Structured Cabling Layout Plan - Structured ......Fifth Floor - Structured Cabling Layout Plan - Structured Cabling Installation ...

International Grid Communities Dr. Carl Kesselman carl@isi.edu Information Sciences Institute University of Southern California.

A Statistical Analysis Example of A Full Functional Utilization of An Engineering Calculator Li-Fei Huang (lhuang@mcu.edu.tw) Dept. of App. Statistics.

Reeognit ion Algorithms ion Algorithms obert M. MacGregor and avid rill USC/Information Sciences Institute 4676 Admiralty Way Marina de1 Rey, CA 90292 macgregor@isi.edu, brill@isi.edu

Provenance and Workflows Yolanda Gil USC/ISI gil@isi.edu March 6, 2015 Coal_from_the_Titanic.jpg.

Structured Analysis and Structured Design

A Structured Settlement Annuity Almost Always Saves … Structured Settlement Annuity Almost... · A Structured Settlement Annuity Almost Always Saves Money ... • Structured Settlement

Semantic Enrichment of Text with Background Knowledge Anselmo Peñas NLP & IR Group UNED nlp.uned.es Eduard Hovy USC / ISI isi.edu.

NS TUTORIAL Padma Haldar haldar@isi.edu USC/ISI 09/04/02.

Leo Huang lhuang@google - APRICOT Confidential and Proprietary Localizing packet loss In a large complex network Leo Huang lhuang@google.com Arya Reais-Parsi growly@google.com Google

THE DETER PROJECT: SCIENTIFIC, SAFE AND EASY CYBERSECURITY EXPERIMENTATION Jelena Mirkovic USC Information Sciences Institute sunshine@isi.edu Sponsored.

Practical Structured Learning for Natural Language Processingusers.umiacs.umd.edu/~hal/docs/defense-talk.pdf · Slide 1 Structured Learning for Language Hal Daumé III (hdaume@isi.edu)

CS 544: Shift-Reduce Parsing Ulf Hermjakob USC Information Sciences Institute ulf@isi.edu February 9, 2010.

Characteristics of level-spacing statistics in chaotic ...icpcs.lzu.edu.cn/lhuang/publication/Chaos_21_013102-levelspacing.pdffestations of classical chaos, has received a great deal

Structured Media STRUCTURED MEDIA SYSTEMS GROUP Components