Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

50
Boosting Applied to Tagging and PP Attachment By Aviad Barzila
  • date post

    24-Jan-2016
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

Page 1: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

Boosting Applied to Tagging and PP

Attachment

By Aviad Barzilai

Page 2: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoost

Introduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Page 3: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost Binary classification

(introduction to AdaBoost)Weak classifier - A classifier better than chance

Strong classifier – A classifier with low error rates

Weak classifiers can be easier to find

Page 4: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost Binary classification

(introduction to AdaBoost)

X – Examples Y – Labels ({-1, 1})D – WeightsH(x) – Classifiers

Goal – Building a strong classifier using weak classifiers

Page 5: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost Binary classification

(introduction to AdaBoost)

Update the weights to reflect the errors

Find the “best” weak classifier

REPEAT!

*A formal definition of “best” will be given later

Page 6: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost Binary classification

(introduction to AdaBoost)

REPEAT!

Update the weights to reflect the errors

Find the “best” weak classifier*A formal definition of “best” will be given later

Page 7: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost Binary classification

(introduction to AdaBoost)

REPEAT!

Update the weights to reflect the errors

Find the “best” weak classifier*A formal definition of “best” will be given later

Page 8: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost Binary classification

(introduction to AdaBoost)Create a strong classifier

++

Page 9: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoost

Introduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Page 10: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

The idea of boosting is to combine many simple “rules of thumb”, such as “the current word isa noun if the previous word is the.”

Such rules often give incorrect classifications. The main idea of boosting is to combine many such rules in a principled manner to produce a single highly accurate classification rule.

AdaBoost & NLP

The “rules of thumb”, (e.g. “the current word isa noun if the previous word is the.”) are called weak hypotheses.

Page 11: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPht maps each example x to a real number ht(x).

The sign of this number is interpreted as the predicted class (-1 or +1) of example x.

The magnitude |ht(x)| is interpreted as the level of confidence in the prediction.

X

±confidence

Weak hypothesis (ht)“the current word isa noun if the previous word is the.”

Weak hypothesis (ht)“the current word isa noun if the previous word is the.”

Page 12: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPAdaBoost outputs a final hypothesis which makepredictions using a simple vote of the weak hypotheses’ predictions, taking into account the varying confidences of the different predictions.

Page 13: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPTraining setTraining set

Importance Weights (initially uniform) Importance Weights (initially uniform)

Get a weak hypothesisGet a weak hypothesis

Update the Importance Weights

Update the Importance Weights

Repeat“In our experiments, we used cross validation tochoose the number of

rounds T”

Repeat“In our experiments, we used cross validation tochoose the number of

rounds T”

Final HypothesisFinal Hypothesis

Page 14: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLP

For each hypothesisFor each hypothesis

Wb,s is the sum of weights of examples for which yi=s and the result of the predicate is b.

Choose “best” hypothesis(minimal error)

Choose “best” hypothesis(minimal error)

Page 15: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPOur default approach to multiclass problems is touse Schapire and Singer’s (1998) AdaBoost.MH algorithm.

The main idea of this algorithm is to regard each example with its multiclass label as several binary-labeled examples.

Suppose that the possible classes are 1,…,k. We map each original example x with label y to k binary labeled derived examples (x; 1),…,(x; k) where example (x; c) is labeled +1 if c = y and -1 otherwise.

Page 16: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoost

Introduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Page 17: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Tagging (with AdaBoost)Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

“In corpus linguistics, part-of-speech tagging is the process of marking up the words in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context.

A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.”

Page 18: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Tagging (with AdaBoost)Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

The authors used the UPenn Treebank corpus (Marcus et al., 1993). The corpus uses 80 labels, which comprise 45 parts of speech properly so-called, and 35 indeterminate tags, representing annotator uncertainty. They introduced an 81st label, ##, for paragraph separators.

Page 19: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Tagging (with AdaBoost)Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

The straightforward way of applying boosting totagging is to use AdaBoost.MH. Each word tokenrepresents an example, and the classes are the 81part-of-speech tags.

Page 20: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Tagging (with AdaBoost)Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Weak hypotheses are identified with “attribute=value” predicates.

The article uses three types of attributes:Lexical attributes, Contextual attributes and Morphological attributes

Page 21: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Tagging (with AdaBoost)Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Lexical attributes: The current word as adowncased string (S); its capitalization (C);and its most-frequent tag in training (T).

Contextual attributes: The string (LS), capitalization (LC), and most-frequent tag (LT) of the preceding word and similarly for the following word (RS;RC;RT).

Morphological attributes: The inflectionalsuffix (I) of the current word, as provided byan automatic stemmer; also the last two (S2)and last three (S3) letters of the current word.

Page 22: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Tagging (with AdaBoost)Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

The authors conducted four experiments of tagging the UPenn Treebank corpus using AdaBoost

Page 23: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Tagging (with AdaBoost)Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

The 3.28% error rate is not significantly different (at p = 0.05) from the error rate of the best-known single tagger, Ratnaparkhi’s Maxent tagger, which achieves 3.11% error on our data.

The results are not as good as those achieved byBrill and Wu’s voting scheme yet the experiments described in the article use very simple features.

Page 24: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Page 25: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

“According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the default”

Tagging (with MaxEnt)

Page 26: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

In MaxEnt we “model all that is known and assume nothing about what is unknown”.

Model all that is known: Satisfy a set of constraints that must hold

Assume nothing about what is unknownChoose the most “uniform” distribution i.e choose the one with maximum entropy

Tagging (with MaxEnt)

Page 27: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Tagging (with MaxEnt)

Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events:

BAf j },1,0{:

A: the set of possible classesB: space of contexts (e.g. neighboring words/ tags)

..0

"")(&1),(

wo

thatbcurWordDETaifbaf j

Page 28: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Tagging (with MaxEnt)

Page 29: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Tagging (with MaxEnt)

x

jjp xfxpfE )()(~~Observed expectation(using the Observed

probability )

Observed expectation(using the Observed

probability )

x

jjp xfxpfE )()( Model expectation (using the model’s probability)

Model expectation (using the model’s probability)

)(maxarg* pHpPp

The task: find p*The task: find p*

}},...,1{,|{ ~ kjfEfEpP jpjp constraintsconstraints

The solution is obtained iteratively

Page 30: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Tagging (with MaxEnt)

“The 3.28% error rate is not significantly different (at p = 0.05) from the error rate of the best-known single tagger, Ratnaparkhi’s Maxent tagger, which achieves 3.11% error on our data.”

Page 31: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

PP attachment(Prepositional phrase attachment

with AdaBoost)

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

“In grammar, a preposition is a part of speech that introduces a prepositional phrase. For example, in the sentence ‘The cat sleeps on the sofa’, the word ‘on’ is a preposition, introducing the prepositional phrase ‘on the sofa’. Simply put, a preposition indicates a relation between things mentioned in a sentence.”

Page 32: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

PP attachment(Prepositional phrase attachment

with AdaBoost)

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Example of PP attachment“Congress accused the president of peccadillos” is classified according to the attachment site of theprepositional phrase.

Attachment to N: accused [the president of peccadillos]

Attachment to V: accused [the president] [of peccadillos]

Page 33: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

PP attachment(Prepositional phrase attachment

with AdaBoost)

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

The cases of PP-attachment that the article addresses define a binary classification problem.

Examples have binary labels: positive represents attachment to noun, and negative represents attachment to verb.

Page 34: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

PP attachment(Prepositional phrase attachment

with AdaBoost)

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

The authors used the same training and test data as Collins and Brooks (1995).

Page 35: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

PP attachment(Prepositional phrase attachment

with AdaBoost)

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Each PP-attachment example is represented by its value for four attributes: the main verb (V ), the head word of the direct object (N1), the preposition (P), and the head word of the object of the preposition (N2).

For instance, in example before, V = accused, N1 = president, P = of, and N2 = peccadillos. Examples have binary labels: positive represents attachment to noun, and negative represents attachment to verb.

Page 36: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

PP attachment(Prepositional phrase attachment

with AdaBoost)

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

The weak hypotheses used correspond to “attribute=value” predicates and conjunctions thereof.

There are 16 predicates that are considered for each example. For the previous example, three of these 16 predicates are (V = accused ^ N1 = president ^ N2 = peccadillos), (P = with), and (V = accused ^ P = of).

Page 37: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

PP attachment(Prepositional phrase attachment

with AdaBoost)

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

Page 38: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

PP attachment(Prepositional phrase attachment

with AdaBoost)

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment

Tagging & PP Attachment

After 20,000 rounds of boosting the test errorwas down to 14.6 ± 0.6%. This is indistinguishable from the best known results for this problem,namely, 14.5 ± 0.6%, reported by Collins and Brookon exactly the same data.

Page 39: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

FIN

Page 40: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Page 41: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPThe article uses very simple weak hypotheses that test the value of a Boolean predicate and make a prediction based on that value.

The predicates used are of the form “a = v”, for a anattribute and v a value; for example, “PreviousWord = the”.

If, on a given example x, the predicate holds, the weak hypothesis outputs prediction p1, otherwise p0

Page 42: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPSchapire and Singer (1998) prove that the training error of the final hypothesis is at most

This suggests that the training error can be greedilydriven down by designing a weak learner which, onround t of boosting, attempts to find a weak hypothesis h that minimizes Z

Page 43: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPGiven a predicate (e.g. “PreviousWord = the”), we choose p0 and p1 to minimize Z

Schapire and Singer (1998) show that Z is minimized when

Wb,s is the sum of weights of examples for which yi=s and the result of the predicate is b.

Page 44: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPIn practice, very large values of p0 and p1 cancause numerical problems and may also lead tooverfitting. Therefore, we usually “smooth” thesevalues

Page 45: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLP

For each hypothesisFor each hypothesis

Wb,s is the sum of weights of examples for which yi=s and the result of the predicate is b.

Choose “best” hypothesis(max confidence)

Choose “best” hypothesis(max confidence)

Page 46: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

Page 47: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPOur default approach to multiclass problems is touse Schapire and Singer’s (1998) AdaBoost.MH algorithm.

The main idea of this algorithm is to regard each example with its multiclass label as several binary-labeled examples.

Suppose that the possible classes are 1,…,k. We map each original example x with label y to k binary labeled derived examples (x; 1),…,(x; k) where example (x; c) is labeled +1 if c = y and -1 otherwise.

Page 48: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPWe maintain a distribution over pairs (x; c),treating each such as a separate example.

Weak hypotheses are identified with predicates over (x; c) pairs, though they now ignore c.

The prediction weights pc,0 and pc,1, however, are chosen separately for each class c.

Page 49: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPAn alternative is the to use binary AdaBoost to train separate discriminators (binary classifiers) for each class, and combine their output by choosing the class c that maximizes fc(x), where fc(x) is the final confidence weighted prediction of the discriminator for class c.

The above is called AdaBoost.MI (multiclass, independent discriminators)

Page 50: Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPAdaBoost.MI differs from AdaBoost.MH in that predicates are selected independently for each class; we do not require that the weak hypothesis at round t be the same for all classes.