Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

Boosting Applied to Tagging and PP

Attachment

By Aviad Barzilai

By Aviad Barzilai

Introduction to AdaBoost


AdaBoost & NLP

AdaBoost & NLP

Tagging & PP Attachment


By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost Binary classification

(introduction to AdaBoost)Weak classifier - A classifier better than chance

Strong classifier – A classifier with low error rates

Weak classifiers can be easier to find

By Aviad Barzilai


(introduction to AdaBoost)

X – Examples Y – Labels ({-1, 1})D – WeightsH(x) – Classifiers

Goal – Building a strong classifier using weak classifiers

By Aviad Barzilai



Update the weights to reflect the errors

Find the “best” weak classifier

REPEAT!

*A formal definition of “best” will be given later

By Aviad Barzilai



REPEAT!

Update the weights to reflect the errors

Find the “best” weak classifier*A formal definition of “best” will be given later

By Aviad Barzilai


(introduction to AdaBoost)Create a strong classifier

++

By Aviad Barzilai



AdaBoost & NLP

AdaBoost & NLP



By Aviad Barzilai

Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP

The idea of boosting is to combine many simple “rules of thumb”, such as “the current word isa noun if the previous word is the.”

Such rules often give incorrect classifications. The main idea of boosting is to combine many such rules in a principled manner to produce a single highly accurate classification rule.

AdaBoost & NLP

The “rules of thumb”, (e.g. “the current word isa noun if the previous word is the.”) are called weak hypotheses.

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPht maps each example x to a real number ht(x).

The sign of this number is interpreted as the predicted class (-1 or +1) of example x.

The magnitude |ht(x)| is interpreted as the level of confidence in the prediction.

X

±confidence

Weak hypothesis (ht)“the current word isa noun if the previous word is the.”

Weak hypothesis (ht)“the current word isa noun if the previous word is the.”

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPAdaBoost outputs a final hypothesis which makepredictions using a simple vote of the weak hypotheses’ predictions, taking into account the varying confidences of the different predictions.

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPTraining setTraining set

Importance Weights (initially uniform) Importance Weights (initially uniform)

Get a weak hypothesisGet a weak hypothesis

Update the Importance Weights

Update the Importance Weights

Repeat“In our experiments, we used cross validation tochoose the number of

rounds T”

Repeat“In our experiments, we used cross validation tochoose the number of

rounds T”

Final HypothesisFinal Hypothesis

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLP

For each hypothesisFor each hypothesis

Wb,s is the sum of weights of examples for which yi=s and the result of the predicate is b.

Choose “best” hypothesis(minimal error)

Choose “best” hypothesis(minimal error)

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPOur default approach to multiclass problems is touse Schapire and Singer’s (1998) AdaBoost.MH algorithm.

The main idea of this algorithm is to regard each example with its multiclass label as several binary-labeled examples.

Suppose that the possible classes are 1,…,k. We map each original example x with label y to k binary labeled derived examples (x; 1),…,(x; k) where example (x; c) is labeled +1 if c = y and -1 otherwise.

By Aviad Barzilai



AdaBoost & NLP

AdaBoost & NLP



By Aviad Barzilai

Tagging (with AdaBoost)Introduction to AdaBoostIntroduction to AdaBoost

AdaBoost & NLP

AdaBoost & NLP



“In corpus linguistics, part-of-speech tagging is the process of marking up the words in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context.

A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.”

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP



The authors used the UPenn Treebank corpus (Marcus et al., 1993). The corpus uses 80 labels, which comprise 45 parts of speech properly so-called, and 35 indeterminate tags, representing annotator uncertainty. They introduced an 81st label, ##, for paragraph separators.

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP



The straightforward way of applying boosting totagging is to use AdaBoost.MH. Each word tokenrepresents an example, and the classes are the 81part-of-speech tags.

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP



Weak hypotheses are identified with “attribute=value” predicates.

The article uses three types of attributes:Lexical attributes, Contextual attributes and Morphological attributes

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP



Lexical attributes: The current word as adowncased string (S); its capitalization (C);and its most-frequent tag in training (T).

Contextual attributes: The string (LS), capitalization (LC), and most-frequent tag (LT) of the preceding word and similarly for the following word (RS;RC;RT).

Morphological attributes: The inflectionalsuffix (I) of the current word, as provided byan automatic stemmer; also the last two (S2)and last three (S3) letters of the current word.

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP



The authors conducted four experiments of tagging the UPenn Treebank corpus using AdaBoost

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP



The 3.28% error rate is not significantly different (at p = 0.05) from the error rate of the best-known single tagger, Ratnaparkhi’s Maxent tagger, which achieves 3.11% error on our data.

The results are not as good as those achieved byBrill and Wu’s voting scheme yet the experiments described in the article use very simple features.

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP



By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP



“According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the default”

Tagging (with MaxEnt)

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP



In MaxEnt we “model all that is known and assume nothing about what is unknown”.

Model all that is known: Satisfy a set of constraints that must hold

Assume nothing about what is unknownChoose the most “uniform” distribution i.e choose the one with maximum entropy


By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP




Feature (a.k.a. feature function, Indicator function) is a binary-valued function on events:

BAf j },1,0{:

A: the set of possible classesB: space of contexts (e.g. neighboring words/ tags)

..0

"")(&1),(

wo

thatbcurWordDETaifbaf j

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP




By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP




x

jjp xfxpfE )()(~~Observed expectation(using the Observed

probability )

Observed expectation(using the Observed

probability )

x

jjp xfxpfE )()( Model expectation (using the model’s probability)

Model expectation (using the model’s probability)

)(maxarg* pHpPp

The task: find p*The task: find p*

}},...,1{,|{ ~ kjfEfEpP jpjp constraintsconstraints

The solution is obtained iteratively

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP




“The 3.28% error rate is not significantly different (at p = 0.05) from the error rate of the best-known single tagger, Ratnaparkhi’s Maxent tagger, which achieves 3.11% error on our data.”

By Aviad Barzilai

PP attachment(Prepositional phrase attachment

with AdaBoost)


AdaBoost & NLP

AdaBoost & NLP



“In grammar, a preposition is a part of speech that introduces a prepositional phrase. For example, in the sentence ‘The cat sleeps on the sofa’, the word ‘on’ is a preposition, introducing the prepositional phrase ‘on the sofa’. Simply put, a preposition indicates a relation between things mentioned in a sentence.”

By Aviad Barzilai


with AdaBoost)


AdaBoost & NLP

AdaBoost & NLP



Example of PP attachment“Congress accused the president of peccadillos” is classified according to the attachment site of theprepositional phrase.

Attachment to N: accused [the president of peccadillos]

Attachment to V: accused [the president] [of peccadillos]

By Aviad Barzilai


with AdaBoost)


AdaBoost & NLP

AdaBoost & NLP



The cases of PP-attachment that the article addresses define a binary classification problem.

Examples have binary labels: positive represents attachment to noun, and negative represents attachment to verb.

By Aviad Barzilai


with AdaBoost)


AdaBoost & NLP

AdaBoost & NLP



The authors used the same training and test data as Collins and Brooks (1995).

By Aviad Barzilai


with AdaBoost)


AdaBoost & NLP

AdaBoost & NLP



Each PP-attachment example is represented by its value for four attributes: the main verb (V ), the head word of the direct object (N1), the preposition (P), and the head word of the object of the preposition (N2).

For instance, in example before, V = accused, N1 = president, P = of, and N2 = peccadillos. Examples have binary labels: positive represents attachment to noun, and negative represents attachment to verb.

By Aviad Barzilai


with AdaBoost)


AdaBoost & NLP

AdaBoost & NLP



The weak hypotheses used correspond to “attribute=value” predicates and conjunctions thereof.

There are 16 predicates that are considered for each example. For the previous example, three of these 16 predicates are (V = accused ^ N1 = president ^ N2 = peccadillos), (P = with), and (V = accused ^ P = of).

By Aviad Barzilai


with AdaBoost)


AdaBoost & NLP

AdaBoost & NLP



By Aviad Barzilai


with AdaBoost)


AdaBoost & NLP

AdaBoost & NLP



After 20,000 rounds of boosting the test errorwas down to 14.6 ± 0.6%. This is indistinguishable from the best known results for this problem,namely, 14.5 ± 0.6%, reported by Collins and Brookon exactly the same data.

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPThe article uses very simple weak hypotheses that test the value of a Boolean predicate and make a prediction based on that value.

The predicates used are of the form “a = v”, for a anattribute and v a value; for example, “PreviousWord = the”.

If, on a given example x, the predicate holds, the weak hypothesis outputs prediction p1, otherwise p0

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPSchapire and Singer (1998) prove that the training error of the final hypothesis is at most

This suggests that the training error can be greedilydriven down by designing a weak learner which, onround t of boosting, attempts to find a weak hypothesis h that minimizes Z

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPGiven a predicate (e.g. “PreviousWord = the”), we choose p0 and p1 to minimize Z

Schapire and Singer (1998) show that Z is minimized when


By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPIn practice, very large values of p0 and p1 cancause numerical problems and may also lead tooverfitting. Therefore, we usually “smooth” thesevalues

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLP

For each hypothesisFor each hypothesis


Choose “best” hypothesis(max confidence)

Choose “best” hypothesis(max confidence)

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPOur default approach to multiclass problems is touse Schapire and Singer’s (1998) AdaBoost.MH algorithm.

The main idea of this algorithm is to regard each example with its multiclass label as several binary-labeled examples.

Suppose that the possible classes are 1,…,k. We map each original example x with label y to k binary labeled derived examples (x; 1),…,(x; k) where example (x; c) is labeled +1 if c = y and -1 otherwise.

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPWe maintain a distribution over pairs (x; c),treating each such as a separate example.

Weak hypotheses are identified with predicates over (x; c) pairs, though they now ignore c.

The prediction weights pc,0 and pc,1, however, are chosen separately for each class c.

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPAn alternative is the to use binary AdaBoost to train separate discriminators (binary classifiers) for each class, and combine their output by choosing the class c that maximizes fc(x), where fc(x) is the final confidence weighted prediction of the discriminator for class c.

The above is called AdaBoost.MI (multiclass, independent discriminators)

By Aviad Barzilai


AdaBoost & NLP

AdaBoost & NLP

AdaBoost & NLPAdaBoost.MI differs from AdaBoost.MH in that predicates are selected independently for each class; we do not require that the weak hypothesis at round t be the same for all classes.

Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

Documents

Transcript of Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.