JHU MT class: Automated Evaluation

38
Automated Evaluation

Transcript of JHU MT class: Automated Evaluation

AutomatedEvaluation

Human Assessment is Fast and Cheap (if that)

© 2010 IBM Corporation

IBM Research

55

What It Takes to compete against Top Human Jeopardy! PlayersOur Analysis Reveals the Winner’s Cloud

Winning Human Performance

Winning Human Performance

2007 QA Computer System2007 QA Computer System

Grand Champion Human Performance

Grand Champion Human Performance

Each dot – actual historical human Jeopardy! games

More ConfidentMore Confident Less ConfidentLess Confident

Computers?Not So Good.

© 2010 IBM Corporation

IBM Research

10

Baseline 12/06

v0.1 12/07

v0.3 08/08

v0.5 05/09

v0.6 10/09

v0.8 11/10

v0.4 12/08

DeepQA: Incremental Progress in Answering Precision on the Jeopardy Challenge: 6/2007-11/2010

v0.2 05/08

IBM WatsonPlaying in the Winners Cloud

V0.7 04/10

Although the northern wind shrieked across the sky , it was still very clear .

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Although the northern wind shrieked across the sky , it was still very clear .

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Edit distance = 163 substitutions

8 deletions5 insertions

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Edit distance = 163 substitutions

8 deletions5 insertions

ed(i, j) = min

ed(i− 1, j) + del(wi)ed(i, j − 1) + ins(w�

j)ed(i− 1, j − 1) + sub(wi, w�

j)

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Edit distance = 163 substitutions

8 deletions5 insertions

ed(i, j) = min

ed(i− 1, j) + del(wi)ed(i, j − 1) + ins(w�

j)ed(i− 1, j − 1) + sub(wi, w�

j)

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Precision:7/15 tokens = 47%

Recall:7/12 tokens = 58%

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens

sky very northern shrieked clear wind Although across the the , still was it .

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens4/14 bigrams1/13 trigrams

sky very northern shrieked clear wind Although across the the , still was it .

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens0/14 bigrams0/13 trigrams

very clear .

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 3/1 tokens2/2 bigrams1/1 trigrams

very clear . shrieked was still Although wind , across it northern the the sky

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens4/14 bigrams1/13 trigrams

a north . the was and was the the the though the , the sky

However , the sky remained clear under the strong north wind .

Despite the strong northerly winds , the sky remains very clear .

The sky was still crystal clear , though the north wind was howling .

Although a north wind was howling , the sky remained clear and blue .

Precision: 11/15 tokens4/14 bigrams1/13 trigrams

BLEU

BP =�

1 if c > re1−r/c if c ≤ r

Bleu = BP · exp

�N�

n=1

wn log pn

Details matter

length

BP

Details matter

Influence of BLEU

BLEU-1BLEU-4

BLEU-v11bBLEU-v12

METEOR-v0.6NIST-v11b

TER-v0.7.254-GRRATEC1AmberATEC3ATEC4

Meteor-v0.7TerrorCat

BEwT-EBadger

BadgerLiteBleu-sbpBleuSPCDer

DP-OrDP-OrpDR-OrEDPM

LETMETEOR-ranking

MaxSim

RTERose

SEPIA1SEPIA2

SNRSR-Or

SVM-RankTERpULCh

ULCoptinvWermBLEUmTER

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

TER: Translation (Error|Edit) Distance

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

TER: Translation (Error|Edit) Distance

Basically edit distance with swaps

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

TER: Translation (Error|Edit) Distance

Basically edit distance with swapsHow hard is it to compute this?

Although the northern wind shrieked across the sky , it was still very clear .

However , the sky remained clear under the strong north wind .

TER: Translation (Error|Edit) Distance

Basically edit distance with swapsHow hard is it to compute this?

ter(i, j) = min

ter(i− 1, j) + del(wi)ter(i, j − 1) + ins(w�

j)ter(i− 1, j − 1) + sub(wi, w�

j)maxk ter(i− 1, [1, ...k − 1, k + 1, ...j]) + 1

Automatic evaluation is Fast and Cheap.

Why Not Use all Translations?(Dreyer & Marcu ’12)

Why Not Use all Translations?

el primer ministro italiano Silvio Berlusconi

(Dreyer & Marcu ’12)

〈PM〉 〈IT〉 〈SB〉

Why Not Use all Translations?

el primer ministro italiano Silvio Berlusconi

(Dreyer & Marcu ’12)

〈PM〉 〈IT〉 〈SB〉

Why Not Use all Translations?

el primer ministro italiano Silvio Berlusconi

〈PM〉 → prime-minister〈PM〉 → PM〈PM〉 → prime minister〈PM〉 → head of government〈PM〉 → premier

〈IT〉 → Italian

〈SB〉 → Silvio Berlusconi〈SB〉 → Berlusconi

(Dreyer & Marcu ’12)

〈PM〉 〈IT〉 〈SB〉

Why Not Use all Translations?

el primer ministro italiano Silvio Berlusconi

〈PM〉 → prime-minister〈PM〉 → PM〈PM〉 → prime minister〈PM〉 → head of government〈PM〉 → premier

〈IT〉 → Italian

〈SB〉 → Silvio Berlusconi〈SB〉 → Berlusconi

〈S〉 → 〈SB〉 , 〈IT〉 〈PM〉〈S〉 → 〈IT〉 〈PM〉 〈SB〉〈S〉 → the 〈IT〉 〈PM〉 , 〈SB〉〈S〉 → the 〈PM〉 of Italy

(Dreyer & Marcu ’12)

HyTER

•Entire set is exponential, but finite.

•Can be encoded as an FST.

•Then compute edit distance as FST composition!

HyTER statistics

•3-4 annotators per sentence.

•2-3 hours per annotator per sentence.

•>1M translations per annotator per sentence.

•>1B translations per sentence (combined).

•Shockingly low overlap between annotators (~10K).

Parting Thoughts•Evaluating machine translation is really, really hard.

•Human evaluation: expensive, slow, unreproducible. But arguably what we want.

•Automatic evaluation: fast, cheap, consistent. But might not have anything to do with what we want.

•It’s also really, really important.

•It’s easier to improve what you measure.

•Research funding often driven by evaluation.

•What should we be measuring?