RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van...

35
RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of Aberdeen, Scotland, UK

Transcript of RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van...

Page 1: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Evaluating Algorithms for GRE

(Going beyond Toy Domains)

Ielka van der SluisAlbert Gatt

Kees van Deemter

University of Aberdeen, Scotland, UK

Page 2: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Outline

• GRE: Generation of Referring Expressions

• TUNA project: Corpus and Annotation

• Evaluation of Algorithms – Furniture Domain– People Domain

• [ Evaluation in the real world: STEC ]

Page 3: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

TUNA project (ended Feb. 2007)

• TUNA: Towards a UNified Algorithm for Generating Referring Expressions.

1. Extend coverage of GRE algorithms (plurals, negation, gradable properties,…)

2. Improve empirical foundations of GRE

• Focus on – Content Determination– “First mention” NPs (no anaphora!)

Page 4: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

TUNA results

• Elsewhere:– Reference to sets (e.g., Gatt 2006, 2007)– Gradable/vague properties (Van Deemter 2006)– Pointing (Van der Sluis & Krahmer 2007)– Large domains (Paraboni et al. 2007)

• This talk: empirical issues– Testing classic algorithms– Method: compute similarity to human-generated NPs

Page 5: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Method (overview)• Elicitation experiment

• Leads to transparent corpus of referring expressions:– referent and distractors are known– Domain attributes are known

• Transparent corpora can be used for many purposes

This talk: Compare some classic algorithms– giving each algorithm the same input as subjects– computing how similar algorithm’s output is to subjects’ output– We count semantic content only

Page 6: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Elicitation Experiment

• Furniture (simple domain)– TYPE, COLOUR, SIZE, ORIENTATION

• People (complex domain)– Nine annotated properties in total

Location:– Vertical location (Y-DIMENSION)– Horizontal location (X-DIMENSION)

the green desk facing backwards

the sofa and the desk which are red

the young man with a white shirtthe man with the funny haircut

the man on the left

the chair in the top right

Page 7: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Furniture trial

Page 8: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

People trial

Page 9: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Corpus setup• Each corpus was carefully balanced, e.g. between

singulars and plurals.

• Between-subjects design:

-Location: Subjects discouraged from using locative expressions.+Location: Subjects not discouraged.

-FaultCritical: Subjects could correct their utterances+FaultCritical: Subjects could not correct their utterances

• After discounting outliers and (self-reported) non-fluent speakers, 45 subjects were left

Page 10: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

• Experiment design: Furniture (-Location)

• 18 trials: (C=Colour, O=orientation, S-size)– 1 referent: minimal identification uses

{c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials]– 2 “similar” referents

{c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials]– 2 “dissimilar referents”

{c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials]

Page 11: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Classic GRE Algorithms• Full Brevity (FB; Dale 1989)

– Generation of a minimal description

• Greedy Algorithm (GR; Dale 1989) – Always add property that removes the most

distractors

• Incremental Algorithm (IA; Dale and Reiter 1995)– Add next useful property from an ordered list of

properties. (“Preference Order” = PO)

Page 12: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Other evaluation studies

• Jordan 2000, Jordan & Walker 2005– More than just identification (Jordan 2000)

• Siddharthan & Copestake 2004– References in linguistic context

• Gupta & Stent 2005– Realisation mixed with Content Determination

• Viethen & Dale 2006– Only Colour and Location

Page 13: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Other evaluation studies

General limitations:

• Limited numbers of subjects/referents

• Few attempts at balancing the corpus. (E.g., Viethen & Dale 2006 let subjects decide what to refer to.)

• IA: no teasing apart of preference orders

Page 14: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Extensions to the classics• Plurality: (van Deemter 2002)

– Extend each algorithm to search through disjunctions of increasing length

• Location: (van Deemter 2006)– Locatives treated as gradable: “the leftmost table/person”– E.g., suppose the referent x is located in column 3

=> “x is left of column 4”, “x is left of column 5” …=> “x is right of column 2”, “x is right of column 1”…

• Type:– People tend to use TYPE (Dale & Reiter 1995)– Here: All algorithms added TYPE.

Page 15: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Evaluation aims

• Hypothesis in Dale & Reiter 1995: – IA resembles human output most

• Our main questions: – Is this true?– How important are parameters (PO) for the IA?

• More generally: – assess ‘quality’ of classic GRE algorithms :– calculate average match between the description

generated by an algorithm and the descriptions produced by people (for the same referent)

Page 16: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Evaluation metric

• Dice Coefficient:

2 x |Common properties|

|total properties|

• A coefficient result of 1 indicates identical sets. 0 means no common terms

• We also used this to measure agreement between annotators of the corpus

Page 17: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

(Assumptions behind DICE)

• Deletion of a property is slightly worse than addition of a property

• The discriminatory power of a description does not matter

• All properties are equidistant

See Gatt & Van Deemter 2007, “Content Determination in GRE: evaluating the evaluator” )

Page 18: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Evaluation (I): Furniture• Which preference orders for the IA?

– Psycholinguistic evidence:

• COLOUR >> {ORIENTATION, SIZE}(Pechmann 89; Eikmeyer & Ahlsen 96; Belke & Meyer 02)

• Y-DIMENSION >> X-DIMENSION(Bryant et al, 1992; Arts 2004)

• Split data: +LOCATION vs –LOCATION This talk: focus on –LOCATION –LOCATION = approx. 800 descriptions

• Compare algorithms to a randomized IA (RAND)

Page 19: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Furniture: -LOCATION

SignificantSignificant

FB/GR

Page 20: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Beyond Toy Domains• More on Furniture corpus:

Gatt et al. (ENLG-2007)

• With complex real-world objects:– Many different attributes can be used– Number of PO’s explodes– Few psycholinguistic precedents

• People domain attributes:– { hasBeard, hasGlasses, age, hasTie,

hasSuit, hasSuit, hasHair, hairColour, orientation }– 9 Attributes, so 9! = 362880 possible POs

Page 21: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

IA: Preference Orders for People Domain

• Little psycholinguistic evidence for choosing between all 362880 possible PO’s

• Focus on the most frequent Attributes: G=hasGlasses, B=hasBeard, H=hasHair, C=haircolour– Assumption: H and B must precede C– This leaves us with eight POs:

{ GBHC, GHBC,HBGC,HBCG, HGBC,BHGC, BHCG, BGHC }

Page 22: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Preference Orders and frequency

Mean (std) Sum

type 1.39 475

hasGlasses .68 231

hasBeard .66 226

HairColour .61 210

hasHair .46 158

orientation .21 73

age .10 34

hasTie .04 12

hasSuit .01 4

hasShirt .01 3

• For attributes other than {G,C,H,B}, we let corpus frequency determine the order

• E.g, IA-GBHC uses

type, G,B,H,C, age,

hasTie, hasSuit,hasShirt

as its PO

Page 23: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Results People Domain

IA-BASE

Significant Significant by subjects

GR

Page 24: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Results People domain

• IA_base performs very badly now

• So much about the best IA’s that start with {B,H,G,C} and end with <age,hasTie,hasSuit,hasShirt>

• Some of these did much worse:– IA_BHCG had DICE=0.6, making it

significantly worse (by subjects) than GR!

Page 25: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Summary

• People domain gives much lower DICE scores than Furniture domain

• Difference between “good” and “bad” POs was enormous in People domain

Page 26: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Summary• The “Incremental Algorithm” (IA):

– not an algorithm but a class of algorithms

• The best IA beats all other algorithms, but the worst is very bad ...

• GR performs remarkably well.

• How to choose a suitable PO?– Furniture: few attributes; psycholinguistic precedent

• Still, there is variation.

– People: more attributes; no precedents• Variation even greater!

Page 27: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Discussion• Suppose you want to build a GRE

algorithm for a new and complex domain, for which no transparent corpus is available.

• Psycholinguistic principles are unlikely to help you much

• If corpus is also not balanced, then frequency doesn’t say much either …

Page 28: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Other uses of this method: STEC

• Summer 2007: First NLG Shared task Evaluation Challenge (STEC)

• STEC involved GRE only, focussing on Content Determination

• 22 GRE Algorithms were submitted and evaluated (6 teams)

• Reported in UCNLG+MT workshop, Copenhagen, Sept 2007

Page 29: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Other uses of this corpus: STEC

• Each algorithm was compared with the TUNA corpus (minus 40% training set) – Both Furniture and People domain – DICE measured “humanlikeness”– Singulars only

• Each algorithm was also tested in terms of identification time (by human reader)

Page 30: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Other uses of this corpus: STEC

• Future STEC:– beyond “first mention”– beyond Content Determination– more hearer-oriented experiments

Page 31: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

STEC results

1. The more minimal the descriptions generated by these 22 systems were, the worse their DICE scores were

Page 32: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Page 33: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

2. No relation between humanlikeness and identification time

– Best system in terms of DICE was worst-but-one in terms of identification time

• More research needed on the different criteria for judging NLG output

Page 34: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Thank you

Page 35: RANLP, Borovets 27-29 Sept. 2007 Evaluating Algorithms for GRE (Going beyond Toy Domains) Ielka van der Sluis Albert Gatt Kees van Deemter University of.

RANLP, Borovets27-29 Sept. 2007

Annotator agreement

• Semantic markup was applied manually to all descriptions in the corpus.

• 2 annotators were given a stratified random sample

• Comparison used Dice.

mean mode

Furniture 0.89 (A/B)

1 (71.1%)

Annotator A 0.93 (A/us)

1 (74.4%)

Annotator B 0.92 (B/us)

1(73%)

People 0.89 (A/B)

1(70%)

Annotator A 0.84 (A/us)

1(41.1%)

Annotator B .78 (B/us)

1(36.3%)