Beyond TREC-QA II

94
Beyond TREC-QA II Ling573 NLP Systems and Applications May 30, 2013

description

Beyond TREC-QA II. Ling573 NLP Systems and Applications May 30, 2013. Reminder. New scoring script!! compute_mrr_rev.py. Roadmap. Beyond TREC Question Answering Distant supervision for web-scale relation extraction Machine reading: Question Generation. New Strategy. - PowerPoint PPT Presentation

Transcript of Beyond TREC-QA II

Page 1: Beyond TREC-QA II

Beyond TREC-QA IILing573

NLP Systems and ApplicationsMay 30, 2013

Page 2: Beyond TREC-QA II

ReminderNew scoring script!!

compute_mrr_rev.py

Page 3: Beyond TREC-QA II

RoadmapBeyond TREC Question Answering

Distant supervision for web-scale relation extraction

Machine reading: Question Generation

Page 4: Beyond TREC-QA II

New StrategyDistant Supervision:

Supervision (examples) via large semantic databaseKey intuition:

If a sentence has two entities from a Freebase relation,they should express that relation in the sentence

Secondary intuition:Many witness sentences expressing relation Can jointly contribute to features in relation classifier

Advantages: Avoids overfitting, uses named relations

Page 5: Beyond TREC-QA II

Feature ExtractionLexical features: Conjuncts of

Page 6: Beyond TREC-QA II

Feature ExtractionLexical features: Conjuncts of

Astronomer Edwin Hubble was born in Marshfield,MO

Page 7: Beyond TREC-QA II

Feature ExtractionLexical features: Conjuncts of

Sequence of words between entitiesPOS tags of sequence between entitiesFlag for entity orderk words+POS before 1st entityk words+POS after 2nd entity

Astronomer Edwin Hubble was born in Marshfield,MO

Page 8: Beyond TREC-QA II

Feature ExtractionLexical features: Conjuncts of

Sequence of words between entitiesPOS tags of sequence between entitiesFlag for entity orderk words+POS before 1st entityk words+POS after 2nd entity

Astronomer Edwin Hubble was born in Marshfield,MO

Page 9: Beyond TREC-QA II

Feature Extraction IISyntactic features: Conjuncts of:

Page 10: Beyond TREC-QA II

Feature Extraction II

Page 11: Beyond TREC-QA II

Feature Extraction IISyntactic features: Conjuncts of:

Dependency path between entities, parsed by MiniparChunks, dependencies, and directions

Window node not on dependency path

Page 12: Beyond TREC-QA II

High Weight Features

Page 13: Beyond TREC-QA II

High Weight FeaturesFeatures highly specific: Problem?

Page 14: Beyond TREC-QA II

High Weight FeaturesFeatures highly specific: Problem?

Not really, attested in large text corpus

Page 15: Beyond TREC-QA II

Evaluation Paradigm

Page 16: Beyond TREC-QA II

Evaluation ParadigmTrain on subset of data, test on held-out portion

Page 17: Beyond TREC-QA II

Evaluation ParadigmTrain on subset of data, test on held-out portionTrain on all relations, using part of corpus

Test on new relations extracted from Wikipedia text

How evaluate newly extracted relations?

Page 18: Beyond TREC-QA II

Evaluation ParadigmTrain on subset of data, test on held-out portionTrain on all relations, using part of corpus

Test on new relations extracted from Wikipedia text

How evaluate newly extracted relations?Send to human assessorsIssue:

Page 19: Beyond TREC-QA II

Evaluation ParadigmTrain on subset of data, test on held-out portionTrain on all relations, using part of corpus

Test on new relations extracted from Wikipedia text

How evaluate newly extracted relations?Send to human assessorsIssue: 100s or 1000s of each type of relation

Page 20: Beyond TREC-QA II

Evaluation ParadigmTrain on subset of data, test on held-out portionTrain on all relations, using part of corpus

Test on new relations extracted from Wikipedia text

How evaluate newly extracted relations?Send to human assessorsIssue: 100s or 1000s of each type of relation

Crowdsource: Send to Amazon Mechanical Turk

Page 21: Beyond TREC-QA II

ResultsOverall: on held-out set

Best precision combines lexical, syntacticSignificant skew in identified relations

@100,000: 60% location-contains, 13% person-birthplace

Page 22: Beyond TREC-QA II

ResultsOverall: on held-out set

Best precision combines lexical, syntacticSignificant skew in identified relations

@100,000: 60% location-contains, 13% person-birthplace

Syntactic features helpful in ambiguous, long-distance

E.g.Back Street is a 1932 film made by Universal Pictures,

directed by John M. Stahl,…

Page 23: Beyond TREC-QA II

Human-Scored Results

Page 24: Beyond TREC-QA II

Human-Scored Results@ Recall 100: Combined lexical, syntactic best

Page 25: Beyond TREC-QA II

Human-Scored Results@ Recall 100: Combined lexical, syntactic best

@1000: mixed

Page 26: Beyond TREC-QA II

Distant SupervisionUses large databased as source of true relationsExploits co-occurring entities in large text

collectionScale of corpus, richer syntactic features

Overcome limitations of earlier bootstrap approaches

Yields reasonably good precisionDrops somewhat with recallSkewed coverage of categories

Page 27: Beyond TREC-QA II

RoadmapBeyond TREC Question Answering

Distant supervision for web-scale relation extraction

Machine reading: Question Generation

Page 28: Beyond TREC-QA II

Question GenerationMind the Gap: Learning to Choose Gaps for

Question Generation, Becker, Basu, Vanderwende ’12

Other side of question-answeringRelated to “machine reading”Generate questions based on arbitrary text

Page 29: Beyond TREC-QA II

MotivationWhy generate questions?

Page 30: Beyond TREC-QA II

MotivationWhy generate questions?

Aside from playing Jeopardy!, of course

Page 31: Beyond TREC-QA II

MotivationWhy generate questions?

Aside from playing Jeopardy!, of course

Educational (self-)assessmentTesting aids in retention of studied concepts

Reading, review relatively passiveActive retrieval, recall of concepts more effective

Page 32: Beyond TREC-QA II

MotivationWhy generate questions?

Aside from playing Jeopardy!, of course

Educational (self-)assessmentTesting aids in retention of studied concepts

Reading, review relatively passiveActive retrieval, recall of concepts more effective

Shifts to less-structured learning settingsOnline study, reading, MOOCs

Page 33: Beyond TREC-QA II

MotivationWhy generate questions?

Aside from playing Jeopardy!, of course

Educational (self-)assessmentTesting aids in retention of studied concepts

Reading, review relatively passiveActive retrieval, recall of concepts more effective

Shifts to less-structured learning settingsOnline study, reading, MOOCs

Assessment difficult automatically generate

Page 34: Beyond TREC-QA II

Generating QuestionsPrior work:

Shared task on question generationFocused on:

Grammatical question generationCreating distractors for multiple choice questions

Page 35: Beyond TREC-QA II

Generating QuestionsPrior work:

Shared task on question generationFocused on:

Grammatical question generationCreating distractors for multiple choice questions

Here: Generating Good QuestionsGiven a text, decompose into two steps:

What sentences form the basis of good questions?What aspects of these sentences should we focus on?

Page 36: Beyond TREC-QA II

OverviewGoal: Generate good gap-fill questions

Allow focus on content vs formLater extend to wh-questions or multiple choice

Page 37: Beyond TREC-QA II

OverviewGoal: Generate good gap-fill questions

Allow focus on content vs formLater extend to wh-questions or multiple choice

ApproachSummarization-based sentence selection

Page 38: Beyond TREC-QA II

OverviewGoal: Generate good gap-fill questions

Allow focus on content vs formLater extend to wh-questions or multiple choice

ApproachSummarization-based sentence selectionMachine learning-based gap selection

Training data creation on Wikipedia data

Page 39: Beyond TREC-QA II

OverviewGoal: Generate good gap-fill questions

Allow focus on content vs formLater extend to wh-questions or multiple choice

ApproachSummarization-based sentence selectionMachine learning-based gap selection

Training data creation on Wikipedia dataPreview: 83% true pos vs 19% false positives

Page 40: Beyond TREC-QA II

Sentence SelectionIntuition:

When studying a new subject, focus on main conceptsNitty-gritty details later

Page 41: Beyond TREC-QA II

Sentence SelectionIntuition:

When studying a new subject, focus on main conceptsNitty-gritty details later

Insight: Similar to existing NLP task

Page 42: Beyond TREC-QA II

Sentence SelectionIntuition:

When studying a new subject, focus on main conceptsNitty-gritty details later

Insight: Similar to existing NLP task Summarization: Pick out key content first

Page 43: Beyond TREC-QA II

Sentence SelectionIntuition:

When studying a new subject, focus on main conceptsNitty-gritty details later

Insight: Similar to existing NLP task Summarization: Pick out key content first

Exploit simple summarization approachBased on SumBasic Best sentences: most representative of article

Page 44: Beyond TREC-QA II

Sentence SelectionIntuition:

When studying a new subject, focus on main conceptsNitty-gritty details later

Insight: Similar to existing NLP task Summarization: Pick out key content first

Exploit simple summarization approachBased on SumBasic Best sentences: most representative of article

~ Contain most frequent (non-stop) words in article

Page 45: Beyond TREC-QA II

Gap SelectionHow can we pick gaps?

Page 46: Beyond TREC-QA II

Gap SelectionHow can we pick gaps?

Heuristics?

Page 47: Beyond TREC-QA II

Gap SelectionHow can we pick gaps?

Heuristics? Possible, but unsatisfyingEmpirical approach using machine learning

Methodology: Generate and filter

Page 48: Beyond TREC-QA II

Gap SelectionHow can we pick gaps?

Heuristics? Possible, but unsatisfyingEmpirical approach using machine learning

Methodology: Generate and filterGenerate many possible questions per sentenceTrain discriminative classifier to filterApply to new Q/A pairs

Page 49: Beyond TREC-QA II

GenerateWhat’s a good gap?

Page 50: Beyond TREC-QA II

GenerateWhat’s a good gap?

Any word position?

Page 51: Beyond TREC-QA II

GenerateWhat’s a good gap?

Any word position? No – stopwords: only good for a grammar test

Page 52: Beyond TREC-QA II

GenerateWhat’s a good gap?

Any word position? No – stopwords: only good for a grammar testNo – bad for training: too many negative examples

Page 53: Beyond TREC-QA II

GenerateWhat’s a good gap?

Any word position? No – stopwords: only good for a grammar testNo – bad for training: too many negative examples

Who did what to whom…

Page 54: Beyond TREC-QA II

GenerateWhat’s a good gap?

Any word position? No – stopwords: only good for a grammar testNo – bad for training: too many negative examples

Who did what to whom…Items filling main semantic roles:

Verb predicate, child NP, APBased syntactic parser, semantic role labeling

Page 55: Beyond TREC-QA II

Processing Example

Page 56: Beyond TREC-QA II

ClassificationIdentify positive/negative examples

Create single score> Threshold positive, o.w. negative

Extract feature vector

Train logistic regression learner

Page 57: Beyond TREC-QA II

DataDocuments:

105 Wikipedia articles listed as vital/popularAcross a range of domains

Page 58: Beyond TREC-QA II

DataDocuments:

105 Wikipedia articles listed as vital/popularAcross a range of domains

Sentences:10 selected by summarization measure10 random: why?

Page 59: Beyond TREC-QA II

DataDocuments:

105 Wikipedia articles listed as vital/popularAcross a range of domains

Sentences:10 selected by summarization measure10 random: why? avoid tuning

Labels:?

Page 60: Beyond TREC-QA II

DataDocuments:

105 Wikipedia articles listed as vital/popularAcross a range of domains

Sentences:10 selected by summarization measure10 random: why? avoid tuning

Labels:?Human judgments via crowdsourcingRate questions (Good/Okay/Bad)– alone?

Page 61: Beyond TREC-QA II

DataDocuments:

105 Wikipedia articles listed as vital/popularAcross a range of domains

Sentences:10 selected by summarization measure10 random: why? avoid tuning

Labels:?Human judgments via crowdsourcingRate questions (Good/Okay/Bad)– alone? not reliable

In context of original sentence and other alternatives

Page 62: Beyond TREC-QA II

HIT Set-upGood: key concepts, fair to answerOkay: key concepts, odd/ambiguous/long to

answerBad: unimportant, uninteresting

Page 63: Beyond TREC-QA II

HIT Set-upGood: key concepts, fair to answerOkay: key concepts, odd/ambiguous/long to

answerBad: unimportant, uninterestingGood = 1; o.w. 0

Page 64: Beyond TREC-QA II

HIT Set-upGood: key concepts, fair to answerOkay: key concepts, odd/ambiguous/long to

answerBad: unimportant, uninterestingGood = 1; o.w. 0

Page 65: Beyond TREC-QA II

Another QAMaintaining quality in crowdsourcingBasic crowdsourcing issue:

Malicious or sloppy workers

Page 66: Beyond TREC-QA II

Another QAMaintaining quality in crowdsourcingBasic crowdsourcing issue:

Malicious or sloppy workersValidation:

Page 67: Beyond TREC-QA II

Another QAMaintaining quality in crowdsourcingBasic crowdsourcing issue:

Malicious or sloppy workersValidation:

Compare across annotators: If average > 2 stdev from median reject

Page 68: Beyond TREC-QA II

Another QAMaintaining quality in crowdsourcingBasic crowdsourcing issue:

Malicious or sloppy workersValidation:

Compare across annotators: If average > 2 stdev from median reject

Compare questions:Exclude those with high variance (>0.3)

Page 69: Beyond TREC-QA II

Another QAMaintaining quality in crowdsourcingBasic crowdsourcing issue:

Malicious or sloppy workersValidation:

Compare across annotators: If average > 2 stdev from median reject

Compare questions:Exclude those with high variance (>0.3)

Yield: 1821 questions, 700 “Good”: alpha = 0.51

Page 70: Beyond TREC-QA II

FeaturesDiverse features:

Page 71: Beyond TREC-QA II

FeaturesDiverse features:

Count: 5Lexical: 11Syntactic: 112Semantic role: 40Named entity: 11Link: 3

Page 72: Beyond TREC-QA II

FeaturesDiverse features:

Count: 5Lexical: 11Syntactic: 112Semantic role: 40Named entity: 11Link: 3

Characterizing:Answer, source sentence, relation b/t

Page 73: Beyond TREC-QA II

FeaturesToken count features:

Page 74: Beyond TREC-QA II

FeaturesToken count features:

Questions shouldn’t be too short Answer gaps shouldn’t be too long

Lengths, overlaps

Lexical features:

Page 75: Beyond TREC-QA II

FeaturesToken count features:

Questions shouldn’t be too short Answer gaps shouldn’t be too long

Lengths, overlaps

Lexical features:Specific words?

Page 76: Beyond TREC-QA II

FeaturesToken count features:

Questions shouldn’t be too short Answer gaps shouldn’t be too long

Lengths, overlaps

Lexical features:Specific words? Too specificWord class densities: Good and bad

Page 77: Beyond TREC-QA II

FeaturesToken count features:

Questions shouldn’t be too short Answer gaps shouldn’t be too long

Lengths, overlaps

Lexical features:Specific words? Too specificWord class densities: Good and bad

Capitalized words in answer; pronoun, stopwords in ans.

Page 78: Beyond TREC-QA II

FeaturesSyntactic features:

Syntactic structure: e.g. answer depth, relation to verb

POS: composition, context of answer

Page 79: Beyond TREC-QA II

FeaturesSyntactic features:

Syntactic structure: e.g. answer depth, relation to verb

POS: composition, context of answerSemantic role label features:

SRL spans relations to answerPrimarily subject, object roles Verb predicate:

Page 80: Beyond TREC-QA II

FeaturesSyntactic features:

Syntactic structure: e.g. answer depth, relation to verbPOS: composition, context of answer

Semantic role label features:SRL spans relations to answer

Primarily subject, object roles Verb predicate: strong association, but BAD

Named entity features:NE density, type frequency in answer; frequency in sentLikely in question, but across types, majority w/o

Page 81: Beyond TREC-QA II

Features Syntactic features:

Syntactic structure: e.g. answer depth, relation to verb POS: composition, context of answer

Semantic role label features: SRL spans relations to answer

Primarily subject, object roles Verb predicate: strong association, but BAD

Named entity features: NE density, type frequency in answer; frequency in sent Likely in question, but across types, majority w/o

Link features: cue to importance of linked spans Density and ratio of links

Page 82: Beyond TREC-QA II

ResultsEqual error rate: 83% TP, 19% FP

Page 83: Beyond TREC-QA II

ObservationsPerforms well, can tune to balance errorsIs it Wikipedia centric?

Page 84: Beyond TREC-QA II

ObservationsPerforms well, can tune to balance errorsIs it Wikipedia centric?

No, little change w/o Wikipedia featuresHow much data does it need?

Page 85: Beyond TREC-QA II

ObservationsPerforms well, can tune to balance errorsIs it Wikipedia centric?

No, little change w/o Wikipedia featuresHow much data does it need?

Learning curve levels out around 1200 samplesWhich features matter?

Page 86: Beyond TREC-QA II

ObservationsPerforms well, can tune to balance errorsIs it Wikipedia centric?

No, little change w/o Wikipedia featuresHow much data does it need?

Learning curve levels out around 1200 samplesWhich features matter?

All types, feature weights distributed across types

Page 87: Beyond TREC-QA II

False Positives

Page 88: Beyond TREC-QA II

False Negatives

Page 89: Beyond TREC-QA II

Error AnalysisFalse positives:

Page 90: Beyond TREC-QA II

Error AnalysisFalse positives:

Need some notion of predictability, repetition

False negatives:

Page 91: Beyond TREC-QA II

Look BackTREC QA vs

Jeopardy!, web-scale relation extraction, question gen.

Page 92: Beyond TREC-QA II

Look BackTREC QA vs

Jeopardy!, web-scale relation extraction, question gen.

Obvious differences:

Page 93: Beyond TREC-QA II

Look BackTREC QA vs

Jeopardy!, web-scale relation extraction, question gen.

Obvious differences:Scale, web vs fixed documents, question-answer

directionTask-specific constraints: betting, timing, evidence,…

Core similarities:

Page 94: Beyond TREC-QA II

Looking BackTREC QA vs

Jeopardy!, web-scale relation extraction, question gen.

Obvious differences:Scale, web vs fixed documents, question-answer directionTask-specific constraints: betting, timing, evidence,…

Core similarities:Similar wide range of features appliedDeep + shallow approaches; learning&rulesRelations b/t question/answer (pattern/relation)