Post on 18-Jan-2016
Evaluating Answer Validation in multi-stream Question
Answering
Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo
UNED NLP & IR group
nlp.uned.es
The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008)
Tokyo, 16 December 2008
UNED
nlp.uned.es
Content
1. Context and motivation• Question Answering at CLEF• Answer Validation Exercise at CLEF
2. Evaluating the validation of answers
3. Evaluating the selection of answers• Correct selection• Correct rejection
4. Analysis and discussion
5. Conclusion
UNED
nlp.uned.es
Evolution of the CLEF-QA Track
2003
2004 2005 20062007
2008
2009
Target language
s3 7 8 9 10 11
UE Official
Collections
News 1994 +News 1995+ Wikipedia Nov. 2006
JRC-Acquis
Type of questions
200 Factoid
+ Temporal
restrictions
+ Definitions
- Type of
question
+ Lists
+ Linked questions
+ Closed lists
FactoidDefinition
MotivePurposeProcedur
e
Supporting
information
Document SnippetParagrap
h
Pilots and
Exercises
Temporal restrictio
n
Lists
AVEReal TimeWiQA
AVEQAST
AVEQASTWSDQA
GikiCLEFQAST
UNED
nlp.uned.es
Evolution of Results
2003 - 2006 (Spanish)
OverallBest
result<60%
Definitions
Best result>80% NOT
IR approach
UNED
nlp.uned.es
Pipeline Upper Bounds
Use Answer Validation to break the pipeline
Question
Answer
Questionanalysis
PassageRetrieval
AnswerExtraction
AnswerRanking
1.00.8 0.8 0.64x x =
Not enough evidence
UNED
nlp.uned.es
Results in CLEF-QA 2006 (Spanish)
Perfect combination
81%
Best system 52,5%
Best with ORGANIZATION
Best with PERSON
Best with TIME
UNED
nlp.uned.es
Collaborative architectures
Diferent systems response better different types of questions
• Specialisation• Collaboration
QA sys1
QA sys2
QA sys3
QA sysn
Question
Candidate answers
Answer Validation &
Selection
Answer
Evaluation Framwork
UNED
nlp.uned.es
Collaborative architectures
How to select the good answer?• Redundancy• Voting• Confidence score• Performance history
Why not deeper analysis?
UNED
nlp.uned.es
Answer Validation Exercise (AVE)
Objective
Validate the correctness of the answers
Given by real QA systems...
...the participants at CLEF QA
UNED
nlp.uned.es
Answer Validation Exercise (AVE)
QuestionAnswering
QuestionCandidate answer
Supporting Text
Textual Entailment
Answer is not correct or not enough evidence
Automatic HypothesisGeneration
QuestionHypothesis
Answer is correct
AVE 2006
AVE 2007 - 2008
Answer Validation
UNED
nlp.uned.es
Techniques in AVE 2007
Overview AVE 2007Generates hypotheses 6
Wordnet 3
Chunking 3
n-grams, longest common Subsequences
5
Phrase transformations 2
NER 5
Num. expressions 6
Temp. expressions 4
Coreference resolution 2
Dependency analysis 3
Syntactic similarity 4
Functions (sub, obj, etc) 3
Syntactic transformations 1
Word-sense disambiguation 2
Semantic parsing 4
Semantic role labeling 2
First order logic representation
3
Theorem prover 3
Semantic similarity 2
UNED
nlp.uned.es
Evaluation linked to main QA task
Question
Answering
Track
Systems’ answers
Systems’ Supporting Texts
Answer
Validation
Exercise
Questions
Systems’ Validation (YES, NO)
Human Judgements (R,W,X,U)
QA Track results
Mapping(YES, NO)
Evaluation
AVE Track results
Reuse human assessments
UNED
nlp.uned.es
Content
1. Context and motivation
2. Evaluating the validation of answers
3. Evaluating the selection of answers
4. Analysis and discussion
5. Conclusion
UNED
nlp.uned.es
QA sys1
QA sys2
QA sys3
QA sysn
Question
Candidate answers
Answer Validation &
Selection
Answer
Participant systems in aCLEF – QA
Evaluation of AnswerValidation & Selection
Evaluation Proposed
UNED
nlp.uned.es
Collections
<q id="116" lang="EN"><q_str> What is Zanussi? </q_str><a id="116_1" value="">
<a_str> was an Italian producer of home appliances </a_str><t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str>
</a><a id="116_2" value="">
<a_str> who had also been in Cassibile since August 31 </a_str><t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str>
</a><a id="116_4" value="">
<a_str> 3 </a_str><t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str>
</a></q>
UNED
nlp.uned.es
Evaluating the Validation
ValidationDecide if each candidate answer is correct or not
• YES | NO
Not balanced collections
Approach: Detect if there is enough evidence to accept an answer
Measures: Precision, recall and F over correct answers
Baseline system: Accept all answers
UNED
nlp.uned.es
Evaluating the Validation
CRCA
CA
nn
nrecall
Correct Answer
Incorrect
Answer
AnswerAccepte
dnCA nWA
AnswerRejecte
dnCR nWR
WACA
CA
nn
nprecision
precisionrecall
precisionrecallF
2
UNED
nlp.uned.es
Evaluating the Selection
Quantify the potential gain of Answer Validation in Question Answering
• Compare AV systems with QA systems
Develop measures more comparable to QA accuracy
questions
correctlyansweredquestions
n
naccuracyqa ___
UNED
nlp.uned.es
Evaluating the selection
Given a question with several candidate answersTwo options:
Selection Select an answer ≡ try to answer the question
• Correct selection: answer was correct• Incorrect selection: answer was incorrect
Rejection Reject all candidate answers ≡ leave question
unanswered• Correct rejection: All candidate answers were incorrect• Incorrect rejection: Not all candidate answers were
incorrect
UNED
nlp.uned.es
Evaluating the Selection
n questionsn= nCA + nWA + nWS + nWR + nCR
Question with Correct Answer
Question without
Correct Answer
Question Answered Correctly(One Answer Selected)
nCA -
Question Answered Incorrectly
nWA nWS
Question Unanswered(All Answers Rejected)
nWR nCR
n
naccuracyqa CA_
100__% recallselectionbest
WRWACA
CA
nnn
nrecall
WACA
CA
nn
nprecision
Not comparable to qa_accuracy
UNED
nlp.uned.es
Evaluating the Selection
n questionsn= nCA + nWA + nWS + nWR + nCR
Question with Correct Answer
Question without
Correct Answer
Question Answered Correctly(One Answer Selected)
nCA -
Question Answered Incorrectly
nWA nWS
Question Unanswered(All Answers Rejected)
nWR nCR
n
naccuracyqa CA_
n
naccuracyrej CR_
UNED
nlp.uned.es
Evaluating the Selection
n
naccuracyqa CA_
n
naccuracyrej CR_
n
n
n
naccuracy CRCA
Rewards rejection(not balanced
cols)
Interpretation for QA: all questions correctly rejected by AV will be answered
correctly
UNED
nlp.uned.es
Evaluating the Selection
)(1
n
nnn
nn
n
n
n
n
nestimated CA
CRCACACRCA
n
naccuracyqa CA_
n
naccuracyrej CR_
Interpretation for QA: questions correctly rejected by AV will be answered correctly in qa_accuracy
proportion
UNED
nlp.uned.es
Content
1. Context and motivation
2. Evaluating the validation of answers
3. Evaluating the selection of answers
4. Analysis and discussion
5. Conclusion
UNED
nlp.uned.es
Analysis and discussion(AVE 2007 English)
Validation
Selection
QA_acc correlated to R
“Estimated” adjusts it
UNED
nlp.uned.es
Multi-stream QA performance (AVE 2007 English)
UNED
nlp.uned.es
Analysis and discussion (AVE 2007 Spanish)
Validation
Selection
Comparing AV & QA
UNED
nlp.uned.es
Conclusion
Evaluation framework for Answer Validation & Selection systems
Measures that reward not only Correct Selection but also Correct Rejection
• Promote improvement of QA systems
Allow comparison between AV and QA systems• In what conditions multi-stream perform better• Room for improvement just using multi-stream-QA• Potential gain that AV systems can provide to QA
Thanks!
http://nlp.uned.es/clef-qa/ave
http://www.clef-campaign.org
Acknowledgement: EU project T-CLEF (ICT-1-4-1 215231)
Evaluating Answer Validation in multi-stream Question
Answering
Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo
UNED NLP & IR group
nlp.uned.es
The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008)
Tokyo, 16 December 2008