Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP...
-
Upload
patricia-moody -
Category
Documents
-
view
217 -
download
0
Transcript of Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP...
Spanish Question Answering Evaluation
Anselmo Peñas, Felisa Verdejo and Jesús Herrera
UNED NLP GroupDistance Learning University of Spain
CICLing 2004, Seoul
Question Answering task
Give an answer to a question– Approach: Find (search) an answer in a document
collection– A document must support the answer
– Where is Seoul?• South Korea (correct)• Korea (responsive?)• Asia (non responsive)• Population of South Korea (inexact) • Oranges of China (incorrect)
QA system architecture
Question
Answer
QuestionAnalysis
Pre-processing/ indexing
Answer type/ structure
Key-terms Passageretrieval
Answerextraction
Answer validation/ scoring
Opportunity for natural language techniques
Documents
Overview
Evaluation forums: objectives QA evaluation methodology The challenge of multilingualism QA at CLEF 2003 QA at CLEF 2004 Conclusion
Evaluation Forums: Objectives Stimulate research Establish shared working lines Generate resources for evaluation and for
training Compare different approaches and obtain
some evidences Serve as a meeting point for collaboration
and exchange
(CLEF, TREC, NTCIR)
QA Evaluation Methodology Test suite production:
• Document collection (hundreds of thousands)• Questions (hundreds)
Systems answering• (Answer + Document id)• Limited time
Judgment of answers • Human assessors• Correct, inexact, Unsupported, Incorrect
Measuring of systems behavior• % of questions correctly answered• % of NIL questions correctly detected• Precision Recall, F, MRR (Mean Reciprocal Rank),
Confidence-weighted score, ...
Results comparison
QA Evaluation MethodologyConsiderations on task definition (I)
Quantitative evaluation constrains the type of questions
• Questions valuable in terms of correctness, completeness and exactness
• e.g. “Which are the causes of the Iraq war?”
Human resources available• Test suite generation• Assessment (# of questions, # of answers per question)
Collection• Restricted vs. unrestricted domains• News vs. patents• Multilingual QA: Comparable collections available
QA Evaluation MethodologyConsiderations on task definition (II)
Research direction• “Do it better” versus “How to get better results?”• Systems are tuned according the evaluation task.• e.g. evaluation measure, external resources (web)
Roadmap versus state of the art• What systems should do in future? (Burger, 2000-2002)• When is it realistic to incorporate new features in the
evaluation?Type of questions, temporary restrictions, confidence in answer, encyclopedic knowledge and inference, different sources and languages, consistency between different answers, ...
The challenge of multilingualism
May I continue this talk in Spanish?
Then multilingualism still remains a challenge...
The challenge of multilingualism
Feasible with current QA state of the art Challenge for systems but ... ... challenge from the evaluation point of view
• What is the possible roadmap to achieve fully multilingual systems?– QA at CLEF (Cross-Language Evaluation Forum)– Monolingual Bilingual Multilingual systems
• What tasks can be proposed according the current state of the art?– Monolingual other than English? Bilingual considering English?– Any bilingual? Fully multilingual?
• Which new resources are needed for the evaluation?– Comparable corpus? Unrestricted domain?– Parallel corpus? Domain specific? Size?– Human resources: Answers in any language make difficult the
assessment by native speakers
The challenge of multilingualism(cont.)
• How to ensure that fully multilingual systems receive better evaluation?
– Some answers in just one language? How?» Hard pre-assessment?» Different languages for different domains?» Different languages for different dates or localities?» Parallel collections extracting a controlled subset of
documents different for each language?– How to balance type and difficulty of questions in all languages?
Spanish Italian Dutch German French
250 Questions 50 50 50 50 50
Answer only in Spanish Italian Dutch German French
10 10 10 10 10 1010101010 1010101010 10 10 10 10 101010101010 Ouch!
The challenge of multilingualism
Fortunately (unfortunately), with the current state of the art is not realistic to plan such evaluation...
Very few systems are able to deal with several target languages
...yet
While we try to answer the questions... Plan a separate evaluation for each target language
seems more realistic Option followed by QA at CLEF in the short term
Overview
Evaluation forums: objectives QA evaluation methodology The challenge of multilingualism QA at CLEF 2003 QA at CLEF 2004 Conclusion
QA at CLEF groups ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Trento, Italy UNED, Universidad Nacional de Educación a Distancia, Madrid,
Spain ILLC, Language and Inference Technology Group, U. of Amsterdam DFKI, Deutsches Forschungszentrum für Künstliche Intelligenz,
Saarbrücken, Germany ELDA/ELRA, Evaluations and Language Resources Distribution
Agency, Paris, France Linguateca, Oslo (Norway), Braga, Lisbon & Porto (Portugal) BulTreeBank Project, CLPP, Bulgarian Academy of Sciences, Sofia,
Bulgaria University of Limerick, Ireland ISTI-CNR, Istituto di Scienza e Tecnologie dell'Informazione "A.
Faedo“, Pisa, Italy NIST, National Institute of Standards and Technology, Gaithersburg,
USA
QA at CLEF 2003 Task
• 200 factoid questions, up to 3 answers per question• Exact answer / answer in 50-byte long string
Document collection• [Spanish] >200,000 news (EFE, 1994)
Questions• DISEQuA corpus (available in web) (Magnini et al. 2003):
– Coordinated work between ITC-IRST (Italian), UNED (Spanish) and U.Amsterdam (Dutch)
– 450 questions and answers translated into English, Spanish, Italian and Dutch
• 200 questions from DISEQuA corpus (20 NIL)
Assessment• Incorrect, Unsupported, Non-exact, Correct
Multilingual pool of questions
Coordination between several groups
Spanish (100)
Italian (100)
Dutch (100)
German (100)
French (100)
Questions with known answer in each target language
Englishpool(500)
Translation into English
MultilingualPool (500x6)
SpanishItalianDutch
GermanFrenchEnglish
Translation into the rest of languages
Final questions are selected from pool-For each target language-After pre-assessment
QA at CLEF 2003
QA at CLEF 2004: tasksSource
languages
(questions)
Target languages
(answers & docs.)
Six main tasks
(one per target language)
(e.g. Spanish)
English English
Spanish Spanish EFE 1994-1995, 1086 Mb (453,045 docs)
French French
German German
Italian Italian
Portuguese Dutch
Dutch Portuguese?
Bulgarian
…
KOREAN?
QA at CLEF 2004
200 questions– Factual: person, object, measure, organization ...– Definition: person, organization– How-to
1 answer per question (without manual intervention) Up to two runs Exact answers Assessment: correct, inexact, unsupported, incorrect Evaluation:
– Fraction of correct answers– Measures based on systems self-scoring
QA at CLEF 2004 Schedule
Registration Opens
Corpora Release
Trial Data
Test Sets Release
Submission of Runs
Release of Results
Papers
CLEF Workshop
January 15
February
March
May 10
May 17
from July 15
August 15
15-16 September
Conclusion
Information and resources Cross-Language Evaluation Forum
• http://clef-qa.itc.it/2004• DISEQuA Corpus: Dutch, Italian, Spanish, English
Spanish QA at CLEF• http://nlp.uned.es/QA