Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP...

Spanish Question Answering Evaluation

Anselmo Peñas, Felisa Verdejo and Jesús Herrera

UNED NLP GroupDistance Learning University of Spain

CICLing 2004, Seoul

Question Answering task

Give an answer to a question– Approach: Find (search) an answer in a document

collection– A document must support the answer

– Where is Seoul?• South Korea (correct)• Korea (responsive?)• Asia (non responsive)• Population of South Korea (inexact) • Oranges of China (incorrect)

QA system architecture

Question

Answer

QuestionAnalysis

Pre-processing/ indexing

Answer type/ structure

Key-terms Passageretrieval

Answerextraction

Answer validation/ scoring

Opportunity for natural language techniques

Documents

Overview

Evaluation forums: objectives QA evaluation methodology The challenge of multilingualism QA at CLEF 2003 QA at CLEF 2004 Conclusion

Evaluation Forums: Objectives Stimulate research Establish shared working lines Generate resources for evaluation and for

training Compare different approaches and obtain

some evidences Serve as a meeting point for collaboration

and exchange

(CLEF, TREC, NTCIR)

QA Evaluation Methodology Test suite production:

• Document collection (hundreds of thousands)• Questions (hundreds)

Systems answering• (Answer + Document id)• Limited time

Judgment of answers • Human assessors• Correct, inexact, Unsupported, Incorrect

Measuring of systems behavior• % of questions correctly answered• % of NIL questions correctly detected• Precision Recall, F, MRR (Mean Reciprocal Rank),

Confidence-weighted score, ...

Results comparison

QA Evaluation MethodologyConsiderations on task definition (I)

Quantitative evaluation constrains the type of questions

• Questions valuable in terms of correctness, completeness and exactness

• e.g. “Which are the causes of the Iraq war?”

Human resources available• Test suite generation• Assessment (# of questions, # of answers per question)

Collection• Restricted vs. unrestricted domains• News vs. patents• Multilingual QA: Comparable collections available

QA Evaluation MethodologyConsiderations on task definition (II)

Research direction• “Do it better” versus “How to get better results?”• Systems are tuned according the evaluation task.• e.g. evaluation measure, external resources (web)

Roadmap versus state of the art• What systems should do in future? (Burger, 2000-2002)• When is it realistic to incorporate new features in the

evaluation?Type of questions, temporary restrictions, confidence in answer, encyclopedic knowledge and inference, different sources and languages, consistency between different answers, ...

The challenge of multilingualism

May I continue this talk in Spanish?

Then multilingualism still remains a challenge...


Feasible with current QA state of the art Challenge for systems but ... ... challenge from the evaluation point of view

• What is the possible roadmap to achieve fully multilingual systems?– QA at CLEF (Cross-Language Evaluation Forum)– Monolingual Bilingual Multilingual systems

• What tasks can be proposed according the current state of the art?– Monolingual other than English? Bilingual considering English?– Any bilingual? Fully multilingual?

• Which new resources are needed for the evaluation?– Comparable corpus? Unrestricted domain?– Parallel corpus? Domain specific? Size?– Human resources: Answers in any language make difficult the

assessment by native speakers

The challenge of multilingualism(cont.)

• How to ensure that fully multilingual systems receive better evaluation?

– Some answers in just one language? How?» Hard pre-assessment?» Different languages for different domains?» Different languages for different dates or localities?» Parallel collections extracting a controlled subset of

documents different for each language?– How to balance type and difficulty of questions in all languages?

Spanish Italian Dutch German French

250 Questions 50 50 50 50 50

Answer only in Spanish Italian Dutch German French

10 10 10 10 10 1010101010 1010101010 10 10 10 10 101010101010 Ouch!


Fortunately (unfortunately), with the current state of the art is not realistic to plan such evaluation...

Very few systems are able to deal with several target languages

...yet

While we try to answer the questions... Plan a separate evaluation for each target language

seems more realistic Option followed by QA at CLEF in the short term

Overview

Evaluation forums: objectives QA evaluation methodology The challenge of multilingualism QA at CLEF 2003 QA at CLEF 2004 Conclusion

QA at CLEF groups ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Trento, Italy UNED, Universidad Nacional de Educación a Distancia, Madrid,

Spain ILLC, Language and Inference Technology Group, U. of Amsterdam DFKI, Deutsches Forschungszentrum für Künstliche Intelligenz,

Saarbrücken, Germany ELDA/ELRA, Evaluations and Language Resources Distribution

Agency, Paris, France Linguateca, Oslo (Norway), Braga, Lisbon & Porto (Portugal) BulTreeBank Project, CLPP, Bulgarian Academy of Sciences, Sofia,

Bulgaria University of Limerick, Ireland ISTI-CNR, Istituto di Scienza e Tecnologie dell'Informazione "A.

Faedo“, Pisa, Italy NIST, National Institute of Standards and Technology, Gaithersburg,

USA

QA at CLEF 2003 Task

• 200 factoid questions, up to 3 answers per question• Exact answer / answer in 50-byte long string

Document collection• [Spanish] >200,000 news (EFE, 1994)

Questions• DISEQuA corpus (available in web) (Magnini et al. 2003):

– Coordinated work between ITC-IRST (Italian), UNED (Spanish) and U.Amsterdam (Dutch)

– 450 questions and answers translated into English, Spanish, Italian and Dutch

• 200 questions from DISEQuA corpus (20 NIL)

Assessment• Incorrect, Unsupported, Non-exact, Correct

Multilingual pool of questions

Coordination between several groups

Spanish (100)

Italian (100)

Dutch (100)

German (100)

French (100)

Questions with known answer in each target language

Englishpool(500)

Translation into English

MultilingualPool (500x6)

SpanishItalianDutch

GermanFrenchEnglish

Translation into the rest of languages

Final questions are selected from pool-For each target language-After pre-assessment

QA at CLEF 2003

QA at CLEF 2004: tasksSource

languages

(questions)

Target languages

(answers & docs.)

Six main tasks

(one per target language)

(e.g. Spanish)

English English

Spanish Spanish EFE 1994-1995, 1086 Mb (453,045 docs)

French French

German German

Italian Italian

Portuguese Dutch

Dutch Portuguese?

Bulgarian

…

KOREAN?

QA at CLEF 2004

200 questions– Factual: person, object, measure, organization ...– Definition: person, organization– How-to

1 answer per question (without manual intervention) Up to two runs Exact answers Assessment: correct, inexact, unsupported, incorrect Evaluation:

– Fraction of correct answers– Measures based on systems self-scoring

QA at CLEF 2004 Schedule

Registration Opens

Corpora Release

Trial Data

Test Sets Release

Submission of Runs

Release of Results

Papers

CLEF Workshop

January 15

February

March

May 10

May 17

from July 15

August 15

15-16 September

Conclusion

Information and resources Cross-Language Evaluation Forum

• http://clef-qa.itc.it/2004• DISEQuA Corpus: Dutch, Italian, Spanish, English

Spanish QA at CLEF• http://nlp.uned.es/QA

([email protected])

Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP...

Documents

Transcript of Spanish Question Answering Evaluation Anselmo Peñas, Felisa Verdejo and Jesús Herrera UNED NLP...