Post on 30-Dec-2015
description
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Overview of the Multilingual Question Answering
Track
Danilo Giampiccolo
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Outline
Tasks Test set preparation Participants Evaluation Results Final considerations Future perspectives
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
QA 2006: Organizing Committee
ITC-irst (Bernardo Magnini): main coordinator CELCT (D. Giampiccolo, P. Forner): general coordination, Italian DFKI (B. Sacalenau): German ELDA/ELRA (C. Ayache): French Linguateca (P. Rocha): Portuguese UNED (A. Penas): Spanish U. Amsterdam (Valentin Jijkoun): Dutch U. Limerick (R. Sutcliff): English Bulgarian Academy of Sciences (P. Osenova): Bulgarian
♦ Only Source Languages:♦ Depok University of Indonesia (M. Adriani): Indonesian♦ IASI, Romania (D. Cristea): Romanian♦ Wrocław University of Technology (J. Pietraszko): Polish
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
QA@CLEF-06: Tasks
Main task:♦ Monolingual: the language of the question (Source language) and the
language of the news collection (Target language) are the same
♦ Cross-lingual: the questions were formulated in a language different from that of the news collection
One pilot task:♦ WiQA: coordinated by Maarten de Rijke
Two exercises: Answer Validation Exercise (AVE): coordinated by Anselmo Peñas Real Time: a “time-constrained” QA exercise coordinated by the
University of Alicante (coordinated by Fernando Llopis)
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Data set: Question format
200 Questions of three kinds FACTOID (loc, mea, org, oth, per, tim; ca. 150): What party did Hitler belong to? DEFINITION (ca. 40): Who is Josef Paul Kleihues?
♦ reduced in number (-25%)♦ two new categories added:
– Object: What is a router?
– Other: What is a tsunami?
LIST (ca. 10): Name works by Tolstoy
♦ Temporally restricted (ca. 40): by date, by period, by event♦ NIL (ca. 20): questions that do not have any known answer in the target
document collection
input format: question type (F, D, L) not indicated
NEW!
NEW!
NEW!
NEW!
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Multiple answers: from one to ten exact answers per question
♦ exact = neither more nor less than the information required
♦ each answer has to be supported by– docid
– one to ten text snippets justifying the answer (substrings of the specified document giving the actual context)
NEW!
Data set: run format
NEW!
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Activated Tasks (at least one registered participant)
S
T
BG DE EN ES FR IN IT NL PT PL RO
BG
DE
EN
ES
FR
IT
NL
PT
11 Source languages (10 in 2005) 8 Target languages (9 in 2005) No Finnish task / New languages: Polish and Romanian
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Activated Tasks
MONOLINGUAL CROSS-LINGUAL TOTAL
CLEF 2003 3 5 8
CLEF 2004 6 13 19
CLEF 2005 8 15 23
CLEF 2006 7 17 24
questions were not translated in all the languages Gold Standard: questions in multiple languages only for tasks were there was at least one registered participant
NEW!
More interest in cross-linguality
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Participants
America Europe Asia TOTALRegisteredparticipants
New comers Veterans
Absentveterans
CLEF 2003 3 5 - 8
CLEF 2004 1 17 -18
(+125%)22 13 5 3
CLEF 2005 1 22 124
(+33%)27 9 15 4
CLEF 2006 4 24 230
(+25%)36 10 20 4
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
List of participants
ACRONYM NAME COUNTRY
SYNAPSE SYNAPSE Developpement France
Ling-Comp U.Rome-La Sapienza Italy
Alicante U.Alicante- Informatica Spain
Hagen U.Hagen-Informatics Germany
Daedalus Daedalus Consortium Spain
Jaen U.Jaen-Intell.Systems Spain
ISLA U.Amsterdam Netherlands
INAOE Inst.Astrophysics,Optics&Electronics Mexico
DEPOK U.Indonesia-Comp.Sci. Indonesia
DFKI DFKI-Lang.Tech. Germany
FURUI Lab. Tokyo Inst Technology Japan
Linguateca Linguateca-Sintef Norway
LIC2M-CEA Centre CEA Saclay France
LINA U.Nantes-LINA France
Priberam Priberam Informatica Portugal
U.Porto U.Porto- AI Portugal
U.Groningen U.Groningen-Letters Netherlands
ACRONYM NAME COUNTRY
Lab.Inf.D‘Avignon
Lab.Inf. D'Avignon France
U.Sao Paulo U.Sao Paulo – Math Brazil
Vanguard Vanguard Engineering Mexico
LCC Language Comp. Corp. USA
UAIC U.AI.I Cuza" Iasi Romania
Wroclaw U. Wroclaw U.of Tech Poland
RFIA-UPV Univ.Politècnica de Valencia Spain
LIMSI CNRS Lab-Orsay Cedex France
U.Stuttgart U.Stuttgart-NLP Germany
ITC ITC-irst, Italy
JRC-ISPRA
Institute for the Protection and the Security of the Citizen
Italy
BTB BulTreeBank Project Sofia
dltg University of Limerick Ireland
Industrial Companies
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Submitted runs
Submitted runs
#Monolingual
#Cross-lingual
#
CLEF 2003 17 6 11
CLEF 2004 48 (+182%) 20 28
CLEF 2005 67 (+39.5%) 43 24
CLEF 2006 77 (+13%) 42 35
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Number of answers and snippets per question
44%
25%
31%
Number of RUNS with respect to number of answers
1 answer
more than 5 answers
between 2 and 5 answers
74%
21%
4%
1%
Number of SNIPPETS for each answer
1 snippet
2 snippets
3 snippets
> 4 snippets
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Evaluation
As in previous campaigns♦ runs manually judged by native speakers♦ each answer: Right, Wrong, ineXact, Unsupported♦ up to two runs for each participating group
Evaluation measures♦ Accuracy (for F,D); main evaluation score, calculated for the FIRST ANSWER only
• excessive workload: some groups could manually assess only one answer (the first one) per question
– 1 answer: Spanish and English– 3 answers: French– 5 answers: Dutch– all answers: Italian, German, Portoguese
♦ P@N for List questions Additional evaluation measures
♦ K1 measure♦ Confident Weighted Score (CWS)♦ Mean Reciprocal Rank (MRR)
NEW!
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Question Overlapping among Languages 2005-2006
0
50
100
150
200
250
300
350
400
450
1lan
guag
e
2 lan
gauge
s
3 lan
guag
es
4 lan
guag
es
5 lan
guag
es
6 lan
guag
es
7 lan
guag
es
8 lan
guag
es
9 lan
guag
es
20062005
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Results: Best and Average scores
41,5
64,5
39,535
45,5
35
68,95
86,32
14,7
29,36
18,4823,729
17
27,94 24,99
0102030405060708090100
Mono
Bilin
gual
Mono
Bilin
gual
Mono
Bilin
gual
Mono
Bilin
gual
Best
Average CLEF03 CLEF04 CLEF05 CLEF06
49,47
* This result is still under validation.
*
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Best results in 2004-2005-2006
34,0
1
23,5
32,5
24,5 28
45,5
28,6
4
27,5
43,5
25,5
42
64
27,5
49,5
64,5
26,6
42,3
3
86,3
2
53,1
6
68,9
5
28,1
9
31,2
65,9
6
0
10
20
30
40
50
60
70
80
90
100
Bulgarian
Germ
an
English
Spanish
French
Italian
Dutch
Portuguese
Best2004
Best2005
Best200622,63
* This result is still under validation.
*
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Participants in 2004-2005-2006: compared best results
0
10
20
30
40
50
60
70
80
DFK
I
HAGEN
ALIC
ANTE
INAOE
DAEDALU
S
TALP
U.VALENCIA
ITC-irst
U.LIM
ERICK
GRONINGEN
LIMSI
LINGUATECA
PRIBERAM
LIC2M
-CEA
LINA
SYNAPSE
U.IN
DONESIA
BTB
Best 2004 Best 2005 Best 2006
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
List questions
Best: 0.8333 (Priberam, Monolingual PT) Average: 0.138
Problems
Wrong classification of List Questions in the Gold Standard♦ Mention a Chinese writer is not a List question!
Definition of List Questions♦ “closed” List questions asking for a finite number of answers
Q: What are the names of the two lovers from Verona separated by family issues in one of Shakespeare’s plays?
A: Romeo and Juliet.♦ “open” List questions requiring a list of items as answer
Q: Name books by Jules Verne.A: Around the World in 80 Days.A: Twenty Thousand Leagues Under The Sea .A: Journey to the Centre of the Earth.
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Final considerations
– Increasing interest in multilingual QA• More participants (30, + 25%)• Two new languages as source (Romanian and Polish)• More activated tasks (24, they were 23 in 2005)• More submitted runs (77, +13%) • More cross-lingual tasks (35, +31.5%)
– Gold Standard: questions not translated in all languages • No possibility of activating tasks at the last minutes• Useful as reusuable resource: available in the near future.
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Final considerations: 2006 main task innovations
– Multiple answers: • good response
• limited capacity of assessing large numbers of answers.
• feedback welcome from participants
– Supporting snippets:• faster evaluation
• feedback from participants
– “F/D/L/” labels not given in the input format:• positive, as apparently there was no real impact on
– List questions
Alicante, September, 22, 20006QA@CLEF 2006 Workshop
Future perspective: main task
For discussion:
Romanian as target
Very hard questions (implying reasoning and multiple document answers)
Allow collaboration among different systems
Partial automated evaluation (right answers)