Departamento de Lenguajes y Sistemas Informáticos
-
Upload
isaiah-craft -
Category
Documents
-
view
15 -
download
1
description
Transcript of Departamento de Lenguajes y Sistemas Informáticos
Departamento de Lenguajes y Sistemas Informáticos
Spoken Document Retrieval experiments with IR-n system
Fernando Llopis PascualPatricio Martínez-Barco
Index
Adapting IR-n System to SDR task
Evaluation
IR-n System
Conclusions and future work
Index
Adapting IR-n System to SDR task
Evaluation
IR-n System
Conclusions and future work
Use short fragments of documents instead of whole documents to evaluate the relevance or similarity
These fragments are called passages Each document is divided into passages
before calculating the relevance
IR-n System
Passage Retrieval Systems
Why IR-n system use the sentence to define the passages ?
A sentence expresses an idea in the document There are algorithms to obtain each sentence with a
great precision Sentences are full units allowing to show an
understandable information by users or provide this information to a subsequent system
IR-n System
Passage concept
General Custer was Civil War Union Major soldier. One of the most famous and controversial figures in United States Military history. Graduated last in his West Point Class (June 1861). Spent first part of the Civil War as a courier and staff officer. Promoted from Captain to Brigadier General of Volunteers just prior to the Battle of Gettysburg, and was given command of the Michigan "Wolverines" Cavalary brigade.
He helped defeat General Stuart's attempt to make a cavalry strike behind Union lines on the 3rd Day of the Battle (July 3, 1863), thus markedly contributing to the Army of the Potomac's victory (a large monument to his Brigade now stands in the East Cavalry Field in Gettysburg). Participated in nearly every cavalry action in Virginia from that point until the end of the war, always performing boldly, most often brilliantly, and always seeking publicity for himself and his actions. Ended the war as a Major General of Volunteers and a Brevet Major General in the Regular Army.
Upon Army reorganization in 1886, he was appointed Lieutenant Colonel of the soon to be renown 7th United States Cavalry. Fought in the various actions against the Western Indians, often with a singular brutality (exemplified by his wiping out of a Cheyenne village on the Washita in November 1868).
His exploits on the Plains were romanticized by Eastern Unites States newspapermen, and he was elevated to legendary status in his time. The death of his friend, Lucarelli change his life.
1 – Obtains sentences from the document
2 – Defines passages according to a fixed number of sentences
IR-n System
Passage concept
IR-n system defines the passages in the following way
SENTENCE 1
SENTENCE 2
SENTENCE 3
SENTENCE 4
SENTENCE 5
SENTENCE 6
SENTENCE 7
SENTENCE 8
SENTENCE 9
SENTENCE 10
SENTENCE 11
SENTENCE 12
SENTENCE 13
SENTENCE 14
SENTENCE 15
Passage 1
Passage 2
Passage 3
Every passage has the same number of sentences
This number depends on The collection of documents Size of the query
IR-n System
Passage concept
Index
Adapting IR-n System to SDR task
Evaluation
IR-n System
Conclusions and future work
Index
Adapting IR-n System to SDR task
Evaluation
IR-n System
Conclusions and future work
As appointed by Dahlback (1997): Spoken input is often incomplete and incorrect Contains interruptions and repairs Sentences occur only very occasionally
Conclusion: Sentence concept is not valid in spoken input
Therefore new basic units for dialogue models must be proposed:
Utterances instead of sentences Turns instead of paragraphs
Adapting IR-n system to SDR task
Spoken input
Utterance: sequency of words chained by a speaker between two
pauses.
Turn: set of utterances that a speaker can express between
two speaker changes (dialogues) set of utterances that a speaker expresses about the
same subject (monologues)
(each section of TREC SDR collection is going to be considered as a turn)
Adapting IR-n system to SDR task
Definitions
The lack of punctuation marks impedes the recognition of utterance boundaries
Utterances boundaries must be estimated detecting longest pauses
Some turns have not semantic content “Morning C.N.N. headline news I’m Sachi Koto”
Some turns are interrupted due to Overlaps Speaker mistakes Repetitions Modifications of previous information
Noise incorporate by Automatic transcriptors
Adapting IR-n system to SDR task
SDR problems
Adapting IR-n system to SDR task
IR-n problems
The lack of sentences to define passages must be solved with the use of utterances
An utterance splitter was developed Overlapping passage technique was used to minimize
fails of utterance splitting
Noise inputs How the system supports them must be tested
Index
Adapting IR-n System to SDR task
Evaluation
IR-n System
Conclusions and future work
Index
Adapting IR-n System to SDR task
Evaluation
IR-n System
Conclusions and future work
The main goal of this experiment is to know the robustness of IR-n system:
How a system based on passages (therefore based on sentences) can be adapted to utterances
How the system supports noise
Evaluation
Evaluation goal
Discovering the minimum time between words to consider a new utterance
…………………..<Word S_time=156.010 E_time=156.140> TO </Word>
<Word S_time=156.140 E_time=156.600> THWART </Word>
<Word S_time=156.600 E_time=156.830> THEIR </Word>
<Word S_time=156.830 E_time=157.480> ABILITY </Word>
<Word S_time=157.510 E_time=157.810> TO </Word>
<Word S_time=157.840 E_time=158.330> ACQUIRE </Word>
<Word S_time=158.330 E_time=158.450> AND </Word>
<Word S_time=158.450 E_time=158.890> DEVELOP </Word>
<Word S_time=158.920 E_time=159.350> WEAPONS </Word>
…………………..
Evaluation
Training focus
Discovering the minimum time between words to consider a new utterance
…………………..<Word S_time=156.010 E_time=156.140> TO </Word>
<Word S_time=156.140 E_time=156.600> THWART </Word>
<Word S_time=156.600 E_time=156.830> THEIR </Word>
<Word S_time=156.830 E_time=157.480> ABILITY </Word>
<Word S_time=157.510 E_time=157.810> TO </Word>
<Word S_time=157.840 E_time=158.330> ACQUIRE </Word>
<Word S_time=158.330 E_time=158.450> AND </Word>
<Word S_time=158.450 E_time=158.890> DEVELOP </Word>
<Word S_time=158.920 E_time=159.350> WEAPONS </Word>
…………………..
Evaluation
Training focus
That is not a new utterance
Discovering the minimum time between words to consider a new utterance
…………………..<Word S_time=215.130 E_time=215.330> BUT </Word><Word S_time=215.350 E_time=215.470> FOR </Word><Word S_time=215.470 E_time=215.610> THE </Word><Word S_time=215.610 E_time=215.900> BAY'S </Word><Word S_time=215.900 E_time=216.190> CHIEF </Word><Word S_time=217.680 E_time=218.270> I </Word><Word S_time=219.780 E_time=219.950> WHAT </Word><Word S_time=220.010 E_time=220.160> WOULD </Word><Word S_time=220.160 E_time=220.340> THEY </Word><Word S_time=220.340 E_time=220.910> ACHIEVED </Word>…………………..
Evaluation
Training focus
Discovering the minimum time between words to consider a new utterance
…………………..<Word S_time=215.130 E_time=215.330> BUT </Word><Word S_time=215.350 E_time=215.470> FOR </Word><Word S_time=215.470 E_time=215.610> THE </Word><Word S_time=215.610 E_time=215.900> BAY'S </Word><Word S_time=215.900 E_time=216.190> CHIEF </Word><Word S_time=217.680 E_time=218.270> I </Word><Word S_time=219.780 E_time=219.950> WHAT </Word><Word S_time=220.010 E_time=220.160> WOULD </Word><Word S_time=220.160 E_time=220.340> THEY </Word><Word S_time=220.340 E_time=220.910> ACHIEVED </Word>…………………..
Evaluation
Training focus
That is a new utterance
Discovering the better size for passages
Evaluation
Training focus
UTTERANCE 1
UTTERANCE 2
UTTERANCE 3
UTTERANCE 4
UTTERANCE 5
UTTERANCE 6
UTTERANCE 7
UTTERANCE 8
UTTERANCE 9
UTTERANCE 10
UTTERANCE 11
UTTERANCE 12
UTTERANCE 13
UTTERANCE 14
UTTERANCE 15
UTTERANCE 1
UTTERANCE 2
UTTERANCE 3
UTTERANCE 4
UTTERANCE 5
UTTERANCE 6
UTTERANCE 7
UTTERANCE 8
UTTERANCE 9
UTTERANCE 10
UTTERANCE 11
UTTERANCE 12
UTTERANCE 13
UTTERANCE 14
UTTERANCE 15
UTTERANCE 1
UTTERANCE 2
UTTERANCE 3
UTTERANCE 4
UTTERANCE 5
UTTERANCE 6
UTTERANCE 7
UTTERANCE 8
UTTERANCE 9
UTTERANCE 10
UTTERANCE 11
UTTERANCE 12
UTTERANCE 13
UTTERANCE 14
UTTERANCE 15
Training corpus : TREC SDR-8 collection (according to the track specification)
Parameters to be evaluated: Number of utterances / passage = (from 1 to 9)
Pause size considered for utterance split = (0.1, 0.2, 0.3 sec.)
Models: With query expansion Without query expansion
Evaluation
Training
Evaluation
Training
Best AvgP
Best pause estimation
0.4620
0.2
Best size of passage 5
Training results
Best model WITH
Evaluation
Monolingual test
Organization
ITC-irst
IR-n Alicante
AvgP
0,3944
0,3637
Exeter 0,3824
1
3
2
JHU/APL 0,31844
IR-n Alicante 0,35633
Monolingual results
Test corpus : TREC SDR-9 collection Parameters :
Number of utterances / passage = 5 Pause size considered for utterance split = 0.2 seconds
Model : with query expansion
French queries were translated into English using machine translation:
Power translator Free translator Babel fish
Evaluation
Bilingual test (French-English)
Organization
ITC-irst
IR-n Alicante
AvgP
0,3064
0,2825Exeter
0,3032
1
3
2
JHU/APL 0,19044
IR-n Alicante 0,28462
Bilingual results
Evaluation
Bilingual (French-English)
Test corpus : TREC SDR-9 collection Parameters :
Number of utterances / passage = 5 Pause size considered for utterance split = 0.2 seconds
Model : with query expansion
Index
Adapting IR-n System to SDR task
Evaluation
IR-n System
Conclusions and future work
Index
Adapting IR-n System to SDR task
Evaluation
IR-n System
Conclusions and future work
Conclusions: IR-n System is robust when working in SDR task (+) IR-n System performance must be increased (-)
Future work: Reduce noise produced by repetitions – modifications Remove turns without semantic content Evaluate and improve our utterance splitter
Conclusions and future work
Departamento de Lenguajes y Sistemas Informáticos
Spoken Document Retrieval experiments with IR-n system
Fernando Llopis PascualPatricio Martínez-Barco