VoiceXML: SSML (Speech Synthesis Markup Language) Recorded speech and audio.
Cross-Language Access to Recorded Speech in the MALACH Project
description
Transcript of Cross-Language Access to Recorded Speech in the MALACH Project
![Page 1: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/1.jpg)
Cross-Language Access to Recorded Speech
in the MALACH Project
Douglas Oard, Dina Demner-Fushman, Jan Hajic,Bhuvana Ramabhadran, Sam Gustman, Bill Byrne,
Dagobert Soergel, Bonnie Dorr, Philip Resnik, Michael Picheny, Josef Psutka
![Page 2: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/2.jpg)
Outline
• The MALACH project
• Searching speech
• A cross-language retrieval experiment
• Next steps
![Page 3: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/3.jpg)
The MALACH Project
• 52,000 interviews with Holocaust survivors– 116,000 hours (180 TB MPEG-1)– 32 languages, recorded in 67 countries
• Present: Manual indexing– 14,000 controlled vocabulary terms
• Future: Automatic indexing– Speech recognition– Translation
![Page 4: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/4.jpg)
Who Uses the Collection?
• History• Linguistics• Journalism• Material culture• Education• Psychology• Political science• Law enforcement
• Book• Documentary film• Research paper• CDROM• Study guide• Obituary• Evidence• Personal use
Discipline Products
Based on analysis of 280 access requests
![Page 5: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/5.jpg)
Research Challenges
• Speech Recognition– Spontaneous, accented, elderly, language switching
• Computational Linguistics– Segmentation, classification, summarization, extraction
• Information Retrieval– Query formulation, search, selection, examination, use
Today
Tomorrow (Josef Psutka)
![Page 6: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/6.jpg)
Supporting Information Access
SourceSelection
Search
Query
Selection
Ranked List
Examination
Recording
Delivery
Recording
QueryFormulation
Search System
Query Reformulation and
Relevance Feedback
SourceReselection
![Page 7: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/7.jpg)
Key Issues in Speech Retrieval
• Recognition accuracy– Content-based retrieval works when WER<40%
• Topic segmentation– Average MALACH interview is 2.3 hours!
• Multi-scale summarization– Brief summaries: selection from a ranked list– Detailed summaries: minimize audio replay
![Page 8: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/8.jpg)
English Recognition Accuracy
• 60% WER for off-the-shelf systems!– 3 systems (broadcast news, dictation, telephone)
• MLLR adaptation helps– 33% WER for fluent speech– 46% WER for heavy accents/disfluent speech
• Next step: retrain on transcribed interviews– 200 hours from 800 speakers
![Page 9: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/9.jpg)
Cross-Language Search
• Query formulation– Spoken words (free text)– Thesaurus descriptors
• Segment selection– Speech-to-text translation– multi-scale indicative summaries
• Use of retrieved segments– Query reformulation– Incorporation in projects
![Page 10: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/10.jpg)
Ranked Retrieval System Design
ComputeTerm Weights
Build Index
Documents
ComputeTerm Weights
ComputeDocument Score
Sort ScoresRankedList
Query
TranslationLexicon
![Page 11: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/11.jpg)
Ranked Retrieval
Czech/EnglishTranslationLexicon
Evaluation Framework
Ranked List
EnglishDocuments
Relevance Judgments Evaluation
Measure of Effectiveness
Czech Queries
![Page 12: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/12.jpg)
Czech/English Test Collection
• 113,000 English newspaper stories
• Two sets of 33 Czech queries – S: Very short (1-3 words)– L: Sentence-length
• Human “ground truth” relevance judgments– Pooled assessment methodology (CLEF-2000)
![Page 13: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/13.jpg)
Translation Lexicon
• Machine-readable dictionary– Lemmatized Czech query words– Looked each up in “PC Translator”
• Bilingual term list– Downloaded 800 term pairs from Ergane
• Retained untranslatable terms– Stripped diacritics to match proper names– Optionally, made minor corrections (by hand)
• e.g., “afrika” to “africa”
![Page 14: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/14.jpg)
Example Query
• Original Czech query (S)– Architektura v Berlínì
• Word-by-word translation into English– architecture architecture– at below beneath by embattled in inside into on per
under upon upstairs v within at below beneath by embattled in inside into on per under upon upstairs v within
– berlin
![Page 15: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/15.jpg)
Example Search Results
• Creating a new architectural vocabulary for a democratic Berlin
• UCLA merges architecture and arts into a new school
• Best of Berlin for young travelers
• Who owns the Nazi paper trail?
• A commitment to change the world; No place like utopia: Modern Architecture and the Company we Kept …
• On the record: Sanderling's dark take on Sibelius
• Max Bill, 85; Controversial Swiss artist, sculptor and writer
• The week ahead: Berlin; Farewell to allies
• Roll over Beethoven; Jeff Berlin leaves the violin and classical …
• Californians had right stuff for airlift; Europe: former pilots …
![Page 16: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/16.jpg)
Precision-Recall Graph
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Inte
rpo
late
d P
reci
sio
n
Average Precision = 0.477
Czech title query 1, LA Times Documents, CLEF 2000 Relevance Assessments
![Page 17: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/17.jpg)
0.0
0.2
0.4
0.6
0.8
1.0
1 3 4 5 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 24 26 28 29 30 31 32 33 34 36 37 38 39 40
Query
Ave
rag
e P
reci
sio
n
Average Precision
Czech title queries, LA Times Documents, CLEF 2000 Relevance Assessments
Mean Average Precision = 0.188
0.477
![Page 18: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/18.jpg)
Results
0.0
0.1
0.2
0.3
0.4
0.5
No Translation DQT DQT +Names
MonolingualMea
n A
ver
age
Pre
cisi
on
TTD
![Page 19: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/19.jpg)
Results
• Czech seems to pose no unusual problems– 55% of monolingual with simple techniques
• Suitable Czech/English resources exist– Czech morphology– Czech/English bilingual lexicon
• Multiword expression handling would help– Named entities, non-compositional phrases
![Page 20: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/20.jpg)
Some Next Steps
• Integrate Czech/English statistical MT– Johns Hopkins (Summer 2002 Workshop)
• Integrate with English and Czech ASR– IBM and Univ of West Bohemia/Charles Univ
• Integrate into an interactive retrieval system– University of Maryland and Shoah Foundation
![Page 21: Cross-Language Access to Recorded Speech in the MALACH Project](https://reader036.fdocuments.us/reader036/viewer/2022062409/56814e5b550346895dbbf6c1/html5/thumbnails/21.jpg)
For More Information• Cross-language and speech retrieval
– http://www.clis.umd.edu/~dlrg/clir/– http://www.clis.umd.edu/~dlrg/speech/
• The MALACH project– http://www.clsp.jhu.edu/research/malach/
• NSF/EU Spoken Word Access Working Group– http://www.dcs.shef.ac.uk/spandh/projects/swag/