Automatic Spoken Document Processing for Retrieval and Browsing

Zahra Ahmadi

Outline

• Motivation • Typical speech retrieval system• Works done• Dealing with OOV • Improvements

Motivation

• Ever-increasing computing power and connectivity bandwidth, together with falling storage costs resulting in an overwhelming amount of data of various types

• Information search and retrieval is a key application area • Less attention to speech search

• As data availability increases, lack of adequate technology for processing spoken documents becomes the limiting factor to large-scale access to spoken content

• Automatic approaches for indexing and searching spoken document collections are very desirable

Typical Speech Retrieval System

• Two primary processing stage: ▫ Offline process of audio content to generate index▫ Query searches via interface and system’s retrieval based

on indexes • ASR is the core component of speech retrieval system

SDR Challenges

• Primary difficulties due to limitations of ASR technology:▫ Highly spontaneous, unprepared speech▫ Topic-specific or person-specific vocabulary & language

usage▫ Unknown content and topics potentially lacking support in

general language model▫ Wide variety of accents and speaking styles▫ OOVs in queries▫ Infrequent query terms, which are most useful for retrieval

Prominent Approaches

• Many of prominent research efforts: SDR-TREC in 1999-2000

• Significant recent contributions on wide variety of speech sources: ▫ SpeechBot: audio from public web sites▫ SCANMail: voice mail▫ Oral history interviews▫ SpeechFind: National Gallery of the Spoken Word (NGSW)

consisting of speeches, news broadcasts, and recordings that are of significant historical content

TREC-SDR: A Success Story

• About 550 hours of broadcast news • Segmented manually into 21,574 stories of 250 words on

average• Evaluation of ASR systems tuned to broadcast news domain:

15-20% WER• Preexisting approximate manual transcriptions had WER of

14.5% for video and 7.5% for radio broadcasts • Accuracy evaluation: by human assessors search queries

• Retrieval performance was flat with respect to ASR WER (1-best) variations in the range of 15-30% (robust to recognition errors)

• No severe degradation in retrieval performance when evaluating with ASR outputs in comparison with approximate manual transcriptions

TREC-SDR Robustness Results

Shortcomings of TREC-SDR

• Speech recognizers tuned heavily for domain:▫ Lead to very good ASR performance▫ Unrealistic to expect 10–15% WER especially when

decoding speech is mismatched to the training data▫ Common to observe WER of 30-50%

• Very low OOV rates: ▫ Typically below 1%▫ Query-side OOV (Q-OOV) was very low as well

By Q-OOV rate close to 15%, severe degradation in MAP performance (50% relative, from 44 to 22) occurs

Dealing with OOV Query Words

• Most common: represent both query and spoken document using subword units:▫ Linguistically:

Phone: completely solves OOV problem, low performance Syllable: stable acoustically, poor language model Morpheme: hard to distinguish acoustically Stem-ending: acceptable OOV, distinguishable segment

(agglutinative lang.)

▫ Data driven: Multigram: non-overlapping, variable-length, phone

subsequences with some predefined maximum length Particle: found greedy to max. leave-one-out likelihood of

bigram LM Morph: based on minimum description length

Dealing with OOV Query Words (cont)

• Advocates tighter integration of ASR and IR:▫ Index phone n-grams appearing in ASR N-best lists▫ Focused on broadcast news thus benefiting from good ASR

performance

• Combination of word and subword level indexing: ▫ word-level indexing and querying is still more accurate▫ abundance of word-spotting false-positives in subword

retrieval▫ somewhat masked by the MAP measure

Effects of Using Different Methods

Dealing with OOV Query Words (cont)

• Building inverted index from ASR lattice:▫ Storing full connectivity information in lattice ▫ Retrieval is performed by looking up strings of units▫ Allows for exact calculation of n-gram expected counts but

more general proximity information is hard to calculate

• Query expansion: ▫ Expand to similar in-vocabulary phrases▫ Phone confusion matrix: acoustic confusion between words▫ Stemming ▫ Semantic similarity

Use of more than just one-best information (N-best lists or lattices) significantly improves retrieval accuracy

Long Spoken Communications

• Important to locate the relevant portion• Achievable by segmenting documents into topics and

locating topics • Spoken Utterance Retrieval (SUR): where segments are

short or when consist of short utterances• SUR goal is to find all utterances containing query• Applications: browsing broadcast news, telephone

conversations, teleconferences, and lectures• NIST STD 2006 Evaluation :

▫ Locating exact occurrence of query in large heterogeneous speech archives

▫ Notable technique with significant improvements: setting detection thresholds in a term-specific fashion to maximize ATWV metric

Spoken Document Understanding & Organization

Keywords in spoken document to

understand subject matters

Automatically extracting key information of

events in segmented short

paragraphsAutomatically

segmented into short

paragraphs with some

central concept

Automatically generating

summery for each

segmented short

paragraphs

Automatically generating title for each

short paragraph

Automatically analyzing subject

topics of segmented paragraphs,

clustering with topic labels, organizing

hierarchical presentation

Finally…

• Use of audio content and text metadata jointly can improve retrieval performance

• Conjunction of subword and word-based methods improves performance

• Need to universal ASR which controls variance in WER across narrow domains as SDR poses new challenges for the core ASR

• Cross-Language SDR: Assumes queries and target spoken documents are not in the same language ▫ Bilingual performance was lower than English monolingual run▫ However, the degree of degraded performance was shown to

depend on the translation resources used.▫ Extension of TREC collections by manually translating short

topics into five European languages: Dutch, French, German, Italian, Spanish

References

• C. Chelba, T.J. Hazen, M. Saraclar. “Retrieval and Browsing of Spoken Content”. IEEE Signal Processing Magazine, May 2008.

• L. Lee, B. Chen. “Spoken Document Understanding and Organization”. IEEE Signal Processing Magazine, September 2005.

• J. Garofolo, G. Auzanne, and E. Voorhees. “The TREC Spoken Document Retrieval Track: A Success Story”. Proc. Recherche d’Informations Assiste par Ordinateur: Content Based Multimedia Information Access Conf., 2000.

• L. Begeja, D. Gibbon, et. Al. “A System for Searching and Browsing Spoken Communications”. 2004.

• S. Parlak, M. saraclar. “Spoken Term Detection for Turkish Broadcast NEWS”. ICASSP 2008.

• N. Bertoldi, M. Federico. “Cross-Language Spoken Document Retrieval on the TREC SDR Collection”. Springer, pp.476-481, 2003.

• C. Chelba, T.J. Hazen. “Automatic Spoken Document Processing for Retrieval and Browsing”. Tutorial slides, NAACL 2006.

• …

Automatic Spoken Document Processing for Retrieval and Browsing

Documents

Transcript of Automatic Spoken Document Processing for Retrieval and Browsing

Information Retrieval Techniques for Spoken Language ... · Information Retrieval Techniques for Spoken Language Processing ... ISCSLP’02 L. F. Chien Outline IR vs. SLP Conventional

Models for Retrieval and Browsing - NTNUberlin.csie.ntnu.edu.tw/Courses/2004F-Information...• Also called the “information retrieval models” • Ranking Algorithms – Predict

(Spoken) Dialogue and Information Retrieval Antoine Raux Dialogs on Dialogs Group 10/24/2003.

Dynamic Maps for Exploring and Browsing Shapesdcor/articles/2013/DynamicMaps_SGP13.pdfY. Kleiman et al. / Dynamic Maps for Exploring and Browsing Shapes object is known as "shape retrieval".

An Indexing, Browsing, Search and Retrieval System for ... this paper we describe an indexing, querying and browsing system for online images based on the PNG (Portable Network Graphics)

IMEDIA Image and Multimedia Indexing, Browsing and Retrieval

Topic regards: ◆ Browsing of Search Results ◆ Video Retrieval using Spatio-Temporal

Subword-based Approaches for Spoken Document …Subword-based Approaches for Spoken Document Retrieval by Kenney Ng Submitted to the Department of Electrical Engineering and Computer

Multimedia Retrieval. Outline Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval.

Information Retrieval from Automatic Speech …...Transcripts Diana Inkpen University of Ottawa, SITE 2 Browsing spoken audio data Ways to facilitate it: gist a spoken audio document

Searching and Browsing Linked Data with SWSE: the Semantic ...ahogan/docs/swse_jws.pdf · search, browsing and retrieval of information; unlike traditional search engines, SWSE operates

Between information retrieval services and bibliometrics research. New ways of semantic browsing and visual analytics

Models for Retrieval and Browsing - 國立臺灣師範大學berlin.csie.ntnu.edu.tw/PastCourses/InformationRetrieval2003S/Slides/IR2003-Modeling...Models for Retrieval and Browsing-Fuzzy

Content-based search and browsing in semantic multimedia retrieval

Browsing and Retrieval of Full Broadcast-Quality Video · 2012. 11. 2. · Browsing and Retrieval of Full Broadcast-Quality Video David Gibbon, Andrea Basso, Reha Civanlar, Qian Huang,

Potential of freely faceted classification for knowledge retrieval and browsing

Retrieval System Voice Based Information · There are 3 different tasks of the Voice based Retrieval System Using Text Queries to retrieve spoken documents Referred as Spoken Document

A survey on content-based image retrieval/browsing …parisent/document/mmdbms-ppt.pdf · A survey on content-based image retrieval/browsing systems exploiting semantic Paolo Parisen

Spoken Content Retrieval – Beyond Cascading Speech …speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf · 2015-06-01 · 1 Spoken Content Retrieval – Beyond Cascading Speech Recognition

Beyond Cascading Speech Recognition and Text Retrievaltlkagk/slide/spoken... · Transcribe spoken content into text by speech recognition Speech Recognition Models Text Retrieval