Speech and Music Retrieval
description
Transcript of Speech and Music Retrieval
Speech and Music Retrieval
LBSC 796/CMSC828o
Session 12, April 19, 2004
Douglas W. Oard
Agenda
• Questions
• Speech retrieval
• Music retrieval
Spoken Word Collections
• Broadcast programming– News, interview, talk radio, sports, entertainment
• Scripted stories– Books on tape, poetry reading, theater
• Spontaneous storytelling– Oral history, folklore
• Incidental recording– Speeches, oral arguments, meetings, phone calls
Some Statistics
• 2,000 U.S. radio stations webcasting
• 250,000 hours of oral history in British Library
• 35 million audio streams indexed by SingingFish– Over 1 million searches per day
• ~100 billion hours of phone calls each year
Economics of the WebText in 1995 Speech in 2003
Storage(words per $)
300K 1.5M
Internet Backbone(simultaneous users)
250K 30M
Modem Capacity(% utilization)
100% 20%
Display Capability(% US population)
10% 38%
Search Systems Lycos
Yahoo
SpeechBot
SingingFish
Audio Retrieval
• Retrospective retrieval applications– Search music and nonprint media collections– Electronic finding aids for sound archives– Index audio files on the web
• Information filtering applications– Alerting service for a news bureau– Answering machine detection for telemarketing– Autotuner for a car radio
The Size of the Problem
• 30,000 hours in the Maryland Libraries– Unique collections with limited physical access
• Over 100,000 hours in the National Archives– With new material arriving at an increasing rate
• Millions of hours broadcast each year– Over 2,500 radio stations are now Webcasting!
Speech Retrieval Approaches
• Controlled vocabulary indexing
• Ranked retrieval based on associated text
Automatic feature-based indexing
• Social filtering based on other users’ ratings
Supporting Information Access
SourceSelection
Search
Query
Selection
Ranked List
Examination
Recording
Delivery
Recording
QueryFormulation
Search System
Query Reformulation and
Relevance Feedback
SourceReselection
Description Strategies• Transcription
– Manual transcription (with optional post-editing)
• Annotation– Manually assign descriptors to points in a recording– Recommender systems (ratings, link analysis, …)
• Associated materials– Interviewer’s notes, speech scripts, producer’s logs
• Automatic– Create access points with automatic speech processing
HotBot Audio Search Results
Detectable Speech Features
• Content – Phonemes, one-best word recognition, n-best
• Identity – Speaker identification, speaker segmentation
• Language– Language, dialect, accent
• Other measurable parameters– Time, duration, channel, environment
How Speech Recognition Works
• Three stages– What sounds were made?
• Convert from waveform to subword units (phonemes)
– How could the sounds be grouped into words?• Identify the most probable word segmentation points
– Which of the possible words were spoken?• Based on likelihood of possible multiword sequences
• All three stages are learned from training data– Using hill climbing (a “Hidden Markov Model”)
Using Speech Recognition
PhoneDetection
WordConstruction
WordSelection
Phonen-grams
Phonelattice
Words
Transcriptiondictionary
Languagemodel
One-besttranscript
Wordlattice
• Segment broadcasts into 20 second chunks• Index phoneme n-grams
– Overlapping one-best phoneme sequences– Trained using native German speakers
• Form phoneme trigrams from typed queries– Rule-based system for “open” vocabulary
• Vector space trigram matching– Identify ranked segments by time
ETHZ Broadcast News Retrieval
Phoneme Trigrams
• Manage -> m ae n ih jh– Dictionaries provide accurate transcriptions
• But valid only for a single accent and dialect
– Rule-base transcription handles unknown words
• Index every overlapping 3-phoneme sequence– m ae n– ae n ih– n ih jh
ETHZ Broadcast News Retrieval
Cambridge Video Mail Retrieval• Added personal audio (and video) to email
– But subject lines still typed on a keyboard
• Indexed most probable phoneme sequences
Key Results from TREC/TDT
• Recognition and retrieval can be decomposed– Word recognition/retrieval works well in English
• Retrieval is robust with recognition errors– Up to 40% word error rate is tolerable
• Retrieval is robust with segmentation errors– Vocabulary shift/pauses provide strong cues
Cambridge Video Mail Retrieval
• Translate queries to phonemes with dictionary– Skip stopwords and words with 3 phonemes
• Find no-overlap matches in the lattice– Queries take about 4 seconds per hour of material
• Vector space exact word match– No morphological variations checked– Normalize using most probable phoneme sequence
• Select from a ranked list of subject lines
Visualizing Turn-Taking
MIT “Speech Skimmer”
BBN Radio News Retrieval
AT&T Radio News Retrieval
IBM Broadcast News Retrieval
• Large vocabulary continuous speech recognition– 64,000 word forms covers most utterances
• When suitable training data is available• About 40% word error rate in the TREC 6 evaluation
– Slow indexing (1 hour per hour) limits collection size
• Standard word-based vector space matching– Nearly instant queries– N-gram triage plus lattice match for unknown words
• Ranked list showing source and broadcast time
Comparison with Text Retrieval
• Detection is harder– Speech recognition errors
• Selection is harder– Date and time are not very informative
• Examination is harder– Linear medium is hard to browse– Arbitrary segments produce unnatural breaks
Speaker Identification
• Gender– Classify speakers as male or female
• Identity– Detect speech samples from same speaker– To assign a name, need a known training sample
• Speaker segmentation– Identify speaker changes– Count number of speakers
A Richer View of Speech
• Speaker identification– Known speaker and “more like this” searches– Gender detection for search and browsing
• Topic segmentation via vocabulary shift– More natural breakpoints for browsing
• Speaker segmentation– Visualize turn-taking behavior for browsing– Classify turn-taking patterns for searching
Other Possibly Useful Features
• Channel characteristics– Cell phone, landline, studio mike, ...
• Accent– Another way of grouping speakers
• Prosody– Detecting emphasis could help search or browsing
• Non-speech audio– Background sounds, audio cues
Competing Demands on the Interface
• Query must result in a manageable set– But users prefer simple query interfaces
• Selection interface must show several segments– Representations must be compact, but informative
• Rapid examination should be possible– But complete access to the recordings is desirable
Iterative Prototyping Strategy
• Select a user group and a collection• Observe information seeking behaviors
– To identify effective search strategies
• Refine the interface– To support effective search strategies
• Integrate needed speech technologies• Evaluate the improvements with user studies
– And observe changes to effective search strategies
Broadcast News Retrieval Study
• NPR OnlineManually prepared transcripts
Human cataloging
• SpeechBotAutomatic Speech Recognition
Automatic indexing
NPR Online
SpeechBot
Study Design• Seminar on visual and sound materials
– Recruited 5 students
• After training, we provided 2 topics– 3 searched NPR Online, 2 searched SpeechBot
• All then tried both systems with a 3rd topic– Each choosing their own topic
• Rich data collection– Observation, think aloud, semi-structured interview
• Model-guided inductive analysis– Coded to the model with QSR NVivo
Criterion-Attribute Framework
Relevance Criteria
Associated Attributes
NPR Online SpeechBot
Topicality
Story Type
Authority
Story title
Brief summary
Audio
Detailed summary
Speaker name
Audio
Detailed summary
Short summary
Story title
Program title
Speaker name
Speaker’s affiliation
Detailed summary
Brief summary
Audio
Highlighted terms
Audio
Program title
Shoah Foundation’s Collection• Enormous scale
– 116,000 hours; 52,000 interviews; 180 TB
• Grand challenges– 32 languages, accents, elderly, emotional, …
• Accessible– $100 million collection and digitization investment
• Annotated– 10,000 hours (~200,000 segments) fully described
• Users– A department working full time on dissemination
0
20
40
60
80
100
En
gli
sh W
ord
Err
or
Rat
e (%
)English ASR Accuracy
Training: 200 hours from 800 speakers
ASR Game Plan
Hours Word
Language Transcribed Error Rate
English 200 39.6%
Czech 84 39.4%
Russian 20 (of 100) 66.6%
Polish
Slovak
As of May 2003
Who Uses the Collection?
• History• Linguistics• Journalism• Material culture• Education• Psychology• Political science• Law enforcement
• Book• Documentary film• Research paper• CDROM• Study guide• Obituary• Evidence• Personal use
Discipline Products
Based on analysis of 280 access requests
Question Types
• Content– Person, organization– Place, type of place (e.g., camp, ghetto)– Time, time period– Event, subject
• Mode of expression– Language– Displayed artifacts (photographs, objects, …) – Affective reaction (e.g., vivid, moving, …)
• Age appropriateness
Observational Studies 8 independent searchers
– Holocaust studies (2)– German Studies– History/Political Science– Ethnography– Sociology– Documentary producer– High school teacher
8 teamed searchers– All high school teachers
Thesaurus-based search
Rich data collection– Intermediary interaction– Semi-structured interviews– Observational notes– Think-aloud– Screen capture
Qualitative analysis– Theory-guided coding– Abductive reasoning
“Old Indexing”Subject PersonLocation-Time
Berlin-1939 Employment Josef Stein
Berlin-1939 Family life Gretchen Stein Anna Stein
Dresden-1939 Schooling Gunter Wendt Maria
Dresden-1939 Relocation Transportation-rail inte
rvie
w ti
me
Table 5. Mentions of relevance criteria by searchers
Relevance Criteria
Number of Mentions
All(N=703)
Think-Aloud
Relevance Judgment(N=300)
QueryForm.
(N=248)
Topicality 535 (76%) 219 234
Richness 39 (5.5%) 14 0
Emotion 24 (3.4%) 7 0
Audio/Visual Expression 16 (2.3%) 5 0
Comprehensibility 14 (2%) 1 10
Duration 11 (1.6%) 9 0
Novelty 10 (1.4%) 4 2
Workshops 1 and 2
Topicality
0 20 40 60 80 100 120 140
Object
Time Frame
Organization/Group
Subject
Event/Experience
Place
Person
Workshops 1 and 2
Total mentions by 8 searchers
Person, by name
Person, by characteristicsDate of birthGenderOccupation, intervieweeCountry of birthReligionSocial status, intervieweeSocial status, parentsNationality, languageFamily statusAddressOccupation, parentsImmigration historyLevel of educationPolitical affiliationMarital status
Workshops 1 and 2
Personal event/experience Resistance Deportation Suicide Immigration Liberation Escaping Forced labor Hiding Abortion Wedding Murder/Death Adaptation Abandonment Incarceration
Historical event/experience November Pogrom Beautification Polish Pogrom Olympic games
7 The extent to which the segment provides a useful basis for learning and internalizing vocabulary
8 The extent to which the clip can be used in several subjects, for example, English, history, and art.
9 Ease of comprehension, mainly clarity of enunciation
10 Expressive power, both body language and voice
11 Length of the segment, in relation to what it contributes
12 Does the segment communicate enough context?
Relevance Criteria1 Relationship to theme
2 Specific topic was less important (different from many researchers for whom specific topic, such as name of camp or physician as aid giver, is important)
3 Match of the demographic characteristics; specifically: age of the interviewee at the time of the events recounted as related to the age of the students for whom a lesson is intended.
4 Language of clip. Relates to student audience and teaching purpose. English first, Spanish second
5 Age-appropriateness of the material
6 Acceptability to the other stakeholders (parents, administrators, etc.)
Workshop 3
MALACH Overview
AutomaticSearch
BoundaryDetection
InteractiveSelection
ContentTagging
SpeechRecognition
QueryFormulation
MALACH Test Collection
AutomaticSearch
BoundaryDetection
SpeechRecognition
QueryFormulation
Topic Statements
Ranked Lists
EvaluationRelevanceJudgments
Mean Average Precision
Interviews
ComparableCollection
ContentTagging
Design Decision #1:
Search for Segments• “Old indexing” used manual segmentation• 4,000 indexed interviews (10,000 hours)• 1,514 have been digitized (3,780 hours)• 800 of those were used to train ASR• 714 remain for (fair) IR testing• 246 were selected (625 hours)
• 199 complete interviews, 47 partial interviews
• 9,947 segments resulted (17 MB of text)– Average length: 380 words (~3 minutes)
Design Decision #2:
Topic Construction• 600 written requests, in folders at VHF
– From scholars, teachers, broadcasters, …
• 280 topical requests– Others just requested a single interview
• 70 recast a TREC-like format– Some needed to be “broadened”
• 50 selected for use in the collection• 30 assessed during Summer 2004• 28 yielded at least 5 relevant segments
Design Decision #3:
Defining “Relevance”
• Topical relevanceDirect They saw the plane crashIndirect They saw the engine hit the building
• Confounding factorsContext Statistics on causes of plane crashesComparison Saw a different plane crashPointer List of witnesses at the investigation
Design Decision #4:
Selective Assessment• Exhaustive assessment is impractical
– 300,000 judgments (~20,000 hours)!
• Pooled assessment is not yet possible– Requires a diverse set of ASR and IR systems
• Search-guided assessment is viable– Iterate topic research/search/assessment– Augment with review, adjudication, reassessment– Requires an effective interactive search system
Quality Assurance
• 14 topics independently assessed– 0.63 topic-averaged kappa (over all judgments)– 44% topic-averaged overlap (relevant judgments)– Assessors later met to adjudicate
• 14 topics assessed and then reviewed– Decisions of the reviewer were final
Comparing Index Terms
0.0
0.1
0.2
0.3
0.4
0.5
ASR Notes ThesTerm Summary Metadata Metadata+ASR
Me
an
Av
era
ge
Pre
cis
ion
Full
Title
Topical relevance, adjudicated judgments, Inquery
Leveraging Automatic Categorization
Motivation: MAP on thesaurus terms, title queries: 28.3% (was 7.5% on text)
text (transcripts)
keywords
training segments
text (ASR output)
keywords
test segments
automaticcategorization
index
Categorizer:k Nearest Neighborstrained on 3,199 manually transcribed segmentsmicro-averaged F1 = 0.192
ASR-Based Search
0.0694 0.0695 0.06810.0740
0.0460
0.0941
0.00
0.02
0.04
0.06
0.08
0.10
Inquery Character 5-grams
Okapi Okapi BlindExpansion
OkapiCategory
Expansion
OkapiMerged
Mea
n A
vera
ge P
reci
sion
Title queries, topical relevance, adjudicated judgments
+30%
0.0
0.2
0.4
0.6
0.8
1.0
1188
1630
1187
1628
1446
1424
1330
1508
1269
1345
1181
1148
1259
1225
1431
2
1192
1663
1551
1414
1623
1605
1279
1159
1173
1286
1311
1332
1179
Manual
ASR
Topical relevance, adjudicated judgments, title queries, Inquery
Uni
nter
pola
ted
Ave
rage
Pre
cisi
on
Comparing ASR and Manual
Error Analysis
0% 20% 40% 60% 80% 100%
1179
1605
1623
1414
1551
1192
14312
1225
1181
1345
1330
1446
1628
1187
1188
1630
Somewhere in ASR Results
(bold occur in <35 segments)
in ASR Lexicon
Only in Metadata
wit eichmann
jew volkswagen
labor camp ig farben
slave labor telefunken aeg
minsk ghetto underground
wallenberg eichmann
bomb birkeneneau
sonderkommando auschwicz
liber buchenwald dachau
jewish kapo
kindertransport
ghetto life
fort ontario refugee camp
jewish partisan poland
jew shanghai
bulgaria save jewTitle queries, adjudicated judgments, Inquery
ASR % of Metadata
Correcting Relevant Segments
0.0
0.2
0.4
0.6
0.8
1.0
1446 14312 1414Topic
Un
inte
rpo
late
d A
vera
ge
Pre
cisi
on
ASR Corrected Metadata
Title+Description+Narrative queries
relevant segments
occurrences per segment
missesper segment
miss % false alarmsper segment
false alarm%
fort 13 1.2 0.85 69 0 0
Ontario 13 1.2 0.77 63 0 0
refugee 2 1.0 0 0 0 0
refugees 6 2.0 0.67 33 0.17 8.3
camp 21 3.6 1.20 33 0 0
camps 2 1.0 0 0 0 0
Jewish 8 2.1 0.75 35 0.38 18
kapo 9 2.6 2.00 78 0 0
kapos 7 2.7 2.70 100 0 0
Minsk 9 1.7 1.70 100 0 0
ghetto 19 3.2 1.10 35 0.32 10
ghettos 3 1.0 1.00 100 0 0
underground 3 1.3 0.33 25 0 0
What Have We Learned?
• IR test collection yields interesting insights– Real topics, real ASR, ok assessor agreement
• Named entities are important to real users– Word error rate can mask key ASR weaknesses
• Knowledge structures seem to add value– Hand-built thesaurus + text classification
For More Information
• The MALACH project– http://www.clsp.jhu.edu/research/malach/
• NSF/EU Spoken Word Access Group– http://www.dcs.shef.ac.uk/spandh/projects/swag/
• Speech-based retrieval– http://www.glue.umd.edu/~dlrg/speech/
New Zealand Melody Index
• Index musical tunes as contour patterns– Rising, descending, and repeated pitch– Note duration as a measure of rhythm
• Users sing queries using words or la, da, …– Pitch tracking accommodates off-key queries
• Rank order using approximate string match– Insert, delete, substitute, consolidate, fragment
• Display title, sheet music, and audio
Contour Matching Example
• “Three Blind Mice” is indexed as:– *DDUDDUDRDUDRD
• * represents the first note
• D represents a descending pitch (U is ascending)
• R represents a repetition (detectable split, same pitch)
• My singing produces:– *DDUDDUDRRUDRR
• Approximate string match finds 2 substitutions
Muscle Fish Audio Retrieval
• Compute 4 acoustic features for each time slice– Pitch, amplitude, brightness, bandwidth
• Segment at major discontinuities– Find average, variance, and smoothness of segments
• Store pointers to segments in 13 sorted lists– Use a commercial database for proximity matching
• 4 features, 3 parameters for each, plus duration
– Then rank order using statistical classification
• Display file name and audio
Muscle Fish Audio Retrieval
Summary
• Limited audio indexing is practical now– Audio feature matching, answering machine detection
• Present interfaces focus on a single technology– Speech recognition, audio feature matching– Matching technology is outpacing interface design