Speech and Music Retrieval

Speech and Music Retrieval

LBSC 796/CMSC828o

Session 12, April 19, 2004

Douglas W. Oard

Agenda

• Questions

• Speech retrieval

• Music retrieval

Spoken Word Collections

• Broadcast programming– News, interview, talk radio, sports, entertainment

• Scripted stories– Books on tape, poetry reading, theater

• Spontaneous storytelling– Oral history, folklore

• Incidental recording– Speeches, oral arguments, meetings, phone calls

Some Statistics

• 2,000 U.S. radio stations webcasting

• 250,000 hours of oral history in British Library

• 35 million audio streams indexed by SingingFish– Over 1 million searches per day

• ~100 billion hours of phone calls each year

Economics of the WebText in 1995 Speech in 2003

Storage(words per $)

300K 1.5M

Internet Backbone(simultaneous users)

250K 30M

Modem Capacity(% utilization)

100% 20%

Display Capability(% US population)

10% 38%

Search Systems Lycos

Yahoo

SpeechBot

SingingFish

Audio Retrieval

• Retrospective retrieval applications– Search music and nonprint media collections– Electronic finding aids for sound archives– Index audio files on the web

• Information filtering applications– Alerting service for a news bureau– Answering machine detection for telemarketing– Autotuner for a car radio

The Size of the Problem

• 30,000 hours in the Maryland Libraries– Unique collections with limited physical access

• Over 100,000 hours in the National Archives– With new material arriving at an increasing rate

• Millions of hours broadcast each year– Over 2,500 radio stations are now Webcasting!

Speech Retrieval Approaches

• Controlled vocabulary indexing

• Ranked retrieval based on associated text

Automatic feature-based indexing

• Social filtering based on other users’ ratings

Supporting Information Access

SourceSelection

Search

Query

Selection

Ranked List

Examination

Recording

Delivery

Recording

QueryFormulation

Search System

Query Reformulation and

Relevance Feedback

SourceReselection

Description Strategies• Transcription

– Manual transcription (with optional post-editing)

• Annotation– Manually assign descriptors to points in a recording– Recommender systems (ratings, link analysis, …)

• Associated materials– Interviewer’s notes, speech scripts, producer’s logs

• Automatic– Create access points with automatic speech processing

HotBot Audio Search Results

http://www.hotbot.com/

Detectable Speech Features

• Content – Phonemes, one-best word recognition, n-best

• Identity – Speaker identification, speaker segmentation

• Language– Language, dialect, accent

• Other measurable parameters– Time, duration, channel, environment

How Speech Recognition Works

• Three stages– What sounds were made?

• Convert from waveform to subword units (phonemes)

– How could the sounds be grouped into words?• Identify the most probable word segmentation points

– Which of the possible words were spoken?• Based on likelihood of possible multiword sequences

• All three stages are learned from training data– Using hill climbing (a “Hidden Markov Model”)

Using Speech Recognition

PhoneDetection

WordConstruction

WordSelection

Phonen-grams

Phonelattice

Words

Transcriptiondictionary

Languagemodel

One-besttranscript

Wordlattice

• Segment broadcasts into 20 second chunks• Index phoneme n-grams

– Overlapping one-best phoneme sequences– Trained using native German speakers

• Form phoneme trigrams from typed queries– Rule-based system for “open” vocabulary

• Vector space trigram matching– Identify ranked segments by time

ETHZ Broadcast News Retrieval

Phoneme Trigrams

• Manage -> m ae n ih jh– Dictionaries provide accurate transcriptions

• But valid only for a single accent and dialect

– Rule-base transcription handles unknown words

• Index every overlapping 3-phoneme sequence– m ae n– ae n ih– n ih jh

ETHZ Broadcast News Retrieval

Cambridge Video Mail Retrieval• Added personal audio (and video) to email

– But subject lines still typed on a keyboard

• Indexed most probable phoneme sequences

Key Results from TREC/TDT

• Recognition and retrieval can be decomposed– Word recognition/retrieval works well in English

• Retrieval is robust with recognition errors– Up to 40% word error rate is tolerable

• Retrieval is robust with segmentation errors– Vocabulary shift/pauses provide strong cues

Cambridge Video Mail Retrieval

• Translate queries to phonemes with dictionary– Skip stopwords and words with 3 phonemes

• Find no-overlap matches in the lattice– Queries take about 4 seconds per hour of material

• Vector space exact word match– No morphological variations checked– Normalize using most probable phoneme sequence

• Select from a ranked list of subject lines

Visualizing Turn-Taking

MIT “Speech Skimmer”

BBN Radio News Retrieval

AT&T Radio News Retrieval

IBM Broadcast News Retrieval

• Large vocabulary continuous speech recognition– 64,000 word forms covers most utterances

• When suitable training data is available• About 40% word error rate in the TREC 6 evaluation

– Slow indexing (1 hour per hour) limits collection size

• Standard word-based vector space matching– Nearly instant queries– N-gram triage plus lattice match for unknown words

• Ranked list showing source and broadcast time

Comparison with Text Retrieval

• Detection is harder– Speech recognition errors

• Selection is harder– Date and time are not very informative

• Examination is harder– Linear medium is hard to browse– Arbitrary segments produce unnatural breaks

Speaker Identification

• Gender– Classify speakers as male or female

• Identity– Detect speech samples from same speaker– To assign a name, need a known training sample

• Speaker segmentation– Identify speaker changes– Count number of speakers

A Richer View of Speech

• Speaker identification– Known speaker and “more like this” searches– Gender detection for search and browsing

• Topic segmentation via vocabulary shift– More natural breakpoints for browsing

• Speaker segmentation– Visualize turn-taking behavior for browsing– Classify turn-taking patterns for searching

Other Possibly Useful Features

• Channel characteristics– Cell phone, landline, studio mike, ...

• Accent– Another way of grouping speakers

• Prosody– Detecting emphasis could help search or browsing

• Non-speech audio– Background sounds, audio cues

Competing Demands on the Interface

• Query must result in a manageable set– But users prefer simple query interfaces

• Selection interface must show several segments– Representations must be compact, but informative

• Rapid examination should be possible– But complete access to the recordings is desirable

Iterative Prototyping Strategy

• Select a user group and a collection• Observe information seeking behaviors

– To identify effective search strategies

• Refine the interface– To support effective search strategies

• Integrate needed speech technologies• Evaluate the improvements with user studies

– And observe changes to effective search strategies

Broadcast News Retrieval Study

• NPR OnlineManually prepared transcripts

Human cataloging

• SpeechBotAutomatic Speech Recognition

Automatic indexing

NPR Online

SpeechBot

Study Design• Seminar on visual and sound materials

– Recruited 5 students

• After training, we provided 2 topics– 3 searched NPR Online, 2 searched SpeechBot

• All then tried both systems with a 3rd topic– Each choosing their own topic

• Rich data collection– Observation, think aloud, semi-structured interview

• Model-guided inductive analysis– Coded to the model with QSR NVivo

Criterion-Attribute Framework

Relevance Criteria

Associated Attributes

NPR Online SpeechBot

Topicality

Story Type

Authority

Story title

Brief summary

Audio

Detailed summary

Speaker name

Audio

Detailed summary

Short summary

Story title

Program title

Speaker name

Speaker’s affiliation

Detailed summary

Brief summary

Audio

Highlighted terms

Audio

Program title

Shoah Foundation’s Collection• Enormous scale

– 116,000 hours; 52,000 interviews; 180 TB

• Grand challenges– 32 languages, accents, elderly, emotional, …

• Accessible– $100 million collection and digitization investment

• Annotated– 10,000 hours (~200,000 segments) fully described

• Users– A department working full time on dissemination

0

20

40

60

80

100

En

gli

sh W

ord

Err

or

Rat

e (%

)English ASR Accuracy

Training: 200 hours from 800 speakers

ASR Game Plan

Hours Word

Language Transcribed Error Rate

English 200 39.6%

Czech 84 39.4%

Russian 20 (of 100) 66.6%

Polish

Slovak

As of May 2003

Who Uses the Collection?

• History• Linguistics• Journalism• Material culture• Education• Psychology• Political science• Law enforcement

• Book• Documentary film• Research paper• CDROM• Study guide• Obituary• Evidence• Personal use

Discipline Products

Based on analysis of 280 access requests

Question Types

• Content– Person, organization– Place, type of place (e.g., camp, ghetto)– Time, time period– Event, subject

• Mode of expression– Language– Displayed artifacts (photographs, objects, …) – Affective reaction (e.g., vivid, moving, …)

• Age appropriateness

Observational Studies 8 independent searchers

– Holocaust studies (2)– German Studies– History/Political Science– Ethnography– Sociology– Documentary producer– High school teacher

8 teamed searchers– All high school teachers

Thesaurus-based search

Rich data collection– Intermediary interaction– Semi-structured interviews– Observational notes– Think-aloud– Screen capture

Qualitative analysis– Theory-guided coding– Abductive reasoning

“Old Indexing”Subject PersonLocation-Time

Berlin-1939 Employment Josef Stein

Berlin-1939 Family life Gretchen Stein Anna Stein

Dresden-1939 Schooling Gunter Wendt Maria

Dresden-1939 Relocation Transportation-rail inte

rvie

w ti

me

Table 5. Mentions of relevance criteria by searchers

Relevance Criteria

Number of Mentions

All(N=703)

Think-Aloud

Relevance Judgment(N=300)

QueryForm.

(N=248)

Topicality 535 (76%) 219 234

Richness 39 (5.5%) 14 0

Emotion 24 (3.4%) 7 0

Audio/Visual Expression 16 (2.3%) 5 0

Comprehensibility 14 (2%) 1 10

Duration 11 (1.6%) 9 0

Novelty 10 (1.4%) 4 2

Workshops 1 and 2

Topicality

0 20 40 60 80 100 120 140

Object

Time Frame

Organization/Group

Subject

Event/Experience

Place

Person

Workshops 1 and 2

Total mentions by 8 searchers

Person, by name

Person, by characteristicsDate of birthGenderOccupation, intervieweeCountry of birthReligionSocial status, intervieweeSocial status, parentsNationality, languageFamily statusAddressOccupation, parentsImmigration historyLevel of educationPolitical affiliationMarital status

Workshops 1 and 2

Personal event/experience Resistance Deportation Suicide Immigration Liberation Escaping Forced labor Hiding Abortion Wedding Murder/Death Adaptation Abandonment Incarceration

Historical event/experience November Pogrom Beautification Polish Pogrom Olympic games

7 The extent to which the segment provides a useful basis for learning and internalizing vocabulary

8 The extent to which the clip can be used in several subjects, for example, English, history, and art.

9 Ease of comprehension, mainly clarity of enunciation

10 Expressive power, both body language and voice

11 Length of the segment, in relation to what it contributes

12 Does the segment communicate enough context?

Relevance Criteria1 Relationship to theme

2 Specific topic was less important (different from many researchers for whom specific topic, such as name of camp or physician as aid giver, is important)

3 Match of the demographic characteristics; specifically: age of the interviewee at the time of the events recounted as related to the age of the students for whom a lesson is intended.

4 Language of clip. Relates to student audience and teaching purpose. English first, Spanish second

5 Age-appropriateness of the material

6 Acceptability to the other stakeholders (parents, administrators, etc.)

Workshop 3

MALACH Overview

AutomaticSearch

BoundaryDetection

InteractiveSelection

ContentTagging

SpeechRecognition

QueryFormulation

MALACH Test Collection

AutomaticSearch

BoundaryDetection

SpeechRecognition

QueryFormulation

Topic Statements

Ranked Lists

EvaluationRelevanceJudgments

Mean Average Precision

Interviews

ComparableCollection

ContentTagging

Design Decision #1:

Search for Segments• “Old indexing” used manual segmentation• 4,000 indexed interviews (10,000 hours)• 1,514 have been digitized (3,780 hours)• 800 of those were used to train ASR• 714 remain for (fair) IR testing• 246 were selected (625 hours)

• 199 complete interviews, 47 partial interviews

• 9,947 segments resulted (17 MB of text)– Average length: 380 words (~3 minutes)

Design Decision #2:

Topic Construction• 600 written requests, in folders at VHF

– From scholars, teachers, broadcasters, …

• 280 topical requests– Others just requested a single interview

• 70 recast a TREC-like format– Some needed to be “broadened”

• 50 selected for use in the collection• 30 assessed during Summer 2004• 28 yielded at least 5 relevant segments

Design Decision #3:

Defining “Relevance”

• Topical relevanceDirect They saw the plane crashIndirect They saw the engine hit the building

• Confounding factorsContext Statistics on causes of plane crashesComparison Saw a different plane crashPointer List of witnesses at the investigation

Design Decision #4:

Selective Assessment• Exhaustive assessment is impractical

– 300,000 judgments (~20,000 hours)!

• Pooled assessment is not yet possible– Requires a diverse set of ASR and IR systems

• Search-guided assessment is viable– Iterate topic research/search/assessment– Augment with review, adjudication, reassessment– Requires an effective interactive search system

Quality Assurance

• 14 topics independently assessed– 0.63 topic-averaged kappa (over all judgments)– 44% topic-averaged overlap (relevant judgments)– Assessors later met to adjudicate

• 14 topics assessed and then reviewed– Decisions of the reviewer were final

Comparing Index Terms

0.0

0.1

0.2

0.3

0.4

0.5

ASR Notes ThesTerm Summary Metadata Metadata+ASR

Me

an

Av

era

ge

Pre

cis

ion

Full

Title

Topical relevance, adjudicated judgments, Inquery

Leveraging Automatic Categorization

Motivation: MAP on thesaurus terms, title queries: 28.3% (was 7.5% on text)

text (transcripts)

keywords

training segments

text (ASR output)

keywords

test segments

automaticcategorization

index

Categorizer:k Nearest Neighborstrained on 3,199 manually transcribed segmentsmicro-averaged F1 = 0.192

ASR-Based Search

0.0694 0.0695 0.06810.0740

0.0460

0.0941

0.00

0.02

0.04

0.06

0.08

0.10

Inquery Character 5-grams

Okapi Okapi BlindExpansion

OkapiCategory

Expansion

OkapiMerged

Mea

n A

vera

ge P

reci

sion

Title queries, topical relevance, adjudicated judgments

+30%

0.0

0.2

0.4

0.6

0.8

1.0

1188

1630

1187

1628

1446

1424

1330

1508

1269

1345

1181

1148

1259

1225

1431

2

1192

1663

1551

1414

1623

1605

1279

1159

1173

1286

1311

1332

1179

Manual

ASR

Topical relevance, adjudicated judgments, title queries, Inquery

Uni

nter

pola

ted

Ave

rage

Pre

cisi

on

Comparing ASR and Manual

Error Analysis

0% 20% 40% 60% 80% 100%

1179

1605

1623

1414

1551

1192

14312

1225

1181

1345

1330

1446

1628

1187

1188

1630

Somewhere in ASR Results

(bold occur in <35 segments)

in ASR Lexicon

Only in Metadata

wit eichmann

jew volkswagen

labor camp ig farben

slave labor telefunken aeg

minsk ghetto underground

wallenberg eichmann

bomb birkeneneau

sonderkommando auschwicz

liber buchenwald dachau

jewish kapo

kindertransport

ghetto life

fort ontario refugee camp

jewish partisan poland

jew shanghai

bulgaria save jewTitle queries, adjudicated judgments, Inquery

ASR % of Metadata

Correcting Relevant Segments

0.0

0.2

0.4

0.6

0.8

1.0

1446 14312 1414Topic

Un

inte

rpo

late

d A

vera

ge

Pre

cisi

on

ASR Corrected Metadata

Title+Description+Narrative queries

relevant segments

occurrences per segment

missesper segment

miss % false alarmsper segment

false alarm%

fort 13 1.2 0.85 69 0 0

Ontario 13 1.2 0.77 63 0 0

refugee 2 1.0 0 0 0 0

refugees 6 2.0 0.67 33 0.17 8.3

camp 21 3.6 1.20 33 0 0

camps 2 1.0 0 0 0 0

Jewish 8 2.1 0.75 35 0.38 18

kapo 9 2.6 2.00 78 0 0

kapos 7 2.7 2.70 100 0 0

Minsk 9 1.7 1.70 100 0 0

ghetto 19 3.2 1.10 35 0.32 10

ghettos 3 1.0 1.00 100 0 0

underground 3 1.3 0.33 25 0 0

What Have We Learned?

• IR test collection yields interesting insights– Real topics, real ASR, ok assessor agreement

• Named entities are important to real users– Word error rate can mask key ASR weaknesses

• Knowledge structures seem to add value– Hand-built thesaurus + text classification

For More Information

• The MALACH project– http://www.clsp.jhu.edu/research/malach/

• NSF/EU Spoken Word Access Group– http://www.dcs.shef.ac.uk/spandh/projects/swag/

• Speech-based retrieval– http://www.glue.umd.edu/~dlrg/speech/

New Zealand Melody Index

• Index musical tunes as contour patterns– Rising, descending, and repeated pitch– Note duration as a measure of rhythm

• Users sing queries using words or la, da, …– Pitch tracking accommodates off-key queries

• Rank order using approximate string match– Insert, delete, substitute, consolidate, fragment

• Display title, sheet music, and audio

Contour Matching Example

• “Three Blind Mice” is indexed as:– *DDUDDUDRDUDRD

• * represents the first note

• D represents a descending pitch (U is ascending)

• R represents a repetition (detectable split, same pitch)

• My singing produces:– *DDUDDUDRRUDRR

• Approximate string match finds 2 substitutions

Muscle Fish Audio Retrieval

• Compute 4 acoustic features for each time slice– Pitch, amplitude, brightness, bandwidth

• Segment at major discontinuities– Find average, variance, and smoothness of segments

• Store pointers to segments in 13 sorted lists– Use a commercial database for proximity matching

• 4 features, 3 parameters for each, plus duration

– Then rank order using statistical classification

• Display file name and audio

Muscle Fish Audio Retrieval

http://www.musclefish.com/cbrdemo.html

Summary

• Limited audio indexing is practical now– Audio feature matching, answering machine detection

• Present interfaces focus on a single technology– Speech recognition, audio feature matching– Matching technology is outpacing interface design

Speech and Music Retrieval

Documents

Transcript of Speech and Music Retrieval