Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003...
-
date post
20-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003...
Tanja Schultz, Alan Black, Bob Frederking
Carnegie Mellon University
West Palm Beach, March 28, 2003
Towards Dolphin RecognitionTowards Dolphin Recognition
OutlineOutline
Speech-to-Speech Recognition Brief Introduction
Lab, Research
Data Requirements Audio data ‘Transcriptions’
Towards Dolphin Recognition
Applications
Current Approaches
Preliminary Results
Part 1Part 1
Speech-to-Speech Recognition Brief Introduction
Lab, Research
Data Requirements Audio data ‘Transcriptions’
Towards Dolphin Recognition
Applications
Current Approaches
Preliminary Results
Speech Processing TermsSpeech Processing Terms
Speech Recognition
Converts spoken speech input into written text output
Natural Language Understanding (NLU)
Derives the meaning of the spoken or written input
(Speech-to-speech) Translation
Transforms text / speech from language A to text / speech
of language B
Speech Synthesis (Text-To-Speech=TTS)
Converts written text input into audible output
Speech RecognitionSpeech Recognition
hello
HelloHale BobHallo ::
TTS
Speech Input - Preprocessing
Decoding/ Search
Postprocessing - Synthesis
Fundamental Equation of SRFundamental Equation of SR
hello
P(W/x) = [ P(x/W) * P(W) ] / P(x)
Am AE MAre A RI AIyou J Uwe VE
I am you are we are:
Acoustic Model Pronunciation Language Model
A-b A-m A-e
SR: Data RequirementsSR: Data Requirements
Audio DataSound set
Units built from sounds
Text Data
Am AE MAre A RI AIyou J Uwe VE
I am you are we are:
Acoustic Model Pronunciation Language Model
A-b A-m A-e
Janus Speech Recognition Toolkit (JRTk)Janus Speech Recognition Toolkit (JRTk)
Unlimited and Open Vocabulary Spontaneous and Conversational Human-Human
Speech Speaker-Independent High Bandwidth, Telephone, Car, Broadcast Languages: English, German, Spanish, French,
Italian, Swedish, Portuguese, Korean, Japanese, Serbo-Croatian, Chinese, Shanghai, Arabic, Turkish, Russian, Tamil, Czech
Best Performance on Public Benchmarks DoD, (English) DARPA Hub-5 Test ‘96, ‘97 (SWB-Task) Verbmobil (German) Benchmark ’95-’00 (Travel-Task)
LingWear: Wearable Language Assistance for the Information Warrior
Mobil Device for Translation&NavigationMobil Device for Translation&Navigation
Multi-lingual Meeting SupportMulti-lingual Meeting Support
The Meeting Browser is a powerful tool that allows us to record a new meeting, review or summarize an existing meeting or search a set of existing meetings for a particular speaker, topic, or idea.
Multilingual Indexing of VideoMultilingual Indexing of Video• View4You / Informedia: Automatically records Broadcast News and allows the user to retrieve video segments of news items for different topics using spoken language input• Non-cooperative speaker on video• Cooperative user• Indexing requires only low quality translation
Part 2Part 2
Speech-to-Speech Recognition Brief Introduction
Lab, Research
Data Requirements Audio data ‘Transcriptions’
Towards Dolphin Recognition
Applications
Current Approaches
Preliminary Results
Towards Dolphin RecognitionTowards Dolphin Recognition
?
?Whose voice is this?Whose voice is this?
?
?
?
Whose voice is this?Whose voice is it?
Identification
?
Is this Bob¡¯s voice?Is this Bob¡¯s voice?
?
Is this Bob¡¯s voice?Is it Nippy’s voice?
Verification/Detection
Segmentation and Clustering
Speaker B
Speaker A
Which segments are from the same speaker?Which segments are from
the same speaker?Where are speaker
changes?Where are speaker
changes?
Speaker B
Speaker A
Which segments are from the same speaker?
Which segments are fromthe same dolphin?
Where are speaker changes?Where are dolphins changing?
ApplicationsApplications
‘off-line’ applications (off the water, off the boat, off season)
Data Management and Indexing Automatic Assignment/Labeling of already recorded (archived) data
Automatic Post-Processing (Indexing) for later retrieval
Towards Important/Meaningful Units = DOLPHONES Segmentation and Clustering of similar sounds/units
Find out about unit frequencies
Find out about correlation between sounds and other events
Whistles correlated to Family Relationship Who belongs to whom
Find out about the family tree?
Can we find out more about social structure?
ApplicationsApplications
‘on-line’ applications Identification and Tracking
Who is currently speaking Who is around
Towards Important/Meaningful Units Find out about correlation between sounds and other events
Whistles correlated to Family Relationship Who belongs to whom
Wide-range identification, tracking, and observation(since sound travels longer distances than image)
Common ApproachesCommon Approaches
Two distinct phases
Feature extraction
Model training
Training speech for each dolphin
Nippy
Havana
Model for each dolphin
Havana
Nippy
Training Phase
Feature extraction
Detectiondecision
Hypothesis: Havana
Detection Phase
?
xyz
Current ApproachesCurrent Approaches
A likelihood ratio test is used for the detection decision
Feature extraction
Feature extraction
Dolphin modelDolphin model
Backgroundmodel
Backgroundmodel
Reject
Accept
p(X|dolph) is the likelihood for the dolphin model when the features X = (x1,x2,…) are given
p(X|dolph) is an alternative or so called background model trained on all data but that of the dolphin in question
= p(X|dolph) / p(X|dolph)
First Experiments - SetupFirst Experiments - Setup Take the data we got from Denise
Alan did the labeling of about 160 files Labels:
dolphin sounds ~370 tokens electric noise (machine, clicks, others) ~180 tokens pauses ~ 220 tokens
Derive Dolphin ID from file name (educ. Guess)(Caroh, Havana, Lag, Lat, LG, LH, Luna, Mel, Nassau, Nippy)
Train one model per dolphin, one ‘garbage’ model for the rest Recognize incoming audio file; hypotheses consist of list of
dolphin and garbage models Count number of models per audio file and return the name of
dolphin with the highest count as the one being identified
First Experiments - ResultsFirst Experiments - Results
0102030405060708090
100
Dol
phin
ID
[%
]
Caroh
Havan
aLag Lat LG LH
Luna Mel
Nassa
uNip
py
100 Gaussians 20 Gaussians 10 Gaussians
Next stepsNext steps Step 1: To build a ‘real’ system we need
MORE audio data MORE audio data MORE ... Labels (the more accurate the better) Idea 1: Automatic labeling, live with the errors Idea 2: Manual labeling Idea 3: Automatic labeling and post-editing
Step 2: Given more data Automatic clustering Try first steps towards unit detection
Step 3: Build a working system, make it small and fast enough for deployment