Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003...

Tanja Schultz, Alan Black, Bob Frederking

Carnegie Mellon University

West Palm Beach, March 28, 2003

Towards Dolphin RecognitionTowards Dolphin Recognition

http://www.cs.cmu.edu/

OutlineOutline

Speech-to-Speech Recognition Brief Introduction

Lab, Research

Data Requirements Audio data ‘Transcriptions’

Towards Dolphin Recognition

Applications

Current Approaches

Preliminary Results


Part 1Part 1


Lab, Research



Applications

Current Approaches

Preliminary Results


Speech Processing TermsSpeech Processing Terms

Speech Recognition

Converts spoken speech input into written text output

Natural Language Understanding (NLU)

Derives the meaning of the spoken or written input

(Speech-to-speech) Translation

Transforms text / speech from language A to text / speech

of language B

Speech Synthesis (Text-To-Speech=TTS)

Converts written text input into audible output


Speech RecognitionSpeech Recognition

hello

HelloHale BobHallo ::

TTS

Speech Input - Preprocessing

Decoding/ Search

Postprocessing - Synthesis


Fundamental Equation of SRFundamental Equation of SR

hello

P(W/x) = [ P(x/W) * P(W) ] / P(x)

Am AE MAre A RI AIyou J Uwe VE

I am you are we are:

Acoustic Model Pronunciation Language Model

A-b A-m A-e


SR: Data RequirementsSR: Data Requirements

Audio DataSound set

Units built from sounds

Text Data

Am AE MAre A RI AIyou J Uwe VE

I am you are we are:

Acoustic Model Pronunciation Language Model

A-b A-m A-e


Janus Speech Recognition Toolkit (JRTk)Janus Speech Recognition Toolkit (JRTk)

Unlimited and Open Vocabulary Spontaneous and Conversational Human-Human

Speech Speaker-Independent High Bandwidth, Telephone, Car, Broadcast Languages: English, German, Spanish, French,

Italian, Swedish, Portuguese, Korean, Japanese, Serbo-Croatian, Chinese, Shanghai, Arabic, Turkish, Russian, Tamil, Czech

Best Performance on Public Benchmarks DoD, (English) DARPA Hub-5 Test ‘96, ‘97 (SWB-Task) Verbmobil (German) Benchmark ’95-’00 (Travel-Task)


LingWear: Wearable Language Assistance for the Information Warrior

Mobil Device for Translation&NavigationMobil Device for Translation&Navigation


Multi-lingual Meeting SupportMulti-lingual Meeting Support

The Meeting Browser is a powerful tool that allows us to record a new meeting, review or summarize an existing meeting or search a set of existing meetings for a particular speaker, topic, or idea.


Multilingual Indexing of VideoMultilingual Indexing of Video• View4You / Informedia: Automatically records Broadcast News and allows the user to retrieve video segments of news items for different topics using spoken language input• Non-cooperative speaker on video• Cooperative user• Indexing requires only low quality translation


Part 2Part 2


Lab, Research



Applications

Current Approaches

Preliminary Results


Towards Dolphin RecognitionTowards Dolphin Recognition

?

?Whose voice is this?Whose voice is this?

?

?

?

Whose voice is this?Whose voice is it?

Identification

?

Is this Bob¡¯s voice?Is this Bob¡¯s voice?

?

Is this Bob¡¯s voice?Is it Nippy’s voice?

Verification/Detection

Segmentation and Clustering

Speaker B

Speaker A

Which segments are from the same speaker?Which segments are from

the same speaker?Where are speaker

changes?Where are speaker

changes?

Speaker B

Speaker A

Which segments are from the same speaker?

Which segments are fromthe same dolphin?

Where are speaker changes?Where are dolphins changing?


ApplicationsApplications

‘off-line’ applications (off the water, off the boat, off season)

Data Management and Indexing Automatic Assignment/Labeling of already recorded (archived) data

Automatic Post-Processing (Indexing) for later retrieval

Towards Important/Meaningful Units = DOLPHONES Segmentation and Clustering of similar sounds/units

Find out about unit frequencies

Find out about correlation between sounds and other events

Whistles correlated to Family Relationship Who belongs to whom

Find out about the family tree?

Can we find out more about social structure?


ApplicationsApplications

‘on-line’ applications Identification and Tracking

Who is currently speaking Who is around

Towards Important/Meaningful Units Find out about correlation between sounds and other events

Whistles correlated to Family Relationship Who belongs to whom

Wide-range identification, tracking, and observation(since sound travels longer distances than image)


Common ApproachesCommon Approaches

Two distinct phases

Feature extraction

Model training

Training speech for each dolphin

Nippy

Havana

Model for each dolphin

Havana

Nippy

Training Phase

Feature extraction

Detectiondecision

Hypothesis: Havana

Detection Phase

?

xyz


Current ApproachesCurrent Approaches

A likelihood ratio test is used for the detection decision

Feature extraction

Feature extraction

Dolphin modelDolphin model

Backgroundmodel

Backgroundmodel

Reject

Accept

p(X|dolph) is the likelihood for the dolphin model when the features X = (x1,x2,…) are given

p(X|dolph) is an alternative or so called background model trained on all data but that of the dolphin in question

= p(X|dolph) / p(X|dolph)


First Experiments - SetupFirst Experiments - Setup Take the data we got from Denise

Alan did the labeling of about 160 files Labels:

dolphin sounds ~370 tokens electric noise (machine, clicks, others) ~180 tokens pauses ~ 220 tokens

Derive Dolphin ID from file name (educ. Guess)(Caroh, Havana, Lag, Lat, LG, LH, Luna, Mel, Nassau, Nippy)

Train one model per dolphin, one ‘garbage’ model for the rest Recognize incoming audio file; hypotheses consist of list of

dolphin and garbage models Count number of models per audio file and return the name of

dolphin with the highest count as the one being identified


First Experiments - ResultsFirst Experiments - Results

0102030405060708090

100

Dol

phin

ID

[%

]

Caroh

Havan

aLag Lat LG LH

Luna Mel

Nassa

uNip

py

100 Gaussians 20 Gaussians 10 Gaussians


Next stepsNext steps Step 1: To build a ‘real’ system we need

MORE audio data MORE audio data MORE ... Labels (the more accurate the better) Idea 1: Automatic labeling, live with the errors Idea 2: Manual labeling Idea 3: Automatic labeling and post-editing

Step 2: Given more data Automatic clustering Try first steps towards unit detection

Step 3: Build a working system, make it small and fast enough for deployment


Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003...

Documents

Transcript of Tanja Schultz, Alan Black, Bob Frederking Carnegie Mellon University West Palm Beach, March 28, 2003...