Computational Audition at AFRL/HE: Past, Present, and Future
description
Transcript of Computational Audition at AFRL/HE: Past, Present, and Future
Computational Audition at AFRL/HE:
Past, Present, and Future
Dr. Timothy R. AndersonHuman Effectiveness Directorate
Air Force Research Laboratory
2
Biologically Based Signal Processing
• research, development and applications of:– Biologically based algorithms– Perceptually relevant features – Human-centered metrics and models– to improve robustness of speech processing
systems
SpeechSpeechTechnologiesTechnologies
JAOC
Sensor-decision maker-shooter
Future JAOCCommand & Control
Combat Plans
AWACS
Chem-bio Defense Environment
3
Why Is This Area Important?
• Present signal processing systems (i.e. speech and speaker recognition, speech coding, etc.) are not robust in adverse military environments.
• Biological principles offer potential to provide improved performance in military environments.
4
Technical Challenges• Identification and modeling of features and processes used by biological systems• Incorporation of those key features and processes into computationally efficient algorithms and structures
Approach• Develop psychoacoustic testing procedures• Characterize key features and processes• Developed human-centered model and metrics• Implement computationally efficient algorithms• Provide support to operational test and warfighting exercises to evaluate system utility
Biologically Based Signal Processing
Dominant
Strong
Favorable
Tenable
Weak
Embryonic Growth Mature Aging
5
Research Areas
• Cockpit Speech Recognition• Robust Speech Recognition
– Monaural Speech Recognition– Binaural Speech Recognition– Auditory Model Front-ends
• Speaker Recognition/Verification– Biologically Based Speaker ID– Channel Robustness– Speaker Recognizability Test
6
Phoneme Classification
• Kohonen Self-Organizing Feature Map– 16 X 16
• 10 Speaker Database (TIMIT)• 10 sentences/speaker• Leaving one out method (per speaker)• Features calculated with
– 16 ms window – 5 ms frame step
7
TRADITIONAL VS. AUDITORYMONAURAL
Phoneme Recognition Rate
05
101520253035404550
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)
% AIMMFCC
9
Binaural Speech Recognition
• Past• Present • Future
10
Binaural Speech Recognition
• Stereausis• Cocktail Party Processor• BAIM• BINAP
11
EXPERIMENT SETUP
SOUNDSOURCE
SOURCENOISE
XX
12
MONAURAL VS. BINAURAL COCKTAIL PARTY PROCESSOR
Phoneme Recognition Rate
05
101520253035404550
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)
% CPPMONO
13
MONAURAL VS. BINAURAL AUDITORY IMAGE MODEL
Phoneme Recognition Rate
05
101520253035404550
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)
% BAIMAIM
14
BINAURAL
Phoneme Recognition Rate
05
101520253035404550
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)
% CPPBAIM
15
MONAURAL
Phoneme Recognition Rate
05
101520253035404550
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)
% AIMMONO
16
BAIM VS. CPP-AIM
Phoneme Recognition Rate
05
101520253035404550
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)
%BAIMAIMCPP-AIM
17
COINCIDENCE
Phoneme Recognition Rate
05
101520253035404550
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)
% BAIMBINAP
18
MONAURAL, BINAURAL AND TRADITIONAL
Phoneme Recognition Rate
05
101520253035404550
1 5 10 15 20 32 Clean
Signal-to-Noise Ratio (dB)
%
CPPBAIMAIMMONOMFCCBINAPCPP-AIM
19
Binaural Speech Recognition
RESULTSBINAURAL AUDITORY MODELPROVIDES BETTER REPRESENTATION THAN TRADITIONAL TECHNIQUES:
TASK
PHONEME RECOGNITION
SPEECH
LOW TO HIGH SNR
RESULTS7-12 dB BINAURAL ADVANTAGE
20
Binaural Speech Recognition
• Past• Present
– No Current Work• Future
21
Binaural Speech Recognition
• Past• Present • Future
– Implement binaural ASR system– Investigate further binaural fusion mechanisms– Meeting room data– Implement binaural system using AIM chips
22
Auditory Model Front Ends
• Past• Present • Future
23
Auditory Model Front Ends
• Tanner Research “Analog Speech Recognition”– Implementation of AIM– 56 channels Analog Filter bank– Single SBUS board– 1.5 X Real-time
24
Auditory Model Front Ends
• AFIT – Designed Digital Implementation
• Middle ear, BMM, adaptive thresholding– 32 channels per chip– 300 Hz – 7 kHz– 44.1 KHz sampling rate– 2 chips provide 64 channels in real-time
27
Auditory Model Front Ends
• Past• Present
– Single board system designed and prototyped - USB– Current chip design undergoing debug– Second fabrication run this fall
• Future
28
Auditory Model Front Ends
• Past• Present • Future
– Debug and verify chip fabrication– Debug PC based real-time auditory model front end– Implement complete end-to-end auditory ASR– Investigate feedback mechanisms in auditory model
for ASR
29
Biologically Based SID
• Past• Present • Future
30
Biologically Based SID
• Auditory Models Investigated– Payton’s Auditory Model (PAM)– Auditory Image Model (AIM)
• VQ Codebook used to model speaker• 37 Speakers from TIMIT (dr1,2 12F 25M)
– MFCC 94%– PAM 67%– AIM 91%
31
Biologically Based SID
• Past• Present • Future
32
Biologically Based SID
• Using perceptual features– Formants, formant bandwidths, and pitch
• Voiced Frames• Using GMM classifier• Conducting experiments on larger databases
– Switchboard
33
Biologically Based SID
MFCCs, no Deltas, no CMS
F0 Base
MFCCs, no CMS
34
Biologically Based SID
MFCCs, no Deltas, no CMS
F0 Base
MFCCs, no CMS
35
Biologically Based SID
F0 Base
MFCCs, no Deltas, no CMSMFCCs,
no CMS
36
Biologically Based SID
MFCCs, no Deltas, no CMS
F0 Base
MFCCs, no CMS
37
Biologically Based SID
• Performance isn’t the best, but this feature set…– Uses only 9 features versus 19–38 for MFCCs– Hasn’t been as heavily researched as MFCCs
38
Biologically Based SID
• Determine reasons for performance differences between various databases
• Channel & score normalizations• Pitch-synchronous features• Closed-phase analysis• Glottal model features
39
Biologically Based SID
40
Biologically Based SID
• Past• Present • Future
41
Biologically Based SID
• Investigate other auditory based features– Vocal agitation– Formants, formant bandwidths, and pitch calculated
from the auditory model– Auditory model features
• Conduct experiments on other databases– Broadcast news– Military training exercises
42
Speaker Recognizability Test
• Past• Present • Future
43
Speaker Recognizability Test
• Dynastat “The Development of a Method for Evaluating and Predicting Speaker Recognizability in Voice Communication Systems”– Determined perceptually relevant features
• Perceptual voice traits (PVT)• 21 traits currently identified
– Developed methodology to measure these traits• Human listeners
– Developed measure to determine loss due to channel• Diagnostic Speaker Recogniziability Test (DSRT)
44
Speaker Recognizability Test
• Past• Present • Future
45
Speaker Recognizability Test
• Use perceptual voice traits to identify groups of similar and distinctive speakers
• Determine if current SID systems have difficulty with these similar speakers
• Implementing in-house – Web-based listening test for
• PVT rating• DSRT
46
Speaker Recognizability Test
• Past• Present • Future
47
Speaker Recognizability Test
• Obtain PVT ratings for larger database– Switchboard
• Determine acoustic correlates of perceptually relevant features
• Use as features for speaker recognition• Utilize DSRT for communication system testing
48
Summary
• Computational Audition offers potential for improved performance in adverse military environments
• Still lots of research needs to be accomplished– Fidelity of model– Model feedback pathways
• Computation issues no longer limiting factor in performing meanful experiments
49
Questions?