Atsip avsp17
-
date post
14-Sep-2014 -
Category
Technology
-
view
4 -
download
0
description
Transcript of Atsip avsp17
![Page 1: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/1.jpg)
Audio-Visual Speech Processing
Gérard Chollet with Meriem Bendris, Hervé Bredin, Thomas Hueber,
Walid Karam, Rémi Landais, Patrick Perrot, Eduardo Sanchez-Soto, Leila Zouari
ATSIP, Sousse, March 18th 2014
![Page 2: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/2.jpg)
Page 2 ATSIP, Sousse, May 18th, 2014
Some motivations,…
■ A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone.
■ The combined use of facial and speech information improves identity verification and robustness to forgeries.
■ Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces.
■ SmartPhones, VisioPhones, WebPhones, SecurePhones, Visio Conferences, Virtual Reality worlds are gaining popularity.
![Page 3: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/3.jpg)
Page 3 ATSIP, Sousse, May 18th, 2014
Some topics under study,…
■ Audio-visual speech recognition – Automatic ‘lip-reading’
■ Audio-visual speaker verification – Detection of forgeries
■ Speech driven animation of the face – Could we look and sound like somebody else ?
■ Speaker indexing – ‘Who is talking in a video sequence ?’
■ OUISPER : a silent speech interface – Corpus based synthesis from tongue and lips
![Page 4: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/4.jpg)
Page 4 ATSIP, Sousse, May 18th, 2014
Audio Visual Speech Recognition
Dictionary Grammar
Acoustic models
Features extraction
Decoder
![Page 5: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/5.jpg)
Page 5 ATSIP, Sousse, May 18th, 2014
Video Mike (IBM, 2004)
■ IBM
■ 2004
![Page 6: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/6.jpg)
Page 6 ATSIP, Sousse, May 18th, 2014
Audio processing
■ Features extraction ■ Digits detection ■ Digits recognition:
• Acoustic parameters : MFCC • Context independent HMMs • Decoding : Time synchronous
algorithm ■ Sound effect
– Noise : Babble ■ Recognition experiments
![Page 7: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/7.jpg)
Page 7 ATSIP, Sousse, May 18th, 2014
Video processing
■ Video extraction ■ Lips localisation ■ Images interpolation (same frequency as speech) ■ Features extraction
• DCT and DCT2 (DCT+LDA) • Projections : PRO et PRO2
(PRO+LDA) ■ Recognition experiments
![Page 8: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/8.jpg)
Page 8 ATSIP, Sousse, May 18th, 2014
Fusion techniques
q Parameters fusion : • Concatenation
• Dimension decrease : Linear Discriminant Analysis (LDA) • Modelisation : classical HMM with one stream
q Scores fusion : Multi-stream HMM
![Page 9: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/9.jpg)
Page 9 ATSIP, Sousse, May 18th, 2014
Experimental results : parameters fusion
0
10
20
30
40
50
60
70
80
90
100
-15 -10 -5 0 5 10S/N
%A
ccur
acy
Speech onlyVideo only : Pro2Video only : DCT2AV Fusion : Pro2AV Fusion : DCT2
![Page 10: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/10.jpg)
Page 10 ATSIP, Sousse, May 18th, 2014
Experimental results : Scores fusion at -5db
42
43
44
45
46
47
48
49
50
51
52
Speech only AV : PRO AV :PRO2 AV : DCT AV : DCT2
![Page 11: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/11.jpg)
Page 11 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Fusion of face and speech for identity verification ■ Detection of possible forgeries ■ Compulsory ? for:
– Homeland/firms security: restricted access,… – Secured computer login – Secured on-line signature of contracts
![Page 12: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/12.jpg)
Page 12 ATSIP, Sousse, May 18th, 2014 12
Talking-face and 2D face sequence database
■ Data: video sequences (.avi) in which a short phrase in English is pronounced / duration ≈ 10s (actual speech duration ≈ 2s)
■ Audio-video data used for talking faces evaluations ■ Same sequences used for 2D face from video sequences evaluations ■ 430 subjects pronounced 4 phrases :
– from a set 430 English phrases – 2 indoor video files acquired during the first session – 2 outdoor video files acquired during the second session – realistic forgeries created a posteriori
![Page 13: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/13.jpg)
Page 13 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Features
Raw Pixel
Value
DCT Transform
Shape Related
Many Others
…
Raw amplitude
« Classical » MFCC coefficients
Many others
![Page 14: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/14.jpg)
Page 14 ATSIP, Sousse, May 18th, 2014
Audio-Visual
Audio-Visual Subspaces
Audio
Visual
Reduced Audiovisual Subspace
Principal Component & Linear Discriminant
Analysis
x
Correlated Audio & Visual Subspaces
Co-inertia & Canonical Correlation Analysis
![Page 15: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/15.jpg)
Page 15 ATSIP, Sousse, May 18th, 2014
Correspondence Measures
Audiovisual subspace Correlated subspaces
Gaussian Mixture Models
Neural Networks
Coupled HMM
Correlation
Mutual Information
![Page 16: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/16.jpg)
Page 16 ATSIP, Sousse, May 18th, 2014
Application to indexation
■ High-level requests – “Find videos where John Doe is speaking” – “Find dialogues between Mr X and Mrs Y” – “Locate the singer in this music video”
Raw Energy
Raw Pixel
Value Correlation
![Page 17: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/17.jpg)
Page 17 ATSIP, Sousse, May 18th, 2014
Who is speaking? ■ Face tracking ■ Correlation
– Pixel of each face – Raw audio energy
■ Find maximum synchrony
Green: current speaker
![Page 18: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/18.jpg)
Page 18 ATSIP, Sousse, May 18th, 2014
How to Perform “Talking-‐Face” Authen:ca:on?
Face recognition
Speaker verification Score fusion
What if…?
OK
OK OK
Deliberate imposture
![Page 19: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/19.jpg)
Page 19 ATSIP, Sousse, May 18th, 2014
Biometrics
■ Identity Verification with Talking Faces – Speaker Verification – Face Recognition
■ What if?
Face
OK Voice
OK
NO X
![Page 20: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/20.jpg)
Page 20 ATSIP, Sousse, May 18th, 2014
Identity Verification
Enrolment of client λ
Model for client λ
Person ε pretending to be client λ
accepted if
rejected otherwise
Co-Inertia Analysis
Equal Error Rate: 30 %
![Page 21: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/21.jpg)
Page 21 ATSIP, Sousse, May 18th, 2014
Test
Replay Attacks Detection
Training
Co-IA CCA
accepted if
rejected otherwise
Sync Model
![Page 22: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/22.jpg)
Page 22 ATSIP, Sousse, May 18th, 2014
Replay Attacks Detection
Genuine synchronized video Audio replay attack Lips do not match audio perfectly
Equal Error Rate: 14 %
![Page 23: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/23.jpg)
Page 23 ATSIP, Sousse, May 18th, 2014
Example of Replay attacks
![Page 24: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/24.jpg)
Page 24 ATSIP, Sousse, May 18th, 2014
delayed video delayed audio
-5 0 +5
Alignment by maximum correlation
-1
![Page 25: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/25.jpg)
Page 25 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Available features
– Face : Face features (lip, eyes) à Face Modality – Speech à Speech Modality – Speech Synchrony à Synchrony Modality
Video
![Page 26: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/26.jpg)
Page 26 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Face modality – Detection:
• Generative models (MPT toolbox) • Temporal median Filtering • Eyes detection within faces
– Normalization: geometry + illumination
![Page 27: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/27.jpg)
Page 27 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification ■ Face Modality:
– Two verification strategies and one single comparison framework
• Global = Eigenfaces: – Calculation of a set of directions (eigenfaces)
defining a projection space – Two faces are compared regarding their
projection on the eigenfaces space. – Learning data: BIOMET (130 pers.) + BANCA
(30 pers.)
![Page 28: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/28.jpg)
Page 28 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Face Modality: • SIFT descriptors:
– Keypoints extraction – Keypoints representation: 128-dimensional
vector (gradient orientation histogramme,…) + 4-dimensional position vector
SIFT descriptor (dim 128)
Position (x,y) + scale + orientation (dim 4)
![Page 29: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/29.jpg)
Page 29 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification ■ Face Modality:
• SVD-based matching method: – Compare two videos V1 and V2 – Exclusive principle: One-to-one correspondences
between » Faces (global) » Descriptors (local)
– Principle: » Proximity matrix computation between faces or
descriptors » Extraction of good pairings (made easy by SVD
computation) – Scores:
» One matching score between global representations
» One matching score between local representations
![Page 30: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/30.jpg)
Page 30 ATSIP, Sousse, May 18th, 2014
Variability !!!!
![Page 31: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/31.jpg)
Page 31 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Speech Modality: – GMM-based approach;
• One world model • Each speaker model is derived from the
World Model by MAP adaptation • Speech verification score: derived from
likelihood ratio
![Page 32: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/32.jpg)
Page 32 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Synchrony Modality: – Principle: synchrony between lips and
speech carries identity information – Process:
• Computation of a synchrony model (CoIA analysis) for each person based on DCT (visual signal) and MFCC (speech signal)
• Comparison of the test sample with the synchrony model
![Page 33: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/33.jpg)
Page 33 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification ■ Experiments:
– BANCA database: • 52 persons divided into two groups (G1 and G2) • 3 recording conditions • 1 person à 8 recordings (4 client accesses, 4
impostor accesses) • Evaluation based on P protocol: 234 client
accesses and 312 impostor accesses – Scores:
• 4 scores per access (PCA face, SIFT face, speech, synchrony)
• Score fusion based on RBF-SVM: hyperplan learned on G1/tested on G2 and conversely)
![Page 34: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/34.jpg)
Page 34 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Experiments:
![Page 35: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/35.jpg)
Page 35 ATSIP, Sousse, May 18th, 2014
SecurePhone
■ Technical solution that improves security
■ Biometric recognition – Makes use of VOICE, FACE and SIGNATURE
■ Electronic signature used to secure information exchange
![Page 36: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/36.jpg)
Page 36 ATSIP, Sousse, May 18th, 2014
Biometrics in SecurePhone
■ Operation
Pre-processing
Modelling Modelling
Modelling
Pre-processing Pre-processing
Access Denied Access Granted
FUSION
Face Voice Written Signature
Modelling
![Page 37: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/37.jpg)
Page 37 ATSIP, Sousse, May 18th, 2014
The BioSecure Multimodal Evaluation Campaign ■ Launched in April 2007 ■ Many modalities including ‘Video sequences’ and
‘Talking Faces’ ■ Development data and reference systems available ■ Evaluations on the sequestrated BioSecure data base
(1000 clients) ■ Debriefing workshop ■ More info on : http://www.int-evry.fr/biometrics/BMEC2007/index.php
![Page 38: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/38.jpg)
Page 38 ATSIP, Sousse, May 18th, 2014
Audio-‐visual forgery scenarios
■ Low-‐effort – “Paparazzi” scenario
• The impostor owns a picture of the face and a recording of the voice of the target
– “Big Brother” scenario • The impostor owns a video of the face and a recording of the voice of the target
■ High-‐effort – “Imitator” scenario
• The impostor owns a recording of the voice of the target and transforms his own voice to sound like the target
– “Playback” scenario • The impostor owns a picture of the face of the target and animate it according to his own face moAon
– “Ventriloquist” scenario • combines the two previous ones
![Page 39: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/39.jpg)
Page 39 ATSIP, Sousse, May 18th, 2014
Detec:on of imposture
Face modality: ACCEPTED!
Voice modality: ACCEPTED!
Synchronisation: DENIED!
![Page 40: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/40.jpg)
Page 40 ATSIP, Sousse, May 18th, 2014 40
Audio replay + “random” face
Talking-Face forgeries @ BMEC
Audio replay attack " Assumptions
§ Forger has recorded speech data from the genuine user in outdoor (test) conditions
§ Forger is replaying the audio and uses his face in front of the sensor
Stolen wave Audio replay + forger face
![Page 41: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/41.jpg)
Page 41 ATSIP, Sousse, May 18th, 2014 41
CRAZY TALK Face animation + TTS Talking-Face forgeries @ BMEC
Replay attack " Assumptions
§ Forger has stolen a picture § Forger uses a face animation software and TTS (male or
female) § Forger plays back the animation to the sensor
Stolen picture Contour detection Generated avi
![Page 42: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/42.jpg)
Page 42 ATSIP, Sousse, May 18th, 2014 42
Picture presentation + TTS forgeries
Talking-Face forgeries @ BMEC
Replay attack " Assumptions
§ Forger has stolen a picture § Forger has printed the picture § Forger present the picture to the sensor and uses TTS
(same wave as for the face animation forgery)
Stolen picture Presented picture
![Page 43: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/43.jpg)
Page 43 ATSIP, Sousse, May 18th, 2014 43
Systems with fusion of (face, speech)
face score
speech score
fusion score
video sequence
frames
speech signal
Face verification
Speaker verification
![Page 44: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/44.jpg)
Page 44 ATSIP, Sousse, May 18th, 2014 44
Voice Conversion methods ■ GMM conversion
– Training of a joined Gaussian model • parallel corpus of aligned sentences of both source and target voice
• MFCC on HNM (Harmonic plus Noise Model) parameterizaAon – Speech synthesis from Gaussian model
• Inversion of the MFCC • Pitch correcAon
■ ALISP conversion – Very low debit speech compression (500 bps) method
• Originally developed by TELECOM-‐ParisTech – Indexed segments dicAonary system (of the target voice) – HNM parameterizaAon
![Page 45: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/45.jpg)
Page 45 ATSIP, Sousse, May 18th, 2014
Voice conversion techniques
Definition: Process of making one person’s voice « source » sounds like another person’s voice target
source target
Voice conversion
My name is John My name is John
![Page 46: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/46.jpg)
Page 46 ATSIP, Sousse, May 18th, 2014
Principle of ALISP
Dictionary of representative
segments
Dictionary of representative
segments
Spectral analysis
Prosodic analysis
Selection of segmental units
Segment index
Prosodic parameters
Input speech
Concatenative synthesis
HNM
Output speech
CODER
![Page 47: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/47.jpg)
Page 47 ATSIP, Sousse, May 18th, 2014
Details of Encoding
speech Spectral analysis
Prosodic analysis
HMM Recognition
Dictionary of HMM models of ALISP classes
Synth unit A1
… Synth unit A8
HMM A
Representative units of the
class
Selection by DTW
Prosodic encoding
Index of ALISP class
Index of synth. unit
Pitch, energy, duration
![Page 48: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/48.jpg)
Page 48 ATSIP, Sousse, May 18th, 2014
Details of decoding
Output speech
Synth unit A1
… Synth unit A8
ALISP Index
Synth unit index within class
Prosodic parameters
Loading Synth unit
Concatenative synthesis
![Page 49: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/49.jpg)
Page 49 ATSIP, Sousse, May 18th, 2014
Principle of Alisp conversion
Learning step: one hour of target voice
- Parametric analysis: MFCC - Segmentation based on temporal decompostion and vector quantization - Stochastic modelling based on HMM - Creation of representative units
Conversion step
- Parametric analysis: MFCC - HMM recognition - Selection of representative segment à DTW
Synthesis step
- Concatenation of representative - HNM synthesis
![Page 50: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/50.jpg)
Page 50 ATSIP, Sousse, May 18th, 2014
Voice conversion using ALISP results
BREF database NIST database
Source
Result
Target Source Target
Result
female female female male
![Page 51: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/51.jpg)
Page 51 ATSIP, Sousse, May 18th, 2014
Demonstra:on of Voice Conversion
Impostor voice Converted voice with GMM Converted voice with ALISP
Target voice Converted voice with ALISP+GMM
![Page 52: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/52.jpg)
Page 52 ATSIP, Sousse, May 18th, 2014
3D reconstruction • 3D face modeling from a front and a profile shot :
• Animated face
• https://picoforge.int-evry.fr/cgi-bin/twiki/view/Myblog3d/Web/Demos
![Page 53: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/53.jpg)
Page 53 ATSIP, Sousse, May 18th, 2014
Face Tranformation
Control point selection
Image segmentation
Figure 2: Division of an image Figure 1: Control points selec8on
Linear transformation
between source and target image
Blending step
source
target
![Page 54: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/54.jpg)
Page 54 ATSIP, Sousse, May 18th, 2014
Face Transformation
Source
?
54
-‐> LocalisaAon of control points
-‐> Warping -‐> Blending
Cible
? X’ = f(X)
p = αp + (1 – α)p’
X X’
p p’
![Page 55: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/55.jpg)
Page 55 ATSIP, Sousse, May 18th, 2014
Face transforma:on (IBM)
![Page 56: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/56.jpg)
Page 56 ATSIP, Sousse, May 18th, 2014
Ouisper1 - Silent Speech Interface
■ Sensor-based system allowing speech communication via standard articulators, but without glottal activity
■ Two distinct types of application – alternative to tracheo-oesophagal speech (TES) for persons
having undergone a tracheotomy – a "silent telephone" for use in situations where quiet must be
maintained, or for communication in very noisy environments
■ Speech Synthesis from ultrasound and optical imagery of the tongue and lips
1) Oral Ultrasound synthetIc SPEech souRce
![Page 57: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/57.jpg)
Page 57 ATSIP, Sousse, May 18th, 2014
Ouisper - System Overview
Ultrasound video of the vocal tract
Optical video of the speaker lips
Recorded audio
Speech Alignment
Text
Visual Feature Extraction
Audio-Visual Speech Corpus
Visual Speech Recognizer
Visual Unit Selection
Audio Unit Concatenation
TRAINING
TEST
Visual Data
N-best
Phonetic or ALISP Targets
![Page 58: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/58.jpg)
Page 58 ATSIP, Sousse, May 18th, 2014
Ouisper - Training Data
![Page 59: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/59.jpg)
Page 59 ATSIP, Sousse, May 18th, 2014
Ouisper - Video Stream Coding
T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu Hawaii, USA, 2007.
Eigenvectors
Build a subset of
typical frames
Perform PCA
Code new frames with their projections onto the set of Eigenvectors
![Page 60: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/60.jpg)
Page 60 ATSIP, Sousse, May 18th, 2014
Ouisper - Audio Stream Coding
ALISP Segmentation
Detection of quasi-stationary parts in the parametric representation of speech Assignment of segments to class using unsupervised classification techniques
Phonetic Segmentation
Forced-alignement of speech with the text Need of a relevant and correct phonetic transcription of the uttered signal
Corpus-based synthesis
Need of a preliminary segmental description of the signal
![Page 61: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/61.jpg)
Page 61 ATSIP, Sousse, May 18th, 2014
Audiovisual dictionary building
■ Visual and acoustic data are synchronously recorded
■ Audio segmentation is used to bootstrap visual speech recognizer
/e -‐ r/
2) Train HMM model for each phonetic class
/a -‐ j//u -‐ th/
Audiovisual dictionary
![Page 62: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/62.jpg)
Page 62 ATSIP, Sousse, May 18th, 2014
Visuo-acoustic decoding
■ Visual speech recognition – Train HMM model for each visual class
• Use multistream-based learning techniques
– Perform a « visuo-phonetic » decoding step • Use N-Best list • Introduce linguistic constraints
– Language model – Dictionary – Multigrams
■ Corpus-based speech synthesis – Combine probabilistic and data-driven approach in the
audiovisual unit selection step.
![Page 63: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/63.jpg)
Page 63 ATSIP, Sousse, May 18th, 2014
Speech recognition from video-only data
ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh
ax w ih y uh r b uh k sh uw dh ax v er s p ey jh
Open your book to the first page
Ref
Rec
A wear your book shoe the verse page
Corpus-based synthesis driven by predicted phonetic lattice is currently under study
![Page 64: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/64.jpg)
Page 64 ATSIP, Sousse, May 18th, 2014
Ouisper - Conclusion
■ More information on – http://www.neurones.espci.fr/ouisper/
■ Contacts – [email protected] – [email protected] – [email protected]
![Page 65: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/65.jpg)
Page 65 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Processing Conclusions and Perspectives
■ A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone.
■ The combined use of facial and speech information improves identity verification and robustness to forgeries.
■ Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces.
![Page 66: Atsip avsp17](https://reader034.fdocuments.us/reader034/viewer/2022051411/5415a51c7bef0a923b8b73b7/html5/thumbnails/66.jpg)
Page 66 ATSIP, Sousse, May 18th, 2014