Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo...

43
Trainable Videorealistic Speech Animation Tony Ezzat Gadi Geiger Tomaso Poggio CBCL/AI Lab MIT

Transcript of Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo...

Page 1: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Trainable Videorealistic Speech Animation

Tony EzzatGadi

GeigerTomaso Poggio

CBCL/AI LabMIT

Page 2: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Outline

• Problem Setting• Previous Work• Our Approach• Results• Evaluation

Page 3: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Overview

VideoDatabase

“Air” “Badge”

Visual SpeechProcessing

“Badge”

2 Themes:Videorealism

Machine Learning

Mary101

Page 4: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Audio Analysis

VideoDatabase

“Air” “Badge”

Visual SpeechProcessing

“Badge”

AudioDatabase

Audio is recorded also to help label video

Page 5: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Audio Synthesis

VideoDatabase

“Air” “Badge”

Visual SpeechProcessing

“Badge”

AudioDatabase

“Badge”

Audio SpeechProcessingX

No Audio Synthesis!

Page 6: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

What is the Input REALLY?

Visual SpeechProcessing

“Badge”

Page 7: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Input: Phone Stream

Visual SpeechProcessing

/SIL B B B AE AE JH JH SIL SIL/ Real Audio

Speech Recognition

ForcedViterbiAlignment

Manual Labelling

“Badge”

TTS

“Badge”

Page 8: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Pre-

and Post-Processing

Pre-Processing

Post-Processing

Remove head movementusing

planar perspectivewarping

Mask out mouthTrack & Recomposite

into background sequence

Visual SpeechProcessing

/SIL B B B AE AE JH JH SIL SIL/

Page 9: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Tracking & Compositing

Page 10: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Outline

• Problem Setting• Previous Work• Our Approach• Results• Evaluation• More Results

Page 11: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Video Rewrite(Bregler, Covell, Slaney 1997)

/H-E-L/ /E-L-OW/

+Hello:

Triphone

basis unitsReorder them to new utterancePixel blending at join points

Coarticulation: /utu/

vs

/iti/

Page 12: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

• Sampling coarticulation20000 triphones

~ 3 hrs!

Video Rewrite Issues(Bregler, Covell, Slaney 1997)

• Model of speech is entire video corpusNo capacity to learn/model/distillNot a parsimonious representation

• Poor capacity for novel image synthesisPoor smoothing at join pointsCannot stretch/shrink to match audioDiscrete number of pathsCannot fill in missing data

Page 13: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Outline

• Problem Setting• Previous Work• Our Approach• Results• Evaluation• More Results

Page 14: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Extracting Prototypes

46 prototypes extracted using PCA and K-means clustering

Page 15: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Multidimensional Morphable

Model

1I

2I 3I

4I

2C 3C

4C

),( βα

Page 16: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

MMM Background

Tommy Poggio/MITDavid BeymerMike JonesVinay

Kumar

Volker Blanz/MPI Saabrucken

Thomas Vetter/University of Basel

Tim Cootes/Manchester

Michael Black/Brown

Page 17: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

1D Morphing

(Beier

& Neely 1992)

),( 111 FIWARP α

x x x

x x x

),( 222 FIWARP α

1 1

0 0

2β1β +

Page 18: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Optical Flow

C = {dx(x,y), dy(x,y)}

OpticalFlow

(Beymer, Shashua, Poggio 93) (Chen & Williams 93)

Page 19: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

1D Morphing w/Optical Flow

Forward warping A to B

Forward warping B to A

Blending

Holefilling

Page 20: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Parameterize using

),( βα

),( βα

MMM Definition

46 Image prototypes from Corpus

1I

2I 3I

4I

2C 3C

4C

46 Optical flow betweenprototypes

alpha is 46-dimensionalbeta is 46 dimensional

Page 21: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

MMM Synthesis

1I

2I 3I

4I

2C 3C

4C

∑=

=N

iii

synth CC1

1 α

synthC1

),( 1 isynth

isynthi CCCWC −=

),( synthii

warpi CIWI =

∑=

=N

i

warpii

morph II1

),( ββα

Fine, but whatabout speech?

Page 22: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Mary101 Speech Model

1I

2I 3I

4I

2C 3C

4C

/SIL/ /F/

/AE/ Each phoneme represents a cluster in MMM space

Speech trajectory passes close to clusters

but which is also smooth

Page 23: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

),( βα

MMM Analysis

1I

2I 3I

4I

2C 3C

4C

Page 24: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

MMM Analysis (Cntd)

1I

2I 3I

4I

2C 3C

4C

novelI

novelC

novelC

∑=

−N

iiinovel CC

1

α

Re-orient + Warp

10

1

=

∀>

∑=

i

i

N

i

warpediinovel

itosubject

II

β

β

β

Page 25: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

MMM Analysis Parameters

badge

lavish

Flow

Texture

Page 26: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Comparison of Real and Synthesized Images

Tongue is not perfect

Slight blurring Real Synthetic Real Synthetic

Page 27: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Analysis of Entire Recorded Corpus

),( 111 βα=z

1I

2I 3I

4I

2C 3C

4C

),( 222 βα=z

1I

2I 3I

4I

2C 3C

4C

),( 300003000030000 βα=z

1I

2I 3I

4I

2C 3C

4C

LVideo Corpus

/b/ /jh/ /ae/

Page 28: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Phonetic Clusters

pμ pΣRepresent each phone with

One set for flows, another set for textures

/t/

/w/

/m/

/aa/

/b/

Page 29: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Trajectory Synthesis

21 )()(min yyy T

yΔ+−Σ− − λμμ

⎥⎥⎥⎥

⎢⎢⎢⎢

=

Ty

yy

yM2

1

⎥⎥⎥⎥

⎢⎢⎢⎢

Σ

ΣΣ

TP

P

P

O2

1

⎥⎥⎥⎥

⎢⎢⎢⎢

=

μμ

μM2

1

Phonetic Targets Smoothness

/SIL B B B AE AE JH JH SIL SIL/

Page 30: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Smoothness

Higher orders of smoothness: K,, ΔΔΔΔΔOrder 2, 3, ….

⎥⎥⎥⎥

⎢⎢⎢⎢

−−

II

IIII

O

Page 31: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Setting

Cross-validation:

flow: order 4, = 250: septic splinestexture: order 5, = 100: nintic

splines

Δ,λ

ΔΔ

λλ

Page 32: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Setting Phonetic Clusters

Use sample estimates?

/t/

/b/

Problem: Underarticulation!

Page 33: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Adjusting Phonetic Clusters

Use Gradient descent

to tweak

)()( yzyzE T −−=

ii

yyEE

μμ ∂∂

∂∂

=∂∂

μημμ∂∂

−=Eoldnew

Compare synthesized trajectory with original trajectory

{ }ttz βα ,={ }tty βα ,=

Page 34: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

/t/

/b/

Phones Before/After Training

/t/

/b/before

after

Page 35: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Trajectories Before/After Training

12α

28β

Page 36: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Coarticulation

Model

1I

2I 3I

4I

2C 3C

4C

/B//U/

/T/Coarticulation

controlledby width

of cluster regions

/I/

Page 37: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Coarticulation

/utu/

/iti/ /ata/

/ubu/

/ibi/ /aba/

Page 38: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Big Picture

Trajectory Synthesis

MMM

Construct

MMM

{ }ii CI ,1I

2I 3I

4I

2C 3C

4C

Analyze Corpus{ }tt βα ,

Train phonetic models

{ }pp Σ,μ/SIL/ /F/

/AE/

Post-process

Pre-process/SIL B B B AE AE JH JH SIL SIL/

Synthesize!

{ }tt βα ,

Page 39: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Results

Mary101:

8 minutes of training data

1-syllable words: 132 training/20 test2-syllable words: 136 training/20 test

46-prototype MMM

Sentences not even included in training.

Page 40: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Comments So Far

“She looks like she’s been Botox’ed”--

Nobel Laureate

“Has she had a frontal lobotomy?”--

ATT executive

Send me your comments to

[email protected]

Page 41: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Visual Turing Tests

We win!

Experiment % correct P<Single

presentation 52.1% 0.3

Double presentation

46.6% 0.5

Page 42: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Visual Intelligibility

Still some work to do…….

Correct Phoneme ID

Experiment %correct on N %correct on S P<

Words+Sents 30.01% 21.19% 0.001

Words 38.55% 28.07% 0.001

Sents 24.38% 16.52% 0.01

Page 43: Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Stay Tuned!

Acknowledgments:Association Christian BenoitNSFNTTITRI

Mary101

Dynasty ModelsCraig Milanesi

Dave KonstineJoanne Flood

Jay BenoitMarypat

Fitzgerald

Casey JohnsonVinay

Kumar

Sayan

MukherjeeChao Wang

Adlar

KimDanielle Suh

Osamu YoshimiVolker Blanz

Thomas VetterDemetri

Terzopoulos

Jenny Shapiro/BMGRehema

Ellis/NBC

Kevin Chang