Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh 81271003...

Designing Facial Animation For Speaking

Persian Language

Hadi [email protected]

June 2005

System Description

Inputs : Speech signal Outputs: Facial Animation

A generic 3D face in MPEG4 standard

Speech stream

Agenda

MPEG4 Standard Speech Processing Different Approaches Learning Phase Face Feature Extraction Training Neural Networks Experimental Results Conclusion

MPEG4 Standard

Multimeida Communication Standard 1999 / Moving Picture Expert Group High quality / Low bit rate Interaction of users with media

Object Oriented Object Properties

Scalable quality SNHC (Synthetic Natural Hybrid Coding)

Synthetic faces and bodies

Facial Animation in MPEG4

FDP (Face Definition Parameters) Shape

84 Feature Points Texture

FAP ( Face Animation Parameters) For animating feature points 68 parameter High level / Low level Global and local parameters FAP units

Face Definition Parametes

Face Animation Parameter Units

Speech Processing

Phases: Noise Reduction

Simple noise Framing

Feature Extraction Speech features:

LPC ,MFCC, Delta MFCC, Delta Delta MFCC

Frame 1Frame 2

Feature VectorX1

Feature VectorX2

Two Approaches

Phoneme-Viseme Mapping Approaches Transitions among visemes Discrete phonetic units Extremely stylized Language dependent

Acoustic-Visual Mapping Approaches Relation between speech features and facial expressions Functional approximation Language independent Neural networks and HMM : learning machines for mapping

Learning Phase

Speaker Video

Speech stream

Feature Extraction Feature Extraction

Training NNFAP Extraction

FAP Player

Face Feature Extraction

Deformable template based approach Semi automatic Candid model

A wire frame model For model based coding Parameterized 113 vertex 168 face

Candid Model

Parameters of WFM Global

3d Rotation , 2d Translation, Scale

Shape UnitsLip Width, Eyes Distance, …

Action UnitsLip Shape, Eyebrow, …

Each parameter value is a real number

Texture

New Face Generation

Transformation

(a(a11, b, b11))PP PP**

OO OO**

YY

XX

Correspondences: Correspondences: (a(a11, b, b11)) ((xx11, , yy11)), , (a(a22, b, b22)) ((xx22, , yy22)), , (a(a33, b, b33)) ((xx33, , yy33),),

**

((aa22, , bb22))

(a(a33, b, b33))

((xx22, , yy22))

((xx33, , yy33))

((xx11, , yy11))

**

sourcesource targettarget

Transformation (cont.)

111

.

100111321

321

321

321

bbb

aaa

fed

cba

yyy

xxx

ATA .1. AAT

New Face Generation

Model Adaptation

Selecting Optimal Parameters Global Parameters:

3d Rotation , 2d Translation, Scale Lip Parameters:

Upper Lip Jaw Open Lip Width Lip Corners Vertical Movements

Full Search ( expensive ) Using Previous Frame Information

Lip Reading

Using of color data to guess lip area Using extracted lip area to guess lip model

parameters. Upper lip, jaw open, mouth width, lip corners

Using related vertex of Candide model. Two regions from first frame:

Lip regions Non lip regions

Lip Area Classification

Fisher Linear Discriminant Simple Fast

Two point sets X , Y in n dimensions

m1 is projection of X on unit vector α

m2 is projection of Y on unit vector α

Find α that maximizes

22

21

221)(

mm

mm

ssJ

Estimating Lip Parameters

FLD is trained by first frames pixels rgb data of pixels

HSV is better than RGB. Robust in different brightness conditions

Lip Area Classification

A simple approach for estimating lip parameters. Column scanning Row scanning

Generating FAPs from model

Generating FAP file from model

FAP file format Trial and error approach Open source FAP players

FAP and wave file as input

Training Neural Networks

60 videos as data set 45 sentences for train 15 sentences for test

Multilayer Perceptrons One input layer, One hidden layer, One output layer Back propagation algorithm

Nine neuron in output layer Five global parameters Four lip parameters

Training Neural Networks

Four speech features LPC, MFCC, Delta MFCC, Delta Delta MFCC

Six networks for each speech feature One feature vector as input

30, 60, 90 neuron in hidden layer Three feature vector as input

90, 120, 150 neuron in hidden layer

frame rate Video : 25 fps Speech : 50 fps

Generating Results From NNs

Generating four lip parameters for each frame

Assessment Criterion

A performance metric to measure the predicted accuracy of audio-visual mapping

Correlation Coefficients G is one if two vectors are equal

k : frame number

N : number of frames in the test set

N

k bp

bp kbkp

NG

1

))()()((1

Results For LPC Networks

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6

Upper lip

Jaw Open

Mouth Width

Mouth Corners

Mean

Results For MFCC Networks

00.10.20.30.40.50.60.70.8

1 2 3 4 5 6

Upper Lip

Jaw Open

Mouth Width

Mouth Corners

Mean

Results For Delta MFCC Networks

00.10.20.30.40.50.60.70.8

1 2 3 4 5 6

Upper Lip

Jaw Open

Mouth Width

Mouth Corners

Mean

Results For Delta Delta MFCC Networks

00.10.20.30.40.50.60.70.8

1 2 3 4 5 6

Upper Lip

Jaw Open

Mouth Width

Mouth Corners

Mean

Conclusion

Speech driven facial animation is possible! Delta Delta MFCC has the best performance Using previous and next speech frames improves

the performance. Using combination of different speech features

Future Works

More train data Speaker independent train data Multi language Other speech features Combination of speech features Facial emotions HMM for storing the mappings

Thanks…

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh 81271003...

Documents

Transcript of Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh 81271003...