Visual Speech Animation - nwpu-aslp.org

Visual Speech Animation

Lei Xie, Lijuan Wang, and Shan Yang

ContentsIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3A Typical VSA System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Face/Mouth Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Input and Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Mapping Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

A Deep BLSTM-RNN-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13LSTM-RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14The Talking Head System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Selected Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Karaoke Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Technology Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

L. Xie (*)School of Computer Science, Northwestern Polytechnical University (NWPU), Xi’an, P. R. Chinae-mail: [email protected]; [email protected]

L. WangMicrosoft Research, Redmond, WA, USAe-mail: [email protected]

S. YangSchool of Computer Science, Northwestern Polytechnical University, Xi’an, Chinae-mail: [email protected]

# Springer International Publishing AG 2016B. Müller, S.I. Wolf (eds.), Handbook of Human Motion,DOI 10.1007/978-3-319-30808-1_1-1

1

mailto:[email protected]




AbstractVisual speech animation (VSA) has many potential applications in human-computer interaction, assisted language learning, entertainments, and otherareas. But it is one of the most challenging tasks in human motion animationbecause of the complex mechanisms of speech production and facial motion. Thischapter surveys the basic principles, state-of-the-art technologies, and featuredapplications in this area. Specifically, after introducing the basic concepts and thebuilding blocks of a typical VSA system, we showcase a state-of-the-art approachbased on the deep bidirectional long short-term memory (DBLSM) recurrentneural networks (RNN) for audio-to-visual mapping, which aims to create avideo-realistic talking head. Finally, the Engkoo project from Microsoft ishighlighted as a practical application of visual speech animation in languagelearning.

KeywordsVisual speech animation • Visual speech synthesis • Talking head • Talking face •Talking avatar • Facial animation • Audio visual speech • Audio-to-visual map-ping • Deep learning • Deep neural network

Introduction

Speech production and perception are both bimodal in nature. Visual speech, i.e.,speech-evoked facial motion, plays an indispensable role in speech communication.Plenty of evidence shows that voice and face reinforce and complement each other inhuman-human communication (McGurk and MacDonald 1976). By viewing thespeaker’s face (and mouth), valuable information is provided for speech perception.Visible speech is particularly effective when auditory speech is degraded or contam-inated, due to acoustic noise, bandwidth limitation, or hearing impairment. In anearly study, Breeuwer et. al. (Breeuwer and Plomp 1985) have already shown thatthe recognition of short sentences that have been band-pass filtered improvessignificantly when subjects are allowed to watch the speaker. The same level ofimprovement can be observed from hearing-impaired listeners and cochlear implantpatients (Massaro and Simpson 2014). In the experiments, lipreading providesessential speech perceptual information. The influence of visual speech is not onlylimited to situations with degraded auditory input. In fact, Sumby and Pollack foundthat seeing the speaker’s face is equivalent to about 15dB signal-to-noise ratio (SNR)improvement of acoustic signal (Sumby and Pollack 1954).

Due to the influence of visual speech in human-human speech communication,researchers have shown their interest in its impact in human-machine interaction.Ostermann and Weissenfeld (Ostermann and Weissenfeld 2004) have shown thattrust and attention of humans toward machines increase by 30% when communicat-ing with a talking face instead of text only. That is to say, visual speech is able toattract the attention of a user, making the human-machine interface more engaging.

2 L. Xie et al.

Hence, visual speech animation (VSA)1 aims to animate the lips/mouth/articulators/face synchronizing with speech for different purposes. In a broad sense, VSA mayinclude facial expressions (Jia et al. 2011; Cao et al. 2005) and visual prosody(Cosatto et al. 2003) like head (Ben Youssef et al. 2013; Le et al. 2012; Bussoet al. 2007; Ding et al. 2015; Jia et al. 2014) and eye (Le et al. 2012; Dziemiankoet al. 2009; Raidt et al. 2007; Deng et al. 2005) motions, which are naturallyaccompanied with human speech. Readers can read Chapter Eye Motion andChapter Head Motion Generation for more details.

Applications of VSA can be found across many domains, such as technical supportand customer service, communication aids, speech therapy, virtual reality, gaming,film special effects, education, and training (Hura et al. 2010). Specific applicationsmay include a virtual storyteller for children, a virtual guider or presenter for personalor commercial Web site, a representative of user in computer games, and a funnypuppetry for computer-mediated human communications. It is clearly promising thatVSAwill become an essential multimodal interface in many applications.

Speech-evoked face animation is one of the most challenging tasks in humanmotion animation. Human face has an extremely complex geometric form (Pighinet al. 2006), and the speech-originated facial movements are the result of a compli-cated interaction between a number of anatomical layers that include the bone,muscle, fat, and skin. As a result, humans are extremely sensitive to the slightestartifacts in an animated face, and even the small subtle changes can lead to unreal-istic appearance. To achieve realistic visual speech animation, tremendous effortsfrom speech, image, computer graphics, pattern recognition, and machine learningcommunities have been made since several decades ago (Parke 1972). Those effortshave been summarized in proceedings of visual speech synthesis challenge (LIPS)(Theobald et al. 2008), surveys (Cosatto et al. 2003; Ostermann and Weissenfeld2004), featured books (Pandzic and Forchheimer 2002; Deng and Neumann 2008),and several journal special issues (Xie et al. 2015; Fagel et al. 2010). This bookchapter aims to introduce the basic principles, survey the state-of-the-art technolo-gies, and discuss featured applications.

State of the Art

After decades of research, a state-of-the-art visual speech animation system currentlycan realize lifelike or video-realistic performance through 2D, 2.5D, or 3D facemodeling and a statistical/parametric text/speech to visual mapping strategy. Forinstance, in Fan et al. (2016), an image-based 2D video-realistic talking head isintroduced. The lower face region of a speaker is modeled by a compact model learnedfrom a set of facial images, called active appearance model (AAM). Given pairs of theaudio and visual parameter sequence, a deep neural network model is trained to learn

1Also called visual speech synthesis, talking face, talking head, talking avatar, speech animation,and mouth animation.

Visual Speech Animation 3

the sequence mapping from audio to visual space. To further improve the realism of theproposed talking head, the trajectory tiling method is adopted to use the predicted AAMtrajectory as a guide to select a smooth real sample image sequence from the recordeddatabase. Based on the similar techniques, Microsoft has released an online visualspeech animation system that can help users to learn English (Wang et al. 2012c).

A Typical VSA System

As shown in Fig. 1, a typical visual speech animation system is usually composed ofseveral modules: data collection, face/mouth model, feature extraction, and learninga mapping model.

Data Collection

According to the source of data used for face/mouth/articulator modeling, a VSAsystem can be built from images, video recordings, and various motion captureequipments like mocap, electromagnetic articulography (EMA), magnetic resonanceimaging (MRI), and X-ray. What type of data are collected essentially depends onthe costs, the desired appearance of the face/head, and the application needs. Manyapproaches choose the straightforward way for data collection: videos of a speakerare recorded by a camera, and the image sequences are used as the source for 2D or3D face/head modeling (Theobald et al. 2008; Bregler et al. 1997; Cosatto et al.2003; Cosatto and Graf 1998; Xie and Liu 2007a; Wang et al. 2010a; Cosatto andGraf 2000; Anderson et al. 2013; Ezzat et al. 2002; Ezzat and Poggio 2000; Xie andLiu 2007b), as shown in Fig. 2a. A recent trend to produce quality facial animation isto use 3D motion-captured data (Deng and Neumann 2008), which have beensuccessfully used in movie special effects to drive a virtual character. As shown inFig. 2b, to record facial movements, an array of high-performance cameras is utilized

AnimationMappingFace/Mouth

Model

DataCollection

Face/MouthModel

FeatureExtraction

LearnMapping

New Audio/Text

Video/Sensor Data

Audio/Text Audio/TextFeature

FeatureExtraction

Audio/Text Feature

Visual Parameters

VisualParameters

Fig. 1 The building blocks of a typical VSA system

4 L. Xie et al.

to reconstruct the 3D marker locations on a subject’s face. Although the mocapsystem is quite expensive and difficult to set up, the reconstructed data provideaccurate timing and motion information. Once the data are collected, facial anima-tion can be created by controlling underlying muscle structure or blend shapes (seeChapter Blendshape Facial Animation for details).

Another data collection system, EMA, as shown in Fig. 2c, is often used to recordthe complex movements of the lips, jaw, tongue, and even intraoral articulators(Richmond et al. 2011). The sensors, called coils, are attached to different positionson a speaker’s face or in the mouth, and the 3D movements of the sensors arecollected in a high frame rate (e.g., 200 frames per second) during the speaker’stalking. Visual speech animation generated by the EMA data is usually used forarticulation visualization (Huang et al. 2013; Fagel and Clemens 2004; Wik andHjalmarsson 2009). In Wang et al. (2012a), an animated talking head is createdbased on the EMA articulatory data for the pronunciation training purposes.

Face/Mouth Model

The appearance of a visual speech animation system is determined by the underlyingface/mouth model, while generating animated talking heads that look like realpeople is challenging. The existing approaches to talking heads use either image-based 2D models (Seidlhofer 2009; Zhang et al. 2009) or geometry-based 3D ones(Musti et al. 2014).Cartoon avatars are relatively easier to build. The more human-like, realistic avatars, which can be seen in some games or movies, are much harderto build. Traditionally, expensive motion capture systems are required to track thereal person’s motion or, in an even more expensive way, have some artists tomanually hand touch every frame. Some desirable features of the next generationavatar are as follows: it should be a 3D avatar to be integrated easily into a versatile3D virtual world; it should be photo-realistic; it can be customized to any user; lastbut not least, an avatar should be automatically created with a small amount ofrecorded data. That is to say, the next generation avatar should be 3D, photo-realistic,personalized or customized, and easy to create with little bootstrapping data.

Camera Video Motion Capture EMA

a b c

Fig. 2 Various data collection methods in building a visual speech animation system. (a) Cameravideo (Theobald et al. 2008), (b) motion capture (Busso et al. 2007), and (c) EMA from http://www.gipsa-lab.grenoble-inp.fr/


http://www.gipsa-lab.grenoble-inp.fr/

http://www.gipsa-lab.grenoble-inp.fr/

In facial animation world, a great variety of different animation techniques based on3D models exist (Seidlhofer 2009). In general, these techniques first generate a 3D facemodel consisting of a 3D mesh, which defines the geometry shape of a face. For that alot of different hardware systems are available, which range from 3D laser scanners tomulti-camera systems. In a second step, either a human-like or cartoon-like texture maybe mapped onto the 3D mesh. Besides generating a 3D model, animation parametershave to be determined for the later animation. A traditional 3D avatar requires a highlyaccurate geometric model to render soft tissues like lips, tongue, wrinkles, etc. It is bothcomputationally intensive and mathematically challenging to make or run such a model.Moreover, any unnatural deformation will make the resultant output fall into theuncanny valley of human rejection. That is, it will be rejected as unnatural.

Image-based facial animation techniques achieve great realism in synthesized videosby combining different facial parts of recorded 2D images (Massaro 1998; Zhang et al.2009; Eskenazi 2009; Scott et al. 2011; Badin et al. 2010). In general, image-basedfacial animations consist of two main steps: audiovisual analysis of a recorded humansubject and synthesis of facial animation. In the analysis step, a database with images ofdeformable facial parts of the human subject is collected, while the time-aligned audiofile is segmented into phonemes. In the second step, a face is synthesized by firstgenerating the audio from the text using a text-to-speech (TTS) synthesizer. The TTSsynthesizer sends phonemes and their timing to the face animation engine, whichoverlays facial parts corresponding to the generated speech over a background videosequence. Massaro (1998), Zhang et al. (2009), Eskenazi (2009), and Scott et al. (2011)show some image-based speech animation that cannot be distinguished from recordedvideo. However, it is challenging to change head pose freely or to render different facialexpressions. Also, it is hard to blend it into 3D scenes seamlessly. Image-basedapproaches have their advantages that the photo realistic appearance is guaranteed.However, a talking head needs to be not just photo-realistic in a static appearance butalso exhibit convincing plastic deformations of the lips synchronized with thecorresponding speech, realistic head movements, and natural facial expressions.

An ideal 3D talking head can mimic realistic motion of a real human face in 3Dspace. One challenge for rendering realistic 3D facial animation is on the mouth area.Our lip, teeth, and tongue are mixed with nonrigid tissues, and sometimes withocclusions. This means accurate geometric modeling is difficult, and also it is hard todeform them properly.Moreover, they need to move together in sync with spoken audio;otherwise, people can observe the asynchrony and think it unnatural. In real world, whenpeople talk, led by vocal organs and facial muscles, both the 3D geometry and textureappearance of the face are constantly changing. Ideally, we can capture both geometrychange and texture change simultaneously. There is lot of ongoing research for solvingthis problem. For example, with the help of Microsoft Kinect kinds of motion sensingdevice, people try to use the captured 3D depth information to better acquire the 3Dgeometry model. On the other hand, people try to recover the 3D face shape from singleor multiple camera views (Wang et al. 2011; Sako et al. 2000; Yan et al. 2010).

In 2.5D talking head as above, as there is no captured 3D geometry informationavailable, they adopt the work in Sako et al. (2000) which reconstructs a 3D face modelfrom a single frontal face image. The only required input to the 2D-to-3D system is a

6 L. Xie et al.

frontal face image of a subject with normal illumination and neutral expression. Asemi-supervised ranking prior likelihood models for accurate local search and a robustparameter estimation approach are used for face alignment. Based on this 2D alignmentalgorithm, 87 key feature points are automatically located, as shown in Fig. 3. Thefeature points are accurate enough for face reconstruction in most cases. A general 3Dface model is applied for personalized 3D face reconstruction. The 3D shapes havebeen compressed by the PCA. After the 2D face alignment, the key feature points areused to compute the 3D shape coefficients of the eigenvectors. Then, the coefficientsare used to reconstruct the 3D face shape. Finally, the face texture is extracted from theinput image. By mapping the texture onto the 3D face geometry, the 3D face model forthe input 2D face image is reconstructed. They reconstruct a 3D face model for each 2Dimage sample in recordings, as examples shown in Fig. 3. Thus a 3D sample library isformed, where each 3D sample has a 3D geometry mesh, a texture, and thecorresponding UV mapping which defines how a texture is projected onto a 3Dmodel. After 2D-to-3D transformation, original 2D sample recordings turn into 3Dsample sequences, which consist of three synchronous streams: geometry meshsequences for depicting the dynamic shape, texture image sequences for the changingappearance, and the corresponding speech audio. This method combines the best ofboth 2D image sample-based and 3D model-based facial animation technologies. Itrenders realistic articulator animation by wrapping 2D video images around a simpleand smooth 3D face model. The 2D video sequence can capture the natural movementof soft tissues, and it helps the new talking head to bypass the difficulties in renderingoccluded articulators (e.g., tongue and teeth). Moreover, with the versatile 3D geom-etry model, different head poses and facial expressions can be freely controlled. The2.5D talking head can be customized to any user by using the 2D video of the user.

Techniques based on 3D models impress by their great automatism and flexibilitywhile lacking in realism. Image-based facial animation achieves photo-realism whilehaving little flexibility and lower automatism. The image-based techniques seem to bethe best candidates for leading facial animation to new applications, since these tech-niques achieve photo-realism. The image-based technique combined with a 3D modelgenerates photo realistic facial animations, while providing some flexibility to the user.

Input and Feature Extraction

According to the input signal, a visual speech animation system can be driven bytext, speech, and performance. The simplest VSA aims to visualize speech pro-nunciations by an avatar from tracked makers of human performance. Currently,performance-based facial animation can be quite realistic (Thies et al. 2016; Wangand Soong 2012; Weise et al. 2011). The aim of such a system is usually not only forspeech visualization. For example, in Thies et al. (2016), an interesting applicationfor real-time facial reenactment is introduced. Readers can go throughChapter Video-based Performance Driven Facial Animation for more details.

During the facial data collection process, speech and text are always collected aswell. Hence, visual speech can be driven by new voice or text input, achieved by a


learned text/audio-to-visual mapping model that will be introduced in the nextsection. To learn such a mapping, a feature extraction module is firstly used to obtainrepresentative text or audio features. The textual feature is often similar to that usedfor TTS system (Taylor 2009), which may include information about phonemes,syllables, stresses, prosodic boundaries, and part-of-speech (POS) labels. Audiofeatures can be typical spectral features (e.g., MFCC (Fan et al. 2016)), pitch, andother acoustic features.

Fig. 3 Auto-reconstructed 3D face model in different mouth shapes and in different view angles(w/o and w/ texture)

8 L. Xie et al.

Mapping Methods

Both text- and speech-driven visual speech animation systems desire an input tovisual feature conversion or mapping algorithm. That is to say, the lip/mouth/facialmovements must be naturally synchronized with the audio speech.2 The conversionis not trivial because of the coarticulation phenomenon of the human speechproduction mechanism, which causes a given phoneme to be pronounced differentlydepending on the surrounding phonemes. Due to this phenomenon, learning anaudio/text-to-visual mapping becomes an essential task in visual speech animation.Researchers have devoted much effort in this task, and the developed approaches canbe roughly categorized into rule based, concatenation, parametric, and hybrid.

Rule BasedDue to the limitations of data collection and learning methods, early approaches aremainly based on hand-crafted mapping rules. In these approaches, the counterpart ofaudio phoneme, viseme, is defined as the basic visual unit. Typically, visemes aremanually designed as key images of mouth shapes, as shown in Fig. 4, and empiricalsmooth functions or coarticulation rules are used to synthesize novel speech anima-tions. Ezzat and Poggio propose a simple approach by morphing key images ofvisemes (Ezzat and Poggio 2000). Due to the coarticulation phenomenon, morphingbetween a set of mouth images is apparently not natural. Cohen and Massaro (Cohenand Massaro 1993) propose a coarticulation model, where a viseme shape is definedvia dominance functions that are defined in terms of each facial measurement, suchas the lips, tongue tip, etc., and the weighted sum of dominance values determinesthe final mouth shapes. In a recent approach, Sarah et. al. (Taylor et al. 2012) arguethat static mouth shapes are not enough, so they redefine visemes as clusteredtemporal units that describe distinctive speech movements of the visual speecharticulators, called dynamic visemes.

Concatenation/Unit selectionTo achieve photo- or video-realistic animation effect, concatenation of real video clipsfrom a recorded database has been considered (Bregler et al. 1997; Cosatto et al. 2003;Cosatto and Graf 1998; 2000). The idea is quite similar with that in concatenative TTS(Hunt and Black 1996). In the off-line stage, a database of recorded videos is cut intosizable clips, e.g., triphone units. In the online stage, given a novel text or speechtarget, a unit selection process is used to select appropriate units and assemble them inan optimal way to produce the desired target, as shown in Fig. 5.

To achieve speech synchronization and a smooth video, the concatenation algo-rithm should be elaborately designed. In Cosatto et al. (2003), a phonetically labeledtarget is first produced by a TTS system or by a labeler or an aligner from therecorded audio. From the phonetic target, a graph is created with statescorresponding to the frames of the final animation. Each state of the final animation

2Sometimes this task is called lip synchronization or lip sync for short.


(a video frame) is populated with a list of candidate nodes (a recorded video samplefrom the database). Each state is fully connected to the next, and concatenation costsare assigned for each arc, while target costs are assigned to each node. A Viterbisearch on the graph finds the optimal path, i.e., the path that generates the lowesttotal cost. The balance between the two costs is critical in the final performance, andits weighting is empirically tuned in real applications.

The video clips for unit selection are usually limited to lower part of the face thathas most speech-evoked facial motions. After selection, the concatenated lower faceclips are stitched to a background whole face video, resulting in the synthesizedwhole face video, as shown in Fig. 6. To achieve seamless stitches, much effortshave been made on image processing. With a relatively large video database, theconcatenation approach is able to achieve video-realistic performance. But it mightbe difficult to add different expressions, and the flexibility of the generated visualspeech animation is also limited.

Fig. 4 Several defined visemes from Ezzat and Poggio (2000)

10 L. Xie et al.

Parametric/StatisticalRecently, parametric methods have gained much attention due to their elegantlyautomatic learned mappings from data. Numerous attempts have been made tomodel the relationship between audio and visual signals, and many are generativeprobabilistic model based, where the underlying probability distributions of audio-visual data are estimated. Typical models include Gaussian mixture model (GMM),hidden Markov model (HMM) (Xie and Liu 2007a; Fu et al. 2005), dynamicalBayesian network (DBN) (Xie and Liu 2007b), and switching linear dynamicalsystem (SLDS) (Englebienne et al. 2007).

wodnihw

Frame

Image Candidates

Reconstructed Face Images

Fig. 5 Unit selection approach for visual speech animation (Fan et al. 2016)

Sample

Mask

LowerFace Image

BackgroundImage/Sequence

StitchedImage/Sequence

Fig. 6 Illustration of the image stitching process in a video-realistic talking head (Fan et al. 2016)


The hidden Markov model-based statistical parametric speech synthesis (SPSS)has made a significant progress (Tokuda et al. 2007). Hence, the HMM approach wasalso intensively investigated for visual speech synthesis (Sako et al. 2000; Masukoet al. 1998). In HMM-based visual speech synthesis, auditory speech and visualspeech are jointly modeled in HMMs, and the visual parameters are generated fromHMMs by using the dynamic (“delta”) constraints of the features (Breeuwer andPlomp 1985). Convincing mouth video can be rendered from the predicted visualparameter trajectories. This approach is called trajectory HMM. Usually, maximumlikelihood (ML) is used as the criterion for HMM model training. However, MLtraining does not optimize directly toward visual generation error. To compensatethis deficiency, a minimum generated trajectory error (MGE) method is proposed inWang et al. (2011) to further refine the audiovisual joint modeling by minimizing theerror between the generation result and the real target trajectories in the training set.

Although HMM can model sequential data efficiently, there are still some limi-tations, such as the wrong model assumptions out of necessity, e.g., GMM and itsdiagonal covariance, and the greedy, hence suboptimal, search-derived decision-tree-based contextual state clustering. Motivated by the superior performance ofdeep neural networks (DNN) in automatic speech recognition (Hinton et al. 2012)and speech synthesis (Zen et al. 2013), a neural network-based photo-realistictalking head is proposed in Fan et al. (2015). Specifically, a deep bidirectionallong short-term memory recurrent neural network (BLSTM-RNN) is adopted tolearn a direct regression model by minimizing the sum of square error (SSE) ofpredicting visual sequence from label sequence. Experiments have confirmed thatthe BLSTM approach significantly outperforms the HMM approach (Fan et al.2015). The BLSTM approach will be introduced in detail later in this chapter.

HybridAlthough parametric approaches have many merits like small footprint, flexibility,and controllability, one obvious drawback of those approaches is the blurringanimation due to the feature dimension reduction and the non-perfect learningmethod. So there are some hybrid visual speech animation approaches that use thepredicted trajectory to guide the sample selection process (Wang et al. 2010b), whichcombines the advantages of both the video-based concatenation and the parametricstatistical modeling approaches. In a recent approach (Fan et al. 2016), visualparameter trajectory predicted by a BLSTM-RNN is used as a guide to select asmooth real sample image sequence from the recorded database.

A Deep BLSTM-RNN-Based Approach

In the past several years, deep neural networks (DNN) and deep learning methods(Deng and Yu 2014) have been successfully used in many tasks, such as speechrecognition (Hinton et al. 2012), natural language processing, and computer vision.For example, the DNN-HMM approach has boosted the speech recognition accuracy

12 L. Xie et al.

significantly (Deng and Yu 2014). Deep neural networks have been investigated forregression/mapping tasks, e.g., text to speech (Zen et al. 2013), learning clean speechfrom noisy speech for speech enhancement (Du et al. 2014), and articulatorymovement prediction from text and speech (Zhu et al. 2015). There are severaladvantages of the DNN approaches: it can model long-span, high-dimensional, andthe correlation of input features; it is able to learn nonlinear mapping between inputand output with a deep-layered, hierarchical, feed-forward, and recurrent structure; ithas the discriminative and predictive capability in generation sense, with appropriatecost function(s), e.g., generation error.

Recently, recurrent neural networks (RNNs) (Williams and Zipser 1989) and theirbidirectional variant, bidirectional RNNs (BRNNs) (Schuster and Paliwal 1997),become popular because they are able to incorporate contextual information that isessential for sequential data modeling. Conventional RNNs cannot well model thelong-span relations in sequential data because of the vanishing gradient problem(Hochreiter 1998). Hochreiter and Schmidhuber (1997) found that the LSTM archi-tecture, which uses purpose-built memory cells to store information, is better atexploiting long-range context. Combining BRNNs with LSTM gives BLSTM,which can access long-range context in both directions. Speech, both in auditoryand visual forms, is typical sequential data. In a recent study, BLSTM has shownstate-of-the-art performance in audio-to-visual sequential mapping (Fan et al. 2015).

RNN

Allowing cyclical connections in a feed-forward neural network, a recurrent neuralnetwork (RNN) is formed (Williams and Zipser 1989). RNNs are able to incorporatecontextual information from previous input vectors, which allows them to rememberpast inputs and allows them to persist in the network’s internal state. This propertymakes them an attractive model for sequence-to-sequence learning. For a given inputvector sequence x = (x1,x2...,xT), the forward pass of RNNs is as follows:

ht ¼ H Wxhxt þWhhht�1 þ bhð Þ, (1)

yt ¼Whyht þ by, (2)

where t= 1,...,T and T is the length of the sequence; h= (h1,...,hT) is the hidden statevector sequence computed from x; y= (y1,...,yT) is the output vector sequence;W isthe weight matrices, whereWxh,Whh, andWhy are the input-hidden, hidden-hidden,and hidden-output weight matrices, respectively. bh and by are the hidden and outputbias vectors, respectively, and H denotes the nonlinear activation function in theoutput layer.

For the visual speech animation task, because of the speech coarticulationphenomenon, a model is desired to have access to both the past and future contexts.Bidirectional recurrent neural networks (BRNNs), as shown in Fig. 7, fit this task


well. A BRNN computes both forward state sequence h!

and backward state

sequence h!, as formulated below:

h!t ¼ H W

x h!xt þW

h! h!

h!t�1 þ b

h!

� �, (3)

h

t ¼ H Wx h xt þW

h h h!h!

t�1 þ bh

� �, (4)

yt ¼Wh!yh!

t þWh yh!t þ by: (5)

LSTM-RNN

Conventional RNNs can access only a limited range of context because of thevanishing gradient problem. Long short-term memory (LSTM) uses purpose-builtmemory cells, as shown in Fig. 8, to store information, which is designed toovercome this limitation. In sequence-to-sequence mapping tasks, LSTM has beenshown capable of bridging very long time lags between input and output sequencesby enforcing constant error flow. For LSTM, the recurrent hidden layer functionH isimplemented as follows:

it ¼ σ Wxixt þWhiht�1 þWcict�1 þ bið Þ, (6)

f t ¼ σ Wxf xt þWhf ht�1 þWcict�1 þ bf� �

, (7)

at ¼ τ Wxcxt þWhcht�1 þ bcð Þ, (8)

ct ¼ f tct1 þ itat, (9)

Backward Layer

Forward Layer

Inputs

Outputs

Fig. 7 Bidirectional recurrent neural networks (BRNNs)

14 L. Xie et al.

ot ¼ σ Wxoxt þWhoht�1 þWcoct þ boð Þ, (10)

ht ¼ otθ ctð Þ, (11)

where σ is the sigmoid function; i, f, o, a, and c are input gate, forget gate, outputgate, cell input activation, and cell memory, respectively. τ and θ are the cell inputand output nonlinear activation functions; generally tanh is chosen. The multiplica-tive gates allow LSTM memory cells to store and access information over longperiods of time, thereby avoiding the vanishing gradient problem.

Combining BRNNs with LSTM gives rise to BLSTM, which can access long-range context in both directions. Motivated by the success of deep neural networkarchitectures, deep BLSTM-RNNs (DBLSTM-RNNs) are used to establish theaudio-to-visual mapping for visual speech animation. Deep BLSTM-RNN is createdby stacking multiple BLSTM hidden layers.

The Talking Head System

Figure 9 shows the diagram of an image-based talking head system using DBLSTMas the mapping function (Fan et al. 2015). The diagram actually follows the basicstructure of a typical visual speech animation system in Fig. 1. The aim of the systemis to achieve speech animation with video-realistic effects. Firstly, an audio/visualdatabase of a subject talking to a camera with frontal view of his/her face is recordedas the training data. In the training stage, the audio is converted into a sequence of

tc

to

ti

tf

Cell

Inputgate

Outputgate

Forgetgate

MemoryBlock

tx

thFig. 8 Long short-termmemory (LSTM)


contextual phoneme labels L using forced alignment, and the corresponding lowerface image sequence is transformed into active appearance model (AAM) featurevectors V. Then a deep BLSTM neural network is used to learn a regression modelbetween the two audio and visual parallel sequences by minimizing the SSE of theprediction, in which the input layer is the label sequence L and the output predictionlayer is the visual feature sequence V. In the synthesis stage, for any input text withnatural or synthesized speech by TTS, the label sequence L is extracted from them

and then the visual AAM parameters V̂ are predicted using the well-trained deep

BLSTM network. Finally, the predicted AAM visual parameter sequence V̂ can bereconstructed to high-quality photo-realistic face images and rendering the full-facetalking head with lip-synced animation.

Label ExtractionThe input sequence L and output feature sequence V are two time-varying parallelsequences. The input of a desired talking head system can be any arbitrary text alongwith natural audio recordings or synthesized speech by TTS. For natural recordings,the phoneme/state time alignment can be obtained by conducting forced alignmentusing a trained speech recognition model. For TTS-synthesized speech, the pho-neme/state sequence and time offset are a by-product of the synthesis process.Therefore, for each speech utterance, the phoneme/state sequence and their timeoffset are converted into a label sequence, denoting as L = (i1,...,it,...,iT), where T isthe number of frames in the sequence. The format of the frame-level label it uses theone-hot representation, i.e., one vector for each frame, shown as follows:

0, . . . , 0, . . . , 1|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl}K

; 1, . . . , 0, . . . , 0|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl}K

; 0, 0, 1, . . . , 0|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}K

; 0, 1, 0|fflffl{zfflffl}3

24

35,

where K denotes the number of phonemes. In Fan et al. (2015), triphone and theinformation of three states to identify it are used. The first three K-element sub--vectors denote the identities of the left, current, and right phonemes in the triphone,respectively, and the last three elements represent the phoneme state. Please note thatthe contextual label can be easily extended to contain richer information, likeposition in syllable, position in word, stress, part of speech, etc. But if the trainingdata is limited, we may consider the phoneme and state level labels only.

Face Model and Visual Feature ExtractionIn the system (Fan et al. 2015), the visual stream is a sequence of lower face imageswhich are strongly correlated to the underlying speech. As the raw face image is hardto model directly due to the high dimensionality, active appearance model (AAM)(Cootes et al. 2001) is used as face model for visual feature extraction. AAM is ajoint statistical model compactly representing both the shape and the texture varia-tions and the correlation between them.

16 L. Xie et al.

Since the speaker moves his/her head naturally during recording, head posenormalization among all the face images is performed before AAM modeling.With the help of an effective 3D model-based head pose tracking algorithm, thehead pose in each image frame is normalized to a fully frontal view and furtheraligned. Facial feature points and the texture of a lower face used in Fan et al. (2015)are shown in Fig. 10.

The shape of the jth lower face, sj, can be represented by the concatenation of thex and y coordinates of N facial feature points:

sj ¼ xj1, xj2, . . . , xjN , yj1, yj2,:::, yjN

� �, (12)

where j = 1,2,..., J and J is the total number of the face images. In this work, a set of51 facial feature points is used, as shown in Fig. 10a. The mean shape is simplydefined by

s0 ¼XJ

j¼1 sj=J: (13)

Applying principal component analysis (PCA) to all J shapes, sj can be givenapproximately by

sj ¼ s0 þXNshape

i¼1 aji~si ¼ s0 þ ajPs, (14)

A/VDatabase

Label Extraction

Visual Feature Extraction

Label Extraction Prediction Image

Reconstruction

Face images

Text & Audio

Deep BLSTMTraining

L

V

L

NNModel

Text & Audio V̂

Fig. 9 Diagram of an image-based talking head system using DBLSTM-RNN as the mapping (Fanet al. 2015)


where Ps ¼ ~s1,~s2, . . . ,~si, . . . ¼ ~sNshape

� �Tdenotes the eigenvectors corresponding to

the N shape largest eigenvalues and aj ¼ aj1, aj2, . . . , aji, . . . , ajNshape

� �is the jth shape

parameter vector.Accordingly, the texture of the jth face image, tj, is defined by a vector

concatenating the R/G/B value of every pixel that lies inside the mean shape, so

tj ¼ rj1, . . . , rjU, gj1, . . . , gjU, bj1, . . . , bjU

� �, (15)

where j = 1,2,..., J and U is the total number of pixels.As the dimensionality of the texture vector is too high to use PCA directly,

EM-PCA (Roweis 1998) is used instead to all J textures. As a result, the jth texturetj can be given approximately by

tj ¼ t0 þXNtexture

i¼1 bjt~ti ¼ t0 þ bjPt, (16)

where t0 is the mean texture. Pt contains the eigenvectors corresponding to theNtexture largest eigenvalues, and bj is the jth texture parameter vector.

The above shape and texture models can only control the shape and textureseparately. In order to recover the correlation between the shape and the texture, ajand bj are combined in another round of PCA:

a

b

Fig. 10 Facial feature points and the texture of a lower face used in Fan et al. (2015). (a) 51 facialfeature points. (b) The texture of a lower face

18 L. Xie et al.

aj,bj� � ¼XNappearance

i¼1 vji~vi ¼ vjPv, (17)

and assuming that Pvs and Pvt are formed by extracting the first Nshape and the lastNtexture values from each component in Pv. Simply combining the above equationsgives

sj ¼ s0 þ vjPvsPs ¼ s0 þ vjQs, (18)

tj ¼ t0 þ vjPvtPt ¼ t0 þ vjQt: (19)

Now, the shape and texture of the jth lower face image can be constructed by asingle vector vj. vj is the jth appearance parameter vector which is used as the AAMvisual feature. Subsequently, the lower face sequence with T frames can berepresented by the visual feature sequence V = (v1,...,vt,...,vT).

DBLSTM-RNN Model TrainingIn the training stage, multiple sequence pairs of L and V are available. As bothsequences are represented as continuous numerical vectors, the network is treated asa regression model minimizing the SSE of predicting V from L. In the synthesisstage, given any arbitrary text along with natural or synthesized speech, they arefirstly converted into a sequence of input features and then fed into the trainednetwork. The output of the network is the predicted visual AAM feature sequence.After reconstructing the AAM feature vectors to RGB images, photo-realistic imagesequence of the lower face is generated. Finally, the lower face is stitched to abackground face and the facial animation of the talking head is rendered.

Learning deep BLSTM network can be regarded as optimizing a differentiableerror function:

E wð Þ ¼XMtrain

k¼1 Ek wð Þ, (20)

where Mtrain represents the number of sequences in the training data and w denotesthe network internode weights. In the task, the training criterion is to minimize the

SSE between the predicted visual features V̂ ¼ v̂1, v̂2,:::, v̂Tð Þ and the ground truthV = (v1, v2, ... , vT). For a particular input sequence k, the error function takes theform

Ek wð Þ ¼XTk

t¼1 Ekt ¼ 1

2

XTk

t¼1 v̂kt� vkt

2, (21)

where Tk is the total number of frames in the kth sequence. In every iteration, theerror gradient is computed with the following equation:

Δw rð Þ ¼ mΔw r � 1ð Þ � α@E w rð Þð Þ@w rð Þ , (22)


where 0 � α � 1 is the learning rate, 0 � m � 1 is the momentum parameter, and w(r) represents the vector of weights after rth iteration of update. The convergencecondition is that the validation error has no obvious change after R iterations.

Backpropagation through time (BPTT) algorithm is usually adopted to train thenetwork. In the BLSTM hidden layer, BPTT is applied to both forward and back-ward hidden nodes and back-propagates layer by layer, taking the error functionderivatives with respect to the output of the network as an example. For

v̂kt ¼ v̂kt1, . . . , v̂ktj, . . . , v̂

ktNappearance

� �in the kth V̂ , because the activation function

used in the output layer is an identity function, we have

v̂ktj ¼Xh

wohzkht, (23)

where o is the index of the an output node, zkht is the activation of a node in the hiddenlayer connected to the node o, and woh is the weight associated with this connection.By applying the chain rule for partial derivatives, we can obtain

@Ekt

@woh¼

XNappearance

j¼1@Ekt

@v̂ktj

@v̂ktj@woh

, (24)

and according to Eqs. (21) and (23), we can derive

@Ekt

@woh¼

XNappearance

j¼1 v̂ktj � vktj

� �zkht, (25)

@Ekt

@woh¼

XT

t¼1@Ekt

@woh: (26)

Performances

The performances of the DBLSTM-based talking head are evaluated on an A/Vdatabase with 593 English utterances spoken by a female in a neutral style (Fan et al.2015). The DBLSTM approach is compared with the previous HMM-basedapproach (Wang and Soong 2015). The results for FBB128 DBLSTM3 and HMMare shown in Table 1. We can clearly see that the deep BLSTM approach out-performs the HMM approach by a large margin in terms of the four objectivemetrics.

Subjective evaluation is also carried out in Fan et al. (2015). Ten sequences oflabels are randomly selected from the test set as the input. The deep BLSTM-based

3FBB128 means two BLSTM layers sitting on top of one feed-forward layer and each layer has128 nodes.

20 L. Xie et al.

and the HMM-based talking head videos are rendered, respectively. For each testsequence, the two talking head videos are played side by side randomly with originalspeech. A group of 20 subjects are asked to perform an A/B preference test accordingto the naturalness. The percentage preference is shown in Fig. 11. It can be seenclearly that the DBLSTM-based talking head is significantly preferred to theHMM-based one. Most subjects prefer the BLSTM-based talking head because itslip movement is smoother than the HMM-based one. Some video clips of thesynthesized talking head can be found from Microsoft Research (2015).

Selected Applications

Avatars, with lively visual speech animation, are increasingly being used to com-municate with users on a variety of electronic devices, such as computers, mobilephones, PDAs, kiosks, and game consoles. Avatars can be found across manydomains, such as customer service and technical support, as well as in entertainment.Some of the many uses of avatars include the following:

• Reading news and other information to users• Guiding uses through Web sites by providing instructions and advice• Presenting personalized messages on social Web sites• Catching users attention in advertisements and announcements• Acting as digital assistants and automated agents for self-service, contact centers,

and help desks• Representing character roles in games• Training users to perform complex tasks• Providing new branding opportunities for organizations

Here, we focus on one application that uses talking head avatar for audio/visualcomputer-assisted language learning (CALL).

Imagine a child learning from his favorite TV star who appears to be personallyteaching him English on his handheld device. Another youngster might show off herown avatar that tells mystery stories in a foreign language to her classmates. The

Table 1 Performancecomparison between deepBLSTM and HMM

ComparisonRMSE(shape)

RMSE(texture)

RMSE(appearance) CORR

HMM 1.223 6.602 167.540 0.582

DBLSTM 1.122 6.286 156.502 0.647

45.7%DBLSTM-RNNs

25.7%HMM

28.6%Neutral

Fig. 11 The percentage preference of the DBLSTM-based and HMM-based photo-real talkingheads


speech processing technologies “talking head” are notable in its potential forenabling such scenarios. These features have been successfully tested in a large-scale DDR project called Engkoo (Wang et al. 2012c), from Microsoft ResearchAsia. It is used by ten million English learners in China per month and was thewinner of the Wall Street Journal 2010 Asian Innovation Readers’ Choice Award(Scott et al. 2011).

Talking head generates karaoke-style short synthetic videos demonstrating oralEnglish. The videos consist of a photo-realistic person speaking English sentencescrawled from the Internet. The technology leverages a computer-generated voicewith native speaker-like quality and synchronized subtitles on the bottom of thevideo; it emulates popular karaoke-style videos specifically designed for a Chineseaudience in order to increase user engagement. Compared to using prerecordedhuman voice and video in English education tools, these videos not only create arealistic look and feel but also greatly reduce the cost of content creation bygenerating arbitrary content sources synthetically and automatically. The potentialfor personalization is there as well. For example, a choice of voice based on preferredgender, age, speaking rate, or pitch range and dynamics can be made, and theselected type of voice can be used to adapt a pre-trained TTS such that thesynthesized voice can be customized.

Motivation

Language teachers have been avid users of technology for a while now. The arrivalof the multimedia computer in the early 1990s was a major breakthrough because itcombined text, images, sound, and video in one device and permitted the integrationof the four basic skills of listening, speaking, reading, and writing. Nowadays, aspersonal computers become more pervasive, smaller, and more portable, and withdevices such as smartphones and tablet computers dominating the market, multime-dia and multimodal language learning can be ubiquitous and more self-paced.

For foreign language users, learning correct pronunciation is considered by manyto be one of the most arduous tasks if one does not have access to a personal tutor.The reason is that the most common method for learning pronunciation, using audiotapes, severely lacks completeness and engagement. Audio data alone may not offerusers complete instruction on how to move their mouth/lips to sound out phonemesthat may be nonexistent in their mother tongue. And audio as a tool of instruction isless motivating and personalized for learners. As supported by studies in cognitiveinformatics, information is processed by humans more efficiently when both audioand visual techniques are utilized in unison.

Computer-assisted audiovisual language learning increases user engagementwhen compared to audio alone. There are many existing bodies of work that usevisualized information and talking head to help language learning. For example,Massaro (1998) used visual articulation to show the internal structure of the mouth,enabling learners to visualize the position and movement of the tongue. Badin et. al(2010) inferred learners’ tongue position and shape to provide visual articulatory

22 L. Xie et al.

corrective feedback in second language learning. Additionally, a number of studiesdone in Eskenazi (2009) focused on overall pronunciation assessment and segmen-tal/prosodic error detection to help learners improve their pronunciation with com-puter feedback.

In the project in Wang et al. (2012c), the focus is on generating a photo-realistic,lip-synced talking head as a language assistant for multimodal, web-based, andlow-cost language learning. The authors feel that a lifelike assistant offers a moreauthoritative metaphor for engaging language learners, particularly younger demo-graphics. The long-term goal is to create a technology that can ubiquitously helpusers anywhere, anytime, from detailed pronunciation training to conversationalpractice. Such a service is especially important as a tool for augmenting humanteachers in areas of the world where native, high-quality instructors are scarce.

Karaoke Function

Karaoke, also known as KTV, is a major pastime among Chinese people, withnumerous KTV clubs found in major cities in China. A karaoke-like feature hasbeen added to Engkoo, which enables English learners to practice their pronuncia-tion online by mimicking a photo-realistic talking head lip-synchronously within asearch and discovery ecosystem. This “KTV function” is exposed as videos gener-ated from a vast set of sample sentences mined from the web. Users can easily launchthe videos with a single click at the sentence of their choosing. Similar to the karaokeformat, the videos display the sentence on the screen, while a model speaker says italoud, teaching the users how to enunciate the words, as shown in Fig. 12. Fig. 13shows the building blocks of the KTV system.

While the subtitles of karaoke are useful, it should be emphasized that the pacingoffered is especially valuable when learning a language. Concretely, the rhythm andthe prosody embedded in the KTV function offer users the timing cues to utter agiven sentence properly. Although pacing can be learned from listening to a nativespeaker, what is offered uniquely in this system is the ability to get this content atscale and on demand.

The KTV function offers a low-cost method for creating highly engaging,personalizable learning material utilizing the state-of-the-art talking head renderingtechnology. One of the key benefits is the generation of lifelike video as opposed tocartoon-based animations. This is important from a pedagogical perspective becausethe content appears closer in nature to a human teacher, which reduces the perceptivegap that students, particularly younger pupils, need to make from the physicalclassroom to the virtual learning experience.

The technology can drastically reduce language learning video production costsin scenarios where the material requires a human native speaker. Rather thanrepeatedly taping an actor speaking, the technique can synthesize the audio andvideo content automatically. This has the potential for further bridging the classroomand e-learning scenarios where a teacher can generate his talking head for students totake home and learn from.


Fig. 12 Screenshots of Karaoke-like talking heads on Engkoo. The service is accessible at http://dict.bing.com.cn

24 L. Xie et al.

http://dict.bing.com.cn

http://dict.bing.com.cn

Technology Outlook

The current karaoke function, despite its popularity with web users, can be furtherenhanced to reach the long-term goal, the vision being that of creating anindiscernibly lifelike computer assistant, at low cost and web based, helpful inmany language learning scenarios, such as interactive pronunciation drills andconversational training.

To make the talking head more lifelike and natural, a new 3D photo-realistic, real-time talking head is proposed with a personalized appearance (Wang and Soong2012). It extends the prior 2D photo-realistic talking head to 3D. First, approxi-mately 20 minutes of audiovisual 2D video is recorded with prompted sentencesspoken by a human speaker. A 2D-to-3D reconstruction algorithm is adopted toautomatically wrap the

3D geometric mesh with 2D video frames to construct a training database, asshown in Fig. 14. In training, super feature vectors consisting of 3D geometry,texture, and speech are formed to train a statistical, multi-streamed HMM. Themodel is then used to synthesize both the trajectories of geometry animation anddynamic texture.

As far as the synthesized audio (speech) output is concerned, the researchdirection is to make it more personalized, adaptive, and flexible. For example, anew algorithm which can teach the talking head to speak authentic English sentenceswhich sound like a Chinese ESL learner has been proposed and successfully tested.Also, synthesizing more natural and dynamic prosody patterns for ESL learners tomimic is highly desirable as an enhanced feature of the talking head.

The 3D talking head animation can be controlled by the rendered geometrictrajectory, while the facial expressions and articulator movements are renderedwith the dynamic 2D image sequences. Head motions and facial expressions canalso be separately controlled by manipulating corresponding parameters. A talkinghead for a movie star or celebrity can be created by using their video recordings.With the new 3D, photo-realistic talking head, the era of having lifelike, web-based,and interactive learning assistants is on the horizon.

The phonetic search can be further improved by collecting more data, both in textand speech, to generate the phonetic candidates to cover the generic and localizedspelling/pronunciation errors committed by language learners at different levels.

Fig. 13 Using talking head synthesis technology for KTV function on Engkoo


When such a database is available, a more powerful LTS can be traineddiscriminatively such that the errors observed in the database can be predicted andrecovered gracefully.

In future work, with regard to the interactivity of the computer assistant, it canhear (via speech recognition) and speak (TTS synthesis), read and compose, correctand suggest, or even guess or read the learner’s intention.

Summary

This chapter surveys the basic principles, state-of-the-art technologies, and featuredapplications in the visual speech animation area. Data collection, face/mouth model,feature extraction, and learning a mapping model are the central building blocks of aVSA system. The technologies used in different blocks depend on the applicationneeds and affect the desired appearance of the system. During the past decades,much effort in this area has been devoted to the audio/text-to-visual mappingproblem, and approaches can be roughly categorized into rule based, concatenation,parametric, and hybrid. We showcase a state-of-the-art approach, based on the deepbidirectional long short-term memory (DBLSM) recurrent neural networks (RNN)for audio-to-visual mapping in a video-realistic talking head. We also use theEngkoo project from Microsoft as a practical application of visual speech animationin language learning. We believe that with the fast development of computergraphics, speech technology, machine learning, and human behavior studies, the

Fig. 14 A 3D photo-realistic talking head by combining 2D image samples with a 3D face model

26 L. Xie et al.

future visual speech animation systems will become more flexible, expressive, andconversational. Subsequently, applications can be found across many domains.

References

Anderson R, Stenger B, Wan V, Cipolla R (2013) Expressive visual text-to-speech using activeappearance models. In: Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, IEEE. p 3382

Badin P, Ben Youssef A, Bailly G et al (2010) Visual articulatory feedback for phonetic correctionin second language learning. In: Proceedings of Second Language learning Studies: Acquisition,Learning, Education and Technology, 2010

Ben Youssef A, Shimodaira H, Braude DA (2013) Articulatory features for speech-driven headmotion synthesis. In: Proceedings of the International Speech Communication Association,IEEE, 2013

Breeuwer M, Plomp R (1985) Speechreading supplemented with formant frequency informationfrom voiced speech. J Acoust Soc Am 77(1):314–317

Bregler C, Covell M, Slaney M (1997) Video rewrite: driving visual speech with audio. In:Proceedings of the 24th annual conference on Computer graphics and interactive techniques,ACM Press, p 353

Busso C, Deng Z, Grimm M, Neumann U et al (2007) Rigid head motion in expressive speechanimation: Analysis and synthesis. IEEE Trans Audio, Speech, Language Process 15(3):1075–1086

Cao Y, Tien WC, Faloutsos P et al(2005) Expressive speech-driven facial animation. In: ACMTransactions on Graphics, ACM, p 1283

Cohen MM, Massaro DW (1993) Modeling coarticulation in synthetic visual speech. In: Modelsand techniques in computer animation. Springer, Japan, p 139

Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern AnalMach Intell 23(6):681–685

Cosatto E, Graf HP (1998) Sample-based synthesis of photo-realistic talking heads. In: Proceedingsof Computer Animation, IEEE, p 103

Cosatto E, Graf HP (2000) Photo-realistic talking-heads from image samples. IEEE Trans Multimed2(3):152–163

Cosatto E, Ostermann J, Graf HP et al (2003) Lifelike talking faces for interactive services. ProcIEEE 91(9):1406–1429

Deng Z, Neumann U (2008) Data-driven 3D facial animation. SpringerDeng L, Yu D (2014) Deep learning methods and applications. Foundations and Trends in Signal

Processing, 2014Deng Z, Lewis JP, Neumann U (2005) Automated eye motion using texture synthesis. IEEE

Comput Graph Appl 25(2):24–30Ding C, Xie L, Zhu P (2015) Head motion synthesis from speech using deep neural networks.

Multimed Tools Appl 74(22):9871–9888Du J, Wang Q, Gao T et al (2014) Robust speech recognition with speech enhanced deep neural

networks. In: Proceedings of the International Speech Communication Association, IEEE, p 616Dziemianko M, Hofer G, Shimodaira H (2009). HMM-based automatic eye-blink synthesis from

speech. In: Proceedings of the International Speech Communication Association, IEEE, p 1799Englebienne G, Cootes T, Rattray M (2007) A probabilistic model for generating realistic lip

movements from speech. In: Advances in neural information processing systems, p 401Eskenazi M (2009) An overview of spoken language technology for education. Speech Commun 51

(10):832–844Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vision 38

(1):45–57


Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: ACMSIGGRAPH 2006 Courses, ACM, p 388

Fagel S, Clemens C (2004) An articulation model for audiovisual speech synthesis: determination,adjustment, evaluation. Speech Commun 44(1):141–154

Fagel S, Bailly G, Theobald BJ (2010) Animating virtual speakers or singers from audio:lip-synching facial animation. EURASIP J Audio, Speech, Music Process 2009(1):1–2

Fan B, Wang L, Soong FK et al (2015) Photo-real talking head with deep bidirectional LSTM. In:IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p 4884

Fan B, Xie L, Yang S, Wang L et al (2016) A deep bidirectional LSTM approach for video- realistictalking head. Multimed Tools Appl 75:5287–5309

Fu S, Gutierrez-Osuna R, Esposito A et al (2005) Audio/visual mapping with cross-modal hiddenMarkov models. IEEE Trans Multimed 7(2):243–252

Hinton G, Deng L, Yu D, Dahl GE et al (2012) Deep neural networks for acoustic modeling inspeech recognition: The shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets andproblem solutions. Int J Uncertain Fuzz 6(02):107–116

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780Huang D, Wu X, Wei J et al (2013) Visualization of Mandarin articulation by using a physiological

articulatory model. In: Signal and Information Processing Association Annual Summit andConference, IEEE, p 1

Hunt AJ, Black AW (1996) Unit selection in a concatenative speech synthesis system using a largespeech database. In: IEEE International Conference on Acoustics, Speech, and Signal Pro-cessing, IEEE, p 373

Hura S, Leathem C, Shaked N (2010) Avatars meet the challenge. Speech Technol, 303217Jia J, Zhang S, Meng F et al (2011) Emotional audio-visual speech synthesis based on PAD. IEEE

Trans Audio, Speech, Language Process 19(3):570–582Jia J, Wu Z, Zhang S et al (2014) Head and facial gestures synthesis using PAD model for an

expressive talking avatar. Multimed Tools Appl 73(1):439–461Kukich K (1992) Techniques for automatically correcting words in text. ACM Comput Surv 24

(4):377–439Le BH, Ma X, Deng Z (2012) Live speech driven head-and-eye motion generators. IEEE Trans Vis

Comput Graph 18(11):1902–1914Liu P, Soong FK (2005) Kullback-Leibler divergence between two hidden Markov models.

Microsoft Research Asia, Technical ReportMassaro DW (1998) Perceiving talking faces: from speech perception to a behavioral principle. Mit

Press, CambridgeMassaro DW, Simpson JA (2014) Speech perception by ear and eye: a paradigm for psychological

inquiry. Psychology PressMasuko T, Kobayashi T, Tamura, M et al (1998) Text-to-visual speech synthesis based on parameter

generation from HMM. In: IEEE International Conference on Acoustics, Speech, and SignalProcessing, IEEE, p 3745

McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748Microsoft Research (2015) http://research.microsoft.com/en-us/projects/voice_driven_talking_head/Mikolov T, Chen K, Corrado G et al (2013) Efficient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781Musti U, Zhou Z, Pietikinen M (2014) Facial 3D shape estimation from images for visual speech

animation. In: Proceedings of the Pattern Recognition, IEEE, p 40Ostermann J, Weissenfeld A (2004) Talking faces-technologies and applications. In: Proceedings of

the 17th International Conference on Pattern Recognition, IEEE, p 826Pandzic IS, Forchheimer R (2002) MPEG-4 facial animation. The standard, implementation and

applications. John Wiley and Sons, Chichester

28 L. Xie et al.

http://research.microsoft.com/en-us/projects/voice_driven_talking_head/

Parke FI (1972) Computer generated animation of faces. In: Proceedings of the ACM annualconference-Volume, ACM, p 451

Peng B, Qian Y, Soong FK et al (2011) A new phonetic candidate generator for improving searchquery efficiency. In: Twelfth Annual Conference of the International Speech CommunicationAssociation

Pighin F, Hecker J, Lischinski D et al (2006) Synthesizing realistic facial expressions fromphotographs. In: ACM SIGGRAPH 2006 Courses, ACM, p 19

Qian Y, Yan ZJ, Wu YJ et al (2010) An HMM trajectory tiling (HTT) approach to high quality TTS.In: Proceedings of the International Speech Communication Association, IEEE, p 422

Raidt S, Bailly G, Elisei F (2007) Analyzing gaze during face-to-face interaction. In: InternationalWorkshop on Intelligent Virtual Agents. Springer, Berlin/Heidelberg, p 403

Microsoft Research (2015) http://research.microsoft.com/en-us/projects/blstmtalkinghead/Richmond K, Hoole P, King S (2011) Announcing the electromagnetic articulography (day 1)

subset of the mngu0 articulatory corpus. In: Proceedings of the International Speech Commu-nication Association, IEEE, p 1505

Roweis S (1998) EM algorithms for PCA and SPCA. Adv Neural Inf Process Syst:626–632Sako S, Tokuda K, Masuko T et al(2000) HMM-based text-to-audio-visual speech synthesis. In:

Proceedings of the International Speech Communication Association, IEEE, p 25Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process

45(11):2673–2681Scott MR, Liu X, Zhou M (2011) Towards a Specialized Search Engine for Language Learners

[Point of View]. Proc IEEE 99(9):1462–1465Seidlhofer B (2009) Common ground and different realities: World Englishes and English as a

lingua franca. World Englishes 28(2):236–245Sumby WH, Pollack I (1954) Erratum: visual contribution to speech intelligibility in noise

[J. Acoust. Soc. Am. 26, 212 (1954)]. J Acoust Soc Am 26(4):583–583Taylor P (2009) Text-to-speech synthesis. Cambridge university press, CambridgeTaylor SL, Mahler M, Theobald BJ et al (2012) Dynamic units of visual speech. In: Proceedings of

the 11th ACM SIGGRAPH/Eurographics conference on Computer Animation, ACM, p 275Theobald BJ, Fagel S, Bailly G et al (2008) LIPS2008: Visual speech synthesis challenge. In:

Proceedings of the International Speech Communication Association, IEEE, p 2310Thies J, Zollhfer M, Stamminger M et al(2016) Face2face: Real-time face capture and reenactment

of rgb videos. In: Proceedings of Computer Vision and Pattern Recognition, IEEE, p 1Tokuda K, Yoshimura T, Masuko T et al (2000) Speech parameter generation algorithms for

HMM-based speech synthesis. In: IEEE International Conference on Acoustics, Speech, andSignal Processing, IEEE, p 1615

Tokuda K, Oura K, Hashimoto K et al (2007) The HMM-based speech synthesis system. Online:http://hts.ics.nitech.ac.jp

Wang D, King S (2011) Letter-to-sound pronunciation prediction using conditional random fields.IEEE Signal Process Lett 18(2):122–125

Wang L, Soong FK (2012) High quality lips animation with speech and captured facial action unitas A/V input. In: Signal and Information Processing Association Annual Summit and Confer-ence, IEEE, p 1

Wang L, Soong FK (2015) HMM trajectory-guided sample selection for photo-realistic talkinghead. MultimedTools Appl 74(22):9849–9869

Wang L, HanW, Qian X, Soong FK (2010a) Rendering a personalized photo-real talking head fromshort video footage. In: 7th International Symposium on Chinese Spoken Language Processing,IEEE, p 129

Wang L, Qian X, Han W, Soong FK (2010b) Synthesizing photo-real talking head via trajectory-guided sample selection. In: Proceedings of the International Speech Communication Associ-ation, IEEE, p 446


http://research.microsoft.com/en-us/projects/blstmtalkinghead/

http://hts.ics.nitech.ac.jp

Wang L, Wu YJ, Zhuang X et al (2011) Synthesizing visual speech trajectory with minimumgeneration error. In: IEEE International Conference on Acoustics, Speech and Signal Pro-cessing, IEEE, p 4580

Wang L, Chen H, Li S et al (2012a) Phoneme-level articulatory animation in pronunciation training.Speech Commun 54(7):845–856

Wang L, Han W, Soong FK (2012b) High quality lip-sync animation for 3D photo-realistic talkinghead. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE, p4529

Wang LJ, Qian Y, Scott M, Chen G, Soong FK (2012c) Computer-assisted Audiovisual LanguageLearning, IEEE Computer, p 38

Weise T, Bouaziz S, Li H et al (2011) Realtime performance-based facial animation. In: ACMTransactions on Graphics, ACM, p 77

Wik P, Hjalmarsson A (2009) Embodied conversational agents in computer assisted languagelearning. Speech Commun 51(10):1024–1037

Williams RJ, Zipser D (1989) A learning algorithm for continually running fully recurrent neuralnetworks. l Comput 1(2):270–280

Xie L, Liu ZQ (2007a) A coupled HMM approach to video-realistic speech animation. PatternRecogn 40(8):2325–2340

Xie L, Liu ZQ (2007b) Realistic mouth-synching for speech-driven talking face using articulatorymodelling. IEEE Trans Multimed 9(3):500–510

Xie L, Jia J, Meng H et al (2015) Expressive talking avatar synthesis and animation. MultimedTools Appl 74(22):9845–9848

Yan ZJ, Qian Y, Soong FK (2010) Rich-context unit selection (RUS) approach to high quality TTS.In: IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, p 4798

Zen H, Senior A, Schuster M (2013) Statistical parametric speech synthesis using deep neuralnetworks. In: IEEE International Conference on Acoustics, Speech and Signal Processing,IEEE, p 7962

Zhang LJ, Rubdy R, Alsagoff L (2009) Englishes and literatures-in-English in a globalised world.In: Proceedings of the 13th International Conference on English in Southeast Asia, p 42

Zhu P, Xie, L, Chen Y (2015) Articulatory movement prediction using deep bidirectional longshort-term memory based recurrent neural networks and word/phone embeddings. In: SixteenthAnnual Conference of the International Speech Communication Association

30 L. Xie et al.

Visual Speech Animation - nwpu-aslp.org

Documents

Transcript of Visual Speech Animation - nwpu-aslp.org