From User-friendly to User’s Friend Dr. Eric Petajan Founder and Chief Scientist face2face...

From User-friendlyFrom User-friendlyto User’s Friendto User’s Friend

Dr. Eric PetajanDr. Eric PetajanFounder and Chief ScientistFounder and Chief Scientist

face2face animation, inc.face2face animation, inc.

www.f2fanimation.comwww.f2fanimation.com

[email protected]@f2f-inc.com

Why vision is required for the ideal HCI designWhy vision is required for the ideal HCI design

Problem Statement Problem Statement The electronic extension of human The electronic extension of human

capabilities is primarily limited by Human-capabilities is primarily limited by Human-Computer Interaction (HCI) systems that Computer Interaction (HCI) systems that

fail to meet our needs for fast, reliable, and fail to meet our needs for fast, reliable, and secure input of information using the most secure input of information using the most comfortable human communication modescomfortable human communication modes

Your computer should Your computer should emulate your best friendemulate your best friend

It should know who you are and if you are presentIt should know who you are and if you are present It should see and hear you in adverse conditionsIt should see and hear you in adverse conditions It should respond to you quicklyIt should respond to you quickly It should tell you the truthIt should tell you the truth It should keep your secretsIt should keep your secrets It should be pleasant or entertainingIt should be pleasant or entertaining It should follow you aroundIt should follow you around

A humanoid agent is a A humanoid agent is a necessary component for the necessary component for the

ultimate HCIultimate HCI

QuickTime™ and aCompact Video decompressor

are needed to see this picture.

Humanoids can provide:Humanoids can provide: Clear focus for audio and visual attentionClear focus for audio and visual attention

– Easier to capture user behaviorEasier to capture user behavior– Less taxing for userLess taxing for user

Perception of credibilityPerception of credibility Engagement and entertainmentEngagement and entertainment Increased comprehensionIncreased comprehension Guidance with traditional information displayGuidance with traditional information display

The quality of the virtual The quality of the virtual human is critically dependent human is critically dependent on the amount of real human on the amount of real human

behavior that informs the behavior that informs the humanoid modelhumanoid model

Autonomous humanoid agents can’t pass Autonomous humanoid agents can’t pass the Turing test todaythe Turing test today

The non-invasive captureThe non-invasive captureand machine understandingand machine understanding

of human behaviorof human behaviorare grand challenges that have are grand challenges that have

yet be fully accomplishedyet be fully accomplished

We are still tethered to the keyboard and mouseWe are still tethered to the keyboard and mouse

Significant Human Behaviors Significant Human Behaviors Available without ContactAvailable without Contact

Audio/Visual SpeechAudio/Visual Speech GesturesGestures Facial expressionsFacial expressions Gaze directionGaze direction PosturePosture

Ideal HCI Process GraphIdeal HCI Process Graph

CaptureCompleteHuman

Behavior

BuildHumanoid

Model

PresentHumanoidTo Human

“AI” Engine

•Knowledge•Motive•Power

CaptureHuman

Behavior

What has been achieved to date?What has been achieved to date?

The Good NewsThe Good News Processing hardware is fast and cheapProcessing hardware is fast and cheap HD cameras now 10 times cheaperHD cameras now 10 times cheaper Displays are good and cheap enoughDisplays are good and cheap enough Mobile data bandwidth is reliable enough for Mobile data bandwidth is reliable enough for

audio plus animation streamsaudio plus animation streams Individual recognition technologies are Individual recognition technologies are

approaching maturity (if not utility)approaching maturity (if not utility)

The Bad NewsThe Bad News Computers can’t reliably “hear” humans with a Computers can’t reliably “hear” humans with a

single fixed microphonesingle fixed microphone Computers can’t reliably “see” humans with a Computers can’t reliably “see” humans with a

single cheap video camerasingle cheap video camera HCI constraints exhaust and encumber usersHCI constraints exhaust and encumber users Large segments of the population are unwilling Large segments of the population are unwilling

or unable to engage in HCIor unable to engage in HCI

Steps in the Right DirectionSteps in the Right Direction

Use one or more HD video camerasUse one or more HD video cameras Use steered microphone array with face trackingUse steered microphone array with face tracking Track and control users attention with humanoidTrack and control users attention with humanoid Continuously identify the userContinuously identify the user Train the user with entertainmentTrain the user with entertainment Use dedicated hardware to minimize the impact Use dedicated hardware to minimize the impact

of the HCI system on general computing and of the HCI system on general computing and communication taskscommunication tasks

Multi-modal Speech RecognitionMulti-modal Speech Recognition Audio-visual speech and speaker recognition Audio-visual speech and speaker recognition

provides robustness in noiseprovides robustness in noise Use of visual speech removes need for close-Use of visual speech removes need for close-

talking microphone and provides robust talking microphone and provides robust steering of microphone arraysteering of microphone array

MPEG-4 Face Animation Parameters (FAPs) MPEG-4 Face Animation Parameters (FAPs) accurately encode visual speechaccurately encode visual speech

People want information and People want information and communication where ever communication where ever

they happen to bethey happen to be Mobile devices need to be small (thin client)Mobile devices need to be small (thin client) Device and service costs must be lowDevice and service costs must be low Must be fast and reliableMust be fast and reliable Bandwidth must be used efficiently for low Bandwidth must be used efficiently for low

latency and costlatency and cost

People want to be entertainedPeople want to be entertained

Entertaining information is retained betterEntertaining information is retained better Personality attracts attention and is main Personality attracts attention and is main

component of entertainmentcomponent of entertainment Personality is manifested mostly in face Personality is manifested mostly in face

and voiceand voice Face and voice must be synced and Face and voice must be synced and

delivered with quality (high frame rate)delivered with quality (high frame rate)

People like animated charactersPeople like animated characters

Entertaining/relationship formingEntertaining/relationship forming Can be efficiently delivered anywhereCan be efficiently delivered anywhere Graphical faces scale well to small screensGraphical faces scale well to small screens Character design limited only by imaginationCharacter design limited only by imagination Any person can drive any character (with FAPs)Any person can drive any character (with FAPs) Emotional response to animated faces is hardwiredEmotional response to animated faces is hardwired

Mobile devices todayMobile devices today

Can deliver animated charactersCan deliver animated characters Are cheapAre cheap Can deliver low bit-rate content reliablyCan deliver low bit-rate content reliably Are communicators and entertainersAre communicators and entertainers Are very popularAre very popular

User Input to Mobile DevicesUser Input to Mobile Devices

Keyboards are impractical for mobile devicesKeyboards are impractical for mobile devices Best user interface is speech and faceBest user interface is speech and face Little room for text/menus on small screensLittle room for text/menus on small screens Acoustic speech recognition is unreliable in Acoustic speech recognition is unreliable in

mobile environmentsmobile environments Visual speech and face recognition are Visual speech and face recognition are

needed for robust mobile user interfaceneeded for robust mobile user interface

Low bit-rate is the key to Low bit-rate is the key to mobile happinessmobile happiness

Reliable delivery of wireless video will not Reliable delivery of wireless video will not happen for a very long timehappen for a very long time

Only 20-30 kilobits/sec can be sustained Only 20-30 kilobits/sec can be sustained everywhereeverywhere

MPEG-4 animation streams fit in available MPEG-4 animation streams fit in available bandwidth with audiobandwidth with audio

2 kilobits/sec for face animation data2 kilobits/sec for face animation data 6-10 kilobits/sec for body animation data6-10 kilobits/sec for body animation data

Mobile Character Player DemoMobile Character Player Demo Facial expressions, lip movements and head Facial expressions, lip movements and head

motion extracted from ordinary video motion extracted from ordinary video automatically as FAPsautomatically as FAPs

FAPs streamed to player with compressed FAPs streamed to player with compressed audio at 10 kbps totalaudio at 10 kbps total

300 triangle 3D mesh face model renders in 300 triangle 3D mesh face model renders in real time on phonereal time on phone

FAPs and audio decoded in parallel with FAPs and audio decoded in parallel with graphics rendering in softwaregraphics rendering in software

StandardsStandards

Facilitate collaborationFacilitate collaboration Minimize reinvention of wheelsMinimize reinvention of wheels Decrease costs with economies of scaleDecrease costs with economies of scale Allow database sharingAllow database sharing Provide free or cheap source codeProvide free or cheap source code Enable low latency communicationEnable low latency communication

The MPEG-4 StandardThe MPEG-4 Standard Provides comprehensive framework for 2D Provides comprehensive framework for 2D

and 3D multimedia communicationand 3D multimedia communication Provides Face and Body Animation (FBA) Provides Face and Body Animation (FBA)

representation and codingrepresentation and coding Low bit-rate coding eliminates network Low bit-rate coding eliminates network

bottlenecksbottlenecks Optimized implementations increase speed Optimized implementations increase speed

and reduce costs to consumersand reduce costs to consumers

MPEG-4 Face AnimationMPEG-4 Face Animation

Face model is independent of Face Face model is independent of Face Animation Parameters (FAPs)Animation Parameters (FAPs)

FAPs contain high quality animation FAPs contain high quality animation data for driving all types of face models data for driving all types of face models from broadcast to wirelessfrom broadcast to wireless

FAPs displace feature points from FAPs displace feature points from neutral positionneutral position

Body AnimationBody Animation

Harmonized with VRML Hanim specHarmonized with VRML Hanim spec Body Animation Parameters (BAPs) are Body Animation Parameters (BAPs) are

humanoid skeleton joint Euler angleshumanoid skeleton joint Euler angles Body Animation Table (BAT) can be Body Animation Table (BAT) can be

downloaded to map BAPs to skin downloaded to map BAPs to skin deformationdeformation

BAPs can be highly compressed for BAPs can be highly compressed for streamingstreaming

Body Animation Parameters Body Animation Parameters (BAPs)(BAPs)

186 humanoid skeleton euler angles186 humanoid skeleton euler angles 110 free parameters for use with 110 free parameters for use with

downloaded body surface meshdownloaded body surface mesh Coded using same codecs as FAPsCoded using same codecs as FAPs Typical bitrates for coded BAPs is 5-Typical bitrates for coded BAPs is 5-

10kbps10kbps

Neutral Face DefinitionNeutral Face Definition

Head axes parallel to the world axes Head axes parallel to the world axes Gaze is in direction of Z axisGaze is in direction of Z axis Eyelids tangent to the irisEyelids tangent to the iris Pupil diameter is one third of iris diameterPupil diameter is one third of iris diameter Mouth is closed and the upper and lower teeth Mouth is closed and the upper and lower teeth

are touchingare touching Tongue is flat, horizontal with the tip of tongue Tongue is flat, horizontal with the tip of tongue

touching the boundary between upper and lower touching the boundary between upper and lower teethteeth

Face Feature PointsFace Feature Points

xy

z

11.5

11.4

11.2

10.2

10.4

10.10

10.810.6

2.14

7.1

11.6 4.6

4.4

4.2

5.2

5.4

2.10

2.122.1

11.1

Tongue

6.26.4 6.3

6.1Mouth

8.18.9 8.10 8.5

8.3

8.7

8.2

8.8

8.48.6

2.2

2.3

2.6

2.82.9

2.72.5 2.4

2.12.12 2.11

2.142.10

2.13

10.610.8

10.4

10.2

10.105.4

5.2

5.3

5.1

10.1

10.910.3

10.510.7

4.1 4.34.54.6

4.4 4.2

11.111.2 11.3

11.4

11.5

x

y

z

Nose

9.6 9.7

9.14 9.13

9.12

9.2

9.4 9.15 9.5

9.3

9.1

Teeth

9.109.11

9.8

9.9

Feature points affected by FAPs

Other feature points

Right eye Left eye

3.13

3.7

3.9

3.5

3.1

3.3

3.11

3.14

3.10

3.12 3.6

3.4

3.23.8

Face Model IndependenceFace Model Independence

FAPs are always normalized for model FAPs are always normalized for model independenceindependence

FAPs (and BAPs) can be used without FAPs (and BAPs) can be used without MPEG-4 systems/BIFSMPEG-4 systems/BIFS

Private face models can be accurately Private face models can be accurately animated with FAPsanimated with FAPs

Face models can be simple or complex Face models can be simple or complex depending on terminal resourcesdepending on terminal resources

Face Animation Parameter Face Animation Parameter NormalizationNormalization

Face Animation Parameters (FAPs) are Face Animation Parameters (FAPs) are normalized to facial dimensionsnormalized to facial dimensions

Each FAP is measured as a fraction of Each FAP is measured as a fraction of neutral face mouth width, mouth-nose neutral face mouth width, mouth-nose distance, eye separation, or iris distance, eye separation, or iris diameter diameter

3 Head and 2 eyeball rotation FAPs are 3 Head and 2 eyeball rotation FAPs are Euler anglesEuler angles

Neutral Face Dimensions for Neutral Face Dimensions for FAP NormalizationFAP Normalization

MW0

MNS0

ENS0

ES0IRISD0

Lip FAPsLip FAPsMouth closed if sum of upper and Mouth closed if sum of upper and

lower lip FAPs = 0lower lip FAPs = 0

FAP CompressionFAP Compression

FAPs are adaptively quantized to FAPs are adaptively quantized to desired quality leveldesired quality level

Quantized FAPs are differentially codedQuantized FAPs are differentially coded Adaptive arithmetic coding further Adaptive arithmetic coding further

reduces bitratereduces bitrate Typical compressed FAP bitrate is less Typical compressed FAP bitrate is less

than 2 kilobits/secondthan 2 kilobits/second

FAP Predictive CodingFAP Predictive Coding

FAP(t) + Q

Q-1FrameDelay

- ArithmeticCoder

Bitstream

General Bandwidth IssuesGeneral Bandwidth Issues Broadband deployment is happening slowlyBroadband deployment is happening slowly 3G will not be ubiquitous for many years3G will not be ubiquitous for many years DSL availability is limited and cable is sharedDSL availability is limited and cable is shared Talking heads need high frame-rateTalking heads need high frame-rate Consumer graphics hardware is cheap and Consumer graphics hardware is cheap and

powerfulpowerful MPEG-4 FBA tools are matched to available MPEG-4 FBA tools are matched to available

bandwidth and terminalsbandwidth and terminals

Markerless Facial Motion Capture for Markerless Facial Motion Capture for Animation ProductionAnimation Production

Track/analyze face features in each video frameTrack/analyze face features in each video frame Captured face feature motion easily converted to Captured face feature motion easily converted to

FAPsFAPs Face model is “puppeteered” by FAPsFace model is “puppeteered” by FAPs MPEG-4 FAPs only specify motion of feature MPEG-4 FAPs only specify motion of feature

points (not surrounding surface)points (not surrounding surface)

Bones rig for mouth areaBones rig for mouth area

Automatic Face Animation Automatic Face Animation DemonstrationDemonstration

FAPs extracted from camcorder videoFAPs extracted from camcorder video Inner lip, eye region and head rotation Inner lip, eye region and head rotation

FAPs compressed to less than 2 kbits/secFAPs compressed to less than 2 kbits/sec 30 frames/sec animation generated 30 frames/sec animation generated

automaticallyautomatically Face models developed with face2face Face models developed with face2face

plugin Mayaplugin Maya

ConclusionsConclusions Humanoid agents are required for best HCIHumanoid agents are required for best HCI Vision-based facial capture is required for Vision-based facial capture is required for

humanoid design and human behavior capturehumanoid design and human behavior capture MPEG-4 Face and Body Animation coding MPEG-4 Face and Body Animation coding

enables high quality mobile communicationenables high quality mobile communication Ultimate HCI systems must continuously see, Ultimate HCI systems must continuously see,

hear and identify the user for best reliability and hear and identify the user for best reliability and securitysecurity

From User-friendly to User’s Friend Dr. Eric Petajan Founder and Chief Scientist face2face...

Documents

Transcript of From User-friendly to User’s Friend Dr. Eric Petajan Founder and Chief Scientist face2face...