From User-friendly to User’s Friend Dr. Eric Petajan Founder and Chief Scientist face2face...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of From User-friendly to User’s Friend Dr. Eric Petajan Founder and Chief Scientist face2face...
From User-friendlyFrom User-friendlyto User’s Friendto User’s Friend
Dr. Eric PetajanDr. Eric PetajanFounder and Chief ScientistFounder and Chief Scientist
face2face animation, inc.face2face animation, inc.
www.f2fanimation.comwww.f2fanimation.com
[email protected]@f2f-inc.com
Why vision is required for the ideal HCI designWhy vision is required for the ideal HCI design
Problem Statement Problem Statement The electronic extension of human The electronic extension of human
capabilities is primarily limited by Human-capabilities is primarily limited by Human-Computer Interaction (HCI) systems that Computer Interaction (HCI) systems that
fail to meet our needs for fast, reliable, and fail to meet our needs for fast, reliable, and secure input of information using the most secure input of information using the most comfortable human communication modescomfortable human communication modes
Your computer should Your computer should emulate your best friendemulate your best friend
It should know who you are and if you are presentIt should know who you are and if you are present It should see and hear you in adverse conditionsIt should see and hear you in adverse conditions It should respond to you quicklyIt should respond to you quickly It should tell you the truthIt should tell you the truth It should keep your secretsIt should keep your secrets It should be pleasant or entertainingIt should be pleasant or entertaining It should follow you aroundIt should follow you around
A humanoid agent is a A humanoid agent is a necessary component for the necessary component for the
ultimate HCIultimate HCI
QuickTime™ and aCompact Video decompressor
are needed to see this picture.
Humanoids can provide:Humanoids can provide: Clear focus for audio and visual attentionClear focus for audio and visual attention
– Easier to capture user behaviorEasier to capture user behavior– Less taxing for userLess taxing for user
Perception of credibilityPerception of credibility Engagement and entertainmentEngagement and entertainment Increased comprehensionIncreased comprehension Guidance with traditional information displayGuidance with traditional information display
The quality of the virtual The quality of the virtual human is critically dependent human is critically dependent on the amount of real human on the amount of real human
behavior that informs the behavior that informs the humanoid modelhumanoid model
Autonomous humanoid agents can’t pass Autonomous humanoid agents can’t pass the Turing test todaythe Turing test today
The non-invasive captureThe non-invasive captureand machine understandingand machine understanding
of human behaviorof human behaviorare grand challenges that have are grand challenges that have
yet be fully accomplishedyet be fully accomplished
We are still tethered to the keyboard and mouseWe are still tethered to the keyboard and mouse
Significant Human Behaviors Significant Human Behaviors Available without ContactAvailable without Contact
Audio/Visual SpeechAudio/Visual Speech GesturesGestures Facial expressionsFacial expressions Gaze directionGaze direction PosturePosture
Ideal HCI Process GraphIdeal HCI Process Graph
CaptureCompleteHuman
Behavior
BuildHumanoid
Model
PresentHumanoidTo Human
“AI” Engine
•Knowledge•Motive•Power
CaptureHuman
Behavior
What has been achieved to date?What has been achieved to date?
The Good NewsThe Good News Processing hardware is fast and cheapProcessing hardware is fast and cheap HD cameras now 10 times cheaperHD cameras now 10 times cheaper Displays are good and cheap enoughDisplays are good and cheap enough Mobile data bandwidth is reliable enough for Mobile data bandwidth is reliable enough for
audio plus animation streamsaudio plus animation streams Individual recognition technologies are Individual recognition technologies are
approaching maturity (if not utility)approaching maturity (if not utility)
The Bad NewsThe Bad News Computers can’t reliably “hear” humans with a Computers can’t reliably “hear” humans with a
single fixed microphonesingle fixed microphone Computers can’t reliably “see” humans with a Computers can’t reliably “see” humans with a
single cheap video camerasingle cheap video camera HCI constraints exhaust and encumber usersHCI constraints exhaust and encumber users Large segments of the population are unwilling Large segments of the population are unwilling
or unable to engage in HCIor unable to engage in HCI
Steps in the Right DirectionSteps in the Right Direction
Use one or more HD video camerasUse one or more HD video cameras Use steered microphone array with face trackingUse steered microphone array with face tracking Track and control users attention with humanoidTrack and control users attention with humanoid Continuously identify the userContinuously identify the user Train the user with entertainmentTrain the user with entertainment Use dedicated hardware to minimize the impact Use dedicated hardware to minimize the impact
of the HCI system on general computing and of the HCI system on general computing and communication taskscommunication tasks
Multi-modal Speech RecognitionMulti-modal Speech Recognition Audio-visual speech and speaker recognition Audio-visual speech and speaker recognition
provides robustness in noiseprovides robustness in noise Use of visual speech removes need for close-Use of visual speech removes need for close-
talking microphone and provides robust talking microphone and provides robust steering of microphone arraysteering of microphone array
MPEG-4 Face Animation Parameters (FAPs) MPEG-4 Face Animation Parameters (FAPs) accurately encode visual speechaccurately encode visual speech
People want information and People want information and communication where ever communication where ever
they happen to bethey happen to be Mobile devices need to be small (thin client)Mobile devices need to be small (thin client) Device and service costs must be lowDevice and service costs must be low Must be fast and reliableMust be fast and reliable Bandwidth must be used efficiently for low Bandwidth must be used efficiently for low
latency and costlatency and cost
People want to be entertainedPeople want to be entertained
Entertaining information is retained betterEntertaining information is retained better Personality attracts attention and is main Personality attracts attention and is main
component of entertainmentcomponent of entertainment Personality is manifested mostly in face Personality is manifested mostly in face
and voiceand voice Face and voice must be synced and Face and voice must be synced and
delivered with quality (high frame rate)delivered with quality (high frame rate)
People like animated charactersPeople like animated characters
Entertaining/relationship formingEntertaining/relationship forming Can be efficiently delivered anywhereCan be efficiently delivered anywhere Graphical faces scale well to small screensGraphical faces scale well to small screens Character design limited only by imaginationCharacter design limited only by imagination Any person can drive any character (with FAPs)Any person can drive any character (with FAPs) Emotional response to animated faces is hardwiredEmotional response to animated faces is hardwired
Mobile devices todayMobile devices today
Can deliver animated charactersCan deliver animated characters Are cheapAre cheap Can deliver low bit-rate content reliablyCan deliver low bit-rate content reliably Are communicators and entertainersAre communicators and entertainers Are very popularAre very popular
User Input to Mobile DevicesUser Input to Mobile Devices
Keyboards are impractical for mobile devicesKeyboards are impractical for mobile devices Best user interface is speech and faceBest user interface is speech and face Little room for text/menus on small screensLittle room for text/menus on small screens Acoustic speech recognition is unreliable in Acoustic speech recognition is unreliable in
mobile environmentsmobile environments Visual speech and face recognition are Visual speech and face recognition are
needed for robust mobile user interfaceneeded for robust mobile user interface
Low bit-rate is the key to Low bit-rate is the key to mobile happinessmobile happiness
Reliable delivery of wireless video will not Reliable delivery of wireless video will not happen for a very long timehappen for a very long time
Only 20-30 kilobits/sec can be sustained Only 20-30 kilobits/sec can be sustained everywhereeverywhere
MPEG-4 animation streams fit in available MPEG-4 animation streams fit in available bandwidth with audiobandwidth with audio
2 kilobits/sec for face animation data2 kilobits/sec for face animation data 6-10 kilobits/sec for body animation data6-10 kilobits/sec for body animation data
Mobile Character Player DemoMobile Character Player Demo Facial expressions, lip movements and head Facial expressions, lip movements and head
motion extracted from ordinary video motion extracted from ordinary video automatically as FAPsautomatically as FAPs
FAPs streamed to player with compressed FAPs streamed to player with compressed audio at 10 kbps totalaudio at 10 kbps total
300 triangle 3D mesh face model renders in 300 triangle 3D mesh face model renders in real time on phonereal time on phone
FAPs and audio decoded in parallel with FAPs and audio decoded in parallel with graphics rendering in softwaregraphics rendering in software
StandardsStandards
Facilitate collaborationFacilitate collaboration Minimize reinvention of wheelsMinimize reinvention of wheels Decrease costs with economies of scaleDecrease costs with economies of scale Allow database sharingAllow database sharing Provide free or cheap source codeProvide free or cheap source code Enable low latency communicationEnable low latency communication
The MPEG-4 StandardThe MPEG-4 Standard Provides comprehensive framework for 2D Provides comprehensive framework for 2D
and 3D multimedia communicationand 3D multimedia communication Provides Face and Body Animation (FBA) Provides Face and Body Animation (FBA)
representation and codingrepresentation and coding Low bit-rate coding eliminates network Low bit-rate coding eliminates network
bottlenecksbottlenecks Optimized implementations increase speed Optimized implementations increase speed
and reduce costs to consumersand reduce costs to consumers
MPEG-4 Face AnimationMPEG-4 Face Animation
Face model is independent of Face Face model is independent of Face Animation Parameters (FAPs)Animation Parameters (FAPs)
FAPs contain high quality animation FAPs contain high quality animation data for driving all types of face models data for driving all types of face models from broadcast to wirelessfrom broadcast to wireless
FAPs displace feature points from FAPs displace feature points from neutral positionneutral position
Body AnimationBody Animation
Harmonized with VRML Hanim specHarmonized with VRML Hanim spec Body Animation Parameters (BAPs) are Body Animation Parameters (BAPs) are
humanoid skeleton joint Euler angleshumanoid skeleton joint Euler angles Body Animation Table (BAT) can be Body Animation Table (BAT) can be
downloaded to map BAPs to skin downloaded to map BAPs to skin deformationdeformation
BAPs can be highly compressed for BAPs can be highly compressed for streamingstreaming
Body Animation Parameters Body Animation Parameters (BAPs)(BAPs)
186 humanoid skeleton euler angles186 humanoid skeleton euler angles 110 free parameters for use with 110 free parameters for use with
downloaded body surface meshdownloaded body surface mesh Coded using same codecs as FAPsCoded using same codecs as FAPs Typical bitrates for coded BAPs is 5-Typical bitrates for coded BAPs is 5-
10kbps10kbps
Neutral Face DefinitionNeutral Face Definition
Head axes parallel to the world axes Head axes parallel to the world axes Gaze is in direction of Z axisGaze is in direction of Z axis Eyelids tangent to the irisEyelids tangent to the iris Pupil diameter is one third of iris diameterPupil diameter is one third of iris diameter Mouth is closed and the upper and lower teeth Mouth is closed and the upper and lower teeth
are touchingare touching Tongue is flat, horizontal with the tip of tongue Tongue is flat, horizontal with the tip of tongue
touching the boundary between upper and lower touching the boundary between upper and lower teethteeth
Face Feature PointsFace Feature Points
xy
z
11.5
11.4
11.2
10.2
10.4
10.10
10.810.6
2.14
7.1
11.6 4.6
4.4
4.2
5.2
5.4
2.10
2.122.1
11.1
Tongue
6.26.4 6.3
6.1Mouth
8.18.9 8.10 8.5
8.3
8.7
8.2
8.8
8.48.6
2.2
2.3
2.6
2.82.9
2.72.5 2.4
2.12.12 2.11
2.142.10
2.13
10.610.8
10.4
10.2
10.105.4
5.2
5.3
5.1
10.1
10.910.3
10.510.7
4.1 4.34.54.6
4.4 4.2
11.111.2 11.3
11.4
11.5
x
y
z
Nose
9.6 9.7
9.14 9.13
9.12
9.2
9.4 9.15 9.5
9.3
9.1
Teeth
9.109.11
9.8
9.9
Feature points affected by FAPs
Other feature points
Right eye Left eye
3.13
3.7
3.9
3.5
3.1
3.3
3.11
3.14
3.10
3.12 3.6
3.4
3.23.8
Face Model IndependenceFace Model Independence
FAPs are always normalized for model FAPs are always normalized for model independenceindependence
FAPs (and BAPs) can be used without FAPs (and BAPs) can be used without MPEG-4 systems/BIFSMPEG-4 systems/BIFS
Private face models can be accurately Private face models can be accurately animated with FAPsanimated with FAPs
Face models can be simple or complex Face models can be simple or complex depending on terminal resourcesdepending on terminal resources
Face Animation Parameter Face Animation Parameter NormalizationNormalization
Face Animation Parameters (FAPs) are Face Animation Parameters (FAPs) are normalized to facial dimensionsnormalized to facial dimensions
Each FAP is measured as a fraction of Each FAP is measured as a fraction of neutral face mouth width, mouth-nose neutral face mouth width, mouth-nose distance, eye separation, or iris distance, eye separation, or iris diameter diameter
3 Head and 2 eyeball rotation FAPs are 3 Head and 2 eyeball rotation FAPs are Euler anglesEuler angles
Neutral Face Dimensions for Neutral Face Dimensions for FAP NormalizationFAP Normalization
MW0
MNS0
ENS0
ES0IRISD0
Lip FAPsLip FAPsMouth closed if sum of upper and Mouth closed if sum of upper and
lower lip FAPs = 0lower lip FAPs = 0
FAP CompressionFAP Compression
FAPs are adaptively quantized to FAPs are adaptively quantized to desired quality leveldesired quality level
Quantized FAPs are differentially codedQuantized FAPs are differentially coded Adaptive arithmetic coding further Adaptive arithmetic coding further
reduces bitratereduces bitrate Typical compressed FAP bitrate is less Typical compressed FAP bitrate is less
than 2 kilobits/secondthan 2 kilobits/second
FAP Predictive CodingFAP Predictive Coding
FAP(t) + Q
Q-1FrameDelay
- ArithmeticCoder
Bitstream
General Bandwidth IssuesGeneral Bandwidth Issues Broadband deployment is happening slowlyBroadband deployment is happening slowly 3G will not be ubiquitous for many years3G will not be ubiquitous for many years DSL availability is limited and cable is sharedDSL availability is limited and cable is shared Talking heads need high frame-rateTalking heads need high frame-rate Consumer graphics hardware is cheap and Consumer graphics hardware is cheap and
powerfulpowerful MPEG-4 FBA tools are matched to available MPEG-4 FBA tools are matched to available
bandwidth and terminalsbandwidth and terminals
Markerless Facial Motion Capture for Markerless Facial Motion Capture for Animation ProductionAnimation Production
Track/analyze face features in each video frameTrack/analyze face features in each video frame Captured face feature motion easily converted to Captured face feature motion easily converted to
FAPsFAPs Face model is “puppeteered” by FAPsFace model is “puppeteered” by FAPs MPEG-4 FAPs only specify motion of feature MPEG-4 FAPs only specify motion of feature
points (not surrounding surface)points (not surrounding surface)
Bones rig for mouth areaBones rig for mouth area
Automatic Face Animation Automatic Face Animation DemonstrationDemonstration
FAPs extracted from camcorder videoFAPs extracted from camcorder video Inner lip, eye region and head rotation Inner lip, eye region and head rotation
FAPs compressed to less than 2 kbits/secFAPs compressed to less than 2 kbits/sec 30 frames/sec animation generated 30 frames/sec animation generated
automaticallyautomatically Face models developed with face2face Face models developed with face2face
plugin Mayaplugin Maya
ConclusionsConclusions Humanoid agents are required for best HCIHumanoid agents are required for best HCI Vision-based facial capture is required for Vision-based facial capture is required for
humanoid design and human behavior capturehumanoid design and human behavior capture MPEG-4 Face and Body Animation coding MPEG-4 Face and Body Animation coding
enables high quality mobile communicationenables high quality mobile communication Ultimate HCI systems must continuously see, Ultimate HCI systems must continuously see,
hear and identify the user for best reliability and hear and identify the user for best reliability and securitysecurity