Where you are the controller
Krishna Kumar, Sr. Developer Evangelist - [email protected]
Started as a $30,000 prototype
Vision: Shift the world from thinking“We need to understand technology” to "Technology needs to understand us"
Option A:
Why Kinect?
Why Kinect?
Option You:
What is Kinect?
What is Kinect?
An extraordinary new way to play, where you are the controller
Voice Recognition
Face Recognition
You Recognition
Gesture Recognition
“Xbox?!”
Kinect knows what to do!
“Let’s Play!”
①
“What are those things?”
③②
“What are those things?”
3D Depth Sensors① ③
Projected Invisible IR pattern
11
Depth Computation
Depth Map
“What are those things?”
RGB Camera②
“What are those things?”
Multi-array Microphone
“What are those things?”
Motorized Tilt
Combination of RGB camera, depth sensor and multi-array microphone RBG camera delivers three basic color components Depth sensors “sees” the room in 3-D Microphone locates voices by sound and extracts ambient
noise
Software makes all the magic possible Skeletal Tracking Face, Gesture Recognition Audio Echo cancellation Audio Beam Forming Speech Recognition
19© 2010 Microsoft Corporation. All rights reserved.
Scope of Microsoft Research
• Significant Investment• Investing > $9B in R&D (MSR & product dev)
• Staff of over 850 in 55 research areas
• International Research lab locations : • Redmond, Washington (Sept, 1991)• San Francisco, California (1995)• Cambridge, United Kingdom (July, 1997)• Beijing, People’s Republic of China (Nov, 1998)• Mountain View, California (July, 2001)• Bangalore, India (January, 2005)• Cambridge, Massachusetts (February, 2008)
Turning ideas into reality.
research.microsoft.com
20© 2010 Microsoft Corporation. All rights reserved.
Scope of Microsoft ResearchResearch Areas
research.microsoft.com
“Xbox?!” “Let’s
Play!”
How does Kinect know what I do?
J. Shotton, J. Winn, C. Rother, A. Criminisi, TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation. European Conference on Computer Vision, 2006
Microsoft Research: Object Recognition
Microsoft Research: Human Body Tracking Wide range of
motion But limited agility And not real-time Infinite number of
movements
R Navaratnam, A Fitzgibbon, R Cipolla The Joint Manifold Model for Semi-supervised Multi-valued RegressionIEEE Intl Conf on Computer Vision, 2007
XBox calls MSR: September 2008“We need a body tracker with
All body motions…All agilities…10x Real-time…For multiple players…… and it has to be 3D ”
MSR’s response?
Teach the Computer/Machine LearningStep 1: Collect A LOT of Data
Teams visit households across the globe, filming real users
Hollywood motion capture studio generates billions of CG images
Training Data
Training
Millions of training images -> millions of classifier parametersVery far from “embarrassingly
parallel”New algorithm for distributed
decision-tree trainingMajor use of DryadLINQ
available for downloadDistributed Data-Parallel Computing Using a High-Level Programming LanguageM Isard, Y YuInternational Conference on Management of Data (SIGMOD), July 2009
t=1 t=2 t=3
Recognize Joint Angles Classify each pixel’s
probability of being each of 32 body parts
Determine probabilistic cluster of body configurations consistent with those parts
Present the most probable to the user
Programmers View
Programmers View
A Platform is Born
Consumer Technologies Push The Envelope
32
Price: $6000
Price: $150
Play Space
Field of View and Operational Area
• Play Space: Ideally need 12ft x 12ft of play space though you can make do with 10ft x 10ft
• Player Position: Ideally is 6-10 feet away from camera
Lighting and Environment
• Fluorescent or LED lighting are recommended• No direct light on player• No direct light into sensor lens• In a stage environment, all lights need to be
Infrared-filtered• To avoid lighting noise do not intersect sensor lens
fields of view• Avoid playing in/next to reflective surfaces
Clothing Considerations
• Avoid anything that conceals your arms or legs
• Avoid wearing flowing clothing such as scarves or long dresses and skirts– Long skirts hide the legs and scarves are often
mistaken for arms
• Avoid baggy jackets or overly baggy clothing• Generally, anything that hides the human form
should be removed for optimal game play• If players with long hair are having difficulty
playing, encourage them to pull their hair back and try playing again
Kinect with more than just games Use your voice or a wave of your
hand to:Video Kinect with others*Manage your media gallery
Music with Last.fm*HD movies with Zune
Get in the game with ESPN*
* with Xbox LIVE Gold membership
XBOX LIVEMore Ways to Connect with Family and Friends
VIDEO KINECTVIDEO KINECT FAMILY CENTERFAMILY CENTER SOCIAL NETWORKSSOCIAL NETWORKS
• Connect with family and far away friends, all from the comfort of your living room with Xbox LIVE Video Chat
• Experience the ease and convenience of chat on the big screen with Kinect-enabled auto camera zoom and pan.
• Connect with family and far away friends, all from the comfort of your living room with Xbox LIVE Video Chat
• Experience the ease and convenience of chat on the big screen with Kinect-enabled auto camera zoom and pan.
• Family Center makes it easy to manage multiple user accounts and edit privacy settings from a single location
• Ensure safe, secure fun for the whole family
• Family Center makes it easy to manage multiple user accounts and edit privacy settings from a single location
• Ensure safe, secure fun for the whole family
• Connect with friends, share photos and updates through Facebook and Twitter
• Connect with friends, share photos and updates through Facebook and Twitter
ESPN Home-field advantage in your living room Access over 3,500 live global events from
ESPN3.com, including out-of-market programming plus fresh video clips from ESPN.com
Enjoy features like HD programming and on-demand viewing, participate in polls, predictions and trivia.
See what the Xbox LIVE community is watching and declare what team you’re rooting for
With Kinect™ control the action right from your couch with just your voice or the wave of your hand
Featured Content: NCAA Football, NCAA Basketball, College Bowl Games,
NBA, MLB, Soccer, Golf and Tennis majors
Where can Kinect go?
Air Guitar Hero?Shopping in 3D?Remote Replacement?Dance Instructor?Education?Personal Trainer?Physical Therapy?
“Xbox?”
The Kinect SDK
Provides both Unmanaged and Managed APIUnmanaged API – Concepts work in C++Managed API – Concepts work in both VB/C#
Samples & documentation to get you startedAssumes some programming experiencehttp://research.microsoft.com/kinectsdk/
The Kinect Sensor
A hybrid device containing the following input devices: A color (RGB) camera A depth sensor A microphone array A tilt sensor
Play space control is done through a tilt motor Pitch +/- 27 degrees
RGB CAMERA
MULTI-ARRAY MIC MOTORIZED TILT
3D DEPTH SENSORS
Kinect USB cable
The Innards
55
The Vision System
IR laser projector
IR camera
RGB camera
Kinect video output30 HZ frame rate; 57deg field-of-view
8-bit VGA RGB640 x 480
12-bit monochrome320 x 240
57
The Audio System
Input Stream(What the mic array hears)
Post-MEC(What APIs present)
MEC
Demo: Multichannel Echo Cancellation
The Kinect SDK
Provides access to:RGB feedDepth feedSkeletal Tracking capabilitiesAudio Beam dataSpeech Recognition
Data Streams• Color stream at 640x480 resolution; 32BPP• Depth stream at 320 x 240 resolution;
16BPP• Skeletal Joint positions• Frame #s, TimeStamps, Tilt sensor data• Echo-canceled audio• Higher level systems– Speech recognition
RGB Camera Fundamentals
Camera Data
RGB stream Format• Upto 640 x 480 resolution• Upto 32 bits per pixel • Data contained in ImageFrame.Image.Bits• Array of bytes public byte[] Bits;• Array– Starts at top left of image– Moves left to right, then top to bottom
Stride
Stride - # of bytes from one row of pixels in memory to the next
Demos::RGB Camera
Depth Camera Fundamentals
Camera Data
Depth Map Format• 320 x 240 resolution• 16 bits per pixel
– Upper 13 bits: depth in mm: 800 mm to 4000 mm range– Lower 3 bits: segmentation mask
• Depth value 0 means unknown– Shadows, low reflectivity, and high reflectivity among the few reasons
• Segmentation index– 0 – no player– 1 – skeleton 0– 2 – skeleton 1– …
Depth Byte Buffer
ImageFrame.Image.BitsArray of bytes public byte[] Bits;Array
Starts at top left of imageMoves left to right, then top to bottomRepresents distance for pixel
Calculating Distance2 bytes per pixel (16 bits)Depth – Distance per pixel
Bitshift second byte by 8 Distance (0,0) = (int)(Bits[0] | Bits[1] << 8);
DepthAndPlayer Index – Includes Player indexBitshift by 3 first byte (player index), 5 second byte Distance (0,0) =(int)(Bits[0] >> 3 | Bits[1] << 5);
Demos::Depth Camera
Skeletal Tracking Fundamentals
Human Depth SensingObject pattern similarity determines disparity
Kinect Depth SensingIR pattern similarity determines disparity
IR Projector
IR Camera
Provided Data
Pipeline Architecture
Title Space
Skeleton API
Joints • Maximum two players tracked at once
– Six player proposals
• Each player with set of <x, y, z> joints in meters• Each joint has associated state
– Tracked, Not tracked, or Inferred
• Inferred - Occluded, clipped, or low confidence joints• Not Tracked - Rare, but your code must check for this state
Provided DataDepth and segmentation map
Depth Map Format• 320 x 240 resolution• 16 bits per pixel
– Upper 13 bits: depth in mm: 800 mm to 4000 mm range– Lower 3 bits: segmentation mask
• Depth value 0 means unknown– Shadows, low reflectivity, and high reflectivity among the few reasons
• Segmentation index– 0 – no player– 1 – skeleton 0– 2 – skeleton 1– …
Demos::Skeletal Tracking
Audio Fundamentals
Going Inside the Kinect• Four microphone array
with hardware-basedaudio processing– Multichannel echo cancellation (MEC)– Sound position tracking– Other digital signal processing (noise
suppression and reduction)
Audio Data
Speech Recognition
Grammar – What we are listening forCode – GrammarBuilder, ChoicesSpeech Recognition Grammar
Specification (SRGS)C:\Program Files (x86)\Microsoft Speech
Platform SDK\Samples\Sample Grammars\
Note: Set AutomaticGainControl = false
Grammar<!-- Confirmation_YesNo._value: string ["Yes", "No"] --><rule id="Confirmation_YesNo" scope="public"> <example> yes </example> <example> no </example> <one-of> <item> <ruleref uri="#Confirmation_Yes" /> </item> <item> <ruleref uri="#Confirmation_No" /> </item> </one-of> <tag> out = rules.latest() </tag></rule></rule>
<!-- Confirmation_Yes._value: string ["Yes"] --><rule id="Confirmation_Yes" scope="public"> <example> yes </example> <example> yes please </example> <one-of> <item> yes </item> <item> yeah </item> <item> yep </item> <item> ok </item> </one-of> <item repeat="0-1"> please </item> <tag> out._value = "Yes";</tag>
Demos::Audio