Real-Time Human Pose Recognition in Parts from Single...

Real-Time Human Pose Recognition

in Parts from Single Depth Images

Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard

Moore, Alex Kipman, Andrew Blake

CVPR 2011

PRESENTER: AHSAN ABDULLAH

PROBLEM

right

elbow

right hand left

shoulderneck

APPROACH

• Partitioning into body parts helps localizing the joints

Shotton et. al. CVPR 2011

infer

body parts

per pixelcluster pixels to

hypothesize

body joint

positions

capture

depth image &

remove bg

fit model &

track skeleton

PIPELINE


Design Goals

• Efficiency

• Robustness

Compute P(ci|wi)

pixels i = (x, y)

body part ci

image window wi

Discriminative approach

learn classifier P(ci|wi) from training data

image windows move

with classifier

BODY PART CLASSIFICATION


LEARNING DATA

synthetic(train & test)

real(test) Shotton et. al. CVPR 2011

LEARNING – DATA SYNTHESIS

Record MoCap500k frames

distilled to 100k poses

Retarget to several models

Render (depth, body parts) pairs


• Depth comparisons

- very fast to compute

input

depth

image

xΔ

xΔ

xΔx

Δ

x

Δ

x

Δ

𝑓 𝐼, x = 𝑑𝐼 x − 𝑑𝐼(x + Δ)

image depth

image coordinate

offset depth

feature

response

Background pixelsd = large constant

scales inversely with depth

Δ =𝐯

𝑑𝐼 x

FEATURE SET


Aggregation of decision trees

DECISION FORESTS

Qn = (I, x)

f(I, x; Δn) > θn

no yes

c

Pr(c)

body part c

Pn(c)

c

Pl(c)

Take (Δ, θ) that maximises information gain

n

l r

reduce

entropy

[Breiman et al. 84]

for all pixels


TRAINING DECISION TREES

image windowcentred at x

no

Toy example:Distinguish left (L)

and right (R) sides of

the body

no yes

yes

L R

P(c)

L R

P(c)

L R

P(c)

f(I, x; Δ1) > θ1

f(I, x; Δ2) > θ2


DECISION TREE CLASSIFICATION

Trained on different random subset of images

“bagging” helps avoid over-fitting

Average tree posteriors

[Amit & Geman 97]

[Breiman 01]

[Geurts et al. 06]

………tree 1 tree T

c

P1(c)c

PT(c)

(𝐼, x) (𝐼, x)

𝑃 𝑐 𝐼, x =1

𝑇

𝑡=1

𝑇

𝑃𝑡(𝑐|𝐼, x)


DECISION FOREST CLASSIFIER

ground truth

1 tree 3 trees 6 trees

inferred body parts (most likely)

40%

45%

50%

55%

1 2 3 4 5 6

Av

era

ge

pe

r-c

lass

…

Number of trees


NUMBER OF TREES

30%

35%

40%

45%

50%

55%

60%

65%

8 12 16 20

Av

era

ge

pe

r-c

lass

ac

cu

rac

y

Depth of trees

30%

35%

40%

45%

50%

55%

60%

65%

5 15Depth of trees

synthetic test data real test data


TREE DEPTH

• Define 3D world space density

• Mean shift for mode detection

Body parts to joint hypotheses

3. hypothesize

body joints

…

1 2

pixel index ibandwidth

3D coord

of i th pixel3D coord

pixel

weight

inferred

probability

depth at

i th pixel


front view top viewside view

input depth inferred body parts

inferred joint positions

Shotton et. al. CVPR 2011No tracking or smoothing

0.00.10.20.30.40.50.60.70.80.91.0

Ce

nte

r H

ea

d

Ce

nte

r N

ec

k

Left

Sh

ou

lde

r

Rig

ht…

Left

Elb

ow

Rig

ht

Elb

ow

Left

Wrist

Rig

ht

Wrist

Left

Ha

nd

Rig

ht

Ha

nd

Left

Kn

ee

Rig

ht

Kn

ee

Left

An

kle

Rig

ht

An

kle

Left

Fo

ot

Rig

ht

Fo

ot

Me

an

AP

Av

era

ge

pre

cis

ion


JOINT PREDICTION ACCURACY

0.00.10.20.30.40.50.60.70.80.91.0

Cen

ter

Hea

d

Cen

ter

Nec

k

Lef

t S

ho

uld

er

Rig

ht

Sh

ou

lder

Lef

t E

lbo

w

Rig

ht

Elb

ow

Lef

t W

rist

Rig

ht

Wri

st

Lef

t H

and

Rig

ht

Han

d

Lef

t K

nee

Rig

ht

Kn

ee

Lef

t A

nkl

e

Rig

ht

An

kle

Lef

t F

oo

t

Rig

ht

Fo

ot

Mea

n A

P

Ave

rag

e p

reci

sio

n

Joint prediction from ground truth body parts

Joint prediction from inferred body parts


JOINT PREDICTION ACCURACY

• No temporal information

- frame-by-frame

• Very fast

- simple depth image feature

- parallel decision forest classifier


ANALYSIS

Uses…

• 3D joint hypotheses

• kinematic constraints

• temporal coherence

… to give

• full skeleton

• higher accuracy

• invisible joints

• multi-player4. track skeleton

1

2

3

KINECT SYSTEM

• Frame-by-frame gives robustness

• Body parts representation for efficiency

• Fast, simple machine learning

• Significant engineering to scale to a

massive, varied training data set


SUMMARY

QUESTIONS

Real-Time Human Pose Recognition in Parts from Single...

Documents

Transcript of Real-Time Human Pose Recognition in Parts from Single...