2011-6-3Monocular 3D Pose Estimation and Tracking by Detection

8/3/2019 2011-6-3Monocular 3D Pose Estimation and Tracking by Detection

1/15

Monocular 3D Pose Estimation

and Tracking by Detection

Mykhaylo Andriluka

Stefan RothBernt Schiele


2/15

3D Pose Estimation

Estimate positions and angles of individual bodyparts in a 3D space

Monocular refers to a single camera system

Very reliable in controlled situations used inmotion tracking

Currently poor performance in realistic scenes

Frequently relies on edge detection/background

subtraction Potential problems: loose clothing, occlusions,

ego motion, background clutter


3/15

Why is it interesting for us?

Accurate body pose estimation makes actionrecognition practically trivial


4/15

This paper

Performs 3D pose estimation of multiple

people simultaneously with a single camera in

a realistic street scene


5/15

Pictorial Structures Model

2D part-based model each part i

represented by lmi = {xmi,ymi,mi,smi} at frame m

Lm - Overall part configuration at frame m

Dm - Visual evidence at frame m


6/15

Pictorial Structures Model

Body represented as left/right lower and

upper legs, torso, head and left/right upper

and lower arms

Each body part detected individually by parts

detectors

The posterior probability Lm is maximized to

detect the body


7/15

Viewpoint Estimation

This method only detects people, and only

from a single viewpoint

This paper trains 10 of these detectors from a

multiview dataset each detector assumes a

different viewpoint

This gives us viewpoint estimation find the

detector with the strongest response to the

scene


8/15

Tracklet Extraction

Want to extract tracks of each person relatingtemporal states can give us more information for bodypose estimation even gives more robustness againstocclusion

Use pictorial structure model as detector, to getbounding boxes, and likely viewpoint at each frame, foreach person

Treat bounding boxes and viewpoint probabilities asemissions, hypotheses as states, in a Hidden MarkovModel

Use Viterbi Decoding to extract most likely sequence ofstates/viewpoints.


9/15

Tracklet Extraction

Transition Probabilities between states:

For viewpoints, high transition probabilities

between similar viewpoints, to reflect that people

turn slowly

For bounding boxes, transition probability is

proportional to difference between RGB colour

histograms within each bounding box


10/15

3D Pose Estimation

Use 2D->3D examplars to pick most likely 3D

pose in tracklet, for each frame. This gives us

M body pose hypotheses for each frame,

where M is the length of the tracklet

3D body pose at frame m: Qm = {qm, m, hm}

q joint configuration

body rotation in 3D world

h position and scale of the body


11/15

Representation of Pose


12/15

3D Pose Estimation

Single Frame likelihood:

Breakdown:


13/15

Position of Body Parts

In order to reduce computational complexity

of 3D body part estimation, we find theJ most

likely locations for each body part n in frame

m, and then calculate the Gaussiandistribution of that body part.

This allows the posterior probability to be

modelled as:


14/15

hGLPVM

Given the above information, with a prior

estimation of p(Q1:m)= p(q1:m)p(h1:m), we can

estimate the posterior probability of the

frames using hGLPVM

This models the sequence of poses as a

Gaussian process, and solves using MAP

estimation


15/15

Pose Estimation Examples

2011-6-3Monocular 3D Pose Estimation and Tracking by Detection

Documents

Transcript of 2011-6-3Monocular 3D Pose Estimation and Tracking by Detection