Pose Machine

05/01/2023 1

Pose Machines Estimating Articulated Pose from

Images

Robotics Institute Carnegie Mellon University

Convolutional Pose Machines. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Pose Machines: Articulated Pose Estimation via Inference Machines. Varun Ramakrishna, Daniel Munoz, Martial Hebert, J.A. Bagnell, Yaser Sheikh. In ECCV 2014 (Oral presentation).

05/01/2023 2

Goal: Articulated Pose Estimation

05/01/2023 3


https://www.youtube.com/watch?v=Oi_ycvFHd64&index=6&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg

https://www.youtube.com/watch?v=Oi_ycvFHd64&index=6&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg

05/01/2023 4


https://www.youtube.com/watch?v=MsZkLK0Wcmk&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg&index=1

https://www.youtube.com/watch?v=MsZkLK0Wcmk&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg&index=1

05/01/2023 5

Which part corresponds to a body part?

• Local evidence is weak • Part context is a strong cue• Top-down cues are helpful

05/01/2023 6

Using Local Image EvidenceMulti-Class Classification of Patches

g1

Image Features

1xz

Image Location z

Input Image

hand

sfe

et

Requires a high-capacity supervised predictor capable of handling multi-modal data

05/01/2023 7

Using Local Image EvidenceA Classical Sliding Window Detection Pipeline

Image Feature Extraction Classification

05/01/2023 8

Local Image Evidence is Weak• Certain parts are easier to detect than others

head neck l.shoulder l.elbow l.wrist

05/01/2023 9

Part Context is a Strong CuePartdetection confidences provide spatial context cues

L-ShoulderL-ElbowImage Neck

10

Tree Structures vs Loopy GraphsTree Structures• Fast and exact

inference• Double counting

Loopy Graphs• Rich context• Approximate inference

2015/9/11

05/01/2023

Designing ContextRepresentations

Context features encode responses of a previous prediction stage

Offs

et

Feat

ures

Pat

ch

Feat

ures

Image

L b11

05/01/2023

Context Feature

sg2

g3

Stage II

Stage IIIConfidence Maps

Confidence Maps

g1

Context Features

Stage I Confidence Maps

Stage

I

Confidence

Image Features

Head Neck L-Shoulder L-Elbow L-Wrist

L b12

05/01/2023

g2g1

Context Features

g3

Image Features

Context Feature

s


Stage II Confidence Maps

Stage III Confidence Maps

Stage IIConfidence


L b13

05/01/2023

g2g1 g3

Context Features

Context Features



Stage III Confidence Maps

Image Features

Stage IIIConfidence


L b14

05/01/2023 15

Level 1parts

Level 2 poselet Level 3 full body

[Bourdev et al., CVPR 2009][Sun et al., CVPR 2012] [Duan et al., BMVC 2012][Singh et al., ECCV 2012] [Pishchulin et al., CVPR 2013] etc.

Top Down Cues are HelpfulLarger Composite Parts can be Easier to detect

05/01/2023 16

2gT

1gT

Stage t = (T = 3)

ContextFeatures

Context

ContextFeatures

ImageFeatures

Features

ContextFeatures

Context Features

Context Features

Image Features

Image Features

Image Features

2g1

L g1

Stage t = 1

1g1Leve

l 1

Leve

l 2

Leve

l L

Image Features

Image Features

Image Features

L g2

2g2

1g

Stage t = 2

Incorporating HierarchicalCues

• Each level of the hierarchy uses a separate predictor• Context features are computed on the outputs of the previous stage• Spatial context information is passed across layers via context features

LgT

05/01/2023

1g21g1Le

vel

1 1gT

Image Features

Image Features

Image Features

ContextFeatures

Context Features

Leve

l 2 2g1

L g1L g2

2g2

Stage t = 1

Stage t = 2

Level I Confidence MapsL.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee

L.Ank.

L gT

2gT

Stage t = (T = 3)

Context Features

Context Features

Context Features

Context Features

Leve

l L

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Sta

ge I

Sta

ge II

Sta

ge II

I

L b17

05/01/2023

Stage t = 2

Level 2 ConfidenceMaps

Sta

ge I

Sta

ge II

Sta

ge II

I

Head+Sho L.Arm R.Arm TorsoL.Leg

Bkgd.R.Leg

1g21g1Le

vel

1 1gT

Image Features

Image Features

Image Features

Context Features

Context Features

Leve

l 2 2g1

L g1L g2

2g2

Stage t = 1

L gT

2gT

Stage t = (T = 3)

ContextFeatures

ContextFeatures

ContextFeatures

Context Features

Leve

l L

L b18

05/01/2023

Stage t = 2

Level 3 Confidence MapsTorso Bkgd.

Sta

ge

IS

tage

II

Sta

ge

III

1g21g1Le

vel

1 1gT

Image Features

Image Features

Image Features

Context Features

Context Features

Leve

l 2 2g1

L g1L g2

2g2

Stage t = 1

L gT

2gT

Stage t = (T = 3)

Context Features

ContextFeatures

Context Features

ContextFeatures

Leve

l L

L b19

05/01/2023

1g21g1Le

vel

1 1gT

Image Features

Image Features

Image Features

Context Features

Context Features

Leve

l 2 2g1

L g1 L g2

2g2

Stage t = 1

Stage t = 2

L gT

2gT

Stage t = (T = 3)

Context Features

Context Features

Context Features

Context Features

Leve

l L

Fully Connected Model

L b20

05/01/2023 21

Pose MachinesSequential Predictionwith Spatial

Context

Training reduces totraining multiple supervised classifiers

g2g1 g3

Context Features

Context Features



Stage III Confidence MapsImage

Features

Image Features

Image Features

No structured lossfunction No specializedsolvers

No handcrafted spatial modelSpatial model is learnedimplicitly by the classifiersin a data-driven fashion

05/01/2023 22

Learning Feature Representations• Convolutional Architectures for Feature Embedding

05/01/2023 23

Learning Context Representations• Large Receptive Fields as a Design Criterion

05/01/2023 24

Learning Context Representations• Large Receptive Fields Improve Pose Estimation

05/01/2023 25

Convolutional Pose Machines• Designing a Convolutional Architecture

05/01/2023 26

Learning• Joint Training with Intermediate Supervision

𝑓 𝑡=‖−‖22

Loss: Euclidean distance

groundtruth prediction

Network without Intermediate Supervision leads vanishing gradients

05/01/2023 27

Input Stage 1

Layer 1 Layer 3Layer 6

41 10

310

Epoch 10

2

110

010

OutputLayer 18

Stage 2

Layer 7

Layer 9

Layer 12

Layer 13

Stage 3

Layer 15

42 10

310

Epoch 10

2

110

010

43 10

310

Epoch 10

2

110

010−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5

Supervision SupervisionHistograms of Gradient Magnitude During Training

Supervision

LearningIntermediateSupervision Addresses Vanishing

Gradients

Gradient Magnitude

10

Gradient (× 10−3) With Intermediate Supervision Without Intermediate Supervision

0101102103104

Input Image h w

3

5⇥5C

5⇥5C

2⇥ 5⇥5 9⇥9 1⇥1 1⇥1

P C C C C

9⇥9C

9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C

2⇥P

5⇥5C

5⇥5C

5⇥5C

2⇥P

2⇥P

Input Image

h w 3

h0 w0

P1+1 P1+1

9⇥9C

Loss1f 2

Loss1 f 1x1 1

x129⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C

5⇥5 2⇥ 5⇥5 2⇥ 5⇥5

C P C P C

Input Image

h w 3

h0 w0

P1+1

Loss1f 3

x12

h0 w0

Stage 3, level 1

Stage 2, level 1

Stage 1, level 1

05/01/2023 28

InputLayer 1

OutputLayer 18

100101102103104

Epoc

h 1

Stage 1

Layer 3 Layer 6 Layer 7

Stage 2

Layer 9 Layer 12 Layer 13

Stage 3

Layer 15

100101102103104

Epoc

h 2

−0.5 0.0 0.5

100101102103104

Epoc

h 3

−0.5 0.0 0.5

−0.5 0.0 0.5

−0.5 0.0 0.5

−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5

Histograms of Gradient Magnitude During TrainingSupervision

SupervisionSupervision

Input Image h w

3

5⇥5C

5⇥5C

2⇥ 5⇥5 9⇥9 1⇥1 1⇥1

P C C C C

9⇥9C

9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C

2⇥P

5⇥5C

5⇥5C

5⇥5C

2⇥P

2⇥P

Input Image

h w 3

h0 w0

P1+1 P1+1

9⇥9C

Loss1f 2

Loss1 f 1x1 1

x129⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C

5⇥5 2⇥ 5⇥5 2⇥ 5⇥5

C P C P C

Input Image

h w 3

h0 w0

P1+1

Loss1f 3

x12

h0 w0

Gradient (× 10−3) With Intermediate Supervision Without Intermediate Supervision

Stage 3, level 1

Stage 2, level 1

Stage 1, level 1

LearningIntermediateSupervision Addresses Vanishing

Gradients

05/01/2023 29

00

Det

ectio

n ra

te %

(i) With Intermediate Supervision (IS)(ii) Stagewise(iii) IS + Stagewise Pretrain(iv) Without Intermediate Supervision

0.05 0.1 0.150.2Normalized distance

100908070605040302010

PCK total, LSP OC

LearningComparison of Learning Methods

05/01/2023

Qualitative Results

L b30

05/01/2023

EvaluationQualitative Examples on LEEDS (Person-

centric)

L b31

05/01/2023

EvaluationQualitative Examples on MPI (Person-

centric)

L b32

05/01/2023

Resolving Symmetric Confusions

Left

Rig

ht

t = 1 t = 2

t = 3

Wrists

L b33

05/01/2023 34


Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

05/01/2023 35


Predicted Pose

Sta

ge

IIS

tage

I

Sta

ge

III



Predicted confidences are resilientto missing context (ofone part)

Context from the confidence map ofhead is removed


05/01/2023 36


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 37


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 38





Stag

e II

Stag

e I

Stag

e II

I

Predicted Pose


05/01/2023 39


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 40


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 41


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 42


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 43


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 44


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 45


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 46


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 47


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 48


Predicted Pose




Sta

ge

IIS

tage

I

Sta

ge

III


05/01/2023 49

0 0.05 0.1 0.15Normalized distance

0.2 00

100908070605040302010

Det

ectio

n ra

te %

Ours 3−Stage 2−Level Tompson et al., CVPR’15

Tompson et al., NIPS’14 Chen&Yullie, NIPS’14

Toshev et al., CVPR’14 Sapp et al., CVPR’13

EvaluationPCK PerformanceComparison on FLIC

datasetPCK wrist, FLIC

0.05 0.1 0.15Normalized distance

0.2

PCK elbow, FLIC

05/01/2023 50

0 0.05 0.1 0.15Normalized distance

Ours 3−Stage 2−Level

0.2 00

100908070605040302010

PCK total, LSP PC

Det

ectio

n ra

te %

Tompson et al., NIPS’14 Pishchulin et al., ICCV’13 Chen&Yuille, NIPS’14 Wang et al., CVPR’13

0.05 0.1 0.15 0.2 0

Normalized distance

0.05 0.1 0.15 0.2 0

Normalized distance

PCK wrist&elbow, LSP PC

0.05 0.1 0.15 0.2 0

Normalized distance

PCK knee, LSP PC

0.05 0.1 0.15 0.2

PCK ankle, LSP PC

Normalized distance

PCK hip, LSP PC

EvaluationPCK PerformanceComparison on LEEDS dataset (Person-

centric)

Pose Machine

Science

Transcript of Pose Machine