Pose Machine

50
Pose Machines Estimating Articulated Pose from Images Robotics Institute Carnegie Mellon University Convolutional Pose Machines. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Pose Machines: Articulated Pose Estimation via Inference Machines. Varun Ramakrishna, Daniel Munoz, Martial Hebert, J.A. Bagnell, Yaser Sheikh. In ECCV 2014 (Oral presentation). 06/21/2022 1

Transcript of Pose Machine

Page 1: Pose Machine

05/01/2023 1

Pose Machines Estimating Articulated Pose from

Images

Robotics Institute Carnegie Mellon University

Convolutional Pose Machines. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Pose Machines: Articulated Pose Estimation via Inference Machines. Varun Ramakrishna, Daniel Munoz, Martial Hebert, J.A. Bagnell, Yaser Sheikh. In ECCV 2014 (Oral presentation).

Page 2: Pose Machine

05/01/2023 2

Goal: Articulated Pose Estimation

Page 3: Pose Machine

05/01/2023 3

Goal: Articulated Pose Estimation

https://www.youtube.com/watch?v=Oi_ycvFHd64&index=6&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg

Page 4: Pose Machine

05/01/2023 4

Goal: Articulated Pose Estimation

https://www.youtube.com/watch?v=MsZkLK0Wcmk&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg&index=1

Page 5: Pose Machine

05/01/2023 5

Which part corresponds to a body part?

• Local evidence is weak • Part context is a strong cue• Top-down cues are helpful

Page 6: Pose Machine

05/01/2023 6

Using Local Image EvidenceMulti-Class Classification of Patches

g1

Image Features

1xz

Image Location z

Input Image

hand

sfe

et

Requires a high-capacity supervised predictor capable of handling multi-modal data

Page 7: Pose Machine

05/01/2023 7

Using Local Image EvidenceA Classical Sliding Window Detection Pipeline

Image Feature Extraction Classification

Page 8: Pose Machine

05/01/2023 8

Local Image Evidence is Weak• Certain parts are easier to detect than others

head neck l.shoulder l.elbow l.wrist

Page 9: Pose Machine

05/01/2023 9

Part Context is a Strong CuePartdetection confidences provide spatial context cues

L-ShoulderL-ElbowImage Neck

Page 10: Pose Machine

10

Tree Structures vs Loopy GraphsTree Structures• Fast and exact

inference• Double counting

Loopy Graphs• Rich context• Approximate inference

2015/9/11

Page 11: Pose Machine

05/01/2023

Designing ContextRepresentations

Context features encode responses of a previous prediction stage

Offs

et

Feat

ures

Pat

ch

Feat

ures

Image

L b11

Page 12: Pose Machine

05/01/2023

Context Feature

sg2

g3

Stage II

Stage IIIConfidence Maps

Confidence Maps

g1

Context Features

Stage I Confidence Maps

Stage

I

Confidence

Image Features

Head Neck L-Shoulder L-Elbow L-Wrist

L b12

Page 13: Pose Machine

05/01/2023

g2g1

Context Features

g3

Image Features

Context Feature

s

Stage I Confidence Maps

Stage II Confidence Maps

Stage III Confidence Maps

Stage IIConfidence

Head Neck L-Shoulder L-Elbow L-Wrist

L b13

Page 14: Pose Machine

05/01/2023

g2g1 g3

Context Features

Context Features

Stage I Confidence Maps

Stage II Confidence Maps

Stage III Confidence Maps

Image Features

Stage IIIConfidence

Head Neck L-Shoulder L-Elbow L-Wrist

L b14

Page 15: Pose Machine

05/01/2023 15

Level 1parts

Level 2 poselet Level 3 full body

[Bourdev et al., CVPR 2009][Sun et al., CVPR 2012] [Duan et al., BMVC 2012][Singh et al., ECCV 2012] [Pishchulin et al., CVPR 2013] etc.

Top Down Cues are HelpfulLarger Composite Parts can be Easier to detect

Page 16: Pose Machine

05/01/2023 16

2gT

1gT

Stage t = (T = 3)

ContextFeatures

Context

ContextFeatures

ImageFeatures

Features

ContextFeatures

Context Features

Context Features

Image Features

Image Features

Image Features

2g1

L g1

Stage t = 1

1g1Leve

l 1

Leve

l 2

Leve

l L

Image Features

Image Features

Image Features

L g2

2g2

1g

Stage t = 2

Incorporating HierarchicalCues

• Each level of the hierarchy uses a separate predictor• Context features are computed on the outputs of the previous stage• Spatial context information is passed across layers via context features

LgT

Page 17: Pose Machine

05/01/2023

1g21g1Le

vel

1 1gT

Image Features

Image Features

Image Features

ContextFeatures

Context Features

Leve

l 2 2g1

L g1L g2

2g2

Stage t = 1

Stage t = 2

Level I Confidence MapsL.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee

L.Ank.

L gT

2gT

Stage t = (T = 3)

Context Features

Context Features

Context Features

Context Features

Leve

l L

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Sta

ge I

Sta

ge II

Sta

ge II

I

L b17

Page 18: Pose Machine

05/01/2023

Stage t = 2

Level 2 ConfidenceMaps

Sta

ge I

Sta

ge II

Sta

ge II

I

Head+Sho L.Arm R.Arm TorsoL.Leg

Bkgd.R.Leg

1g21g1Le

vel

1 1gT

Image Features

Image Features

Image Features

Context Features

Context Features

Leve

l 2 2g1

L g1L g2

2g2

Stage t = 1

L gT

2gT

Stage t = (T = 3)

ContextFeatures

ContextFeatures

ContextFeatures

Context Features

Leve

l L

L b18

Page 19: Pose Machine

05/01/2023

Stage t = 2

Level 3 Confidence MapsTorso Bkgd.

Sta

ge

IS

tage

II

Sta

ge

III

1g21g1Le

vel

1 1gT

Image Features

Image Features

Image Features

Context Features

Context Features

Leve

l 2 2g1

L g1L g2

2g2

Stage t = 1

L gT

2gT

Stage t = (T = 3)

Context Features

ContextFeatures

Context Features

ContextFeatures

Leve

l L

L b19

Page 20: Pose Machine

05/01/2023

1g21g1Le

vel

1 1gT

Image Features

Image Features

Image Features

Context Features

Context Features

Leve

l 2 2g1

L g1 L g2

2g2

Stage t = 1

Stage t = 2

L gT

2gT

Stage t = (T = 3)

Context Features

Context Features

Context Features

Context Features

Leve

l L

Fully Connected Model

L b20

Page 21: Pose Machine

05/01/2023 21

Pose MachinesSequential Predictionwith Spatial

Context

Training reduces totraining multiple supervised classifiers

g2g1 g3

Context Features

Context Features

Stage I Confidence Maps

Stage II Confidence Maps

Stage III Confidence MapsImage

Features

Image Features

Image Features

No structured lossfunction No specializedsolvers

No handcrafted spatial modelSpatial model is learnedimplicitly by the classifiersin a data-driven fashion

Page 22: Pose Machine

05/01/2023 22

Learning Feature Representations• Convolutional Architectures for Feature Embedding

Page 23: Pose Machine

05/01/2023 23

Learning Context Representations• Large Receptive Fields as a Design Criterion

Page 24: Pose Machine

05/01/2023 24

Learning Context Representations• Large Receptive Fields Improve Pose Estimation

Page 25: Pose Machine

05/01/2023 25

Convolutional Pose Machines• Designing a Convolutional Architecture

Page 26: Pose Machine

05/01/2023 26

Learning• Joint Training with Intermediate Supervision

𝑓 𝑡=‖−‖22

Loss: Euclidean distance

groundtruth prediction

Network without Intermediate Supervision leads vanishing gradients

Page 27: Pose Machine

05/01/2023 27

Input Stage 1

Layer 1 Layer 3Layer 6

41 10

310

Epoch 10

2

110

010

OutputLayer 18

Stage 2

Layer 7

Layer 9

Layer 12

Layer 13

Stage 3

Layer 15

42 10

310

Epoch 10

2

110

010

43 10

310

Epoch 10

2

110

010−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5

Supervision SupervisionHistograms of Gradient Magnitude During Training

Supervision

LearningIntermediateSupervision Addresses Vanishing

Gradients

Gradient Magnitude

10

Gradient (× 10−3) With Intermediate Supervision Without Intermediate Supervision

0101102103104

Input Image h w

3

5⇥5C

5⇥5C

2⇥ 5⇥5 9⇥9 1⇥1 1⇥1

P C C C C

9⇥9C

9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C

2⇥P

5⇥5C

5⇥5C

5⇥5C

2⇥P

2⇥P

Input Image

h w 3

h0 w0

P1+1 P1+1

9⇥9C

Loss1f 2

Loss1 f 1x1 1

x129⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C

5⇥5 2⇥ 5⇥5 2⇥ 5⇥5

C P C P C

Input Image

h w 3

h0 w0

P1+1

Loss1f 3

x12

h0 w0

Stage 3, level 1

Stage 2, level 1

Stage 1, level 1

Page 28: Pose Machine

05/01/2023 28

InputLayer 1

OutputLayer 18

100101102103104

Epoc

h 1

Stage 1

Layer 3 Layer 6 Layer 7

Stage 2

Layer 9 Layer 12 Layer 13

Stage 3

Layer 15

100101102103104

Epoc

h 2

−0.5 0.0 0.5

100101102103104

Epoc

h 3

−0.5 0.0 0.5

−0.5 0.0 0.5

−0.5 0.0 0.5

−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5

Histograms of Gradient Magnitude During TrainingSupervision

SupervisionSupervision

Input Image h w

3

5⇥5C

5⇥5C

2⇥ 5⇥5 9⇥9 1⇥1 1⇥1

P C C C C

9⇥9C

9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C

2⇥P

5⇥5C

5⇥5C

5⇥5C

2⇥P

2⇥P

Input Image

h w 3

h0 w0

P1+1 P1+1

9⇥9C

Loss1f 2

Loss1 f 1x1 1

x129⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C

5⇥5 2⇥ 5⇥5 2⇥ 5⇥5

C P C P C

Input Image

h w 3

h0 w0

P1+1

Loss1f 3

x12

h0 w0

Gradient (× 10−3) With Intermediate Supervision Without Intermediate Supervision

Stage 3, level 1

Stage 2, level 1

Stage 1, level 1

LearningIntermediateSupervision Addresses Vanishing

Gradients

Page 29: Pose Machine

05/01/2023 29

00

Det

ectio

n ra

te %

(i) With Intermediate Supervision (IS)(ii) Stagewise(iii) IS + Stagewise Pretrain(iv) Without Intermediate Supervision

0.05 0.1 0.150.2Normalized distance

100908070605040302010

PCK total, LSP OC

LearningComparison of Learning Methods

Page 30: Pose Machine

05/01/2023

Qualitative Results

L b30

Page 31: Pose Machine

05/01/2023

EvaluationQualitative Examples on LEEDS (Person-

centric)

L b31

Page 32: Pose Machine

05/01/2023

EvaluationQualitative Examples on MPI (Person-

centric)

L b32

Page 33: Pose Machine

05/01/2023

Resolving Symmetric Confusions

Left

Rig

ht

t = 1 t = 2

t = 3

Wrists

L b33

Page 34: Pose Machine

05/01/2023 34

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 35: Pose Machine

05/01/2023 35

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Sta

ge

IIS

tage

I

Sta

ge

III

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Context from the confidence map ofhead is removed

Ablative Spatial Analysis

Page 36: Pose Machine

05/01/2023 36

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 37: Pose Machine

05/01/2023 37

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 38: Pose Machine

05/01/2023 38

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Stag

e II

Stag

e I

Stag

e II

I

Predicted Pose

Ablative Spatial Analysis

Page 39: Pose Machine

05/01/2023 39

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 40: Pose Machine

05/01/2023 40

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 41: Pose Machine

05/01/2023 41

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 42: Pose Machine

05/01/2023 42

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 43: Pose Machine

05/01/2023 43

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 44: Pose Machine

05/01/2023 44

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 45: Pose Machine

05/01/2023 45

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 46: Pose Machine

05/01/2023 46

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 47: Pose Machine

05/01/2023 47

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 48: Pose Machine

05/01/2023 48

Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.

Predicted Pose

Level 1 PartConfidences

L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.

Predicted confidences are resilientto missing context (ofone part)

Sta

ge

IIS

tage

I

Sta

ge

III

Ablative Spatial Analysis

Page 49: Pose Machine

05/01/2023 49

0 0.05 0.1 0.15Normalized distance

0.2 00

100908070605040302010

Det

ectio

n ra

te %

Ours 3−Stage 2−Level Tompson et al., CVPR’15

Tompson et al., NIPS’14 Chen&Yullie, NIPS’14

Toshev et al., CVPR’14 Sapp et al., CVPR’13

EvaluationPCK PerformanceComparison on FLIC

datasetPCK wrist, FLIC

0.05 0.1 0.15Normalized distance

0.2

PCK elbow, FLIC

Page 50: Pose Machine

05/01/2023 50

0 0.05 0.1 0.15Normalized distance

Ours 3−Stage 2−Level

0.2 00

100908070605040302010

PCK total, LSP PC

Det

ectio

n ra

te %

Tompson et al., NIPS’14 Pishchulin et al., ICCV’13 Chen&Yuille, NIPS’14 Wang et al., CVPR’13

0.05 0.1 0.15 0.2 0

Normalized distance

0.05 0.1 0.15 0.2 0

Normalized distance

PCK wrist&elbow, LSP PC

0.05 0.1 0.15 0.2 0

Normalized distance

PCK knee, LSP PC

0.05 0.1 0.15 0.2

PCK ankle, LSP PC

Normalized distance

PCK hip, LSP PC

EvaluationPCK PerformanceComparison on LEEDS dataset (Person-

centric)