ICCV2011: Human Action Recognition by Learning bases of action attributes and parts

Human Action Recognition by Learning Bases of Action

Attributes and Parts

Bangpeng Yao, Xiaoye Jiang, Aditya Khosla,

Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei

1

Stanford University

2

Action Classification in Still Images

Low level feature

Yao & Fei-Fei, 2010Koniusz et al., 2010Delaitre et al., 2010Yao et al., 2011

Riding bike

3


Riding a bikeSitting on a bike seatWearing a helmetPeddling the pedals…

- Semantic concepts – Attributes

Low level feature


High-level representationRiding bike

4


- Semantic concepts – Attributes- Objects


Low level feature



5


- Semantic concepts – Attributes- Objects- Human poses

Parts


Low level feature



6


- Semantic concepts – Attributes- Objects- Human poses- Contexts of attributes & parts

Parts


Riding

Low level feature



7

Low level feature


- Semantic concepts – Attributes- Objects- Human poses- Contexts of attributes & parts

High-level representation

Parts

riding a bike

wearing a helmet

Peddling the pedal

sitting on bike seat

Farhadi et al., 2009Lampert et al., 2009Berg et al., 2010Parikh & Grauman, 2011

Gupta et al., 2009Yao & Fei-Fei, 2010Torresani et al., 2010Li et al., 2010

Yang et al., 2010Maji et al., 2011Liu et al., 2011

Incorporate human knowledge; More understanding of image content; More discriminative classifier.


Riding bike

• Intuition: Action Attributes and Parts

• Algorithm: Learning Bases of Attributes

and Parts

• Experiments: PASCAL VOC & Stanford

40 Actions

• Conclusion

Outline

8



and Parts


40 Actions

• Conclusion

Outline

9

10

Action Attributes and Parts

Attributes:

… …

semantic descriptions of human actions

11


Attributes:

… …

semantic descriptions of human actions

Riding bike Not

riding bike

Lampert et al., 2009Berg et al., 2010

Discriminative classifier, e.g. SVM

12


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …

A pre-trained detector

Object Bank, Li et al., 2010Poselet, Bourdev & Malik, 2009

13


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …

Attribute classification

Object detection

Poselet detection

a: Image feature vector

14


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …

Attribute classification

Object detection

Poselet detection


…

Action bases Φ

15


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …


…

Action bases Φ

16


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …


…

Action bases Φ

17


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …

…

Action bases

Bases coefficients w

Φ


a Φw

18


Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …

…

Action bases


Φ


a Φw

• Sparse• Encodes context• Robust to initially weak detections


• Algorithm: Learning Bases of

Attributes and Parts


40 Actions

• Conclusion

Outline

19

20

Bases of Atr. & Parts: Training

w

Φa

a Φw

• Input: 1, , Na a

• Output: 1, , MΦ Φ Φ

1, , NW w wsparse

2

2 1,1

1min ,

2

N

i i ii

Φ W

a Φw w

2

1 2s.t. , 1

2j jj

Φ Φ

L1 regularization, sparsity of W

Elastic net, sparsity of [Zou & Hasti, 2005]

Accurate approximation

• Jointly estimate and :Φ W

• Optimization: stochastic gradient descent.

Φ

…

21

Bases of Atr. & Parts: Testing

…

w

Φa

a Φw

• Input: a

• Output:

1, , MΦ Φ Φ

w sparse

• Estimate w:

• Optimization: stochastic gradient descent.

2

2 1

1min

2

wa Φw w

L1 regularization, sparsity of WAccurate approximation



and Parts


40 Actions

• Conclusion

Outline

22

23

PASCAL VOC 2010 Action Dataset

Figure credit: Ivan Laptev

• 9 classes, 50-100 trainval / testing images per class

14 attributes – trained from the trainval images;27 objects – taken from Li et al, NIPS 2010;150 poselets – taken from Bourdev & Malik, ICCV 2009.

•

24

VOC 2010: Classification Result

1 2 3 4 5 6 7 8 9

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Phoning Playing instrument

Reading Riding bike

Riding horse

Running Taking photo

Using computer

Walking

Ave

rag

e p

reci

sio

n

Our method, use “a”

Poselet, Maji et al, 2011

SURREY_MKUCLEAR_DOSP

…

w

Φa

25

…

w

Φa

1 2 3 4 5 6 7 8 9

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Reading Riding bike

Riding horse


Walking

Our method, use “a”Our method, use “w”



Ave

rag

e p

reci

sio

n

Using computer

VOC 2010: Classification Result

26

…

w

Φa

1 2 3 4 5 6 7 8 9

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Reading Riding bike

Riding horse


Walking




Ave

rag

e p

reci

sio

n

Using computer

400 action bases

attributesobjects

poselets

VOC 2010: Analysis of Bases

27

…

w

Φa

1 2 3 4 5 6 7 8 9

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Reading Riding bike

Riding horse


Walking




Ave

rag

e p

reci

sio

n

Using computer

400 action bases

attributesobjects

poselets


28

…

w

Φa

1 2 3 4 5 6 7 8 9

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Reading Riding bike

Riding horse


Walking




Ave

rag

e p

reci

sio

n

Using computer

400 action bases

attributesobjects

poselets


29

VOC 2010: Control Experiment

…

w

ΦaA+O+P A+O A+P O+P

0.45

0.5

0.55

0.6

0.65

0.7

Mea

n av

erag

e pr

ecis

ion

Use “a”

Use “w”

A: attributeO: objectP: poselet

30

PASCAL VOC 2011 Result

• Our method ranks the first in nine out of ten classes in comp10.

Others’ best in comp9


Our method

Jumping 71.6 59.5 66.7

Phoning 50.7 31.3 41.1

Playing instrument 77.5 45.6 60.8

Reading 37.8 27.8 42.2

Riding bike 88.8 84.4 90.5

Riding horse 90.2 88.3 92.2

Running 87.9 77.6 86.2

Taking photo 25.7 31.0 28.8

Using computer 58.9 47.4 63.5

Walking 59.5 57.6 64.2

31

PASCAL VOC 2011 Result



Our method

Jumping 71.6 59.5 66.7

Phoning 50.7 31.3 41.1

Playing instrument 77.5 45.6 60.8

Reading 37.8 27.8 42.2

Riding bike 88.8 84.4 90.5

Riding horse 90.2 88.3 92.2

Running 87.9 77.6 86.2

Taking photo 25.7 31.0 28.8

Using computer 58.9 47.4 63.5

Walking 59.5 57.6 64.2

• Our method achieves the best performance in five out of ten classes if we consider both comp9 and comp10.

32

Stanford 40 Actions

Applauding Blowing bubbles

Brushing teeth

Calling Cleaning floor

Climbing wall

Cooking Cutting trees

Cutting vegetables

Drinking Feeding horse

Fishing Fixing bike

Gardening Holding umbrella

Jumping

Playing guitar

Playing violin

Pouring liquid

Pushing cart

Reading Repairing car

Riding bike

Riding horse

Rowing Running Shooting arrow

Smoking cigarette

Taking photo

Texting message

Throwing frisbee

Using computer

Using microscope

Using telescope

Walking dog

Washing dishes

Watching television

Waving hands

Writing on board

Writing on paper

http://vision.stanford.edu/Datasets/40actions.html

• 40 actions classes, 9532 real world images from Google, Flickr, etc.


33

Stanford 40 Actions


Brushing teeth


Climbing wall


Cutting vegetables


Fishing Fixing bike


Jumping

Playing guitar

Playing violin

Pouring liquid

Pushing cart


Riding bike

Riding horse


Smoking cigarette

Taking photo

Texting message

Throwing frisbee

Using computer

Using microscope

Using telescope

Walking dog

Washing dishes

Watching television

Waving hands

Writing on board

Writing on paper



Riding bike

Fixing bike


34

Stanford 40 Actions


Brushing teeth


Climbing wall


Cutting vegetables


Fishing Fixing bike


Jumping

Playing guitar

Playing violin

Pouring liquid

Pushing cart


Riding bike

Riding horse


Smoking cigarette

Taking photo

Texting message

Throwing frisbee

Using computer

Using microscope

Using telescope

Walking dog

Washing dishes

Watching television

Waving hands

Writing on board

Writing on paper



Writing on board

Writing on paper


35

Stanford 40 Actions


Brushing teeth


Climbing wall


Cutting vegetables


Fishing Fixing bike


Jumping

Playing guitar

Playing violin

Pouring liquid

Pushing cart


Riding bike

Riding horse


Smoking cigarette

Taking photo

Texting message

Throwing frisbee

Using computer

Using microscope

Using telescope

Walking dog

Washing dishes

Watching television

Waving hands

Writing on board

Writing on paper



Drinking Gardening

Smoking Cigarette


36

Stanford 40 Actions: Result• We use 45 attributes, 81 objects, and 150 poselets.• Compare our method with the Locality-constrained Linear Coding (LLC, Wang et al, CVPR 2010) baseline.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Riding

a h

orse

Rowing

a b

oat

Riding

a b

ike

Climbin

g m

ount

ain

Jum

ping

Cleanin

g th

e flo

or

Wal

king

a do

g

Shoot

ing a

n ar

row

Playin

g gu

itar

Fishin

g

Holding

up

an u

mbr

ella

Runni

ng

Throw

ing

a fri

sbee

Writ

ing

on a

boa

rd

Wat

chin

g TV

Cuttin

g tre

es

Feedin

g a

hors

e

Garde

ning

Writ

ing

on a

boo

k

Repai

ring

a ca

r

Look

ing th

ru a

micr

osco

pe

Cuttin

g ve

geta

bles

Blowing

bub

bles

Playin

g vio

lin

Brush

ing te

eth

Repai

ring

a bi

ke

Pushin

g a

cart

Using

a co

mpu

ter

Appla

uding

Cookin

g

Smok

ing c

igare

tte

Look

ing th

ru a

teles

cope

Was

hing

dishe

s

Drinkin

g

Calling

Wav

ing h

ands

Pourin

g liq

uid

Readi

ng a

boo

k

Taking

pho

tos

Textin

g m

essa

ge

LLC

Our Method

Ave

rage

pre

cisi

on

37

Stanford 40 Actions: Result

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Riding

a h

orse

Rowing

a b

oat

Riding

a b

ike

Climbin

g m

ount

ain

Jum

ping

Cleanin

g th

e flo

or

Wal

king

a do

g

Shoot

ing a

n ar

row

Playin

g gu

itar

Fishin

g

Holding

up

an u

mbr

ella

Runni

ng

Throw

ing

a fri

sbee

Writ

ing

on a

boa

rd

Wat

chin

g TV

Cuttin

g tre

es

Feedin

g a

hors

e

Garde

ning

Writ

ing

on a

boo

k

Repai

ring

a ca

r

Look

ing th

ru a

micr

osco

pe

Cuttin

g ve

geta

bles

Blowing

bub

bles

Playin

g vio

lin

Brush

ing te

eth

Repai

ring

a bi

ke

Pushin

g a

cart

Using

a co

mpu

ter

Appla

uding

Cookin

g

Smok

ing c

igare

tte

Look

ing th

ru a

teles

cope

Was

hing

dishe

s

Drinkin

g

Calling

Wav

ing h

ands

Pourin

g liq

uid

Readi

ng a

boo

k

Taking

pho

tos

Textin

g m

essa

ge

LLC

Our Method

Ave

rage

pre

cisi

on



and Parts


40 Actions

• Conclusion

Outline

38

39

Conclusion

Attributes:

… …

Parts-Objects:

… …

Parts-Poselets:

… …

…

Action bases


Φ


a Φw

40

Acknowledgement

ICCV2011: Human Action Recognition by Learning bases of action attributes and parts

Entertainment & Humor

Transcript of ICCV2011: Human Action Recognition by Learning bases of action attributes and parts