Human action recognition with kinect using a joint motion descriptor

46
Human Action Recognition with Kinect using a Joint Motion Descriptor Somar Boubou, Einoshin Suzuki 09.02.2015 Journal of Intelligent Information Systems 44 (1), 49-65

Transcript of Human action recognition with kinect using a joint motion descriptor

Page 1: Human action recognition with kinect using a joint motion descriptor

Human Action Recognition with Kinect using a

Joint Motion Descriptor

Somar Boubou, Einoshin Suzuki09.02.2015

Journal of Intelligent Information Systems 44 (1), 49-65

Page 2: Human action recognition with kinect using a joint motion descriptor

• Introduction /7 slides/.

• Related work /6 slides/.

• Proposed approach /9 slides/.

• Experimental results /9 slides/.

• Detailed analysis /5 slides/.

• Experiments and detailed analysis with a public challenging MSR-Action3D dataset. /7 slides/

• Summary.

Structure of this presentation1

Page 3: Human action recognition with kinect using a joint motion descriptor

Vision-based human action recognition

Using machine learning techniques to detect and understand the meaningful motion of the human body.

Objective:

2

Page 4: Human action recognition with kinect using a joint motion descriptor

Kinect V1.8Kinect is a motion sensing input device with RGB camera, depth sensor and multi-array microphone.

[0] HipCenter[1] Spine[2] ShoulderCenter[3] Head[4] ShoulderLeft[5] ElbowLeft[6] WristLeft[7] HandLeft[8] ShoulderRight[9] ElbowRight[10] WristRight[11] HandRight[12] HipLeft[13] KneeLeft[14] AnkleLeft[15] FootLeft[16] HipRight[17] KneeRight[18] AnkleRight[19] FootRight

Depth sensor

RGB camera

Multi-array microphone

3

Page 5: Human action recognition with kinect using a joint motion descriptor

According to the Aggarwal review * :

Action primitives /Gestures/

- Elementary movements of a person's body part

- Can be described at the limb level (e.g., “Raising an arm”)

Actions

- Consist of several action primitives

- Could be temporally organized (e.g., “Walking”)

Activities

- Contain a number of subsequent actions (e.g., “Playing soccer”)

* [Aggarwal and Ryoo. Human activity analysis: A review, 2011.]

Taxonomies of human motion4

Page 6: Human action recognition with kinect using a joint motion descriptor

• Depth frames are a rich and informative sources of data:

- Complex actions and activities.

- Time consuming. (Not suitable for real-time applications).

• Recognizing gestures and actions in real-time is considered very useful for the application of human-machine interaction.

• Researchers are seeking an effective and fast descriptor for human action recognition which exploit simple data source such as skeleton data.

Significance5

Page 7: Human action recognition with kinect using a joint motion descriptor

Intra- and inter-class variations

- Speed of movement - Style of movement

• Wide variety of one action performance.

• Different actions have a high degree of similarity in between.

Environment and recording settings

- View point - Recording rate

- Partial occlusion

Computational time

- Complex descriptors lead to an extensive computational time.

Challenges of action recognition6

Page 8: Human action recognition with kinect using a joint motion descriptor

A new feature descriptor for human actions represented by 3D skeleton series acquired by Kinect.

• Fast and effective

• Scale-invariant

• Speed-invariant

• Length-invariant

• Outperforms the state-of-the-art descriptors.

Publications:

Contribution7

S. Boubou & E. Suzuki, “Classifying Actions Based onHistogram of Oriented Velocity Vectors”, J. Intell. Inf. Syst.44(1): 49-65 (2015)

Page 9: Human action recognition with kinect using a joint motion descriptor

Related work

8

Page 10: Human action recognition with kinect using a joint motion descriptor

Methodologies of

human action recognition

Single-layered approachHierarchical

approach

- Recognition of gestures- Actions with sequential

characteristics

Low computational cost

- Complex activities- High-level human actions

High computational cost

9

Page 11: Human action recognition with kinect using a joint motion descriptor

Methodologies

Single-layered approachHierarchical

approach

- To demonstrate a low computational cost.

- Suitable to be implemented for action recognition by robots.

- Fast and accurate enough for human-robot interaction.

10

Page 12: Human action recognition with kinect using a joint motion descriptor

Methodologies

Single-layered approach

Space-time approach

Sequential approach

Hierarchical approach

11

(e.g., Hidden Conditional Random Fields)

[Wang and Mori, CVPR 2009.]

Page 13: Human action recognition with kinect using a joint motion descriptor

Space-time approaches

Space-time volumes

TrajectoriesFeature

descriptors

12

[SHEIKH et at ,ICCV2005]

[LIU AND SHAH, CVPR2008]

[Gorelick et at, PAMI2007]

Page 14: Human action recognition with kinect using a joint motion descriptor

Histograms feature descriptors

[Ikizler and Duygulu. Histogram of orientedrectangles: A new pose descriptor for human action recognition. Image and Vision Computing, 2009.]

[Oreifej and Liu. HON4D. CVPR 2013.]

[Li, Zhang, and Liu. Action recognition based on a bag of 3D points,

CVPR 2010.]

13

Page 15: Human action recognition with kinect using a joint motion descriptor

Histogram of Oriented Velocity Vectors(HOVV)

14

Proposed approach:

Page 16: Human action recognition with kinect using a joint motion descriptor

We base our proposed motion understanding approach on the fact that the orientations of joint velocity vectors change over time with respect to the actions carried out.

15

Proposed approach

Page 17: Human action recognition with kinect using a joint motion descriptor

Proposed action classification approach

Feature vectorsDistance function

16

Feature extraction

Class labels

Descriptors of joint groups

Class labels

HOVV

Skeleton sequence

SVM or ELMKNN

Page 18: Human action recognition with kinect using a joint motion descriptor

Feature extraction

The orientation of the velocity vector is defined in a spherical coordinate system by two angles:

17

𝑉𝑡𝑖,𝑗= 𝑣𝑡,𝑥

𝑖,𝑗, 𝑣𝑡,𝑦𝑖,𝑗, 𝑣𝑡,𝑧𝑖,𝑗= 𝑥𝑡+1

𝑖,𝑗− 𝑥𝑡

𝑖,𝑗, 𝑦𝑡+1

𝑖,𝑗− 𝑦𝑡

𝑖,𝑗, 𝑧𝑡+1𝑖,𝑗− 𝑧𝑡

𝑖,𝑗

𝛼𝑡𝑖,𝑗= 𝑎𝑟𝑐𝑡𝑎𝑛

𝑣𝑡,𝑦𝑖,𝑗

𝑣𝑡,𝑥𝑖,𝑗

𝛽𝑡𝑖,𝑗= 𝑎𝑟𝑐𝑡𝑎𝑛

𝑣𝑡,𝑥𝑖,𝑗

𝑣𝑡,𝑥𝑖,𝑗 2

+ 𝑣𝑡,𝑦𝑖,𝑗

2+ 90

𝑖: Action identifier.𝑗: Joint identifier.𝑡: Time.

𝑣𝑡𝑜𝑡 =

𝑡=1

𝑇−1

𝑗=1

𝐽

𝑣𝑡,𝑥𝑗+ 𝑣𝑡,𝑦

𝑗+ 𝑣𝑡,𝑧

𝑗

Page 19: Human action recognition with kinect using a joint motion descriptor

Feature arrays 18

𝑖: Action identifier.𝑗 → 𝐽 : Joint identifier.𝑡 → 𝑇: Time.

Page 20: Human action recognition with kinect using a joint motion descriptor

𝛽

𝜶𝑻𝒐𝒓𝒔𝒐

Our proposed histogram descriptor19

𝒃1,1 𝒃𝐵,1

𝒃1,𝐵/2

𝒃1,1

𝒃𝐵,𝐵/2

• B= 24. • 𝛼 ∊ [0,360], 𝛽 ∊ [0,180]• Each bin representing a degree interval

equal to 360/B = 360/24 = 15°• 𝒃1,1 representing the occurrence of the

joints where 𝛼 ∊ [0,15°], 𝛽 ∊ [0,15°]

• The occurrence of all bins is regulated into an interval equal to [0,255] where 0 is represented by black and 255 is represented with white.

Page 21: Human action recognition with kinect using a joint motion descriptor

𝛽

𝜶𝑻𝒐𝒓𝒔𝒐 𝜶𝑼𝒑𝒑𝒆𝒓 𝒍𝒊𝒎𝒃𝒔𝜶𝑳𝒐𝒘𝒆𝒓 𝒍𝒊𝒎𝒃𝒔

Our proposed histogram descriptor20

Page 22: Human action recognition with kinect using a joint motion descriptor

Feature vector

w𝐡𝐞𝐫𝐞 𝒗𝒕𝒐𝒕 is defined as:

21

𝑣𝑡𝑜𝑡 =

𝑡=1

𝑇−1

𝑗=1

𝐽

𝑣𝑡,𝑥𝑗+ 𝑣𝑡,𝑦

𝑗+ 𝑣𝑡,𝑧

𝑗

The attributes of the feature vector 𝑓 are normalized density of the histogram bins [0; 255] and 𝑣𝑡𝑜𝑡 is added to the feature vector.

The new feature vector will be: 𝑓𝑘𝑙 = 𝑓𝑘𝑙 , 𝑣𝑡𝑜𝑡 .

f𝑘𝑙= 𝑏11, 𝑏21, … , 𝑏𝑘1, 𝑏12, 𝑏22, … , 𝑏𝑘2, … , 𝑏1𝑙 , … , 𝑏𝑘𝑙

Page 23: Human action recognition with kinect using a joint motion descriptor

• No need for a spatial transform.

• Each skeleton is rotated around a vertical axis passing through the skeleton’s Hip-center.

Rotation transform:

• The rotation angle is 𝜽𝒊.

22

Page 24: Human action recognition with kinect using a joint motion descriptor

Experimental Results

23

Page 25: Human action recognition with kinect using a joint motion descriptor

Office dataset / parameter tuning /

One subject, 9 actions, 100 segmented samples. Recording rate is 15 frames/sec. Each sample is fixed length and consists of 30 frames.

“standing-up” , “sitting-down” , “stretching” , “scratching the head” , “crossing hands”, “sliding the chair forward/backward while sitting” ,

“re-positioning chair” , “taking a rest by bending on the desk” and “putting two hands on the table”

24

Page 26: Human action recognition with kinect using a joint motion descriptor

Office dataset

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5

Acc

ura

cy

Distance functions

14 joints

16 joints

20 joints

• 1-nearest neighbor classification method• Distance functions are: 1.Bhattacharyya 2.Correlation

3.Chi-Square χ² 4.Intersection 5.Hellinger

25

Page 27: Human action recognition with kinect using a joint motion descriptor

3-D skeleton dataset

26

Page 28: Human action recognition with kinect using a joint motion descriptor

3-D skeleton dataset 9 actions, 9 subjects * 2 trials 162 samples. Recording rate is 15 frames/sec. In total 10687 frames (jpg, depth)

“sitting”, “standing”, “waving”, “walking”, “picking-up”, “stretching”, “using a hammer”, “drawing a circle”,

“forward-punching”.

27

Page 29: Human action recognition with kinect using a joint motion descriptor

3-D skeleton dataset

• 𝑘-nearest neighbor classification method /(𝑔=3)/• 20*10 fold cross validation

k=5

k=7

0%10%20%30%40%50%60%70%80%90%

100%

4 8 16 24 36 72

k=5

k=6

k=7

Acc

ura

cy

Number of histogram bins

28

Page 30: Human action recognition with kinect using a joint motion descriptor

Comparison with the-state-of-the-art methods

Method Accuracy%

Our approach HOVV 88.90 %

Other approaches

* HCRF [ω=0] 54.17%

* HCRF [ω=1] 56.94%

** HON4D 81.94%

** HON4DA 76.39%

*** Chen and Koskela [4](SVM) 78.32%

*** Chen and Koskela [4](ELM) 87.71%

We compare accuracy with:* Sequential representation approach with HCRF classifier.** 4-D Space-time representation approach with SVM classifier. *** Sequential data with SVM and ELM classifier.

29

Page 31: Human action recognition with kinect using a joint motion descriptor

Method Computational latency (Sec)(Feature extraction)

Our approach HOVV 0.055

Other approaches

* HCRF [ω=0] 0.017

* HCRF [ω=1] 0.058

** HON4D 34.755

** HON4DA 6.706

We compare computational latency with:* Sequential representation approach with HCRF classifier.** 4-D Space-time representation approach with SVM classifier.

Comparison with the-state-of-the-art methods30

Page 32: Human action recognition with kinect using a joint motion descriptor

Method Total classification computational latency

(train time + test time) (Sec)

ClassificationAccuracy

Proposed descriptor

Average 3.168 88.75%

Minimum 2.964 84.72%

Maximum 3.51 91.67%

Chen and Koskela

Average 3.389 87.71%

Minimum 3.136 82.27%

Maximum 3.651 89.88%

We compare with Chen and Koskela 2013* descriptor.ELM is used as classification method.

* Classification of RGB-D and Motion Capture Sequences Using Extreme Learning Machine. Scandinavian Conference on Image Analysis, 2013.

Comparison with the-state-of-the-art methods31

Page 33: Human action recognition with kinect using a joint motion descriptor

Detailed analysis of proposed descriptor performance

32

• Joint grouping proposal.• Number of histogram bins.• Performance per action.• What kind of actions HOVV is suitable for?

Page 34: Human action recognition with kinect using a joint motion descriptor

Joint grouping with 3-D skeleton dataset

• Three scenarios of joint grouping:

33

• In order to achieve a fast and efficient motion descriptor above mentioned three scenarios of joint grouping are analyzed.

• A motion descriptor with optimal efficiency must show the highest recognition accuracy with the lowestcomputational time.

Page 35: Human action recognition with kinect using a joint motion descriptor

Cla

ssif

icat

ion

acc

ura

cy

Time (ms)

Accuracy & computational time

HOVV_SVM_(g=3) HOVV_SVM_(g=8) HOVV_SVM_(g=20)HCRF[w=1] POS(SVM) Reg

Method Accuracy Time

HON4D 81.94% 34.76 Sec

HON4DA 76.39% 6.71 Sec

(8.33, 81.94%)g=3, B=16

Grouping joints into three groups allow us to outperform the-state-of-the-art methods with very low computational time.

34

Page 36: Human action recognition with kinect using a joint motion descriptor

Cla

ssif

icat

ion

acc

ura

cy

Time (ms)

Accuracy & computational time (Logarithmic scale)

HOVV_SVM_(g=3) HOVV_SVM_(g=8) HOVV_SVM_(g=20)HCRF[w=1] POS(SVM) HON4DHON4DA

(8.33, 81.94%)g=3, B=16

Grouping joints into three groups allow us to outperform the-state-of-the-art methods with very low computational time.

35

Page 37: Human action recognition with kinect using a joint motion descriptor

Action Recognition

accuracy

(My method)

(SVM)

HON4D HON4DA Chen

[SVM]

Chen

[ELM]

B=24, g=20 - - - -

Sitting 75.00% 75% 75% 79.8% 77.50%

Standing 100% 62.5% 62.5% 93.9% 98.21%

Waving 100% 37.5% 25% 59 % 65.54%

Walking 100% 100% 100% 100% 94.71%

Picking-up 100% 87.5% 75% 51 % 64.18%

Stretching 70.00% 100% 75% 65% 92.30%

Using-hummer 100% 100% 87.5% 53 % 82.91%

Drawing a circle 87.00% 87.5% 87.5% 100% 96.40%

Forward-punching 75.00% 87.5% 100% 98.7% 98.41%

Overall 88.89% 81.94% 76.39% 78.32% 85.57%

Compare accuracy for each action with3-D skeleton dataset

36

Page 38: Human action recognition with kinect using a joint motion descriptor

For more detailed analysis of the joint grouping and the method performance per action,

we implement proposed method on a public dataset provided by Microsoft:

MSR-Action3D dataset

37

• What kind of actions HOVV is suitable for?

Page 39: Human action recognition with kinect using a joint motion descriptor

Descriptor performance by action- In order to demonstrate the kind of actions that proposed descriptor is effective with, I conducted experiments on a public MSR action 3D Microsoft dataset*:

* http://research.microsoft.com/en-us/um/people/zliu/ActionRecoRsrc/

10 subjects, 20 actions

Each subject performs each action 2 or 3 times.

There are 557 samples in total.

- Recognition accuracy of each action is demonstrated under various conditions (i.e., number of bins and joint grouping).

- In the following three slides we show the confusion matrices of descriptor performance with

- Bins number = 36

- g=20, g=8 and g=3

38

Page 40: Human action recognition with kinect using a joint motion descriptor

Comparison with the state-of-the-art methods

Method My

method

Actionlet HON4D

(HON4DA)

Action

Graph on

Bag of 3D

Points

HMM Dynamic

Temporal

Warping

Recurrent

Neural

Network

Over all

accuracy 85.64% 88.2% 88.89

(85.85%)

74.7% 63% 54% 42.5%

Accuracy

by actionAvailable Available - - - - -

MSR-Action3D dataset

39

Page 41: Human action recognition with kinect using a joint motion descriptor

accuracy for each action

Action

Recognition accuracy

(My method)

Recognition accuracy

(Actionlet)

B=36, g=8

High arm waving 88.9% 91.67%

Horizontal arm waving 96.2% 100%

Hammering 74.1% 83.33%

Hand catching 76% 25.00%

Forward punching 65.4% 72.73%

High throwing 88.5% 72.73%

Drawing X 63% 53.85%

Drawing tick 76.7% 100%

Drawing circle 76.7% 100%

Hand clapping 93.3% 100%

Two hand waving 93.3% 100%

Side boxing 100% 86.67%

Bending 81.5% 93.33%

Forward kicking 86.2% 100%

Side kicking 95% 100%

Jogging 100% 100%

Tennis swinging 90% 100%

Tennis serving 90% 100%

Golf swinging 96.7% 100%

Pick-Up throwing 77.8% 64.29%

Overall 85.64% 88.2%

40

Page 42: Human action recognition with kinect using a joint motion descriptor

Action(MSR3D)

Recognition accuracy

B=36, g=20

B=36, g=8

B=36, g=3

High arm waving 7.4% 88.9% 88.9%

Horizontal arm waving 26.9% 96.2% 84.6%

Hammering 7.4% 74.1% 77.8%

Hand catching 40% 76% 72%

Forward punching 15.4% 65.4% 61.5%

High throwing 19.2% 88.5% 88.5%

Drawing X 18.5% 63% 66.7%

Drawing tick 36.7% 76.7% 73.3%

Drawing circle 23.3% 76.7% 76.7%

Hand clapping 43.3% 93.3% 90%

Two hand waving 76.7% 93.3% 90%

Side boxing 26.7% 100% 100%

Bending 63% 81.5% 74.1%

Forward kicking 51.7% 86.2% 82.8%

Side kicking 40% 95% 95%

Jogging 83.3% 100% 96.7%

Tennis swinging 43.3% 90% 93.3%

Tennis serving 53.3% 90% 90%

Golf swinging 60% 96.7% 96.7%

Pick-Up throwing 74.1% 77.8% 81.5%

Accuracy increased

No change

Accuracy decreased < 5%

Accuracy decreased > 5%

41

Action(3DOffice)

Recognition accuracy

B=16, g=20

B=16, g=8

B=16, g=3

Sitting 62.5% 62.5% 50%

Standing 100% 100% 100%

Waving 100% 100% 100%

Walking 100% 100% 100%

Picking-up 100% 75% 75%

Stretching 87.5% 87.5% 62.5%

Using-hummer 75% 87.5% 75%

Drawing a circle 87.5% 87.5% 87.5%

Forward-punching

75% 87.5% 87.5%

Page 43: Human action recognition with kinect using a joint motion descriptor

High arm waving

Horizontal arm waving

Hammering

Hand catching

Forward punching

High throwing

Drawing X

Draw tick

Drawing circle

Hand clappingTwo hand waving

Side boxing

Bending

Forward kicking

Side kicking

Jogging

Tennis Swinging

Tennis serving

Golf Swinging

pickup throwing

74.1%95%

65%90%

88.9%96.7%100%

93.3%93.3%

96.2%88.5%100%86.2%

63%76.7%81.5%76.7%

90%76%

77.8%

Accuracy increasedNo change

Accuracy decreased < 5%Accuracy decreased > 5%

36 1624 8

Very Effective

Effective

Not Effective

42Accuracy

Number of bins:

Page 44: Human action recognition with kinect using a joint motion descriptor

36 1624 8

74.1%95%

65%90%

88.9%96.7%100%

93.3%93.3%

96.2%88.5%100%86.2%

63%76.7%81.5%76.7%

90%76%

77.8%

Very Effective

Effective

Not Effective

43

Accuracy increasedNo change

Accuracy decreased < 5%Accuracy decreased > 5%

AccuracyNumber of bins:

High arm waving

Horizontal arm waving

Hammering

Hand catching

Forward punching

High throwing

Drawing X

Draw tick

Drawing circle

Hand clappingTwo hand waving

Side boxing

Bending

Forward kicking

Side kicking

Jogging

Tennis Swinging

Tennis serving

Golf Swinging

pickup throwing

Effective with:

- Periodic actions.

- Actions with unique joints trajectories.

Grouping is less effective with actions involve movement of few number of joints with small displacements.

Grouping joints into eight groups is always effective with actions of MSR3D dataset.

Page 45: Human action recognition with kinect using a joint motion descriptor

Summary

We proposed a novel descriptor for motion of skeleton joints.

Proposed descriptor proved to outperform the state-of-the-art descriptors such as HON4D and the one proposed by Chen et al 2013.

Our proposed approached proved to be effective for periodic actions (e.g., Waving, Walking, Jogging, Side-Boxing, etc).

Grouping was effective for actions with unique joints trajectories (e.g., Tennis serving, Side kicking , etc).

Grouping joints into eight groups is always effective with actions of MSR3D dataset.

44

Page 46: Human action recognition with kinect using a joint motion descriptor

The end

45