Human action recognition with kinect using a joint motion descriptor

Human Action Recognition with Kinect using a

Joint Motion Descriptor

Somar Boubou, Einoshin Suzuki09.02.2015

Journal of Intelligent Information Systems 44 (1), 49-65

http://link.springer.com/article/10.1007/s10844-014-0329-0

• Introduction /7 slides/.

• Related work /6 slides/.

• Proposed approach /9 slides/.

• Experimental results /9 slides/.

• Detailed analysis /5 slides/.

• Experiments and detailed analysis with a public challenging MSR-Action3D dataset. /7 slides/

• Summary.

Structure of this presentation1

Vision-based human action recognition

Using machine learning techniques to detect and understand the meaningful motion of the human body.

Objective:

2

Kinect V1.8Kinect is a motion sensing input device with RGB camera, depth sensor and multi-array microphone.

[0] HipCenter[1] Spine[2] ShoulderCenter[3] Head[4] ShoulderLeft[5] ElbowLeft[6] WristLeft[7] HandLeft[8] ShoulderRight[9] ElbowRight[10] WristRight[11] HandRight[12] HipLeft[13] KneeLeft[14] AnkleLeft[15] FootLeft[16] HipRight[17] KneeRight[18] AnkleRight[19] FootRight

Depth sensor

RGB camera

Multi-array microphone

3

According to the Aggarwal review * :

Action primitives /Gestures/

- Elementary movements of a person's body part

- Can be described at the limb level (e.g., “Raising an arm”)

Actions

- Consist of several action primitives

- Could be temporally organized (e.g., “Walking”)

Activities

- Contain a number of subsequent actions (e.g., “Playing soccer”)

* [Aggarwal and Ryoo. Human activity analysis: A review, 2011.]

Taxonomies of human motion4

• Depth frames are a rich and informative sources of data:

- Complex actions and activities.

- Time consuming. (Not suitable for real-time applications).

• Recognizing gestures and actions in real-time is considered very useful for the application of human-machine interaction.

• Researchers are seeking an effective and fast descriptor for human action recognition which exploit simple data source such as skeleton data.

Significance5

Intra- and inter-class variations

- Speed of movement - Style of movement

• Wide variety of one action performance.

• Different actions have a high degree of similarity in between.

Environment and recording settings

- View point - Recording rate

- Partial occlusion

Computational time

- Complex descriptors lead to an extensive computational time.

Challenges of action recognition6

A new feature descriptor for human actions represented by 3D skeleton series acquired by Kinect.

• Fast and effective

• Scale-invariant

• Speed-invariant

• Length-invariant

• Outperforms the state-of-the-art descriptors.

Publications:

Contribution7

S. Boubou & E. Suzuki, “Classifying Actions Based onHistogram of Oriented Velocity Vectors”, J. Intell. Inf. Syst.44(1): 49-65 (2015)

Related work

8

Methodologies of

human action recognition

Single-layered approachHierarchical

approach

- Recognition of gestures- Actions with sequential

characteristics

Low computational cost

- Complex activities- High-level human actions

High computational cost

9

Methodologies

Single-layered approachHierarchical

approach

- To demonstrate a low computational cost.

- Suitable to be implemented for action recognition by robots.

- Fast and accurate enough for human-robot interaction.

10

Methodologies

Single-layered approach

Space-time approach

Sequential approach

Hierarchical approach

11

(e.g., Hidden Conditional Random Fields)

[Wang and Mori, CVPR 2009.]

Space-time approaches

Space-time volumes

TrajectoriesFeature

descriptors

12

[SHEIKH et at ,ICCV2005]

[LIU AND SHAH, CVPR2008]

[Gorelick et at, PAMI2007]

Histograms feature descriptors

[Ikizler and Duygulu. Histogram of orientedrectangles: A new pose descriptor for human action recognition. Image and Vision Computing, 2009.]

[Oreifej and Liu. HON4D. CVPR 2013.]

[Li, Zhang, and Liu. Action recognition based on a bag of 3D points,

CVPR 2010.]

13

Histogram of Oriented Velocity Vectors(HOVV)

14

Proposed approach:

We base our proposed motion understanding approach on the fact that the orientations of joint velocity vectors change over time with respect to the actions carried out.

15

Proposed approach

Proposed action classification approach

Feature vectorsDistance function

16

Feature extraction

Class labels

Descriptors of joint groups

Class labels

HOVV

Skeleton sequence

SVM or ELMKNN

Feature extraction

The orientation of the velocity vector is defined in a spherical coordinate system by two angles:

17

𝑉𝑡𝑖,𝑗= 𝑣𝑡,𝑥

𝑖,𝑗, 𝑣𝑡,𝑦𝑖,𝑗, 𝑣𝑡,𝑧𝑖,𝑗= 𝑥𝑡+1

𝑖,𝑗− 𝑥𝑡

𝑖,𝑗, 𝑦𝑡+1

𝑖,𝑗− 𝑦𝑡

𝑖,𝑗, 𝑧𝑡+1𝑖,𝑗− 𝑧𝑡

𝑖,𝑗

𝛼𝑡𝑖,𝑗= 𝑎𝑟𝑐𝑡𝑎𝑛

𝑣𝑡,𝑦𝑖,𝑗

𝑣𝑡,𝑥𝑖,𝑗

𝛽𝑡𝑖,𝑗= 𝑎𝑟𝑐𝑡𝑎𝑛

𝑣𝑡,𝑥𝑖,𝑗

𝑣𝑡,𝑥𝑖,𝑗 2

+ 𝑣𝑡,𝑦𝑖,𝑗

2+ 90

𝑖: Action identifier.𝑗: Joint identifier.𝑡: Time.

𝑣𝑡𝑜𝑡 =

𝑡=1

𝑇−1

𝑗=1

𝐽

𝑣𝑡,𝑥𝑗+ 𝑣𝑡,𝑦

𝑗+ 𝑣𝑡,𝑧

𝑗

Feature arrays 18

𝑖: Action identifier.𝑗 → 𝐽 : Joint identifier.𝑡 → 𝑇: Time.

𝛽

𝜶𝑻𝒐𝒓𝒔𝒐

Our proposed histogram descriptor19

𝒃1,1 𝒃𝐵,1

𝒃1,𝐵/2

𝒃1,1

𝒃𝐵,𝐵/2

• B= 24. • 𝛼 ∊ [0,360], 𝛽 ∊ [0,180]• Each bin representing a degree interval

equal to 360/B = 360/24 = 15°• 𝒃1,1 representing the occurrence of the

joints where 𝛼 ∊ [0,15°], 𝛽 ∊ [0,15°]

• The occurrence of all bins is regulated into an interval equal to [0,255] where 0 is represented by black and 255 is represented with white.

𝛽

𝜶𝑻𝒐𝒓𝒔𝒐 𝜶𝑼𝒑𝒑𝒆𝒓 𝒍𝒊𝒎𝒃𝒔𝜶𝑳𝒐𝒘𝒆𝒓 𝒍𝒊𝒎𝒃𝒔

Our proposed histogram descriptor20

Feature vector

w𝐡𝐞𝐫𝐞 𝒗𝒕𝒐𝒕 is defined as:

21

𝑣𝑡𝑜𝑡 =

𝑡=1

𝑇−1

𝑗=1

𝐽

𝑣𝑡,𝑥𝑗+ 𝑣𝑡,𝑦

𝑗+ 𝑣𝑡,𝑧

𝑗

The attributes of the feature vector 𝑓 are normalized density of the histogram bins [0; 255] and 𝑣𝑡𝑜𝑡 is added to the feature vector.

The new feature vector will be: 𝑓𝑘𝑙 = 𝑓𝑘𝑙 , 𝑣𝑡𝑜𝑡 .

f𝑘𝑙= 𝑏11, 𝑏21, … , 𝑏𝑘1, 𝑏12, 𝑏22, … , 𝑏𝑘2, … , 𝑏1𝑙 , … , 𝑏𝑘𝑙

• No need for a spatial transform.

• Each skeleton is rotated around a vertical axis passing through the skeleton’s Hip-center.

Rotation transform:

• The rotation angle is 𝜽𝒊.

22

Experimental Results

23

Office dataset / parameter tuning /

One subject, 9 actions, 100 segmented samples. Recording rate is 15 frames/sec. Each sample is fixed length and consists of 30 frames.

“standing-up” , “sitting-down” , “stretching” , “scratching the head” , “crossing hands”, “sliding the chair forward/backward while sitting” ,

“re-positioning chair” , “taking a rest by bending on the desk” and “putting two hands on the table”

24

Office dataset

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3 4 5

Acc

ura

cy

Distance functions

14 joints

16 joints

20 joints

• 1-nearest neighbor classification method• Distance functions are: 1.Bhattacharyya 2.Correlation

3.Chi-Square χ² 4.Intersection 5.Hellinger

25

3-D skeleton dataset

26

3-D skeleton dataset 9 actions, 9 subjects * 2 trials 162 samples. Recording rate is 15 frames/sec. In total 10687 frames (jpg, depth)

“sitting”, “standing”, “waving”, “walking”, “picking-up”, “stretching”, “using a hammer”, “drawing a circle”,

“forward-punching”.

27

3-D skeleton dataset

• 𝑘-nearest neighbor classification method /(𝑔=3)/• 20*10 fold cross validation

k=5

k=7

0%10%20%30%40%50%60%70%80%90%

100%

4 8 16 24 36 72

k=5

k=6

k=7

Acc

ura

cy

Number of histogram bins

28

Comparison with the-state-of-the-art methods

Method Accuracy%

Our approach HOVV 88.90 %

Other approaches

* HCRF [ω=0] 54.17%

* HCRF [ω=1] 56.94%

** HON4D 81.94%

** HON4DA 76.39%

*** Chen and Koskela [4](SVM) 78.32%

*** Chen and Koskela [4](ELM) 87.71%

We compare accuracy with:* Sequential representation approach with HCRF classifier.** 4-D Space-time representation approach with SVM classifier. *** Sequential data with SVM and ELM classifier.

29

http://www.computer.org/csdl/trans/tp/2007/10/i1848-abs.html


http://www.cs.ucf.edu/~oreifej/HON4D.html


http://link.springer.com/chapter/10.1007/978-3-642-38886-6_60


Method Computational latency (Sec)(Feature extraction)

Our approach HOVV 0.055

Other approaches

* HCRF [ω=0] 0.017

* HCRF [ω=1] 0.058

** HON4D 34.755

** HON4DA 6.706

We compare computational latency with:* Sequential representation approach with HCRF classifier.** 4-D Space-time representation approach with SVM classifier.

Comparison with the-state-of-the-art methods30





Method Total classification computational latency

(train time + test time) (Sec)

ClassificationAccuracy

Proposed descriptor

Average 3.168 88.75%

Minimum 2.964 84.72%

Maximum 3.51 91.67%

Chen and Koskela

Average 3.389 87.71%

Minimum 3.136 82.27%

Maximum 3.651 89.88%

We compare with Chen and Koskela 2013* descriptor.ELM is used as classification method.

* Classification of RGB-D and Motion Capture Sequences Using Extreme Learning Machine. Scandinavian Conference on Image Analysis, 2013.

Comparison with the-state-of-the-art methods31


Detailed analysis of proposed descriptor performance

32

• Joint grouping proposal.• Number of histogram bins.• Performance per action.• What kind of actions HOVV is suitable for?

Joint grouping with 3-D skeleton dataset

• Three scenarios of joint grouping:

33

• In order to achieve a fast and efficient motion descriptor above mentioned three scenarios of joint grouping are analyzed.

• A motion descriptor with optimal efficiency must show the highest recognition accuracy with the lowestcomputational time.

Cla

ssif

icat

ion

acc

ura

cy

Time (ms)

Accuracy & computational time

HOVV_SVM_(g=3) HOVV_SVM_(g=8) HOVV_SVM_(g=20)HCRF[w=1] POS(SVM) Reg

Method Accuracy Time

HON4D 81.94% 34.76 Sec

HON4DA 76.39% 6.71 Sec

(8.33, 81.94%)g=3, B=16

Grouping joints into three groups allow us to outperform the-state-of-the-art methods with very low computational time.

34



Cla

ssif

icat

ion

acc

ura

cy

Time (ms)

Accuracy & computational time (Logarithmic scale)

HOVV_SVM_(g=3) HOVV_SVM_(g=8) HOVV_SVM_(g=20)HCRF[w=1] POS(SVM) HON4DHON4DA

(8.33, 81.94%)g=3, B=16

Grouping joints into three groups allow us to outperform the-state-of-the-art methods with very low computational time.

35

Action Recognition

accuracy

(My method)

(SVM)

HON4D HON4DA Chen

[SVM]

Chen

[ELM]

B=24, g=20 - - - -

Sitting 75.00% 75% 75% 79.8% 77.50%

Standing 100% 62.5% 62.5% 93.9% 98.21%

Waving 100% 37.5% 25% 59 % 65.54%

Walking 100% 100% 100% 100% 94.71%

Picking-up 100% 87.5% 75% 51 % 64.18%

Stretching 70.00% 100% 75% 65% 92.30%

Using-hummer 100% 100% 87.5% 53 % 82.91%

Drawing a circle 87.00% 87.5% 87.5% 100% 96.40%

Forward-punching 75.00% 87.5% 100% 98.7% 98.41%

Overall 88.89% 81.94% 76.39% 78.32% 85.57%

Compare accuracy for each action with3-D skeleton dataset

36

For more detailed analysis of the joint grouping and the method performance per action,

we implement proposed method on a public dataset provided by Microsoft:

MSR-Action3D dataset

37

• What kind of actions HOVV is suitable for?

Descriptor performance by action- In order to demonstrate the kind of actions that proposed descriptor is effective with, I conducted experiments on a public MSR action 3D Microsoft dataset*:

* http://research.microsoft.com/en-us/um/people/zliu/ActionRecoRsrc/

10 subjects, 20 actions

Each subject performs each action 2 or 3 times.

There are 557 samples in total.

- Recognition accuracy of each action is demonstrated under various conditions (i.e., number of bins and joint grouping).

- In the following three slides we show the confusion matrices of descriptor performance with

- Bins number = 36

- g=20, g=8 and g=3

38

http://research.microsoft.com/en-us/um/people/zliu/ActionRecoRsrc/

Comparison with the state-of-the-art methods

Method My

method

Actionlet HON4D

(HON4DA)

Action

Graph on

Bag of 3D

Points

HMM Dynamic

Temporal

Warping

Recurrent

Neural

Network

Over all

accuracy 85.64% 88.2% 88.89

(85.85%)

74.7% 63% 54% 42.5%

Accuracy

by actionAvailable Available - - - - -

MSR-Action3D dataset

39

accuracy for each action

Action

Recognition accuracy

(My method)


(Actionlet)

B=36, g=8

High arm waving 88.9% 91.67%

Horizontal arm waving 96.2% 100%

Hammering 74.1% 83.33%

Hand catching 76% 25.00%

Forward punching 65.4% 72.73%

High throwing 88.5% 72.73%

Drawing X 63% 53.85%

Drawing tick 76.7% 100%

Drawing circle 76.7% 100%

Hand clapping 93.3% 100%

Two hand waving 93.3% 100%

Side boxing 100% 86.67%

Bending 81.5% 93.33%

Forward kicking 86.2% 100%

Side kicking 95% 100%

Jogging 100% 100%

Tennis swinging 90% 100%

Tennis serving 90% 100%

Golf swinging 96.7% 100%

Pick-Up throwing 77.8% 64.29%

Overall 85.64% 88.2%

40

Action(MSR3D)


B=36, g=20

B=36, g=8

B=36, g=3

High arm waving 7.4% 88.9% 88.9%

Horizontal arm waving 26.9% 96.2% 84.6%

Hammering 7.4% 74.1% 77.8%

Hand catching 40% 76% 72%

Forward punching 15.4% 65.4% 61.5%

High throwing 19.2% 88.5% 88.5%

Drawing X 18.5% 63% 66.7%

Drawing tick 36.7% 76.7% 73.3%

Drawing circle 23.3% 76.7% 76.7%

Hand clapping 43.3% 93.3% 90%

Two hand waving 76.7% 93.3% 90%

Side boxing 26.7% 100% 100%

Bending 63% 81.5% 74.1%

Forward kicking 51.7% 86.2% 82.8%

Side kicking 40% 95% 95%

Jogging 83.3% 100% 96.7%

Tennis swinging 43.3% 90% 93.3%

Tennis serving 53.3% 90% 90%

Golf swinging 60% 96.7% 96.7%

Pick-Up throwing 74.1% 77.8% 81.5%

Accuracy increased

No change

Accuracy decreased < 5%

Accuracy decreased > 5%

41

Action(3DOffice)


B=16, g=20

B=16, g=8

B=16, g=3

Sitting 62.5% 62.5% 50%

Standing 100% 100% 100%

Waving 100% 100% 100%

Walking 100% 100% 100%

Picking-up 100% 75% 75%

Stretching 87.5% 87.5% 62.5%

Using-hummer 75% 87.5% 75%

Drawing a circle 87.5% 87.5% 87.5%

Forward-punching

75% 87.5% 87.5%

High arm waving

Horizontal arm waving

Hammering

Hand catching

Forward punching

High throwing

Drawing X

Draw tick

Drawing circle

Hand clappingTwo hand waving

Side boxing

Bending

Forward kicking

Side kicking

Jogging

Tennis Swinging

Tennis serving

Golf Swinging

pickup throwing

74.1%95%

65%90%

88.9%96.7%100%

93.3%93.3%

96.2%88.5%100%86.2%

63%76.7%81.5%76.7%

90%76%

77.8%

Accuracy increasedNo change

Accuracy decreased < 5%Accuracy decreased > 5%

36 1624 8

Very Effective

Effective

Not Effective

42Accuracy

Number of bins:

36 1624 8

74.1%95%

65%90%

88.9%96.7%100%

93.3%93.3%

96.2%88.5%100%86.2%

63%76.7%81.5%76.7%

90%76%

77.8%

Very Effective

Effective

Not Effective

43

Accuracy increasedNo change

Accuracy decreased < 5%Accuracy decreased > 5%

AccuracyNumber of bins:

High arm waving

Horizontal arm waving

Hammering

Hand catching

Forward punching

High throwing

Drawing X

Draw tick

Drawing circle

Hand clappingTwo hand waving

Side boxing

Bending

Forward kicking

Side kicking

Jogging

Tennis Swinging

Tennis serving

Golf Swinging

pickup throwing

Effective with:

- Periodic actions.

- Actions with unique joints trajectories.

Grouping is less effective with actions involve movement of few number of joints with small displacements.

Grouping joints into eight groups is always effective with actions of MSR3D dataset.

Summary

We proposed a novel descriptor for motion of skeleton joints.

Proposed descriptor proved to outperform the state-of-the-art descriptors such as HON4D and the one proposed by Chen et al 2013.

Our proposed approached proved to be effective for periodic actions (e.g., Waving, Walking, Jogging, Side-Boxing, etc).

Grouping was effective for actions with unique joints trajectories (e.g., Tennis serving, Side kicking , etc).

Grouping joints into eight groups is always effective with actions of MSR3D dataset.

44

The end

45

Human action recognition with kinect using a joint motion descriptor

Engineering

Transcript of Human action recognition with kinect using a joint motion descriptor