Post on 10-May-2015
Human Action Recognition in Videos Employing
2DPCA on 2DHOOF and Radon Transform
Presented in Partial Fullment of the Requirements of the Degree of Masters of Science in the School of Communication and Information Technology
Fadwa Fawzy FouadSupervisor: Dr. Moataz
M.Abdelwahab
Agenda
Introduction
Quick overview
2DHOOF/2DPCA Contour Based Optical Flow Algorithm
Human Gesture Recognition Employing Radon Transform/2DPCA
Introduction
• Importance & Applications• Action V.S. Activity• Challenges & characteristics of the domain
Importance &Applications
Human action\activity recognition is one of the most promising applications of computer vision. The interest of this topic is motivated by the promise of many applications include
• character animation for games and movies
• advanced intelligent user interfaces
• biomechanical analysis of actions for sports and medicine
• automatic surveillance
Action V.S. Activity
Action
Simple motion pattern
Single person
Short time duration
Activity
Complex sequence of actions
Single/ multiple person(s)
Long time duration
Challenges and characteristics of the
domainThe difficulty of the recognition process is associated with multiple variation sources
Inter- and intra-class variations
Environmental Variations and Capturing conditions
Temporal variations
• Inter-class variations (variations within single class)
The variations in the performance of certain action due to anthropometric differences between individuals. For example, running movements can differ in speed and stride length.
• Intra-class variations (variations within different classes)
Overlap between different action classes due to the similarity in actions performance.
• Environmental variations
Destructions originate from the actor’s surroundings include dynamic or cluttered environments, illumination variation, Body occlusion
• Capturing conditions
Depend on the method used to capture the scene, wither single\multiple static/dynamic camera(s) systems.
• Temporal variations
Includes the changes in the performance rate from one person to another. Also, the changes in the recording rate (frame/sec).
Agenda
Introduction
Quick overview
2DHOOF/2DPCA Contour Based Optical Flow Algorithm
Human Gesture Recognition Employing Radon Transform/2DPCA
Overview
The main structure of action recognition system
The main structure of action recognition
systemThe structure of the action recognition system is typically hierarchical.
Action classificati
on
Extraction of the action descriptors
Human detection & segmentation
Capture the input videoStart
End
Capture the input video
For single camera, the scene is captured from only one viewpoint, so it can't provide enough information about the action performed in case of poor viewpoint. Besides, it can't handle the occlusion problem.
Video 1
Video 2
Video 3 Video 4
Multi-camera systems can capture the same view from different poses., so they provide sufficient information that can alleviate the occlusion problem.
Camera 0 Camera 1
Camera 2 Camera 3
The new technology of Kinect depth camera can be utilized to capture theperformed actions. The device has: RGB camera, depth sensor and multi-array microphone.
It provides full-body 3D motion capture, facial recognition and voice recognition capabilities. Furthermore, depth information can be used for segmentation.
Kinect depth camera
RGBinformation
Depth information
It’s the first step of the full process of human sequence evaluation.
Techniques can be divided into :
• Background Subtraction techniques
• Motion Based techniques
• Appearance Based techniques
• Depth Based Segmentation
Human detection & segmentation
Extraction of the action descriptors
Input videos consist of massive amounts of information in the form of spatio-temporal pixel intensity variations. But most of this information is not directly relevant to the task of understanding and identifying the activity occurring in the video.
In this work we used Non-Parametric approaches in which a set of features are extracted per video frame, then these features are accumulated and matched to stored templates.
Example: Motion Energy Image & Motion History Image
When the extracted features are available for an input video, human action recognition becomes a classification problem.
Dimensionality reduction is a common step before the actual classification and is discussed first.
Action classificati
on
Dimensionality reductionImage representations are often high-dimensional. This makes matching task computationally more expensive. Also, the representation might contain noisy features. This problem trigged the idea of obtaining a more compact, robust feature representation by reducing the space of the image representation into a lower dimensional space.
Example: One\Two Dimension(s) Principal component analysis (PCA)
Nearest neighbor classification
k-Nearest neighbor (NN) classifiers use the distance between the features of anobserved sequence and those in a training set. The most common label among the k closest training sequences is chosen as the classification.
NN classification can be either performed at the frame level, or for the whole video sequences. In the latter case, issues with different frame lengths need to be resolved.
In our work we used 1-NN with Euclidean distance to classify the tested actions.
is class
is class
Agenda
Introduction
Quick overview
2DHOOF/2DPCA Contour Based Optical Flow Algorithm
Human Gesture Recognition Employing Radon Transform/2DPCA
2DHOOF/2DPCA Contour BasedOptical Flow Algorithm
• Dense V.S. Sparse OF• Alignment issues with OF• The Calculation of 2D Histogram of Optical Flow(2DHOOF)• Overall System Description• Experimental Results
Dense V.S. Sparse OF
In practice, dense OF is not the best choice to get the OF. Besides it’s high computation complexity, it is not accurate for homogenous moving objects (aperture problem).
Align actor then calculate OF
Calculate OF then Align it
Alignment issues with OF
We had two choices to decide the best order for actor alignment:
Jumping & Transition effects in Running action
Align actor then calculate OF Calculate OF then Align OF
The Calculation of 2D Histogram of Optical
Flow(2DHOOF)
Calculated OF
Histogram layersW/m x H/m x n
An example to obtain the n-layers 2DHOOF for any two successive frames
Accumulated 2D-HOOF that represents the whole video
1DHOOF V.S. 2DHOOF
Confusion between Wave and Bend actions when using 1DHOOF
Wave
Bend
Overall System Description
Segmentation & Contour Extraction
Extract the dominant vectors
Store extracted features
Sparse OF 2DHOOF 2DPCA
Segmentation & Contour Extraction
Projection on the
dominant vectors
Classification and Voting
Scheme
Sparse OF 2DHOOF
Training Mode
Testing Mode
Training Mode
Segmentation & Contour Extraction
Extract the dominant vectors
Store extracted features
Sparse OF 2DHOOF 2DPCA
Segmentation & Contour Extraction (Method 1)
• Geodesic segmentation
Input Video Frame
Face Detection
Initial Stroke
Blob Extraction
Final Contour
GD
Where xi : stroke pixels (black)x : other pixels (white)I : image intensity
Segmentation & Contour Extraction (Method 2)
• Contour extraction from Magnitude dense OF
Edge pixel has specific criteria based on it's (3 x 3) neighbor pixels.
Applying edgy criteria on the magnitude of the dense OF
Steps of contour extraction from dense OF
Training Mode
Segmentation & Contour Extraction
Extract the dominant vectors
Store extracted features
Sparse OF 2DHOOF 2DPCA
2DHOOF-2DPCA Features Extraction
Projection
Final Features
2DHOOF ofTraining Videos
Mea
n/L
ayer
Cov
aria
nce
/Lay
er
Dom
inan
t Ve
ctor
s/La
yer
Training Mode
Segmentation & Contour Extraction
Extract the dominant vectors
Store extracted features
Sparse OF 2DHOOF 2DPCA
Testing Mode
Segmentation & Contour Extraction
Projection on the
dominant vectors
Classification and Voting
Scheme
Sparse OF 2DHOOF
Projection on the dominant vectors
Classification
D1
D2
D3
Dj
Final Decision
based on the minimum D
value
Experimental Results
Two experiments were conducted to evaluate the performance of the proposed algorithm.
• For the first experiment Weizmann dataset was used to measure the performance of the low resolution single camera operation.
• For the second Experiment IXMAS multi-view dataset was used to evaluate the performance of the parallel camera structure.
The two experiments was conducted using the Leave-One-Actor-Out (LOAO) technique to be consistent with the most recent algorithms.
Both datasets provide RGB frames and the actor ‘s silhouettes.
Weizmann dataset
The Weizmann dataset consists of 90 low-resolution video sequences showing 9 different actors, each performing 10 natural actions such as walk, run, jump forward, gallop sideways, bend, wave with one hand (wave1), wave with two hands (wave2), jump in place (Pjump), jump-jack, and skip.
Bend Run Jump Jump-jack Gallop
The confusion matrix for this experiment shows that the average recognition accuracy is 97.78%, and eight actions were 100% accurate.
2DHOOF / 2DPCA
On the other hand, using 1DHOOF with 1DPCA decreases the accuracy to 63.34% because of the large confusion between actions (as discussed before).
1DHOOF / 1DPCA
Comparison with the most recent algorithms:
Method Accuracy
Previous Contribution
98.89%
Our Algorithm 97.79%
Shah et al. 95.57%
Yang et al. 92.8%
Yuan et al. 92.22%
• Recognition Accuracy
Method Average Runtime
Our Algorithm 66.11 msec
Previous Contribution
113.00 msec
Shah et al. 18.65 sec
Blank et al. 30 sec
• Average Testing Time
Samples from the calculated contour OF
Walk Skip P-jump
IXMAS Dataset
The proposed parallel structure algorithm was applied on the IXMAS multi-view dataset. Each camera is considered as an independent system, then a voting scheme was carried out between the four cameras to obtain the final decision.
Our AlgorithmCamera0
Our AlgorithmCamera1
Our Algorithm
Our Algorithm
Camera2
Camera3
Voting Scheme
Final Decision
This dataset consists of 5 cameras capturing the scene, 12 actors, each performing 13 natural actions 3 times in which the actors are free to change their orientation for each scenario.
The actions: check watch, cross arms, scratch head, sit down, get up, turn around, walk, wave, punch, kick, and pick up and throw.
Example on IXMAS multi-camera dataset. Action: Pick up and Throw
Camera 0 Camera 1
Camera 2 Camera 3
Confusion matrix for IXMAS dataset shows that average accuracy is 87.12%,where SH=Scratch head, CW=Check watch, CA=Cross arms, SD=Sit down, GU=Get up, TA=Turn around, PU=Pick up.
Method Actors #
Cam(0) %
Cam(1) %
Cam(2) %
Cam(3) %
Overall
Vote%
Proposed Algorithm 12 97.29 79.04 72.47 78.53 87.12
Previous Contribution
12 78.9 78.61 80.93 77.38 84.59
Weinland et al. 10 65.04 70.00 54.30 66.00 81.30
Srivastava et al. 10 N/A N/A N/A N/A 81.40
Shah et al. 12 72.00 53.00 68.00 63.00 78.00
Comparison with the best reported accuracies shows that we achieved the highest accuracy with an enhancement of 3%.
Bold indicates the best performance, N/A= Not available in published reports
Samples from the calculated contour OF
Walk Set down Kick
Published Paper
F. Fawzy, M. Abdelwahab, and W. Mikhael. 2DHOOF-2DPCA Contour Based Optical Flow Algorithm for Human Activity Recognition . IEEE International Midwest Symposium on Circuits and Systems (MWSCAS 2013), Ohio, USA.
Agenda
Introduction
Quick overview
2DHOOF/2DPCA Contour Based Optical Flow Algorithm
Human Gesture Recognition Employing Radon Transform/2DPCA
Human Gesture Recognition Employing
Radon Transform/2DPCA
• Radon Transform (RT)• Overall system description
Radon Transform
The RT computes projections of an image matrix along specified directions. A projection of a two-dimensional function f(x,y) is a set of line integrals along parallel paths, or beams.
Projections can be computed along any angle , by using general equation of the Radon Transform:
where is the delta function with value not equal zero only for argument equal 0, and is the projection direction, and is the orientation of this direction.
Overall system description
The proposed system is designed and tested for gesture recognition and can be extended to regular action recognition.
We have two modes for this algorithm• Training Mode• Testing Mode
Both have a pre-processing step before feature extraction.
Training Mode
Pre-processing Step: 1) Input videos
The One Shot Learning ChaLearn Gesture Dataset was used for this experiment. In this dataset a single user facing a fixed Kinect™ camera, interacting with a computer by performing gestures was captured.
Videos are represented by RGB and depth images.
Each actor has from 8 to 15 different gestures(vocabulary) for training, and 47 input videos each has from 1 to 5 gesture(s) for testing.
We applied our algorithm on a subset of this dataset consists of 37 different actors.
The dataset can be divided into two main groups; standing actors, and sitting actors. In this experiment we used a subset of the standing actor group in which actors are using their whole body to perform the gesture and make significant motion to be captured by the MEI and MHI.
Standing actors Sitting actors
Also, we used only the depth videos as input videos. Depth information makes the segmentation task easier than using RGB or gray videos, especially when the actor's clothes have the same color as the background, or textured background.
Training Mode
Pre-processing Step: 2) Segmentation & Blob extraction
We used Basic Global Thresholding Algorithm in order to extract the actor's blob.
1. Select an initial estimate for T (typically the average grey level in the image).
2. Segment the image using T into two groups of pixels: consisting of pixels with grey levels > T and consisting pixels with grey levels < T.
3. Compute the average grey levels of pixels in to give and to give .
4. Compute a new threshold value: Repeat steps 2-4 until the difference T is less than 1 or the number of total iterations is more than 10.
In some cases the resultant blob has some objects with it. This noise results from some objects that were at the same depth as the actor.
Case 1
Case 2
Case 3
In this situation we perform a noise elimination step
Case 1
Case 2
Case 3
Training Mode
Alignment using RT of the First Frame
• Vertical alignment using the projection on the y-axis (90o from RT)
• Horizontal alignment using the projection on the x-axis (0o from RT)
Training Mode
Calculate the MEI and MHI
MEI MHI MEI MHI
Whole Body Body Parts
Training Mode
Get Radon Transform for MEI and MHI
Basically, the difference between RT of the whole body and RT of the body parts is the white portion in the center representing the projection of the actor's body
Training Mode
Testing Mode
Video Chopping
We can do that by two main steps :1. Calculate the plot that represents the moving area/frame2. Apply the Local minima criteria on this plot.
As we have mentioned, the testing videos may contain from 1 to 5 different gestures per video. In this case we need to separate these gestures into one gesture per video to test our system with.
1. Calculate the plot that represents the moving area/frame
2. Apply the Local minima criteria
We are searching for a frame i that satisfies the following conditions:
a) The number of frames before this i is greater than or equal to the Frame Threshold.
b) The amount of decrease in the area at i is greater than 50% of Peak value.
c) The area at i-1 and i+1 is grater than the area at i to insure that i is a local minima between two peaks.
Good Results
Bad Results
Experimental Results
We did four One Shot Learning experiments
OSL Experimen
ts
Radon Transform
2DPCA
Direct correlation
MEI/MHI
2DPCA
Direct correlation
I, II
III, IV
Features
Experiment
Whole Body Body Parts
MEI MHI MEI MHI
RTI 71 69 82 81.5
II 70 70 81.7 81.6
MEI/MHIIII 70 68 82 81.7
IV 71.24 68.7 83.33 82.9
Recognition accuracy of the four experiments
Comparison between using RT, and using MEI/MHI directly without RT
Features % Maintained Energy
Storage Requirements
RT 99% 72 Mbytes
MEI/MHI 88% 102Mbytes
30% OFF
2D
PC
A
Better
Thank You