【VISAPP2016】Activity Prediction Using a Space-Time CNN and Bayesian Framework

17
Activity Prediction Using a Space-Time CNN and Bayesian Framework Hirokatsu KATAOKA , Yoshimitsu AOKI , Kenji IWATA, Yutaka SATOH National Institute of Advanced Industrial Science and Technology (AIST) † Keio University http://www.hirokatsukataoka.net/

Transcript of 【VISAPP2016】Activity Prediction Using a Space-Time CNN and Bayesian Framework

Activity Prediction Using a Space-Time CNN and Bayesian Framework

Hirokatsu KATAOKA, Yoshimitsu AOKI†, Kenji IWATA, Yutaka SATOH

National Institute of Advanced Industrial Science and Technology (AIST) † Keio University

http://www.hirokatsukataoka.net/

Background •  Computer vision for human sensing –  Detection, tracking, trajectory analysis –  Posture estimation, action analysis –  Action recognition is able to extend human sensing applications

Mental state

Body Situation

Attention

Action Analysis

shakinghands

Look at people

Detection Gaze Estimation

Action Recognition

Posture Estimation

Face Recognition

Trajectory extraction

Tracking

Related work 1: Action Recognition •  Action is a low-level primitive with semantic meaning –  e.g. walking, running, sitting

This image contains a man walking - The classification (location is given)

Action recognition

Walking

Is action recognition enough?

Time-series

Post-detection

Event detection (Action tag : Ai)

Time-series

Event prediction (Prediction tag : Aj)

Pre-estimation

Related work 2: Early Action Recognition •  Prediction in early part of action –  Integral bag-of-words –  Accumulating likelihood through time-sequence

M. S. Ryoo, “Human Activity Prediction: Early Recognition of Ongoing Activities from Streaming Videos”, International Conference on Computer Vision (ICCV), pp.1036-1043, 2011.

Proposal •  Action prediction within a ST-CNN and Bayesian framework –  Action recognition –  Database analysis

???Daytime (Time Zone)

Walking (Previous Action)

Sitting (Current Action)

??? (Next Action)

xtimezone xprevious xcurrent

θ = “Using a PC”

Given Not givenTime series

Problem settings

•  Three different works in action analysis –  Action recognition

•  Recognizing At given 1 ~ t frames

–  Early action recognition

•  Recognizing At given 1 ~ t-L frames

–  Action Prediction

•  Recognizing At+L given 1 ~ t frames

Approach Setting Action Recognition

Early Action Recognition

Action Prediction

f (F1...tA )→ At

f (F1...t−LA )→ At

f (F1...tA )→ At+L

Process flow •  Consist of (i) action recognition (ii) action prediction

1.  Action recognition 1.1 Improved dense trajectories (IDT) 1.2 Space-time convolutional neural networks (ST-CNN)

2.  Action prediction 2.1 Bayesian framework 2.2 Database

xxxxxxxxxxxxxxx

xxx

Trajectory (in t + L frames)

Feature extraction (HOG, HOF, MBH, Traj.)

Bag-of-words (BoW)

Pedestrian detection IDT

Input

ConvConv

Pool

FC

ConvConv

Pool

ConvConv

Pool

ConvConv

Pool

ConvConv

Pool

ST-CNNOxford VGG architecture (VGGNet)

Action Recognition (1/2) •  Improved Dense Trajectories (IDT) [Wang+, ICCV2013] –  Pyramidal image sequences and flow tracking –  Feature descriptors on trajectories –  Feature representation with bag-of-words (BoW)

sitting walking

Action Recognition (1/2) •  IDT + Co-occurrence HOG [Kataoka+, ACCV2014]

CoHOG: edge-pair counting to corresponding histogram position

Extended CoHOG(ECoHOG): edge-magnitude accumulation

–  PCA dim. reduction: 103 - 104 dims into 101-102 ,easy to divide in feature space

Action Recognition (2/2) •  Space-time Convolutional Neural Networks (ST-CNN) –  Based on VGG 16-layer architecture (VGGNet) [Simonyan+, ICLR2015] –  Statio-temporal feature concatenation (around 10 frames)

Space-time CNN (ST-CNN) Feature

Input

Conv

Conv

Pool

FC

FC

Conv

Conv

Pool

Conv

Conv

Pool

Conv

Conv

Pool

Conv

Conv

Pool

FC

So3max

・・・

CNN architecture with VGGNet

Action Prediction (1/2) •  Prediction model

- Action sequence Predicting “Using a PC” at “Walk” => “Sit”

- Time zone (supplemental info.) Day time

???Daytime (Time Zone)

Walking (Previous Activity)

Sitting (Current Activity)

??? (Next Activity)

xtimezone xprevious xcurrent

θ = “Using a PC”

Given Not givenTime series

•  Database: ST-action tags + attribute –  Time zone

•  “morning”, “day time”, “night”

–  Previous & current action

•  “walk”, “bend”, “stand”, “sit”…

–  Next action (objective)

•  “use a PC”, “read”, “meal”…

Action Prediction (2/2)

Action History DB

Walking

Sitting

Using a PC

Daytime

Experiments on the Daily Living Data –  Total 20h of video –  3 different scenes –  640x480, 30fps

Results •  Action recognition –  IDT (HOG, HOF, MBH, CoHOG, ECoHOG, All) –  Per-frame CNN –  ST-CNN –  Combined vector

Results •  Action prediction

Time Attributes

Estimated Intention

Action

PC (0.82) Read (0.11)

Predicted activity

Read (1.00) PC (0.00)

Coluclusion •  Action prediction approach within recognition and database analysis –  Concatenated vector of IDT, ST-CNN –  Bayesian framework –  Database