FaceTrack: Tracking and summarizing faces from compressed video Hualu Wang, Harold S. Stone*,...
-
date post
22-Dec-2015 -
Category
Documents
-
view
218 -
download
3
Transcript of FaceTrack: Tracking and summarizing faces from compressed video Hualu Wang, Harold S. Stone*,...
FaceTrack: FaceTrack: Tracking and Tracking and
summarizing faces summarizing faces from compressed from compressed
videovideoHualu Wang, Harold S. Stone*, Shih-Fu ChangHualu Wang, Harold S. Stone*, Shih-Fu Chang
Dept. of Electrical Engineering, Columbia Dept. of Electrical Engineering, Columbia UniversityUniversity
*NEC Research Institute*NEC Research InstitutePresentation by
Andy RovaSchool of Computing Science
Simon Fraser University
March 15, 2005 2Andy Rova • SFU CMPT 820
IntroductionIntroduction
FaceTrackFaceTrack System for both System for both trackingtracking and and summarizingsummarizing
faces in faces in compressed videocompressed video data data TrackingTracking
Detect faces and trace them through time in video shotsDetect faces and trace them through time in video shots SummarizingSummarizing
Cluster the faces across video shots and associate them Cluster the faces across video shots and associate them with different peoplewith different people
Compressed videoCompressed video Avoids the costly overhead of decoding prior to face Avoids the costly overhead of decoding prior to face
detectiondetection
March 15, 2005 3Andy Rova • SFU CMPT 820
System OverviewSystem Overview
The FaceTrack system’s goals are The FaceTrack system’s goals are related to ideas discussed in related to ideas discussed in previous presentationsprevious presentations A face-based video summary can help A face-based video summary can help
users decide if they want to download users decide if they want to download the whole videothe whole video
The summary provides good visual The summary provides good visual indexing information for a database indexing information for a database search enginesearch engine
March 15, 2005 4Andy Rova • SFU CMPT 820
Problem definitionProblem definition
The goal of the FaceTrack system is The goal of the FaceTrack system is to take an input video sequence and to take an input video sequence and generate a list of prominent faces generate a list of prominent faces that appear in the video, and that appear in the video, and determine the time periods where determine the time periods where each of the faces appearseach of the faces appears
March 15, 2005 5Andy Rova • SFU CMPT 820
General ApproachGeneral Approach Track faces within shots Track faces within shots Once tracking is done, group faces Once tracking is done, group faces
across video shots into faces of different across video shots into faces of different peoplepeople
Output a list of faces for each sequenceOutput a list of faces for each sequence For each face, list shots where it appears, For each face, list shots where it appears,
and whenand when Face recognition Face recognition is notis not performed performed
Very difficult in unconstrained videos due to Very difficult in unconstrained videos due to the broad range of face sizes, numbers, the broad range of face sizes, numbers, orientations and lighting conditionsorientations and lighting conditions
March 15, 2005 6Andy Rova • SFU CMPT 820
General ApproachGeneral Approach
Try to work in the compressed domain Try to work in the compressed domain as much as possibleas much as possible MPEG-1 and MPEG-2 videos MPEG-1 and MPEG-2 videos
Used in applications such as digital TV and DVDUsed in applications such as digital TV and DVD Macroblocks and motion vectors can be Macroblocks and motion vectors can be
used directly in tracking used directly in tracking Greater computational speed compared to Greater computational speed compared to
decodingdecoding Can always decode select frames down to Can always decode select frames down to
the pixel level for further analysisthe pixel level for further analysis For example, grouping faces across shotsFor example, grouping faces across shots
March 15, 2005 7Andy Rova • SFU CMPT 820
MPEG ReviewMPEG Review 3 types of frame data3 types of frame data
Intra-frames Intra-frames (I-frames)(I-frames) Forward predictive frames Forward predictive frames (P-frames)(P-frames) Bidirectional predictive frames Bidirectional predictive frames (B-frames)(B-frames)
Macroblocks are coding units which Macroblocks are coding units which combine pixel information via DCTcombine pixel information via DCT Luminance and chrominance are separatedLuminance and chrominance are separated
P-frames and B-frames are subjected to P-frames and B-frames are subjected to motion compensationmotion compensation Motion vectors are found and their differences Motion vectors are found and their differences
are encodedare encoded
March 15, 2005 8Andy Rova • SFU CMPT 820
System DiagramSystem Diagram
March 15, 2005 9Andy Rova • SFU CMPT 820
Face TrackingFace Tracking
ChallengesChallenges Locations of detected faces may not be accurate, Locations of detected faces may not be accurate,
since the face detection algorithm works on 16x16 since the face detection algorithm works on 16x16 macroblocksmacroblocks
False alarms and missesFalse alarms and misses Multiple faces cause ambiguities when they move Multiple faces cause ambiguities when they move
close to each otherclose to each other The motion approximated by the MPEG motion The motion approximated by the MPEG motion
vectors may not be accuratevectors may not be accurate A tracking framework which can handle these A tracking framework which can handle these
issues in the compressed domain is neededissues in the compressed domain is needed
March 15, 2005 10Andy Rova • SFU CMPT 820
The Kalman FilterThe Kalman Filter A linear, discrete-time dynamic system A linear, discrete-time dynamic system
is defined by the following difference is defined by the following difference equations:equations:
We only have access to a sequence of We only have access to a sequence of measurements measurements
Given this noisy observation data, the Given this noisy observation data, the problem is to find the optimal estimate problem is to find the optimal estimate of the unknown system state variablesof the unknown system state variables
March 15, 2005 11Andy Rova • SFU CMPT 820
The Kalman FilterThe Kalman Filter The “filter” is actually an iterative algorithm The “filter” is actually an iterative algorithm
which keeps taking in new observationswhich keeps taking in new observations The new states The new states are successively estimatedare successively estimated The error of the prediction ofThe error of the prediction of is called is called
the the innovationinnovation The innovation is amplified by a gain matrix The innovation is amplified by a gain matrix
and used as a correction for the state and used as a correction for the state predictionprediction
The corrected prediction is the new state The corrected prediction is the new state estimateestimate
March 15, 2005 12Andy Rova • SFU CMPT 820
The Kalman FilterThe Kalman Filter In the FaceTrack system, the state In the FaceTrack system, the state
vectorvector of the Kalman filter is the of the Kalman filter is the kinematic information of the face kinematic information of the face position, velocity (and sometimes position, velocity (and sometimes
acceleration)acceleration) The observation vector The observation vector is the is the
position of the position of the detecteddetected face face May not be accurateMay not be accurate
The Kalman filter lets the system predict The Kalman filter lets the system predict and update the position and parameters and update the position and parameters of the facesof the faces
March 15, 2005 13Andy Rova • SFU CMPT 820
The Kalman FilterThe Kalman Filter
The FaceTrack system uses a 0.1 The FaceTrack system uses a 0.1 second time interval for state second time interval for state updatesupdates
This corresponds to every I-frame This corresponds to every I-frame and P-frame for typical MPEG GOP and P-frame for typical MPEG GOP structurestructure GOP: “Group Of Pictures” frame GOP: “Group Of Pictures” frame
structurestructure For example, IBBPBBP…For example, IBBPBBP…
March 15, 2005 14Andy Rova • SFU CMPT 820
The Kalman FilterThe Kalman Filter For I-frames, the face detector results are used For I-frames, the face detector results are used
directlydirectly For P-frames, the face detector results are more For P-frames, the face detector results are more
prone to false alarmsprone to false alarms Instead, P-frame face locations are predicted Instead, P-frame face locations are predicted
based on the MPEG motion vectors based on the MPEG motion vectors (approximately)(approximately)
These locations are then fed into the Kalman These locations are then fed into the Kalman filter as observationsfilter as observations (in contrast with previous trackers, which assumed (in contrast with previous trackers, which assumed
that the motion-vector calculated locations were that the motion-vector calculated locations were correct alone) correct alone)
March 15, 2005 15Andy Rova • SFU CMPT 820
The Face Tracking The Face Tracking FrameworkFramework
How to discriminate new faces from How to discriminate new faces from previous ones during tracking?previous ones during tracking? The The Mahalanobis distanceMahalanobis distance is a is a
quantitative indicator of how close the new quantitative indicator of how close the new observation is to the predictionobservation is to the prediction
This can help separate new faces from This can help separate new faces from existing tracks: if the Mahalanobis distance existing tracks: if the Mahalanobis distance is greater than a certain threshold, then the is greater than a certain threshold, then the newly detected face is unlikely to belong to newly detected face is unlikely to belong to a particular existing tracka particular existing track
March 15, 2005 16Andy Rova • SFU CMPT 820
The Face Tracking The Face Tracking FrameworkFramework
In the case where two faces move close In the case where two faces move close together, Mahalanobis distance alone together, Mahalanobis distance alone cannot keep track of multiple facescannot keep track of multiple faces
Case where a face is missed or occluded:Case where a face is missed or occluded: Hypothesize the continuation of the face trackHypothesize the continuation of the face track
Case of false alarm or faces close Case of false alarm or faces close together:together: Hypothesize creation of a new trackHypothesize creation of a new track
The idea is to wait for new observation The idea is to wait for new observation data before making the final decision data before making the final decision about a trackabout a track
March 15, 2005 17Andy Rova • SFU CMPT 820
Intra-shot Tracking Intra-shot Tracking ChallengesChallenges
Multiple hypothesis method:Multiple hypothesis method:
March 15, 2005 18Andy Rova • SFU CMPT 820
Kalman Motion ModelsKalman Motion Models The Kalman filter is a framework which can The Kalman filter is a framework which can
model different types of motion, depending on model different types of motion, depending on the system matrices usedthe system matrices used
Several models were tested for the paper, with Several models were tested for the paper, with varying resultsvarying results
Intuition: who pays to research object Intuition: who pays to research object tracking?tracking? The military! The military! Hence many tracking models are based on Hence many tracking models are based on
trajectories that are unlike those that faces in video trajectories that are unlike those that faces in video will likely exhibitwill likely exhibit
For example, in most commercial video, a human For example, in most commercial video, a human face will not maneuver like a jet or missile face will not maneuver like a jet or missile
March 15, 2005 19Andy Rova • SFU CMPT 820
Kalman Motion ModelsKalman Motion Models
Four motion models were tested for Four motion models were tested for FaceTrackFaceTrack Constant VelocityConstant Velocity (CV)(CV) Constant AccelerationConstant Acceleration (CA)(CA) Correlated AccelerationCorrelated Acceleration (AA)(AA) Variable DimensionVariable Dimension (VDF)(VDF)
The testing was done against The testing was done against ground ground truthtruth consisting of manually consisting of manually identified face centers in each frameidentified face centers in each frame
March 15, 2005 20Andy Rova • SFU CMPT 820
Kalman Motion ModelsKalman Motion Models
Rather than go through the whole Rather than go through the whole process in exact detail, the next process in exact detail, the next several slides are an illustration of several slides are an illustration of the differences between the CV and the differences between the CV and CA modelsCA models
Also, the matrices are expanded to Also, the matrices are expanded to show how the states are updatedshow how the states are updated
March 15, 2005 21Andy Rova • SFU CMPT 820
Constant Velocity (CV) Constant Velocity (CV) ModelModel
expand
March 15, 2005 22Andy Rova • SFU CMPT 820
Constant Velocity (CV) Constant Velocity (CV) ModelModel
simplify
March 15, 2005 23Andy Rova • SFU CMPT 820
Constant Velocity (CV) Constant Velocity (CV) ModelModel
simplify
expand
March 15, 2005 24Andy Rova • SFU CMPT 820
Constant Acceleration (CA) Constant Acceleration (CA) ModelModel
Acceleration is now added to the state vector, and is explicitly modeled as constants disturbed by random noises
expand
March 15, 2005 25Andy Rova • SFU CMPT 820
Constant Acceleration (CA) Constant Acceleration (CA) ModelModel
simplify
March 15, 2005 26Andy Rova • SFU CMPT 820
The Correlated Acceleration The Correlated Acceleration ModelModel
Replaces constant accelerations with a Replaces constant accelerations with a AR(1) modelAR(1) model AR(1): First order autoregressiveAR(1): First order autoregressive
A stochastic process where the immediately previous A stochastic process where the immediately previous value has an effect on the current value (plus some value has an effect on the current value (plus some random noise)random noise)
Why? Why? There is a strong negative autocorrelation There is a strong negative autocorrelation
between the accelerations of consecutive framesbetween the accelerations of consecutive frames Positive accelerations tend to be followed by negative Positive accelerations tend to be followed by negative
accelerationsaccelerations Implies that faces tend to “stabilize”Implies that faces tend to “stabilize”
March 15, 2005 27Andy Rova • SFU CMPT 820
The Variable Dimension The Variable Dimension FilterFilter
A system that switches between CV A system that switches between CV (constant velocity) and CA (constant (constant velocity) and CA (constant acceleration) modesacceleration) modes
The dimension of the state vector The dimension of the state vector changes when a maneuver is changes when a maneuver is detected, hence “VDF”detected, hence “VDF”
Developed for tracking highly Developed for tracking highly maneuverable targets (probably maneuverable targets (probably military jets)military jets)
March 15, 2005 28Andy Rova • SFU CMPT 820
Comparison of Motion Comparison of Motion ModelsModels
average tracking error
tracking runs (first 16)
March 15, 2005 29Andy Rova • SFU CMPT 820
Comparison of Motion Comparison of Motion ModelsModels
Why does CV perform best?Why does CV perform best? Small sampling interval justifies viewing Small sampling interval justifies viewing
face motion as piecewise linear movementsface motion as piecewise linear movements The face cannot achieve very high The face cannot achieve very high
accelerations (as opposed to a jet fighter)accelerations (as opposed to a jet fighter) AA also performs well because it fits the AA also performs well because it fits the
nature of the face motion wellnature of the face motion well Commercial video faces exhibit few Commercial video faces exhibit few
persistent accelerations (negative persistent accelerations (negative autocorrelation)autocorrelation)
March 15, 2005 30Andy Rova • SFU CMPT 820
Summarization Across Summarization Across ShotsShots
Select representative frames for tracked Select representative frames for tracked facesfaces Large, frontal-view faces are bestLarge, frontal-view faces are best
Decode representative frames into the pixel Decode representative frames into the pixel domaindomain
Use clustering algorithms to group the faces Use clustering algorithms to group the faces into different personsinto different persons
Make use of domain knowledgeMake use of domain knowledge For example, people do not usually change clothes For example, people do not usually change clothes
within a news segment, but often do change within a news segment, but often do change outfits within a sitcom episodeoutfits within a sitcom episode
March 15, 2005 31Andy Rova • SFU CMPT 820
Simulation ResultsSimulation Results
March 15, 2005 32Andy Rova • SFU CMPT 820
Conclusions & Future Conclusions & Future ResearchResearch
The FaceTrack is an effective face tracking (and The FaceTrack is an effective face tracking (and summarization) architecture, within which summarization) architecture, within which different detection and tracking methods can be different detection and tracking methods can be usedused Could be updated to use new face detection Could be updated to use new face detection
algorithms or improved motion modelsalgorithms or improved motion models Based on the results, the CV and AA motion Based on the results, the CV and AA motion
models are sufficient for commercial face motionmodels are sufficient for commercial face motion Summarization techniques need the most Summarization techniques need the most
development, followed by optimizing tracking for development, followed by optimizing tracking for adverse situationsadverse situations