FACTS - A Computer Vision System for 3D Recovery and ...€¦ · galvanic skin response temperature...
Transcript of FACTS - A Computer Vision System for 3D Recovery and ...€¦ · galvanic skin response temperature...
FACTS - A Computer Vision System
for 3D Recovery and Semantic
Mapping of Human Factors
Lucas Paletta, Katrin Santner, Gerald Fritz, Albert Hofmann,
Gerald Lodron, Georg Thallinger, Heinz Mayer
Human Attention
& Environment
Selectively attending to
one aspect of the environment
Study of joint attention for
communication on objects
Human factors in the context of
environments
Study of attention, workload, memory,
stress, emotion and decision making
Study of wayfinding systems, marketing
concepts, usability of user interfaces and
products
2
Wearable „Eye Tracking Glasses“
HD camera
3
Eye Tracking Glasses (SMI ETG) wearable, 30 Hz binocular
Arousal (Affectiva Q)
Computational audition
Biosensor pulse sensor
acceleration
galvanic skin response
temperature
limb motion 6DOF
Eye Tracker static, 500 Hz binocular, SMI RED 500
Suite of (Wearable) Sensors
4
Human Factors Analysis, User Modeling, and Simulation
5
Wearable
Multimodal Sensing
User Interaction &
Human Factors
Analysis
User Model
Attention Model
Simulation in
3D Model
Statistical
Analysis
3D model
Motivation 3D Gaze Estimation
Understanding behavior in task specific ambiente
Localise Real Human Gaze in the 3D environment –
Saliency map on attended infrastructure
6
© Vrvis, JR, AIT, 2006
Previous Work on 3D Attention Mapping
Munn et al. [ETRA, 2008]
Introduced monocular eye-tracking and triangulation of 2D gaze
positions of subsequent key frames within the scene video of the eye-
tracking system.
Reconstructed only single 3D points without the reference to a
complete 3D model achieving angular error of ≈3.8 (our: ≈0.6 °)
Voßkühler et al. [ECEM 2009], Pirri et al. [CVPR 2011]
Requires special, not mass marketed stereo rig that is required in
addition to a commercial eye-tracking device.
The achieved accuracy indoor is ≈3.6 cm at 2 m distance to the target
(our: ≈0.9 cm) at the same distance of our proposed workflow.
No reference to 3D model
7
Workflow: Recovery of 3D Gaze & Semantics
8
3D Model Generation: RGB-D based Map Building
9
Pose trajectory
on ground
plane
pointcloud
Depth assocation
by means of
stereo calibration
3D Model Generation: Methodology
Fully automated 3D model generation
Grabbing RGB-D images of environment with Kinect
Performing depth based visual SLAM using both image
and depth information [*] Reconstruction of sparse point cloud consisting of 3D feature points
Each feature point is attached to a SIFT descriptor for robust data
association during pose estimation
Pose estimation using sliding window bundle adjustment while
minimizing reprojection error and depth discrepancy using 2D-3D
correspondences
10
[*] K. Pirker Katrin, G. Schweighofer, M. Rüther, H. Bischof. GPSlam: Marrying Sparse Geometric and
Dense Probabilistic Visual Mapping, Proc. 22nd British Machine Vision Conference (BMVC), 2011.
Loop closure detection through vocabulary tree search
11
query frame potential loop closing candidates
returned by the vocabulary tree
Returns a probability for each image in the map/tree
Geometr. consistency check delivers candidate frame
Low memory and fast computation time
3D Model Generation: Loop Closing
3D Model Generation: Dense Model
For human attention analysis and realistic surface
reconstruction, a dense environment model is
constructed afterwards
Using probabilistic occupancy grid mapping Every depth image is inserted into the voxel space
Using pyramidal approach presented in [*]
Real-time performance using GPU implementation
Surface reconstruction is handled by standard
marching cubes algorithm [**]
[*] K. Pirker, G.Schweighofer, M. Rüther, H. Bischof: Fast and Accurate Environment Modeling using Three-Dimensional
Occupancy Grids, Proc. 1st IEEE/ICCV Workshop on Consumer Depth Cameras for Computer Vision, 2011.
[**] W. E. Lorensen, H. E. Cline: Marching Cubes: A high resolution 3D Surface Construction Algorithm, in Computer
Graphics, vol. 21, 1987, pp. 163-169.
12
Result: 3D Model
13
Image based Pose Estimation: Matching Process
14
Results in pose for every ETG frame
point cloud matching
matching
Image based Pose Estimation [**]
Estimate the user‘s pose within previously
reconstructed area
Sparse three-dimensional point cloud and its SIFT
keypoints build the matching model
ETG 2D image descriptors are matched against
those in the 3D point cloud (global/local)
Pose estimation through perspective n-point
algorithm [*]
RANSAC is used to eliminate matching outlier
[*] Lepetit V., Moreno-Noguer F. and Fua P.: EPnP: An Accurate O(n) Solution to the PnP Problem, International Journal
of Computer Vision, pp. 155-166, 2009.
[**] Santner, K., Paletta, L., Fritz, G., Mayer, H., Visual Recovery of Saliency Maps from Human Attention in 3D
Environments, Proc. ICRA 2013.
15
16
200 out of 2200 poses could not be
estimated (~90% coverage)
! less image feature points (textureless area)
! rapid head movements (motion blur)
point cloud
?
?
Image based Pose Estimation: Issues
Given the estimated
camera pose
intersection of viewing
ray with the dense
environment model
fast interference
detection using object
oriented bounding box
tree [*]
[*] Gottschalk S. & Lin M. C. & Manocha D.; OBB-Tree: A Hierarchical Structure for Rapid Interference Detection, Proc.
23rd Annual Conference on Computer Graphics and Interactive Techniques, 1996.
6 DOF Reconstruction of Human Gaze
17
Reconstruction of Human Gaze
18
19
Reconstruction of Human Gaze
Precision of Gaze Mapping 20
Angular Error
max. ≈0,6 º
Euclidean Error
max. ≈1,1 cm
Continuous Estimation of 3D Attention
21
Large 3D Model
22
23
Mapping of Gaze and Arousal in Large Environments
23
„3D attention shop“
Attention Guided Behaviors: “Exploration” and “Visual Search”
24
Region (=objects) of interest (ROI) detection
Annotation in 2D → Annotation in 3D
25 ROIs for Visual Search
Towards Cognition from Attention Mapping
Dwell time indicates that gaze / points of regard (PORs) are
in series within ROI
Dwell times on ROI indicate conscious processing of object
information (e.g., ROI #1)
26
region of interest
(ROI)
27
related work Context of the FACTS System
Eye-tracking videos
Computer vision /multisensor analysis applied
Driver analysis:
Driver distraction analysis
Usability engineering:
Mobile user behavior analysis
User modeling:
Eye contact behavior analysis
related work: Driver Distraction Analysis
Driver with Eye Tracking
Glasses
Gaze tracked with optical
flow analysis
Projection onto reference
images
Collective saliency map onto
environment
Time analysis
28
Localisation of smartphone in eye-tracking videos
Attention on display vs. environment
Marker free tracking of the smartphone
Saliency mapping on display image capture, rectified
Behavior analysis
related work: Mobile User Behavior Analysis
29
Smartphone eye-tracking Smartphone saliency mapping
related work: Eye Contact - Behavior Analysis
30
subject A B C D mean
UAR 70 % 67 % 65 % 68 % 67.4 %
± .02
AUC 77 % 71 % 68 % 78 % 73.2 %
± .05
Eyben, Schuller,
Paletta, et al.,
submitted to
IEEE Pervasive
Computing 2013
unweighted
average recall
area under
the ROC
System Components 31
Summary & Conclusions
Summary
Recovery of 3D gaze:
Automated reconstruction of a 3D model
Automated mapping of gaze into a 3D model
Full recovery of semantic analysis (in the frame of ROIs)
System approach – various applications
Future work
Multisensor positioning (accelerometer, vision)
Computational attention model using 3D information
32
JOANNEUM RESEARCH
Forschungsgesellschaft mbH
Institute for Information and
Communication Technologies
www.joanneum.at/digital
Dr. Lucas Paletta
+43 664 602 876 1769
Thank you for your attention