Elements of behaviour recognition with ego- centric Visual Sensors . Application to AD...
description
Transcript of Elements of behaviour recognition with ego- centric Visual Sensors . Application to AD...
TRECVID 2008
Elements of behaviour recognition with ego-centric Visual Sensors. Application to AD patients assessmentJenny Benois-Pineau,LaBRI Universit de Bordeaux CNRS UMR 5800/ University Bordeaux1H. Boujut, V. Buso, L. Letoupin, R. Megret, S. Karaman, V. Dovgalecs, Ivan Gonsalez-Diaz(University Bordeaux1)Y. Gaestel, J.-F. Dartigues (INSERM)
1SummaryIntroduction and motivationEgocentric sensors : Wearable videosFusion of multiple cues: eraly fusion frameworkFusion of multiple cues: late fusion and intermediate fusion Model of activities as Active Objects&Context5.1. 3D Localization from wearable camera5.2. Object recognition with visual saliencyConclusion and perspectives231. Introduction and motivationRecognition of Instrumental Activities of Daily Living (IADL) of patients suffering from Alzheimer DiseaseDecline in IADL is correlated with future dementiaIADL analysis:Survey for the patient and relatives subjective answersObservations of IADL with the help of video cameras worn by the patient at homeObjective observations of the evolution of diseaseAdjustment of the therapy for each patient: IMMED ANR, Dem@care IP FP7 EC
16/03/2010316/03/2010342. Egocentric sensors : Wearable videoVideo acquisition setupWide angle camera on shoulderNon intrusive and easy to use deviceIADL capture: from 40 minutes up to 2,5 hoursNatural integration into home visit by paramedical assistants protocol(c)
Loxie ear-wearedLooking-glaces weared with eye-tracker, iPhone.16/03/2010416/03/201045Wearable videos4 examples of activities recorded with this camera:Making the bed, Washing dishes, Sweeping, Hovering IMMED dataset
16/03/2010516/03/20105Wearable Camera analysisMonitoring of Instrumental Activities of Daily Living (IADLs)Meal related activities (Preparation, meal, cleaning)Instrumental activities (telephone, )For directed activities in @Lab or undirected activities in @Home and @NursingHomeAdvantages of wearable cameraProvides a close-up view on IADLsObjects *Location *Actions *Complements fixed sensors with more detailed informationFollows the person in multiple places of interest (sink, table, telephone) without full equipment of the environmentVisual positionning from wearable camera
In front of the stovepancookingstool6Wearable video datasets and scenariiIMMED Dataset @HomeUB1, CHU Bordeaux12 volunteers and 42 patientsADL Dataset @HomeUC Irvine20 healthy volunteersVisual positionning from wearable camera
H. Pirsiavash, D. Ramanan. "Detecting Activities of Daily Living in First-person Camera Views" CVPR 2012.
S.Unconstrained scenario
7Dem@care DatasetCHUN @Lab3 healthy, 5 MCI, 3 AD, 2 mixed dementia
First pilot ongoingS. Constrained scenario8
3. Fusion of multiple cues for activities recognitionEarly fusion frameworkThe IMMED Framework9Temporal Partitioning(1)Pre-processing: preliminary step towards activities recognitionObjectives:Reduce the gap between the amount of data (frames) and the target number of detections (activities)Associate one observation to one viewpointPrinciple:Use the global motion e.g. ego motion to segment the video in terms of viewpointsOne key-frame per segment: temporal centerRough indexes for navigation throughout this long sequence shotAutomatic video summary of each new video footage
16/03/2010916/03/20109Complete affine model of global motion (a1, a2, a3, a4, a5, a6)
[Krmer, Benois-Pineau, Domenger]Camera Motion Detection in the Rough Indexing Paradigm, TRECVID2005.Principle:Trajectories of corners from global motion modelEnd of segment when at least 3 corners trajectories have reached outbound positions10 Temporal Partitioning(2)
16/03/20101016/03/20101011Threshold t defined as a percentage p of image width wp=0.2 0.25
Temporal Partitioning(3)16/03/20101116/03/20101112Temporal Partitioning(4)Video Summary332 key-frames, 17772 frames initiallyVideo summary (6 fps)
16/03/20101216/03/20101213Color: MPEG-7 Color Layout Descriptor (CLD)6 coefficients for luminance, 3 for each chrominanceFor a segment: CLD of the key-frame, x(CLD) 12
Localization: feature vector adaptable to individual home environment. Nhome localizations. x(Loc) NhomeLocalization estimated for each frameFor a segment: mean vector over the frames within the segmentAudio: x(Audio) : probabilistic features SMNV. Dovgalecs, R. Mgret, H. Wannous, Y. Berthoumieu. "Semi-Supervised Learning for Location Recognition from Wearable Video". CBMI2010, France.
Description space: fusion of features (1)16/03/20101316/03/20101314
Description space(2). AudioJ. Pinquier, IRITHtpe log-scale histogram of the translation parameters energyCharacterizes the global motion strength and aims to distinguish activities with strong or low motionNe = 5, sh = 0.2. Feature vectors x(Htpe,a1) and x(Htpe,a4) 5
Histograms are averaged over all frames within the segment
x(Htpe, a1)x(Htpe,a4)Low motion segment0,87 0,03 0,02 0 0,080,93 0,01 0,01 0 0,05Strong motion segment0,05 0 0,01 0,11 0,830 0 0 0,06 0,94 Description space(3). Motion15
16/03/20101516/03/201015
16Hc: cut histogram. The ith bin of the histogram contains the number of temporal segmentation cuts in the 2i last frames
Hc[1]=0, Hc[2]=0, Hc[3]=1, Hc[4]=1, Hc[5]=2, Hc[6]=7Average histogram over all frames within the segmentCharacterizes the motion history, the strength of motion even outside the current segment26=64 frames 2s, 28=256 frames 8.5s x(Hc) 6 or 8Residual motion 4x4 x(RM) 16
Description space(4). Motion
16/03/20101616/03/201016Room Recognition from Wearable CameraClassification based recognitionMulti-class classifiers, trained on a bootstrap videoTemporal postprocessing of the estimates (temporal accumulation)4 global architectures considered (based on SVM classifiers)
Visual features considered : Bag-of-Visual-Words (BoVW)Spatial Pyramid Histograms (SPH)Composed Receptive Field Histograms (CRFH)
UB1 - Wearable camera video analysisTemporalacculumationSVMFeature 1Feature 2Classifier : Single feature Early-fusion Late fusion Co-training with late fusionA semi-supervised multi-modal framework based on co-training and late fusionCo-training exploits unannotated data by transfering reliable information from one feature pipeline to the other.Time-information is used to consolidate estimates
18 Vladislavs Dovgalecs; Rmi Megret; Yannick Berthoumieu. Multiple Feature Fusion Based on Co-Training Approach and Time Regularization for Place Classification in Wearable Video. Advances in Multimedia, 2013, 2013, pp. Article ID 175064Global architectureMultiple featuresLate-fusionCo-trainingVisual positionning from wearable cameraCo-training: a semi-supervised multi-modal frameworkCombining co-training, temporal accumulation and late-fusion produce the best results, by leveraging non annotated data
Time-aware cotrainingLate-fusion baselineIllustration of correct detection rate on IDOL2 database19 Vladislavs Dovgalecs; Rmi Megret; Yannick Berthoumieu. Multiple Feature Fusion Based on Co-Training Approach and Time Regularization for Place Classification in Wearable Video. Advances in Multimedia, 2013, Article ID 175064CotrainingVisual positionning from wearable cameraIMMED Dataset @Home (subset used for experiments)14 sessions at patients homes. Each session is divided in training/testing
6 classes:Bathroom, bedroom, kitchen, living-room, outside, otherTrainingAverage 5 min / sessionBootstrapPatient asked to present the various rooms in the appartmentUsed for location model trainingTestingAverage 25 min / sessionGuided activitiesVisual positionning from wearable camera20AlgorithmFeature spaceTemporal accumulationwithoutwithSVMBOVW0.490.52CRFH0.480.53SPH30.470.49Early-fusionBOVW+SPH30.480.50BOVW+CRFH0.500.54CRFH+SPH30.480.51Late-fusionBOVW+SPH30.510.56BOVW+CRFH0.510.56CRFH+SPH30.500.54Co-training with late-fusionBOVW+SPH30.500.53BOVW+CRFH0.540.58CRFH+SPH30.540.57RRWC performance on IMMED @HomeDiscussionBest performances obtained using late-fusion approaches with temporal accumulationGlobal accuracy in the 50-60% range. Difficult dataset due to specific applicative constraints in IMMED leading to low quality of training data for location. Visual positionning from wearable cameraAverage accuracy for room recognition on IMMED dataset.21Dem@Care Dataset @LabTraining (22 videos)Total time: 6h42minTesting (22 videos).Total time: 6h20minAverage duration for 1 video: 18min5 classes:phone_place, tea_place, tv_place, table_place, meication_placeVisual positionning from wearable camera22
RRWC performance on Dem@Care @Lab23
= bovwLarger homogeneous training set yield: 94% accuracyLate fusionVisual positionning from wearable camera23
Details on one test video24Video 20120717a_gopro_split002Parameters:CRFH+SPH2C = 1000Normalized score for best classVisual positionning from wearable camera24Discussion on the resultsFeature fusion (late fusion) works best for place classificationSemi-supervised training slightly improve performances on IMMED datasetBut performance highly dependent on good training:Sufficient amount of dataHomogeneous contentOpen issuesMore selective criteria for recognition/rejection to improve reliability of positive decision
Visual positionning from wearable camera25Early fusion of all features26Dynamic FeaturesStatic FeaturesAudio FeaturesCombinationof featuresHMM ClassifierActivitiesMedia27Feature vector fusion: early fusion CLD x(CLD) 12Motionx(Htpe) 10x(Hc) 6 or 8x(RM) 16
Localization: Nhome between 5 and 10.x(Loc) NhomeAudio : x(Audio) 7
Description space(5)16/03/20102716/03/201027Description space(6)Possible combinations of descriptors : 2 - 1 = 63DescriptorsAudioLocRMHtpeHcCLDconfigminconfigmaxDimensions77161081276016/03/20102829A two level hierarchical HMM:Higher level:transition between activitiesExample activities: Washing the dishes, Hovering,Making coffee, Making tea...Bottom level: activity descriptionActivity: HMM with 3/5/7/8 statesObservations model: GMMPrior probability of activity
Model of the content: activities recognition
16/03/20102916/03/20102930
Bottom level HMMStart/End Non emitting stateObservation x only for emitting states qiTransitions probabilitiesand GMM parameters are learnt by Baum-Welsh algorithmA priori fixed number of statesHMM initialization:Strong loop probability aiiWeak out probability aiend
Activities recognition16/03/20103016/03/20103031Video Corpus CorpusHealthy volonteers/
Patients
Number of videosDurationIMMED12 healthy volonteers
157H16IMMED42 patients
4617H04TOTAL12 healthy volonteers
+ 42 patients
6124H20
16/03/20103116/03/201031Evaluation protocolprecision= TP/(TP+FP)recall= TP/(TP+FN)accuracy= (TP+TN)/(TP+FP+TN+FN)F-score= 2/(1/precision+1/recall)32Leave-one-out cross validation scheme (one video left)Resultas are averaged Training is performed over a sub sampling of smoothed (10 frames) data. Label of a segment is derived by majority vote of frames resultsRecognition of Activities5 videos. Descriptors?33
3states LL HMM, 1 state NoneActivitiesPlantSpraying, RemoveDishes, WipeDishes, MedsManagement, HandCleaning, BrushingTeeth, WashingDishes, Sweeping, MakingCoffee, MakingSnack, PickingUpDust, PutSweepBack, HairBrushing, Serving, Phone and TV. (16)Result : 5 times better than chance34Results using segments as observations35
Comparison with a GMM baseline(23 activities on 26 videos)36The variance is strong : this is the difficulty of the task. The better mean values than in 5 sequence experiment : due to the stronger training set3623 activities Food manual preparation, Displacement free, Hoovering, Sweeping, Cleaning, Making a bed, Using dustpan, Throwing into the dustbin, Cleaning dishes by hand, Body hygiene, Hygiene beauty, Getting dressed, Gardening, Reading, Watching TV, Working on computer, Making coffee, Cooking, Using washing machine, Using microwave, Taking medicines, Using phone, Making home visit.
37Hovering38
Conclusion on early fusionEarly fusion requires tests of numerous combinations of features.The best results were achieved for the complete description spaceFor specific activities optimal combinations of descriptors vary and correspond to common sense approach.
S. Karaman, J. Benois-Pineau, V. Dovgalecs, R. Megret, J. Pinquier, R. Andr-Obrecht, Y. Gaestel, J.-F. Dartigues, Hierarchical Hidden Markov Model in detecting activities of daily living in wearable videos for studies of dementia, Multimedia Tools and Applications, Springer, 2012, pp. 1-29, DOI 10.1007/s11042-012-117-x Article in Press
394. Fusion of multiple cues: intermediate fusion and late fusion Intermediate fusion
Treat the different modalities (Dynamic, Static, Audio) separately.We represent each modality by a stream, that is a set of measures along the time. Each state of the HMM models the observations of each stream separately by a Gaussian mixture.K streams of observations40
Late Fusion framework41Dynamic FeaturesStatic FeaturesAudio FeaturesMEDIACombinationFusion of scores of classifiersfeaturesHMM ClassifierActivitiesHMM ClassifierHMM ClassifierPerformance measure of classifiers : modality k, activity l The late fusion approach uses a HHMM classifier for each subspace of the complete description space and merges the results of these preliminary classifiers for the decision making. In the late fusion scheme in the context of our work a classifier is trained for each modality Audio, Dynamic and Static. Each modality k is assigned an expertise trust score that is specific to each activity l. This score is defined as a normalized performance computed from the performance measure perflk obtained by the modalityk for the activity l:elk=perflkkperflk
41Experimental video corpusin 3-fusion experiment37 videos recorded by 34 persons (healthy volunteers and patients) for a total of 14 hours of content. 42Results of 3- fusion experiment(1)43
Results of 3- fusion experiment(2) 44
Conclusion on 3-fusion experimentOverall, the experiments have shown that the intermediate fusion has provided consistently better results than the other fusion approaches, on such complex data, supporting its use and expansion in future work.
455. Model of activities as Active Objects & ContextVisual positionning from wearable cameraLocalizationObject recognitionInstrumental Activities RecogninitionWearable camera videoHigher level reasoning
465.1. 3D localization from wearable cameraProblematicsHow to generate precise position events from wearable video ?How to improve the precision of 3D estimation ?
Visual positionning from wearable camera47
2 step positionningTrain: Reconstruct 3D models of placesBootstrap framesPartial models of environement focused on places of interestStructure from Motion (SfM) techniquesTest: Match test images to 3D modelsExtraction of visual featuresRobust 2D-3D matching
48Visual positionning from wearable camera48General architecture49Objectness [1][2]
Measure to quantify how likely it is for an image window to contain an object.
[1] Alexe, B., Deselares, T. and Ferrari, V. What is an object? CVPR 2010.[2] Alexe, B., Deselares, T. and Ferrari, V. Measuring the objectness of image windows PAMI 2012.
3D model finalization
Densificationfor visualization Map global geolocalizationmanual, using natural markers on the sceneDefinition of specific 3D places of interestCoffee machine, sinkDirectly on 3D map50Visual positionning from wearable camera503D positioningUsing Perspective-n-Point approachMatch 2D points in frame with 3D points in model and apply PnP techniques
51Visual positionning from wearable camera
Exploitation: zone detection
InstrumentalNearFarObject-camera distance (green: cofee machine, blue: sink)Distance to objects of interestZones: Near/Instrumental/FarEvents: zone entrance/exit52Visual positionning from wearable cameraProduce discrete events suitable for higher level analysis52Experiments2 test sequences 68000 and 63000 frames6 classes of interest (desk, sink, printer, shelf, library, couch)Repeated events One bootstrap videoThrough each place6 reference 3D modelsReconstruction requires less than 100 frames / model
Sample frame3D point cloudDeskSinkPrinterShelfLibraryCouch53Visual positionning from wearable camera53
P = 0.9074R = 0.5816 cuisinerangesallephotocop.bibliopccuisinerangesallephotocop.bibliopcRecognition performances54Visual positionning from wearable camera54Les classes:1 : pc2 : Biblio.3 : photocop.4 : salle5 : range_bureau6 : cuisine0 : transition
Taux de localisation = ?(1er col et 1er lin correspond la classe de rejetconf_matrix =
15057 10472 252 2703 2982 635 3283 464 3063 0 0 0 0 0 423 0 1470 985 0 0 0 415 0 6 5287 0 0 0 75 0 0 0 2692 0 0 121 0 0 0 0 590 0 535 0 0 0 0 0 16537P = 0.8813R = 0.6771
cuisinerangesallephotocop.bibliopccuisinerangesallephotocop.bibliopcRecognition performanceMajority filtering over 100 frames55Visual positionning from wearable camera55
Interest of 3D model for indexing2D-3D matching: matching between test frames and 3D models2D-2D matching: direct match between test frames and reference frames(-) Overhead in building 3d model(+) Much better precision/recall that purely 2D analysis(+) Matching to consolidated model instead of large set of imagesHazem Wannous, Vladislavs Dovgalecs, Rmi Mgret, Mohamed Daoudi: Place Recognition via 3D Modeling for Personal Activity Lifelog Using Wearable Camera. MMM 2012: 244-25456Visual positionning from wearable camera5.2. Object Recognition with SaliencyMany objects may be present in the camera field
How to consider the object of interest?
Our proposal: By using visual saliency
57
IIMMED DBOur approach : moleling Visual Attention58Several approachesBottom-up or top-downOvert or covert attentionSpatial or spatio-temporalScanpath or pixel-based saliency
FeaturesIntensity, color, and orientation (Feature Integration Theory [1]), HSI or L*a*b* color spaceRelative motion [2]
Plenty of models in the literatureIn their 2012 survey, A. Borji and L. Itti [3] have taken the inventory of 48 significant visual attention methods[1] Anne M. Treisman & Garry Gelade. A feature-integration theory of attention. Cognitive Psychology, vol. 12, no. 1, pages 97136, January 1980.[2] Scott J. Daly. Engineering Observations from Spatiovelocity and Spatiotemporal Visual Models. In IS&T/SPIE Conference on Human Vision and Electronic Imaging III, volume 3299, pages 180191, 1 1998.[3] Ali Borji & Laurent Itti. State-of-the-art in Visual Attention Modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 99, no. PrePrints, 2012.Ittis model59
The most widely used model
Designed for still images
Does not consider the temporal dimension of videos[1] Itti, L.; Koch, C.; Niebur, E.; , "A model of saliency-based visual attention for rapid scene analysis ,Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.20, no.11, pp.1254-1259, Nov 1998Spatiotemporal Saliency Modeling60Most of spatio-temporal bottom-up methods work in the same way[1], [2]Extraction of the spatial saliency map (static pathway)Extraction of the temporal saliency map (dynamic pathway)Fusion of the spatial and the temporal saliency maps (fusion)[1] Olivier Le Meur, Patrick Le Callet & Dominique Barba. Predicting visual fixations on video based on low-level visual features. Vision Researchearch, vol. 47, no. 19, pages 24832498, Sep 2007.[2] Sophie Marat, Tien Ho Phuoc, Lionel Granjon, Nathalie Guyader, Denis Pellerin & Anne Gurin-Dugu. Modelling spatio-temporal saliency to predict gaze direction for short videos. International Journal of Computer Vision, vol. 82, no. 3, pages 231243, 2009. Dpartement Images et Signal.Spatial Saliency Model61Based on the sum of 7 color contrast descriptors in HSI domain [1][2]Saturation contrastIntensity contrastHue contrastOpposite color contrastWarm and cold color contrastDominance of warm colorsDominance of brightness and hueThe 7 descriptors are computed for each pixels of a frame I using the 8 connected neighborhood.The spatial saliency map is computed by:
Finally, is normalized between 0 and 1 according to its maximum value
[1] M.Z. Aziz & B. Mertsching. Fast and Robust Generation of Feature Maps for Region-Based Visual Attention. Image Processing, IEEE Transactions on, vol. 17, no. 5, pages 633 644, may 2008.[2] Olivier Brouard, Vincent Ricordel & Dominique Barba. Cartes de Saillance Spatio-Temporelle bases Contrastes de Couleur et Mouvement Relatif. In Compression et representation des signaux audiovisuels, CORESA 2009, page 6 pages, Toulouse, France, March 2009.Temporal Saliency Model62The temporal saliency map is extracted in 4 steps [Daly 98][Brouard et al. 09][Marat et al. 09]The optical flow is computed for each pixel of frame i.The motion is accumulated in and the global motion is estimated.The residual motion is computed:
Finally, the temporal saliency map is computed by filtering the amount of residual motion in the frame.
with , and
Saliency Model Improvement63Spatio-temporal saliency models were designed for edited videos
Not well suited for unedited egocentric video streamsOur proposal:Add a geometric saliency cue that considers the camera motion anticipation
1. H. Boujut, J. Benois-Pineau, and R. Megret. Fusion of multiple visual cues for visual saliency extraction from wearable camera settings with strong motion. In A. Fusiello, V. Murino, and R. Cucchiara, editors, Computer Vision ECCV 2012, IFCV WSGeometric Saliency Model642D Gaussian was already applied in the literature [1]Center bias, Busswel, 1935 [2]Suitable for edited videosOur proposal:Train the center position as a function of camera positionMove the 2D Gaussian center according to camera center motion.Computed from the global motion estimation
Considers the anticipation phenomenon [Land et al.].
Geometric saliency map[1] Tilke Judd, Krista A. Ehinger, Frdo Durand & Antonio Torralba.Learning to predict where humans look. In ICCV, pages 21062113. IEEE,2009.[2] Michael Dorr, et al. Variability of eye movements when viewing dynamic natural scenes. Journal of Vision (2010), 10(10):28, 1-17Geometric Saliency Model65The saliency peak is never located on the visible part of the shoulder
Most of the saliency peaks are located on the 2/3 at the top of the frame
So the 2D Gaussian center is set at:
Saliency peak on frames fromall videos of the eye-tracker experimentSaliency Fusion66
FrameSpatio-temporal-geometricsaliency mapSubjective saliency mapSubjective Saliency67D. S. Wooding method, 2002(was tested over 5000 participants)
Eye fixationsfrom the eye-tracker
+Subjectivesaliency map2D Gaussians(Fovea area = 2 spread)
How people watch videos from wearable camera?68Psycho-visual experiment
Gaze measure with an Eye-Tracker (Cambridge Research Systems Ltd. HS VET 250Hz)
31 HD video sequences from IMMED database.Duration 133025 subjects (5 discarded)6 562 500 gaze positions recordedWe noticed that subject anticipate camera motion
Evaluation on IMMED DB69Normalized Saliency Scanpath (NSS) correlation method Metric
Comparison of:Baseline spatio-temporal saliencySpatio-temporal-geometric saliency without camera motionSpatio-temporal-geometric saliency with camera motionResults:Up to 50% better than spatio-temporal saliencyUp to 40% better than spatio-temporal-geometric saliency without camera motion
H. Boujut, J. Benois-Pineau, R. Megret: Fusion of Multiple Visual Cues for Visual Saliency Extraction from Wearable Camera Settings with Strong Motion. ECCV Workshops (3) 2012: 436-445Evaluation on GTEA DBIADL dataset8 videos, duration 2443Eye-tracking mesures for actors and observersActors (8 subjects)Observers (31 subjects) 15 subjects have seen each videoSD resolution video at 15 fps recorded with eyetracker glasses
70
A. Fathi, Y. Li, and J. Rehg. Learning to recognize daily actions using gaze. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, editors, Computer Vision ECCV 2012Evaluation on GTEA DBCenter bias : camera on looking glaces, head mouvement compensatesCorrelation is database dependent71
Objective vs. Subjective -viewer Proposed Processing Pipeline for Object Recognition72
Mask computationLocal PatchDetection & DescriptionBoWcomputationVisual vocabularyImageMatchingSupervised classifierImage retrievalObject recognitionSpatially constrained approach using saliency methodsWGTEA Dataset (constrained scenario)GTEA [1] is an ego-centric video dataset, containing 7 types of daily activities performed by 14 different subjects. The camera is mounted on a cap worn by the subject.Scenes show 15 object categories of interest.Data split into a training set (294 frames) and a test set (300 frames)
73
Categories in GTEA dataset with the number of positives in train/test sets[1] Alireza Fathi, Yin Li, James M. Rehg, Learning to recognize daily actions using gaze, ECCV 2012.Assessment of Visual Saliency in Object RecognitionWe tested various parameters of the model:Various approaches for local region sampling.Sparse detectors (SIFT, SURF)Dense Detectors (grids at different granularities).Different options for the spatial constraints:Without masks: global BoWIdeal manually annotated masks.Saliency masks: geometric, spatial, fusion schemesIn two tasks:Object retrieval (image matching): mAP ~ 0.45Object recognition (learning): mAP ~ 0.91
74Sparse vs Dense Detection75
Detector + DescriptorAvg Pt NbrSparse SURF + SURF137.9Dense + SURF1210Experimental setup:GTEA database.15 categories.SURF descriptor.Object recognition task.SVM with 2 kernel.Using ideal masks.
Dense detection greatly improves the performance at the expense of an increase on the computation time (more points to be processed)
Object Recognition with Saliency Maps76
The best: Ideal, Geometric, Squared-with-geometric, Object recognition on ADL dataset (unconstrained scenario)
77Approach Temporal pyramid matchingFeatures : probability vector of objects Oprobability vector of places (2D approach) PEarly fusion by concatenation F=O+P7879
30Resultas of activity recognition (ADL) 80
6. Conclusion and perspectives(1)Spatiotemporal saliency correlation is database dependent for egocentric videosNeed A VERY GOOD motion estimation ( now Ransac)
Saliency maps are a good ROIs for object recognition in the task of interpretation of wearable video content. The model of activities as a combination of objects and location is promising
81Conclusion and Perspectives (2)Integration of 3D localizationTests on Dem@care constrained and unconstrained datasets (Lab and Home)Learning of manupulated objects vs Active Objects.
82
Metrics (averaged)Early fusionInterm.fusionLate Fusion
F-score trustPrec. trust Recall trust
Accuracy 0.2070.4420.2150.1880.210
Precision 0.1740.2670.1060.0970.131
Recall0.2840.2880.1710.1550.221
F-Score0.1800.2530.1090.0920.139
Activity recognition accuracy for our approach computed at Frame and Segment level, respectivelyApproachAvg Frame AccuracyAvg Segment Accuracy
Our Objects24.240.5
Our Places19.711.1
Our Objects + Places26.9%41.3%
Pirsiavash et al.2336.9