DOI: 10.4018/IJCVIP.2019040102
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
Copyright©2019,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited.
16
Accelerating Deep Action Recognition Networks for Real-Time ApplicationsDavid Ivorra-Piqueres, University of Alicante, Alicante, Spain
John Alejandro Castro Vargas, University of Alicante, Alicante, Spain
Pablo Martinez-Gonzalez, University of Alicante, Alicante, Spain
ABSTRACT
Inthiswork,theauthorsproposeseveraltechniquesforacceleratingamodernactionrecognitionpipeline.This article reviewed several recent andpopular action recognitionworks and selectedtwoofthemaspartofthetoolsusedforimprovingtheaforementionedacceleration.Specifically,temporalsegmentnetworks(TSN),aconvolutionalneuralnetwork(CNN)frameworkthatmakesuseofasmallnumberofvideoframesforobtainingrobustpredictionswhichhaveallowedtowinthefirstplaceinthe2016ActivityNetchallenge,andMotionNet,aconvolutional-transposedCNNthatiscapableofinferringopticalflowRGBframes.Togetherwiththelastproposal,thisarticleintegratedanewsoftwarefordecodingvideosthattakesadvantageofNVIDIAGPUs.ThisarticleshowsaproofofconceptforthisapproachbytrainingtheRGBstreamoftheTSNnetworkinvideosloadedwithNVIDIAVideoLoader (NVVL)ofasubsetofdailyactions fromtheUniversityofCentralFlorida101dataset.
KeywoRDSAction Recognition, Action Understanding, Deep Learning, GPU Acceleration, Machine Learning, Optical Flow, Real-Time, Recurrent Networks, Video Decoding
1. INTRoDUCTIoN
Although in recentyears the taskofactivity recognitionhaswitnessednumerousbreakthroughsthankstothedevelopmentofnewmethodologiesandtherebirthofdeeplearningtechniques,thenaturalcourseofeventshasnotalwaysbeenlikethis.Asformanyyears,despiteofbeingtackledfrommultipleperspectives,theproblemofconstructingasystemthatiscapableofidentifyingwhichactivityisbeingperformedinagivenscenehasbeenbarelysolved.Inthestateoftheartwecanfinddifferentapproachesbasedonhandcraftedtraditionalmethodsandmachinelearningapproaches:
• Handcrafted features dominance.Thefirstapproximationsweremotivatedbyfundamentalalgorithmssuchasopticalflow(HornandRhunck,1981),theCannyedgedetector(Canny,1986),HiddenMarkovModel(HMM)(RabinerandJuang,1986)orDynamicTimeWarping(DTW)(BellmanandKalaba,1959).Severalofthesemethodshavebeenreviewedin(Gavrila,1999),forhandandthewhole-bodymovements,whichcanbeusedtoobtainrelevantinformationfortherecognitionofactivities.
• Machine learning approaches.Moremodernmethodsuseopticalflow(Efrosetal.,2003)toobtaintemporalfeaturesoverthesequences,inadditiontousingautomaticlearningalgorithms
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
17
such as Support Vector Machine (SVM) (Schüldt, Laptev and Caputo, 2004) to classifyspatiotemporalfeatures.
• Deep learning.TheCNNnetworksallowtoobtainrobustvisualfeatureson2Dimages(ChéronandLaptev,2015),butmorespecificallyitsversionadaptedtoworkwithdatadefinedinthreedimensionsofferstheabilitytoobtainspatialandtemporalfeatureswhenworkingwithsequencesofimages.Inthisway,furthermoreoftwospatialdimensions(heightandwidth),wehaveathirddimensiondefinedbytime(frames)(Jietal.,2013)(SimonyanandZisserman,2014).
2. APPRoACH
Inthissectionwereviewthemostmodernactionrecognitionworkscarriedoutinthepastthreeyears.Online InverseReinforcementLearning (Rhinehart andKitani, 2017) is anovelmethod for
predicting future behaviors by modeling the interactions between the subject, objects, and theirenvironment, through a first-person mounted camera. The system makes use of online inversereinforcementlearning.Thus,makingitpossibletocontinuallydiscovernewlong-termgoalsandrelationships. Also, a similar approach to that of thehybrid Siamese networks, has been shown(Mahmud,HasanandRoy-Chowdhury,2017)thatispossibletosimultaneouslypredictfutureactivitylabelsandtheirstartingtime.Itdoessobytakingadvantageoffeaturesofpreviouslyseenactivitiesandcurrentlypresentobjectsinthescene.
ThankstotheuseofSingleShotmulti-boxDetectors(SSDs)CNNs,thesystemproposedin(Singhetal.,2017)iscapableofpredictingbothactionlabels,andtheircorrespondingboundingboxesinreal-time(28FPS).Moreover,itcandetectmorethanoneactionatthesametime.AllofthisisaccomplishedbyusingRGBimagefeaturescombinedwithopticalflowones(withadecreaseintheopticalflowqualityandglobalaccuracy)extractedinreal-timeforthecreationofmultipleactiontubes.
In(Kong,TaoandFu,2017),forpredictingactionclasslabelsbeforetheactionfinishes,authorsmakeuseoffeaturesextractedfromfullyobservedvideosprocessedattraintime,forfillingoutthemissinginformationpresentintheincompletevideostopredict.Furthermore,thankstothisapproachtheirmodelobtainsagreatspeedupimprovementwhencomparedtosimilarmethods.
Amodelthatiscapableofperformingvisualforecastingatdifferentabstractionlevelsispresentedin(Zengetal.,2017).Forexample,thesamemodelcanbetrainedforfutureframegenerationaswellasforactionanticipation.Thisisaccomplishedbyfollowinganinversereinforcementlearningapproach.Also,themodelisenforcedtoimitatenaturalvisualsequencesfrompixellevel.
Themodeldevelopedin(Renéetal.,2017)iscapableofpredictinginreal-timefutureactivitieslabelsonRGB-Dvideos.Thisisaccomplishedbymakinguseofsoftregression,forjointlylearningboththepredictormodelandthesoftlabels.Moreover,real-timeperformance(around40FPS)isobtainedbyincludinganovelRGB-DfeaturenamedLocalAccumulativeFrameFeature(LAFF).Moreover,aTCNEncoder-Decodersystemisbuiltforperformingthementionedtasks.Aftertraining,itisabletosurpasscurrentsimilarapproaches.Furthermore,thesystempresentsabetterperformancethanBidirectionalLongshort-termmemorynetworks(Bi-LSTMs)networks.
In(Buchetal.,2017),asystemthatiscapableofpresentingtemporalactionproposalsonavideowithonlyoneforwardpassispresented.Thus,thereisnoneedtocreateoverlappedtemporalslidingwindows.Moreover,thesystemcanworkwithlonguntrimmedvideosofarbitrarylengthinacontinuedfashion.Finally,bycombiningthesystemwithactionclassifiers,temporalactiondetectionperformanceisincreased.
Anewconvolutionalmodelispresentin(CarreiraandZisserman,2017),knownasTwo-StreamInflated3Dconvolutionalneuralnetwork(I3D),whichisusedasaspatio-temporalfeatureextractor.Afterthis,authorspre-trainI3DbasedmodelsontheKineticsdataset,showingthatwiththisapproach,actionclassificationperformanceonwell-knowndatasetsisnoticeablyincreased.
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
18
In(Feichtenhofer,PinzandWildes,2017),afullyspace-timeconvolutionaltwo-streamnetwork(namedSTResNet)isproposedforthetaskofactionrecognitioninvideos.ThefirststreamisfedwithRGBdatawhilethesecond,withopticalflowfeatures.Themainparticularityofthismodelistheexistinginterconnectionsbetweenbothstreams.Moreover,forlearninglong-termrelationships,identitymappingkernelsareinjectedthroughthenetwork.Allofthisallowsthenetworktopredictonasingleforwardpass.
Newsrecurrentneuralnetworkapproachesarepresentedin(Dave,RussakovskyandRamanan,2017),whichareusedforsolvingtheproblemofactiondetectioninvideosobtainingsatisfactoryresults.Initsbasisthemodel:(1)Focusesonchangesbetweenframes,(2)predictsthefuture,(3)makescorrectionsuponitbyobservingwhattrulyhappensnext.
Authorsof(Sigurdssonetal.,2017)proposeamodelthatiscapableofdetailedlyreasonaboutaspectsofanactivity,i.e,foreachframethemodeliscapableofpredictingthecurrentactivity,itsactionandobject,thescene,andthetemporalprogress.ThisisaccomplishedbymakinguseofConditionalRandomFields(CRFs)thatarefedbyCNNfeatureextractors.Moreover,forbeingabletotrainthissysteminandend-to-end-manner,anasynchronousstochasticinferencealgorithmisdeveloped.
In(Wangetal.,2017)authorsproposeaCNNframeworkfortherecognitionofactionsinvideos,bothtrimmedanuntrimmedandin(Aliakbarianetal.,2017)isproposedamulti-stageLongshort-termmemorynetwork(LSTM)architecturecombinedwithanovellossfunction,thatiscapableofpredictingactionclasslabelsinvideos,evenwhenonlythefirstframesofthesequencehavebeenshown.Themodel takesadvantageofaction-awareandcontext-awarefeatures forsucceeding inthistask.
2.1. TSN FrameworkTemporalSegmentNetwors(TSN)(Wangetal.,2017)isaCNNframeworkfortherecognitionofactionsinvideos,bothtrimmedanuntrimmed.Alongwithit,aseriesofguidelinesforproperlyinitializeandoperatesuchdeepmodelsforthistaskareproposed.TheframeworkaimstotacklefourcommonlimitationswhenusingConvolutionalNeuralNetworksonvideos.First,thedifficultyofusinglong-rangeframesequences,duetohighcomputationalandmemoryspacecosts,whichcanleadtomissimportanttemporalinformation.Second,mostofthesystemsfocusontrimmedvideosinsteadofuntrimmedones(severalactionsmayhappeninavideo).Adaptingtothesewouldmeantoproperlylocalizeactionsandavoidbackground(irrelevant)videoparts.Third,meanwhiledeepmodelsbecomecomplex,stillmanydatasetsaresmallinnumberofsamplesanddiversity,lackingenoughdataforproperlytrainthemandavoidoverfitting.Fourth,timeconsumedforoptical-flowextractioncanbecomeadelayforboth,usinglarge-scaledatasetsandusingthemodelonreal-timeapplications.Figure1showsaschematicviewofsuchnetwork.
2.2. GPU Video DecodingSincethebeginningofthemoderndeeplearningera,datastoringandloadingtimeshavealwaysbeenabottleneckinthepipeline.AlthoughrecentlywearewitnessinggreatspeedupsthankstonewhardwaretechnologieslikeSSDforstoring,ordatatransferringdevices(betweenCPUtoGPU,andvice-versa)suchasNVLINK,theissuepersists.
Manyoftheresearchareaswherethisproblemaggravatesmorearetheoneswhichworkwithvideosasthemaindatasetsource.Theseinclude:predictivelearning,videounderstanding,questionanswering,activityrecognition,andsuper-resolutionnetworks,betweenmanyothers.
Themainapproachwhentacklingthisprobleminthoseareasistofirstextractalltheframesforeachvideoofthedataset,forexamplebyusingFFmpeg,andsavetheminahigh-qualityimageformat,ratherthanonewithpossiblelossycompressionandartifactgenerations,inordertoproperlytrainthenetwork.Thiscomeswithanincreasingneedofstoragespace,sincethemoreinformationwillingtobekept,thelargerinsizeourconvertedimagedatasetwillbe.
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
19
InFigure2wecanseetheeffectsofstoringtheUniversityofCentralFlorida101(UCF101)dataset(Soomro,ZamirandShah,2012),composedofonly13320videosindifferentformats.ForthecaseofJPEG(image),itoccupies63GiB,whileinAVIformat(video)itoccupies9.25timesless,6.8GiB.IfitistransformedtotheproperMP4format,neededbyNVIDIAVideoLoader(NVVL)(Casper,BarkerandCatanzaro,2018)withthecorrespondingnumberofframes,itoccupies14.2GiB,still4.44timesless.Ifwetakethisintoafine-grainedlevel,suchasframes,wecanseethatthestoragediffersbyalargemargin.
Inordertoalleviatethisproblem,ausefulsolutionistodirectlyloadvideofilesintomemory,decodethenecessaryframes,preparethem,andfinallyfeedthemtothenetwork.Actually,APIsthatcanmanagethefirsttwostepsexist:FFmpeglibrariesitself,andhigherabstractionmoduleslikePyAVorPIMS,whichbothloaddataintotheCPU.Ontheotherhand,the(beta)Hwangproject,alsosupportsNVIDIAGPUsthroughtheuseoftheirspecifichardwaredecoderunits.Furthermore,thosedesignedwithmachinelearningtasksinmind,whichcanprovideallthementionedstepshavebeenrecentlydeveloped.Twoarecurrentlyknown:Lintel(Duke,2018),andNVVL(Casper,BarkerandCatanzaro,2018).ThefirstfocusesonCPUloading(usesFFmpegasbackend),whilethesecondtargetsGPUdevices.Indeed,althoughbeingwritteninC++,itoffersoff-the-shelfPyTorchmodules(datasetandloader).Moreover,anotherwrapperforCuPyarrayshasbeencreated.
Figure 1. Representation of TSN framework. First a snippet is extracted from each of a fixed number of segments that equally divide the video, Then, features such as optical flow or RGB-diff (top and bottom images of the second process column) are extracted. After passing through the corresponding stream, an aggregation function joins the individual snippet class probabilities. Then, softmax is applied for obtaining the final video action class.
Figure 2. Storage comparison between frames and video formats for the UCF101 dataset
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
20
Regardingperformance,wecanseethatNVVLreducesbyalargemargintheI/Oprocessingtimes,asitcanbeappreciatedinFigure3.MorebenchmarksthattakeintoaccountmemoryusageandCPUloadscanalsobefoundintheblogpost,whileanevenmoredetailedevaluationislocatedon GitHub1. Regarding data, loading behaves like a sliding window of stride one, where framesequencesofapreviouslyfixedlengtharesubsequentlyloadedandreturnedasasingletensor.Ontheotherhand,wecanapplydifferenttransformationstothesesequences:datatype(float,half,orbyte),widthandheightresizingandscaling,randomcroppingandflipping,normalizing,colorspace(RGBorY:Luminance;Cb:Chrominance-Blue;andCr:Chrominance-Red(YCbCr)),andframeindexmapping.Forperformance,flexibility,andcompletenessreasons,wedecidedtouseNVVLasourmaintooltoacceleratetheTSNframework.
2.3. Hidden Two-Stream Convolutional Networks for Action RecognitionAnothersideforapproachingthequestionofreal-timeactionrecognitioncanbefoundin(Zhu,Yetal,2017),wheretheuseofaconvolutionalnetworkforautomaticallycomputeopticalflowispresented.
More indetail, ina firstphase,aCNNdenotedasMotionNet is trained inanunsupervisedmannerforthetaskofoptical-flowestimation.Afterobtainingacceptableresultssimilartooptimaltraditionalmethods,thenetworkisattachedtoaconventionalCNNasthetemporalstreampartofthewholemodel,beingthespatialstreamsimilarinarchitecturetotheotherone.Then,thenetworkistrained(includingMotionNet)onthetaskofactionrecognitionfromframesequences.Theapproachenablestheopticalflowgeneratortobeadaptedtothecharacteristicsofthetaskandfurtherfindingasuitablemotionrepresentation.
3. eXPeRIMeNTATIoN
Inthisprojectwehaveexperimentedusingtheapproachesdiscussedwithdatasetucf101.
Figure 3. Average loading time (milliseconds) that 32-Floating Point PyTorch tensors take to be available in the GPU. The experiment was run on an NVIDIA V100 GPU over one epoch with batches of size 8. Figure extracted from (Casper, Barker and Catanzaro, 2018).
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
21
3.1. Dataset(Soomro,ZamirandShah,2012)GiventhelimitednumberofRGBactiondatasetsthatincludedrealisticscenes(withoutactorsorpreparedenvironments)andawiderangeofclassesuntil2012,authorsofthispaperproposeanewlarge-scaledatasetsofuser-uploadedvideos(YouTube).Thesepresentamuchdiversetypeofchallengesthantheonesofpreviousdatasets,sincerecordingscancontaindifferent lightingconfigurations, imagequalitydegradation,cluttering,movementof thecamera,andoccludedscenes.
Inregardtothesizeofthedataset,13320videosaredividedinto101classesthatcoverfiveaction groups: Human-Human Interaction, Sports, Playing Musical Instruments, Human-ObjectInteraction, and Body-Motion only. The actions contained in the first and fourth groups can beobservedinFigure42.
Furthermore, thisdatasetmarkedamilestoneinwhat largescaleactionrecognitiondatasetsrefers. Itmadepossible toestablishawell-knownstarting testbed tobe improvedaswellas forbenchmarking. Moreover, deep learning competitions where established around it, such as thedifferentmodalitiesoftheTHUMOSChallenge,whichwasrunforthreeyearsinarow.Afterthat,otherlarge-scaledatasetsappeared,expandingthecharacteristicsoftheUCF101.Forthis,markingthestartofanever-growingnumberofdiverselarge-scaleactionrecognitiondatasets,theUCF101datasetisalsoworthof.
3.2. GPU Video Decoding experimentsInorder to testourGPUvideodecodingalgorithm,wecancompare thedifferencebetween theoriginalframesandtheonesloadedthroughNVVL.Forthistask,wearegoingtousetheSSIMindexbetweentwopictures,usuallyusedinthevideoindustryformeasuringthevisualdifferencewecanperceivewhencomparingframesofanoriginalanddownsampledvideo.Itrangesfrom0to1,where1isgivenfortwoidenticalpicturesand0fortwocompletelydifferentones.Forexample,giventhetwoframesobtainedfromtheUCFdataset(Divingclass)thatcanbeobservedinFigure5,wecannoticeagreenbandontherightextremeoftheNVVLloadedimage.
Apartfromthis,wecannotperceiveanyothersubstantialdegradationinquality.Indeed,theSSIMobtainedis0.992,indicatingthatthisartifactisprobablyduetoabugratherthanalow-qualityvideoprocessor.Forreassuringthisfact,wecancomputetheSSIMheatmap,inordertolocateotherpossiblemissedartifacts(Figure6):
Thus,wecanassumethattherewillbenoharmatthetimeofincorporatingthistoolinaneuralnetworktrainingpipeline.
Figure 4. Classes for the Human-Object Interaction (blue) and Body-Motion Only (red) action groups from the UCF101 dataset. Figure extracted from (Soomro, Zamir and Shah, 2012).
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
22
Now,weshouldpayattention toknowingwhich thecurrent timespeedup iswecanobtainfrom replacing the image loading systemof theTSN frameworkbyaNVVLpipeline.For this,afteradaptingtheframe-indexgenerationfunctions,andintegratingthevideoloaderintoit,wecanperformthefollowing:
1. Obtainalistofvideos,getthetotalnumberofvideosandthemeannumberofframespervideo.2. Extractalltheframesfromthevideos,alsoconvertthemintotherequiredNVVLvideoformat.3. Selectthenumberofframespervideothataregoingtobeloaded.ForNVVL,alltheframes
havetobeloaded.
Figure 5. Original frame (left) and NVVL obtained frame (right). The frames pertain to a sample of the diving class in the ucf101 dataset.
Figure 6. Heat map of above frames, the lighter the color, the closer to the original frame each pixel is
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
23
4. Measurehowmuchtimeittakesforextractingthetoldnumberofframes(intotheGPU)oneachoccasion.ForNVVLthisonlyneedstobedoneonce.
5. Obtainmeantimesandtrendforthepreviousprocessandcomparetheresults.
Step1Wewillusethefirst450videosoftheUCFsplit-1trainlistobtainedwiththedatatoolsprovided
bytheTSNframework.Thislistisformattedwitharowforeachvideo.Ineachrow,thepathtothevideo,thenumberofframesthevideohasanditsclassindex.Duetothis,thetotalnumberofframescanbeobtainedjustbysummingthesecondelementofeachrowoverthewholelist.Theresultingnumberis87501.So,themeannumberofframesinavideoisapproximately194.
Step2Forcompletingthisstep,wecansimplyfollowtheinstructionsandcommandsprovidedinthe
repositoriesofeachproject.Wehavetotakeintoaccountthattheextractionprocesscantakeaquitegreateramountoftimethanthevideoconvertingprocess.
Step3Inthiscase,wearegoingtoloadevennumberofframes,startingfrom3andfinishingin25,a
totalof12differentinstances.ThishasbeenselectedsincetheauthorsofTSNtestthemodelwith3,5,7,and9framespervideo.
Step4Forobtaininganaccuratemeasurement,wearegoingtorepeateachexecution29times.For
computingthetimewehaveusedPythontime.timefunction.Also,inordertofreealltheresourcesineachrun,wearegoingtoloopinsideabashscriptinsteadofinsidethePythonexecutingscriptitself,thushavingtheprocesskilledautomatically.
Step5Inthisstepwecomputedthemeanvaluesforeachnumberofframes.Thetimetakenforloading
allthevideoswithNVVLisapproximately24.18seconds.Ontheotherhand,wecanplottheresultsobtainedfromloadingsoleframes:
Wecannoticethatthetrendfollowsalinealgrowthwithrespecttothenumberofframesloaded.Sincewecomputedtheequationdefiningthetrendline(showninthelower-rightpartofFigure7),wecanobtainamorepreciseapproximationofthespeedupachievedwhenusingNVVL.Forthis,sinceweknowthenumberofframesloadedwithNVVListhesameasthemeannumberofframesobtainedinStep1,wejustneedtosubstituteitintheequation(Xvariable),obtainingameanvalueofapproximately458.74secondsor7.65minutes.Wehaveachievedanimprovementonloadingtimeperformance,leadingto18.97timesspeedupwhenusingNVVL.
3.3. Training RGB TSN+HTS with NVVLSofarwehaveshownhowusefulincorporatingNVVLintoavideo-consumingdeeplearningpipelinecanbe,itallowsustoreduceboththestorageanddatatransfercostsatthesametimewedonotsufferdegradationinimagequality.Now,whatonlyremainsistoincorporatethistoolintoacommonactionrecognitionscenario,wherewetrainandtestanetworkforlearningtocategorizehumanactions.
SuchanetworkisgoingtobeTSN,sinceithasdemonstratedasuperiorperformanceinthetaskathand.Moreover,weproposetomakeuseoftheconvertedHTSCaffemodelandweights,inordertoavoidpre-computingtheopticalflowandbeingabletouseNVVLalsointhisstream,focusingtheresultingpipelineforreal-timeapplications.Inspiteofthedatasetwearegoingtouse,memorylimitationsdetailedbelow,andtimeconstrains,wearegoingtofocusthefollowingexperimentonlyfortheRGBstream.
Beforestarting,weneedtopreparethedataintoaformatthatiscompatiblewithNVVL.AsreferredintheGitHubrepository1,weneedvideoswitheitherH.264orHEVC(H.265)codec,andyuv420ppixelformat,alsotheycanbeinanycontainerthatFFmpegparsersupports.
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
24
Moreover,wehavetotakeintoaccountthenumberofkeyframeseachvideowillhave,i.e.,acodeconlycontainsasubsetofalltheframesthatweseeinavideo,thesearethekeyframes.Atthetimeofdecoding,therestoftheframesareobtainedbyalgorithmicallyinferringthemthroughthekeyframes.Forthisreason,whenloadingsequencesthatcanstartandendatanyframe(similartowhatwecandowithNVVL),thesystemhastoseekthenearestkeyframe,whichcanbefarbeforeorafterthestartingframe.Thiscanresultintoanunderperformingexecution,andforthisreasonwhenconvertingthevideos,wehavetoindicatethefrequencyofkeyframesperframewewanttohave(Figure8).
Developersofthevideoloadersuggesttosetonekeyframeinintervalsthatcorrespondtothelengthofthesequenceswearegoingtoload.Forexample,ifwearegoingtoloadsequencesoflength7,thenevery7framestherewillbeakeyframe.Furthermore,theyalsoprovidetherequiredcommandstocarryoutthisconversionwithFFmpeg.
Forourcase,wearegoingtoseteveryframeinthevideotobeakeyframe,thisisduetothefactthatcurrentlythePyTorchwrapper(theC++APIseemsmoreflexible)isintendedforloadingmultipleframesequencesforeachvideowithaslidingwindowapproachofafixedlength.Althoughthislengthcouldbeequaltothenumberofframesinthevideo,thusloadingonlyasequencepervideo,thiswouldonlyworkifallthevideoshadthesamelength,sincethisparameter,thesequencelength,isglobalforthewholedataset.
Foriteratingoverthedataset,wearegoingtousethedataloaderprovidedbyNVVLPyTorchwrapper,whereineachiterationitwillloadabatchofframesequences.Sincenow,eachframeisasequenceoflengthone,weneedtosetthebatchsizealsotoone.Inthiswaywecaneasilyknowwhentheloaderhasfullyoutputavideo,addittoalist,andwhenwehaveenoughvideos,grouptheminabatchofthesizewewantforprovidingittothenetwork.Furthermore,foraccomplishingthiswealsoneedtosettofalsetheshuffleoptionintheloader.
Althoughwearereadyfortrainingournetwork,animpedimentarisesatthetimeofwritingthiswork.Whetherthevideoshavenotbeenproperlyconverted,orthereisacodeissue,thedataloader
Figure 7. Mean loading time in seconds of each number of frames executed (blue). Trend line of from the obtained data (red). Y axis represents the loading time in seconds, while the X axis shows the number of frames used.
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
25
seemstogetsilentlystuckwhenloadingsomevideos.Forsolvingthis,onecircumventionistocreatealoaderforeachvideoinsteadofhavingoneforthewholedataset.
Sofarthisworks,butwhathappensnextisthatGPUmemoryisnotproperlyfreed,thuslimitingthesizeofourdatasettothespaceavailableonthegraphiccardatthemoment.ForAsimov,thisconcursinhavingaround240videosfortrainingand160forvalidation(onlytheTitancardsupportsNVVL).
Forthis,followingthesamelinesofmotivationproposedatthebeginningofthedocument,wearegoingtoselectdailyactionsforthereduceddatasetwecanworkwith.Specifically,itiscomposedofeightclassesfromtheHuman-ObjectInteractiongroupoftheUCF101dataset:Apply Eye Makeup,Apply Lipstick,Blow Dry Hair,Brushing Teeth,Cutting In Kitchen,Mopping floor,Shaving Beard,andTyping.Thetrainingsetcontains30videosforeachaction,whilethevalidationonehas20ofthem.
Regardingthetraininghyper-parameters,wearegoingtousetheonessetbydefaultfortheTSNwiththeonlyexceptionofthebatchsizeandnumberofepochs.Fortheformer,wehavesetitto4duestothelimitedmemory,forthelater,wewillperform40epochs,whichisenoughforthemodeltoconvergewiththisdataset.Forthemetrics,wewillkeeptrackofthelossandtop-1andtop-5accuraciesforboththetrainingandvalidationsets.
Figure 8. Representation of how keyframes can be evenly inserted into a video stream
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
26
Oncealltheepochshavebeencompleted,wefinishourtrainingwithinacommonsituation,100%oftop-1(andtop-5)accuracyandzeroloss.Clearly,ournetworkhasoverfitted.Thisisduetothescarceamountofdataanditslimitedvariability.Moreover,suchdeepnetworks(BNInception)arepronetooverfit,sincetheyhavemoreflexibility(biggernumberofparameters)foradjustingtothedatatheyareconsumingwhileintraining.However,weobtainvalidationaccuraciesof76.25%and98.125%forthetop-1andtop-5versionsrespectively,withalossofapprox.1.52.Takingintoaccountthedataweareworkingwith;theseresultsarequitepromisingincomparisontowhathashappenedinthetrainingphase.
Now,wecangetmoreinsightsiflookingonhowthetrainingandvalidationhaveevolved.IntheFigure9:
Herewecantakenoteoftwofacts.First,thetop-5accuracyconvergesmuchfaster(iter.approx.200)thanthetop-1accuracy(iter.approx.800).Clearly,thisissomethingthatcanbeforeseen,sinceittakesmoretimetolearnthelabelofavideoratherthanguessitamongfivesamples.
Secondly,weseethattheunsmoothedcurve(shadedred)bouncesbetweenhigherandlowervalues(accuracyandloss)amongthefirstiterations.Thiseffectcanbebetterseeninthevalidationcurves(Figure10):
Thishappensasaconsequenceofthesmallbatchsizewehavepreviouslyset.Thesmalleristhebatchsize,themoreweightupdateswewillperform.Ifitistoosmall,wecouldfindthefollowing:
• Instability:Thefrequentupdateswillcausethemetricstowander,goingcontinuallyupanddown.• Notmeaningfulupdates:Thereducednumberofsamplesmakesittocontainlessinformation
abouttheerror(negativegradient)directionineachupdate,thusneedingamajornumberofepochsforconvergingintothesameaccuracythanwithabiggerbatchsize.Thiscanbesummarizedaslongertrainingtimes.
• Hitalocalminimal:Alsoknownasplateau,andcommonlyinducedbythepreviousstatements,asmallbatchsizecanmakethatthenetworkgetsstuckonanon-optimum(norsub-optimum)minimumofthelossfunction,obtaininginsufficientperformanceresults.
Asintuition,wecantakealookattheFigure11,whereontheleft,theevolutionofthreetypesofbatchsizelosscurves(arrivingtotheminimum)areplotted.Theblueone,representsabatchof
Figure 9. Training curves for 40 epochs, 60 iterations per epoch
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
27
thesamesizeofthedataset,thusmakingonlyoneupdateperepoch,asmothercurvewithamuchlessnoisyevolution.Althoughitseemsthebestapproach,thedetailisinthetimeandspaceittakestoupdatetheweights,sincewehavealargenumberofsamples,wehavetocomputeavastamountofoperations.Moreover,commonlyisimpracticablethatacompletedatasetfitsintoamodernGPUmemory.
Thepurplecurveisforthecasewherewerealizeone-sampleupdates,somethingthatreflects,onextreme,whathappenedduringthetrainingofournetwork.Finally,thegreencurveshowsthedailysituationofmostdeeplearningtrainings,wherethebatchsizeisfoundinbalancewiththenumberofupdatesperepoch.Althoughtherearefrequentupdates,theyarenottoomuchforfiringdivergence,atthesametimeareasonabletimeistakenforfindingtheerror.
Inordertobettervisualizewherethenetworkguessesrightorwrong,wecanmakeuseofthetrainingandvalidationconfusionmatrices12,whereineachcellwecanseethepercentageoftruepositivesfortheclassinthecellrow.Forexample,inthevalidationmatrix,35%percentofthetimesweseetheclassBrushing teeth,thenetworkseesitasShaving Beard.Moreover,wecannotethatbyobtainingthetraceofaconfusionmatrix(summationoverthediagonal)anddividingbythenumberofclasses,weretrievethefinalaccuracy(Figure12).
Easily,wenoticewhatwedeterminedbefore,thetrainingsetisoverfitted,sinceforalldiagonalcellstheconfusionmatrixreportsa100%value(normalizedbetween0and1).Ontheotherhand,whenanalyzingvalidationmatrix,wecanseethatthenetworkmostlyfailswhentheclassesareverysimilar.Forexample:
• Apply Eye Makeupisconfounded15%ofthetimeswithApplyLipstick,sincebothusesomekindofhandstickandcoverzonesofthefaceverticallynearbetweeneachother,islogictothinkthattheyaremoredifficulttodifferentiate.
• Apply Eye MakeupandShaving Beardfollowasimilarerrorpattern,sinceinbothactionsthereishandmovementoverthezoneofthemouthandarmmovementaroundthewholeface.
Inothercases,thecontrarycanhappen,whentheactioniseasilydifferentiablefromothers,thismostlyhappenswithtwoactions,Mopping the floorwhichusuallyhappensinaroom,andCutting in kitchenwherethecamerafocusesontheknifeandthecuttingtablearea(Figure13).
Figure 10. Validation curves for 40 epochs, 60 iterations per epoch
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
28
4. CoNCLUSIoN
Inthiswork,wehavefocusedourattentionondifferentwaysforacceleratingthetrainingandinferenceprocessesofamodernvideo-basedactionrecognitionpipeline.First,theuseofaTSNframework,sinceitrequiressmallamountsofdataasaninput.Secondly,theuseofMotionNetfromtheHTSwork,inordertoachievereal-timeopticalflowcomputationtimes,adaptitsrepresentationforactionrecognition.Third,theuseoftherecentNVVLforreducingthecostofIOoperations,savestoragespace,andspeedupthewholepipelinebydirectlydecodingvideosontheGPU.
Figure 11. Effects of batch sizes when training
Figure 12. Confusion matrices for the proposed dataset
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
29
Figure 13. Class labels and network predictions: First line is correct label, second line is the predicted one, green if correct or red if not
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
30
ReFeReNCeS
Aliakbarian,M.S.,Saleh,F.S.,Salzmann,M.,Fernando,B.,Petersson,L.,&Andersson,L.(2017,October).Encouraging lstms to anticipate actions very early. In IEEE International Conference on Computer Vision (ICCV).10.1109/TPAMI.2018.2868668
Bellman,R.,&Kalaba,R.(1959).Onadaptivecontrolprocesses.I.R.E. Transactions on Automatic Control,4(2),1–9.doi:10.1109/TAC.1959.1104847
Buch,S.,Escorcia,V.,Shen,C.,Ghanem,B.,&Niebles,J.C.(2017,July).Sst:Single-streamtemporalactionproposals.In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.6373-6382).IEEE.10.1109/CVPR.2017.675
Canny,J. (1986).Acomputationalapproach toedgedetection. IEEE Transactions on Pattern Analysis and Machine Intelligence,PAMI-8(6),679–698.doi:10.1109/TPAMI.1986.4767851
Carreira,J.,&Zisserman,A.(2017,July).Quovadis,actionrecognition?anewmodelandthekineticsdataset.In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.4724-4733).IEEE.
Casper,J.,Barker,J.,&Catanzaro,B.(2018).NVVL:NVIDIAVideoLoader.
Chéron,G.,Laptev,I.,&Schmid,C.(2015).P-cnn:Pose-basedcnnfeaturesforactionrecognition.InProceedings of the IEEE international conference on computer vision(pp.3218-3226).10.1109/ICCV.2015.368
Dave,A.,Russakovsky,O.,&Ramanan,D.(2017,April).Predictivecorrectivenetworksforactiondetection.InProceedings of the Computer Vision and Pattern Recognition.
Duke,B.(2018).Lintel:Pythonvideodecoding.
Efros,A.A.,Berg,A.C.,Mori,G.,&Malik,J.(2003,October).Recognizingactionatadistance.Innull(p.726).IEEE.
Feichtenhofer,C.,Pinz,A.,&Wildes,R.P.(2017,July).Spatiotemporalmultipliernetworksforvideoactionrecognition.In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.7445-7454).IEEE.10.1109/CVPR.2017.787
Gavrila, D. M. (1999). The visual analysis of human movement: A survey. Computer Vision and Image Understanding,73(1),82–98.doi:10.1006/cviu.1998.0716
Horn,B.K.,&Schunck,B.G. (1981).Determiningoptical flow.Artificial Intelligence,17(1-3),185–203.doi:10.1016/0004-3702(81)90024-2
Ji,S.,Xu,W.,Yang,M.,&Yu,K.(2013).3Dconvolutionalneuralnetworksforhumanactionrecognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,35(1),221–231.doi:10.1109/TPAMI.2012.59
Kong,Y.,Tao,Z.,&Fu,Y.(2017,July).Deepsequentialcontextnetworksforactionprediction.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition(pp.1473-1481).10.1109/CVPR.2017.390
Lea,C.,Flynn,M.D.,Vidal,R.,Reiter,A.,&Hager,G.D.(2017,July).Temporalconvolutionalnetworksforactionsegmentationanddetection. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.1003-1012).IEEE.
Mahmud,T.,Hasan,M.,&Roy-Chowdhury,A.K. (2017,October). Jointpredictionofactivity labelsandstartingtimesinuntrimmedvideos.In2017 IEEE International Conference on Computer Vision (ICCV)(pp.5784-5793).IEEE.10.1109/ICCV.2017.616
Rabiner,L.R.,&Juang,B.H.(1986).AnintroductiontohiddenMarkovmodels.IEEE ASSP Magazine,3(1),4–16.
Rhinehart,N.,&Kitani,K.M.(2017,October).First-personactivityforecastingwithonlineinversereinforcementlearning.InProceedings of the IEEE International Conference on Computer Vision(pp.3696-3705).10.1109/ICCV.2017.399
Schuldt,C.,Laptev,I.,&Caputo,B.(2004).Recognizinghumanactions:alocalSVMapproach.InProceedings of the 17th International Conference on Pattern Recognition ICPR 2004(Vol.3,pp.32-36).IEEE.
International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019
31
John A. Castro-Vargas is a PhD student at the University of Alicante. His areas of interest are: Robotics, DeepLearning, gesture recognition and action recognition. He has participated in the nationally funded projects “Multi-sensorial robotic system with dual manipulation for human-robot assistance tasks” and “COMBAHO: COMe BAck HOme system for enhancing autonomy of people with acquired brain injury and dependent on their integration into society”.
Sigurdsson,G.A.,Divvala,S.K.,Farhadi,A.,&Gupta,A.(2017,July).Asynchronous Temporal Fields for Action Recognition(Vol.6,p.8).CVPR.
Simonyan,K.,&Zisserman,A.(2014).Two-streamconvolutionalnetworksforactionrecognitioninvideos.InAdvancesinneuralinformationprocessingsystems(pp.568-576).
Singh, G., Saha, S., Sapienza, M., Torr, P., & Cuzzolin, F. (2017, October). Online real-time multiplespatiotemporalactionlocalisationandprediction.In2017 IEEE International Conference on Computer Vision (ICCV)(pp.3657-3666).IEEE.10.1109/ICCV.2017.393
Soomro,K.,Zamir,A.R.,&Shah,M.(2012).UCF101:Adatasetof101humanactionsclassesfromvideosinthewild.arXiv:1212.0402
Wang,L.,Xiong,Y.,Wang,Z.,Qiao,Y.,Lin,D.,Tang,X.,&VanGool,L.(2018).Temporalsegmentnetworksforactionrecognitioninvideos.IEEE Transactions on Pattern Analysis and Machine Intelligence.
Zeng,K.H.,Shen,W.B.,Huang,D.A.,Sun,M.,&Niebles, J.C. (2017,August).Visual forecastingbyimitatingdynamicsinnaturalsequences.In IEEE International Conference on Computer Vision (ICCV)(Vol.2).10.1109/ICCV.2017.326
Zhu,Y.,Lan,Z.,Newsam,S.,&Hauptmann,A.G.(2017).Hiddentwo-streamconvolutionalnetworksforactionrecognition.arXiv:1704.00389
eNDNoTeS
1 https://github.com/NVIDIA/nvvl/tree/master/pytorch/test2 http://crcv.ucf.edu/data/UCF101.php
Top Related