Download - Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

DOI: 10.4018/IJCVIP.2019040102

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

Copyright©2019,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited.

16

Accelerating Deep Action Recognition Networks for Real-Time ApplicationsDavid Ivorra-Piqueres, University of Alicante, Alicante, Spain

John Alejandro Castro Vargas, University of Alicante, Alicante, Spain

Pablo Martinez-Gonzalez, University of Alicante, Alicante, Spain

ABSTRACT

Inthiswork,theauthorsproposeseveraltechniquesforacceleratingamodernactionrecognitionpipeline.This article reviewed several recent andpopular action recognitionworks and selectedtwoofthemaspartofthetoolsusedforimprovingtheaforementionedacceleration.Specifically,temporalsegmentnetworks(TSN),aconvolutionalneuralnetwork(CNN)frameworkthatmakesuseofasmallnumberofvideoframesforobtainingrobustpredictionswhichhaveallowedtowinthefirstplaceinthe2016ActivityNetchallenge,andMotionNet,aconvolutional-transposedCNNthatiscapableofinferringopticalflowRGBframes.Togetherwiththelastproposal,thisarticleintegratedanewsoftwarefordecodingvideosthattakesadvantageofNVIDIAGPUs.ThisarticleshowsaproofofconceptforthisapproachbytrainingtheRGBstreamoftheTSNnetworkinvideosloadedwithNVIDIAVideoLoader (NVVL)ofasubsetofdailyactions fromtheUniversityofCentralFlorida101dataset.

KeywoRDSAction Recognition, Action Understanding, Deep Learning, GPU Acceleration, Machine Learning, Optical Flow, Real-Time, Recurrent Networks, Video Decoding

1. INTRoDUCTIoN

Although in recentyears the taskofactivity recognitionhaswitnessednumerousbreakthroughsthankstothedevelopmentofnewmethodologiesandtherebirthofdeeplearningtechniques,thenaturalcourseofeventshasnotalwaysbeenlikethis.Asformanyyears,despiteofbeingtackledfrommultipleperspectives,theproblemofconstructingasystemthatiscapableofidentifyingwhichactivityisbeingperformedinagivenscenehasbeenbarelysolved.Inthestateoftheartwecanfinddifferentapproachesbasedonhandcraftedtraditionalmethodsandmachinelearningapproaches:

• Handcrafted features dominance.Thefirstapproximationsweremotivatedbyfundamentalalgorithmssuchasopticalflow(HornandRhunck,1981),theCannyedgedetector(Canny,1986),HiddenMarkovModel(HMM)(RabinerandJuang,1986)orDynamicTimeWarping(DTW)(BellmanandKalaba,1959).Severalofthesemethodshavebeenreviewedin(Gavrila,1999),forhandandthewhole-bodymovements,whichcanbeusedtoobtainrelevantinformationfortherecognitionofactivities.

• Machine learning approaches.Moremodernmethodsuseopticalflow(Efrosetal.,2003)toobtaintemporalfeaturesoverthesequences,inadditiontousingautomaticlearningalgorithms


17

such as Support Vector Machine (SVM) (Schüldt, Laptev and Caputo, 2004) to classifyspatiotemporalfeatures.

• Deep learning.TheCNNnetworksallowtoobtainrobustvisualfeatureson2Dimages(ChéronandLaptev,2015),butmorespecificallyitsversionadaptedtoworkwithdatadefinedinthreedimensionsofferstheabilitytoobtainspatialandtemporalfeatureswhenworkingwithsequencesofimages.Inthisway,furthermoreoftwospatialdimensions(heightandwidth),wehaveathirddimensiondefinedbytime(frames)(Jietal.,2013)(SimonyanandZisserman,2014).

2. APPRoACH

Inthissectionwereviewthemostmodernactionrecognitionworkscarriedoutinthepastthreeyears.Online InverseReinforcementLearning (Rhinehart andKitani, 2017) is anovelmethod for

predicting future behaviors by modeling the interactions between the subject, objects, and theirenvironment, through a first-person mounted camera. The system makes use of online inversereinforcementlearning.Thus,makingitpossibletocontinuallydiscovernewlong-termgoalsandrelationships. Also, a similar approach to that of thehybrid Siamese networks, has been shown(Mahmud,HasanandRoy-Chowdhury,2017)thatispossibletosimultaneouslypredictfutureactivitylabelsandtheirstartingtime.Itdoessobytakingadvantageoffeaturesofpreviouslyseenactivitiesandcurrentlypresentobjectsinthescene.

ThankstotheuseofSingleShotmulti-boxDetectors(SSDs)CNNs,thesystemproposedin(Singhetal.,2017)iscapableofpredictingbothactionlabels,andtheircorrespondingboundingboxesinreal-time(28FPS).Moreover,itcandetectmorethanoneactionatthesametime.AllofthisisaccomplishedbyusingRGBimagefeaturescombinedwithopticalflowones(withadecreaseintheopticalflowqualityandglobalaccuracy)extractedinreal-timeforthecreationofmultipleactiontubes.

In(Kong,TaoandFu,2017),forpredictingactionclasslabelsbeforetheactionfinishes,authorsmakeuseoffeaturesextractedfromfullyobservedvideosprocessedattraintime,forfillingoutthemissinginformationpresentintheincompletevideostopredict.Furthermore,thankstothisapproachtheirmodelobtainsagreatspeedupimprovementwhencomparedtosimilarmethods.

Amodelthatiscapableofperformingvisualforecastingatdifferentabstractionlevelsispresentedin(Zengetal.,2017).Forexample,thesamemodelcanbetrainedforfutureframegenerationaswellasforactionanticipation.Thisisaccomplishedbyfollowinganinversereinforcementlearningapproach.Also,themodelisenforcedtoimitatenaturalvisualsequencesfrompixellevel.

Themodeldevelopedin(Renéetal.,2017)iscapableofpredictinginreal-timefutureactivitieslabelsonRGB-Dvideos.Thisisaccomplishedbymakinguseofsoftregression,forjointlylearningboththepredictormodelandthesoftlabels.Moreover,real-timeperformance(around40FPS)isobtainedbyincludinganovelRGB-DfeaturenamedLocalAccumulativeFrameFeature(LAFF).Moreover,aTCNEncoder-Decodersystemisbuiltforperformingthementionedtasks.Aftertraining,itisabletosurpasscurrentsimilarapproaches.Furthermore,thesystempresentsabetterperformancethanBidirectionalLongshort-termmemorynetworks(Bi-LSTMs)networks.

In(Buchetal.,2017),asystemthatiscapableofpresentingtemporalactionproposalsonavideowithonlyoneforwardpassispresented.Thus,thereisnoneedtocreateoverlappedtemporalslidingwindows.Moreover,thesystemcanworkwithlonguntrimmedvideosofarbitrarylengthinacontinuedfashion.Finally,bycombiningthesystemwithactionclassifiers,temporalactiondetectionperformanceisincreased.

Anewconvolutionalmodelispresentin(CarreiraandZisserman,2017),knownasTwo-StreamInflated3Dconvolutionalneuralnetwork(I3D),whichisusedasaspatio-temporalfeatureextractor.Afterthis,authorspre-trainI3DbasedmodelsontheKineticsdataset,showingthatwiththisapproach,actionclassificationperformanceonwell-knowndatasetsisnoticeablyincreased.


18

In(Feichtenhofer,PinzandWildes,2017),afullyspace-timeconvolutionaltwo-streamnetwork(namedSTResNet)isproposedforthetaskofactionrecognitioninvideos.ThefirststreamisfedwithRGBdatawhilethesecond,withopticalflowfeatures.Themainparticularityofthismodelistheexistinginterconnectionsbetweenbothstreams.Moreover,forlearninglong-termrelationships,identitymappingkernelsareinjectedthroughthenetwork.Allofthisallowsthenetworktopredictonasingleforwardpass.

Newsrecurrentneuralnetworkapproachesarepresentedin(Dave,RussakovskyandRamanan,2017),whichareusedforsolvingtheproblemofactiondetectioninvideosobtainingsatisfactoryresults.Initsbasisthemodel:(1)Focusesonchangesbetweenframes,(2)predictsthefuture,(3)makescorrectionsuponitbyobservingwhattrulyhappensnext.

Authorsof(Sigurdssonetal.,2017)proposeamodelthatiscapableofdetailedlyreasonaboutaspectsofanactivity,i.e,foreachframethemodeliscapableofpredictingthecurrentactivity,itsactionandobject,thescene,andthetemporalprogress.ThisisaccomplishedbymakinguseofConditionalRandomFields(CRFs)thatarefedbyCNNfeatureextractors.Moreover,forbeingabletotrainthissysteminandend-to-end-manner,anasynchronousstochasticinferencealgorithmisdeveloped.

In(Wangetal.,2017)authorsproposeaCNNframeworkfortherecognitionofactionsinvideos,bothtrimmedanuntrimmedandin(Aliakbarianetal.,2017)isproposedamulti-stageLongshort-termmemorynetwork(LSTM)architecturecombinedwithanovellossfunction,thatiscapableofpredictingactionclasslabelsinvideos,evenwhenonlythefirstframesofthesequencehavebeenshown.Themodel takesadvantageofaction-awareandcontext-awarefeatures forsucceeding inthistask.

2.1. TSN FrameworkTemporalSegmentNetwors(TSN)(Wangetal.,2017)isaCNNframeworkfortherecognitionofactionsinvideos,bothtrimmedanuntrimmed.Alongwithit,aseriesofguidelinesforproperlyinitializeandoperatesuchdeepmodelsforthistaskareproposed.TheframeworkaimstotacklefourcommonlimitationswhenusingConvolutionalNeuralNetworksonvideos.First,thedifficultyofusinglong-rangeframesequences,duetohighcomputationalandmemoryspacecosts,whichcanleadtomissimportanttemporalinformation.Second,mostofthesystemsfocusontrimmedvideosinsteadofuntrimmedones(severalactionsmayhappeninavideo).Adaptingtothesewouldmeantoproperlylocalizeactionsandavoidbackground(irrelevant)videoparts.Third,meanwhiledeepmodelsbecomecomplex,stillmanydatasetsaresmallinnumberofsamplesanddiversity,lackingenoughdataforproperlytrainthemandavoidoverfitting.Fourth,timeconsumedforoptical-flowextractioncanbecomeadelayforboth,usinglarge-scaledatasetsandusingthemodelonreal-timeapplications.Figure1showsaschematicviewofsuchnetwork.

2.2. GPU Video DecodingSincethebeginningofthemoderndeeplearningera,datastoringandloadingtimeshavealwaysbeenabottleneckinthepipeline.AlthoughrecentlywearewitnessinggreatspeedupsthankstonewhardwaretechnologieslikeSSDforstoring,ordatatransferringdevices(betweenCPUtoGPU,andvice-versa)suchasNVLINK,theissuepersists.

Manyoftheresearchareaswherethisproblemaggravatesmorearetheoneswhichworkwithvideosasthemaindatasetsource.Theseinclude:predictivelearning,videounderstanding,questionanswering,activityrecognition,andsuper-resolutionnetworks,betweenmanyothers.

Themainapproachwhentacklingthisprobleminthoseareasistofirstextractalltheframesforeachvideoofthedataset,forexamplebyusingFFmpeg,andsavetheminahigh-qualityimageformat,ratherthanonewithpossiblelossycompressionandartifactgenerations,inordertoproperlytrainthenetwork.Thiscomeswithanincreasingneedofstoragespace,sincethemoreinformationwillingtobekept,thelargerinsizeourconvertedimagedatasetwillbe.


19

InFigure2wecanseetheeffectsofstoringtheUniversityofCentralFlorida101(UCF101)dataset(Soomro,ZamirandShah,2012),composedofonly13320videosindifferentformats.ForthecaseofJPEG(image),itoccupies63GiB,whileinAVIformat(video)itoccupies9.25timesless,6.8GiB.IfitistransformedtotheproperMP4format,neededbyNVIDIAVideoLoader(NVVL)(Casper,BarkerandCatanzaro,2018)withthecorrespondingnumberofframes,itoccupies14.2GiB,still4.44timesless.Ifwetakethisintoafine-grainedlevel,suchasframes,wecanseethatthestoragediffersbyalargemargin.

Inordertoalleviatethisproblem,ausefulsolutionistodirectlyloadvideofilesintomemory,decodethenecessaryframes,preparethem,andfinallyfeedthemtothenetwork.Actually,APIsthatcanmanagethefirsttwostepsexist:FFmpeglibrariesitself,andhigherabstractionmoduleslikePyAVorPIMS,whichbothloaddataintotheCPU.Ontheotherhand,the(beta)Hwangproject,alsosupportsNVIDIAGPUsthroughtheuseoftheirspecifichardwaredecoderunits.Furthermore,thosedesignedwithmachinelearningtasksinmind,whichcanprovideallthementionedstepshavebeenrecentlydeveloped.Twoarecurrentlyknown:Lintel(Duke,2018),andNVVL(Casper,BarkerandCatanzaro,2018).ThefirstfocusesonCPUloading(usesFFmpegasbackend),whilethesecondtargetsGPUdevices.Indeed,althoughbeingwritteninC++,itoffersoff-the-shelfPyTorchmodules(datasetandloader).Moreover,anotherwrapperforCuPyarrayshasbeencreated.

Figure 1. Representation of TSN framework. First a snippet is extracted from each of a fixed number of segments that equally divide the video, Then, features such as optical flow or RGB-diff (top and bottom images of the second process column) are extracted. After passing through the corresponding stream, an aggregation function joins the individual snippet class probabilities. Then, softmax is applied for obtaining the final video action class.

Figure 2. Storage comparison between frames and video formats for the UCF101 dataset


20

Regardingperformance,wecanseethatNVVLreducesbyalargemargintheI/Oprocessingtimes,asitcanbeappreciatedinFigure3.MorebenchmarksthattakeintoaccountmemoryusageandCPUloadscanalsobefoundintheblogpost,whileanevenmoredetailedevaluationislocatedon GitHub1. Regarding data, loading behaves like a sliding window of stride one, where framesequencesofapreviouslyfixedlengtharesubsequentlyloadedandreturnedasasingletensor.Ontheotherhand,wecanapplydifferenttransformationstothesesequences:datatype(float,half,orbyte),widthandheightresizingandscaling,randomcroppingandflipping,normalizing,colorspace(RGBorY:Luminance;Cb:Chrominance-Blue;andCr:Chrominance-Red(YCbCr)),andframeindexmapping.Forperformance,flexibility,andcompletenessreasons,wedecidedtouseNVVLasourmaintooltoacceleratetheTSNframework.

2.3. Hidden Two-Stream Convolutional Networks for Action RecognitionAnothersideforapproachingthequestionofreal-timeactionrecognitioncanbefoundin(Zhu,Yetal,2017),wheretheuseofaconvolutionalnetworkforautomaticallycomputeopticalflowispresented.

More indetail, ina firstphase,aCNNdenotedasMotionNet is trained inanunsupervisedmannerforthetaskofoptical-flowestimation.Afterobtainingacceptableresultssimilartooptimaltraditionalmethods,thenetworkisattachedtoaconventionalCNNasthetemporalstreampartofthewholemodel,beingthespatialstreamsimilarinarchitecturetotheotherone.Then,thenetworkistrained(includingMotionNet)onthetaskofactionrecognitionfromframesequences.Theapproachenablestheopticalflowgeneratortobeadaptedtothecharacteristicsofthetaskandfurtherfindingasuitablemotionrepresentation.

3. eXPeRIMeNTATIoN

Inthisprojectwehaveexperimentedusingtheapproachesdiscussedwithdatasetucf101.

Figure 3. Average loading time (milliseconds) that 32-Floating Point PyTorch tensors take to be available in the GPU. The experiment was run on an NVIDIA V100 GPU over one epoch with batches of size 8. Figure extracted from (Casper, Barker and Catanzaro, 2018).


21

3.1. Dataset(Soomro,ZamirandShah,2012)GiventhelimitednumberofRGBactiondatasetsthatincludedrealisticscenes(withoutactorsorpreparedenvironments)andawiderangeofclassesuntil2012,authorsofthispaperproposeanewlarge-scaledatasetsofuser-uploadedvideos(YouTube).Thesepresentamuchdiversetypeofchallengesthantheonesofpreviousdatasets,sincerecordingscancontaindifferent lightingconfigurations, imagequalitydegradation,cluttering,movementof thecamera,andoccludedscenes.

Inregardtothesizeofthedataset,13320videosaredividedinto101classesthatcoverfiveaction groups: Human-Human Interaction, Sports, Playing Musical Instruments, Human-ObjectInteraction, and Body-Motion only. The actions contained in the first and fourth groups can beobservedinFigure42.

Furthermore, thisdatasetmarkedamilestoneinwhat largescaleactionrecognitiondatasetsrefers. Itmadepossible toestablishawell-knownstarting testbed tobe improvedaswellas forbenchmarking. Moreover, deep learning competitions where established around it, such as thedifferentmodalitiesoftheTHUMOSChallenge,whichwasrunforthreeyearsinarow.Afterthat,otherlarge-scaledatasetsappeared,expandingthecharacteristicsoftheUCF101.Forthis,markingthestartofanever-growingnumberofdiverselarge-scaleactionrecognitiondatasets,theUCF101datasetisalsoworthof.

3.2. GPU Video Decoding experimentsInorder to testourGPUvideodecodingalgorithm,wecancompare thedifferencebetween theoriginalframesandtheonesloadedthroughNVVL.Forthistask,wearegoingtousetheSSIMindexbetweentwopictures,usuallyusedinthevideoindustryformeasuringthevisualdifferencewecanperceivewhencomparingframesofanoriginalanddownsampledvideo.Itrangesfrom0to1,where1isgivenfortwoidenticalpicturesand0fortwocompletelydifferentones.Forexample,giventhetwoframesobtainedfromtheUCFdataset(Divingclass)thatcanbeobservedinFigure5,wecannoticeagreenbandontherightextremeoftheNVVLloadedimage.

Apartfromthis,wecannotperceiveanyothersubstantialdegradationinquality.Indeed,theSSIMobtainedis0.992,indicatingthatthisartifactisprobablyduetoabugratherthanalow-qualityvideoprocessor.Forreassuringthisfact,wecancomputetheSSIMheatmap,inordertolocateotherpossiblemissedartifacts(Figure6):

Thus,wecanassumethattherewillbenoharmatthetimeofincorporatingthistoolinaneuralnetworktrainingpipeline.

Figure 4. Classes for the Human-Object Interaction (blue) and Body-Motion Only (red) action groups from the UCF101 dataset. Figure extracted from (Soomro, Zamir and Shah, 2012).


22

Now,weshouldpayattention toknowingwhich thecurrent timespeedup iswecanobtainfrom replacing the image loading systemof theTSN frameworkbyaNVVLpipeline.For this,afteradaptingtheframe-indexgenerationfunctions,andintegratingthevideoloaderintoit,wecanperformthefollowing:

1. Obtainalistofvideos,getthetotalnumberofvideosandthemeannumberofframespervideo.2. Extractalltheframesfromthevideos,alsoconvertthemintotherequiredNVVLvideoformat.3. Selectthenumberofframespervideothataregoingtobeloaded.ForNVVL,alltheframes

havetobeloaded.

Figure 5. Original frame (left) and NVVL obtained frame (right). The frames pertain to a sample of the diving class in the ucf101 dataset.

Figure 6. Heat map of above frames, the lighter the color, the closer to the original frame each pixel is


23

4. Measurehowmuchtimeittakesforextractingthetoldnumberofframes(intotheGPU)oneachoccasion.ForNVVLthisonlyneedstobedoneonce.

5. Obtainmeantimesandtrendforthepreviousprocessandcomparetheresults.

Step1Wewillusethefirst450videosoftheUCFsplit-1trainlistobtainedwiththedatatoolsprovided

bytheTSNframework.Thislistisformattedwitharowforeachvideo.Ineachrow,thepathtothevideo,thenumberofframesthevideohasanditsclassindex.Duetothis,thetotalnumberofframescanbeobtainedjustbysummingthesecondelementofeachrowoverthewholelist.Theresultingnumberis87501.So,themeannumberofframesinavideoisapproximately194.

Step2Forcompletingthisstep,wecansimplyfollowtheinstructionsandcommandsprovidedinthe

repositoriesofeachproject.Wehavetotakeintoaccountthattheextractionprocesscantakeaquitegreateramountoftimethanthevideoconvertingprocess.

Step3Inthiscase,wearegoingtoloadevennumberofframes,startingfrom3andfinishingin25,a

totalof12differentinstances.ThishasbeenselectedsincetheauthorsofTSNtestthemodelwith3,5,7,and9framespervideo.

Step4Forobtaininganaccuratemeasurement,wearegoingtorepeateachexecution29times.For

computingthetimewehaveusedPythontime.timefunction.Also,inordertofreealltheresourcesineachrun,wearegoingtoloopinsideabashscriptinsteadofinsidethePythonexecutingscriptitself,thushavingtheprocesskilledautomatically.

Step5Inthisstepwecomputedthemeanvaluesforeachnumberofframes.Thetimetakenforloading

allthevideoswithNVVLisapproximately24.18seconds.Ontheotherhand,wecanplottheresultsobtainedfromloadingsoleframes:

Wecannoticethatthetrendfollowsalinealgrowthwithrespecttothenumberofframesloaded.Sincewecomputedtheequationdefiningthetrendline(showninthelower-rightpartofFigure7),wecanobtainamorepreciseapproximationofthespeedupachievedwhenusingNVVL.Forthis,sinceweknowthenumberofframesloadedwithNVVListhesameasthemeannumberofframesobtainedinStep1,wejustneedtosubstituteitintheequation(Xvariable),obtainingameanvalueofapproximately458.74secondsor7.65minutes.Wehaveachievedanimprovementonloadingtimeperformance,leadingto18.97timesspeedupwhenusingNVVL.

3.3. Training RGB TSN+HTS with NVVLSofarwehaveshownhowusefulincorporatingNVVLintoavideo-consumingdeeplearningpipelinecanbe,itallowsustoreduceboththestorageanddatatransfercostsatthesametimewedonotsufferdegradationinimagequality.Now,whatonlyremainsistoincorporatethistoolintoacommonactionrecognitionscenario,wherewetrainandtestanetworkforlearningtocategorizehumanactions.

SuchanetworkisgoingtobeTSN,sinceithasdemonstratedasuperiorperformanceinthetaskathand.Moreover,weproposetomakeuseoftheconvertedHTSCaffemodelandweights,inordertoavoidpre-computingtheopticalflowandbeingabletouseNVVLalsointhisstream,focusingtheresultingpipelineforreal-timeapplications.Inspiteofthedatasetwearegoingtouse,memorylimitationsdetailedbelow,andtimeconstrains,wearegoingtofocusthefollowingexperimentonlyfortheRGBstream.

Beforestarting,weneedtopreparethedataintoaformatthatiscompatiblewithNVVL.AsreferredintheGitHubrepository1,weneedvideoswitheitherH.264orHEVC(H.265)codec,andyuv420ppixelformat,alsotheycanbeinanycontainerthatFFmpegparsersupports.


24

Moreover,wehavetotakeintoaccountthenumberofkeyframeseachvideowillhave,i.e.,acodeconlycontainsasubsetofalltheframesthatweseeinavideo,thesearethekeyframes.Atthetimeofdecoding,therestoftheframesareobtainedbyalgorithmicallyinferringthemthroughthekeyframes.Forthisreason,whenloadingsequencesthatcanstartandendatanyframe(similartowhatwecandowithNVVL),thesystemhastoseekthenearestkeyframe,whichcanbefarbeforeorafterthestartingframe.Thiscanresultintoanunderperformingexecution,andforthisreasonwhenconvertingthevideos,wehavetoindicatethefrequencyofkeyframesperframewewanttohave(Figure8).

Developersofthevideoloadersuggesttosetonekeyframeinintervalsthatcorrespondtothelengthofthesequenceswearegoingtoload.Forexample,ifwearegoingtoloadsequencesoflength7,thenevery7framestherewillbeakeyframe.Furthermore,theyalsoprovidetherequiredcommandstocarryoutthisconversionwithFFmpeg.

Forourcase,wearegoingtoseteveryframeinthevideotobeakeyframe,thisisduetothefactthatcurrentlythePyTorchwrapper(theC++APIseemsmoreflexible)isintendedforloadingmultipleframesequencesforeachvideowithaslidingwindowapproachofafixedlength.Althoughthislengthcouldbeequaltothenumberofframesinthevideo,thusloadingonlyasequencepervideo,thiswouldonlyworkifallthevideoshadthesamelength,sincethisparameter,thesequencelength,isglobalforthewholedataset.

Foriteratingoverthedataset,wearegoingtousethedataloaderprovidedbyNVVLPyTorchwrapper,whereineachiterationitwillloadabatchofframesequences.Sincenow,eachframeisasequenceoflengthone,weneedtosetthebatchsizealsotoone.Inthiswaywecaneasilyknowwhentheloaderhasfullyoutputavideo,addittoalist,andwhenwehaveenoughvideos,grouptheminabatchofthesizewewantforprovidingittothenetwork.Furthermore,foraccomplishingthiswealsoneedtosettofalsetheshuffleoptionintheloader.

Althoughwearereadyfortrainingournetwork,animpedimentarisesatthetimeofwritingthiswork.Whetherthevideoshavenotbeenproperlyconverted,orthereisacodeissue,thedataloader

Figure 7. Mean loading time in seconds of each number of frames executed (blue). Trend line of from the obtained data (red). Y axis represents the loading time in seconds, while the X axis shows the number of frames used.


25

seemstogetsilentlystuckwhenloadingsomevideos.Forsolvingthis,onecircumventionistocreatealoaderforeachvideoinsteadofhavingoneforthewholedataset.

Sofarthisworks,butwhathappensnextisthatGPUmemoryisnotproperlyfreed,thuslimitingthesizeofourdatasettothespaceavailableonthegraphiccardatthemoment.ForAsimov,thisconcursinhavingaround240videosfortrainingand160forvalidation(onlytheTitancardsupportsNVVL).

Forthis,followingthesamelinesofmotivationproposedatthebeginningofthedocument,wearegoingtoselectdailyactionsforthereduceddatasetwecanworkwith.Specifically,itiscomposedofeightclassesfromtheHuman-ObjectInteractiongroupoftheUCF101dataset:Apply Eye Makeup,Apply Lipstick,Blow Dry Hair,Brushing Teeth,Cutting In Kitchen,Mopping floor,Shaving Beard,andTyping.Thetrainingsetcontains30videosforeachaction,whilethevalidationonehas20ofthem.

Regardingthetraininghyper-parameters,wearegoingtousetheonessetbydefaultfortheTSNwiththeonlyexceptionofthebatchsizeandnumberofepochs.Fortheformer,wehavesetitto4duestothelimitedmemory,forthelater,wewillperform40epochs,whichisenoughforthemodeltoconvergewiththisdataset.Forthemetrics,wewillkeeptrackofthelossandtop-1andtop-5accuraciesforboththetrainingandvalidationsets.

Figure 8. Representation of how keyframes can be evenly inserted into a video stream


26

Oncealltheepochshavebeencompleted,wefinishourtrainingwithinacommonsituation,100%oftop-1(andtop-5)accuracyandzeroloss.Clearly,ournetworkhasoverfitted.Thisisduetothescarceamountofdataanditslimitedvariability.Moreover,suchdeepnetworks(BNInception)arepronetooverfit,sincetheyhavemoreflexibility(biggernumberofparameters)foradjustingtothedatatheyareconsumingwhileintraining.However,weobtainvalidationaccuraciesof76.25%and98.125%forthetop-1andtop-5versionsrespectively,withalossofapprox.1.52.Takingintoaccountthedataweareworkingwith;theseresultsarequitepromisingincomparisontowhathashappenedinthetrainingphase.

Now,wecangetmoreinsightsiflookingonhowthetrainingandvalidationhaveevolved.IntheFigure9:

Herewecantakenoteoftwofacts.First,thetop-5accuracyconvergesmuchfaster(iter.approx.200)thanthetop-1accuracy(iter.approx.800).Clearly,thisissomethingthatcanbeforeseen,sinceittakesmoretimetolearnthelabelofavideoratherthanguessitamongfivesamples.

Secondly,weseethattheunsmoothedcurve(shadedred)bouncesbetweenhigherandlowervalues(accuracyandloss)amongthefirstiterations.Thiseffectcanbebetterseeninthevalidationcurves(Figure10):

Thishappensasaconsequenceofthesmallbatchsizewehavepreviouslyset.Thesmalleristhebatchsize,themoreweightupdateswewillperform.Ifitistoosmall,wecouldfindthefollowing:

• Instability:Thefrequentupdateswillcausethemetricstowander,goingcontinuallyupanddown.• Notmeaningfulupdates:Thereducednumberofsamplesmakesittocontainlessinformation

abouttheerror(negativegradient)directionineachupdate,thusneedingamajornumberofepochsforconvergingintothesameaccuracythanwithabiggerbatchsize.Thiscanbesummarizedaslongertrainingtimes.

• Hitalocalminimal:Alsoknownasplateau,andcommonlyinducedbythepreviousstatements,asmallbatchsizecanmakethatthenetworkgetsstuckonanon-optimum(norsub-optimum)minimumofthelossfunction,obtaininginsufficientperformanceresults.

Asintuition,wecantakealookattheFigure11,whereontheleft,theevolutionofthreetypesofbatchsizelosscurves(arrivingtotheminimum)areplotted.Theblueone,representsabatchof

Figure 9. Training curves for 40 epochs, 60 iterations per epoch


27

thesamesizeofthedataset,thusmakingonlyoneupdateperepoch,asmothercurvewithamuchlessnoisyevolution.Althoughitseemsthebestapproach,thedetailisinthetimeandspaceittakestoupdatetheweights,sincewehavealargenumberofsamples,wehavetocomputeavastamountofoperations.Moreover,commonlyisimpracticablethatacompletedatasetfitsintoamodernGPUmemory.

Thepurplecurveisforthecasewherewerealizeone-sampleupdates,somethingthatreflects,onextreme,whathappenedduringthetrainingofournetwork.Finally,thegreencurveshowsthedailysituationofmostdeeplearningtrainings,wherethebatchsizeisfoundinbalancewiththenumberofupdatesperepoch.Althoughtherearefrequentupdates,theyarenottoomuchforfiringdivergence,atthesametimeareasonabletimeistakenforfindingtheerror.

Inordertobettervisualizewherethenetworkguessesrightorwrong,wecanmakeuseofthetrainingandvalidationconfusionmatrices12,whereineachcellwecanseethepercentageoftruepositivesfortheclassinthecellrow.Forexample,inthevalidationmatrix,35%percentofthetimesweseetheclassBrushing teeth,thenetworkseesitasShaving Beard.Moreover,wecannotethatbyobtainingthetraceofaconfusionmatrix(summationoverthediagonal)anddividingbythenumberofclasses,weretrievethefinalaccuracy(Figure12).

Easily,wenoticewhatwedeterminedbefore,thetrainingsetisoverfitted,sinceforalldiagonalcellstheconfusionmatrixreportsa100%value(normalizedbetween0and1).Ontheotherhand,whenanalyzingvalidationmatrix,wecanseethatthenetworkmostlyfailswhentheclassesareverysimilar.Forexample:

• Apply Eye Makeupisconfounded15%ofthetimeswithApplyLipstick,sincebothusesomekindofhandstickandcoverzonesofthefaceverticallynearbetweeneachother,islogictothinkthattheyaremoredifficulttodifferentiate.

• Apply Eye MakeupandShaving Beardfollowasimilarerrorpattern,sinceinbothactionsthereishandmovementoverthezoneofthemouthandarmmovementaroundthewholeface.

Inothercases,thecontrarycanhappen,whentheactioniseasilydifferentiablefromothers,thismostlyhappenswithtwoactions,Mopping the floorwhichusuallyhappensinaroom,andCutting in kitchenwherethecamerafocusesontheknifeandthecuttingtablearea(Figure13).

Figure 10. Validation curves for 40 epochs, 60 iterations per epoch


28

4. CoNCLUSIoN

Inthiswork,wehavefocusedourattentionondifferentwaysforacceleratingthetrainingandinferenceprocessesofamodernvideo-basedactionrecognitionpipeline.First,theuseofaTSNframework,sinceitrequiressmallamountsofdataasaninput.Secondly,theuseofMotionNetfromtheHTSwork,inordertoachievereal-timeopticalflowcomputationtimes,adaptitsrepresentationforactionrecognition.Third,theuseoftherecentNVVLforreducingthecostofIOoperations,savestoragespace,andspeedupthewholepipelinebydirectlydecodingvideosontheGPU.

Figure 11. Effects of batch sizes when training

Figure 12. Confusion matrices for the proposed dataset


29

Figure 13. Class labels and network predictions: First line is correct label, second line is the predicted one, green if correct or red if not


30

ReFeReNCeS

Aliakbarian,M.S.,Saleh,F.S.,Salzmann,M.,Fernando,B.,Petersson,L.,&Andersson,L.(2017,October).Encouraging lstms to anticipate actions very early. In IEEE International Conference on Computer Vision (ICCV).10.1109/TPAMI.2018.2868668

Bellman,R.,&Kalaba,R.(1959).Onadaptivecontrolprocesses.I.R.E. Transactions on Automatic Control,4(2),1–9.doi:10.1109/TAC.1959.1104847

Buch,S.,Escorcia,V.,Shen,C.,Ghanem,B.,&Niebles,J.C.(2017,July).Sst:Single-streamtemporalactionproposals.In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.6373-6382).IEEE.10.1109/CVPR.2017.675

Canny,J. (1986).Acomputationalapproach toedgedetection. IEEE Transactions on Pattern Analysis and Machine Intelligence,PAMI-8(6),679–698.doi:10.1109/TPAMI.1986.4767851

Carreira,J.,&Zisserman,A.(2017,July).Quovadis,actionrecognition?anewmodelandthekineticsdataset.In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.4724-4733).IEEE.

Casper,J.,Barker,J.,&Catanzaro,B.(2018).NVVL:NVIDIAVideoLoader.

Chéron,G.,Laptev,I.,&Schmid,C.(2015).P-cnn:Pose-basedcnnfeaturesforactionrecognition.InProceedings of the IEEE international conference on computer vision(pp.3218-3226).10.1109/ICCV.2015.368

Dave,A.,Russakovsky,O.,&Ramanan,D.(2017,April).Predictivecorrectivenetworksforactiondetection.InProceedings of the Computer Vision and Pattern Recognition.

Duke,B.(2018).Lintel:Pythonvideodecoding.

Efros,A.A.,Berg,A.C.,Mori,G.,&Malik,J.(2003,October).Recognizingactionatadistance.Innull(p.726).IEEE.

Feichtenhofer,C.,Pinz,A.,&Wildes,R.P.(2017,July).Spatiotemporalmultipliernetworksforvideoactionrecognition.In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.7445-7454).IEEE.10.1109/CVPR.2017.787

Gavrila, D. M. (1999). The visual analysis of human movement: A survey. Computer Vision and Image Understanding,73(1),82–98.doi:10.1006/cviu.1998.0716

Horn,B.K.,&Schunck,B.G. (1981).Determiningoptical flow.Artificial Intelligence,17(1-3),185–203.doi:10.1016/0004-3702(81)90024-2

Ji,S.,Xu,W.,Yang,M.,&Yu,K.(2013).3Dconvolutionalneuralnetworksforhumanactionrecognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,35(1),221–231.doi:10.1109/TPAMI.2012.59

Kong,Y.,Tao,Z.,&Fu,Y.(2017,July).Deepsequentialcontextnetworksforactionprediction.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition(pp.1473-1481).10.1109/CVPR.2017.390

Lea,C.,Flynn,M.D.,Vidal,R.,Reiter,A.,&Hager,G.D.(2017,July).Temporalconvolutionalnetworksforactionsegmentationanddetection. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.1003-1012).IEEE.

Mahmud,T.,Hasan,M.,&Roy-Chowdhury,A.K. (2017,October). Jointpredictionofactivity labelsandstartingtimesinuntrimmedvideos.In2017 IEEE International Conference on Computer Vision (ICCV)(pp.5784-5793).IEEE.10.1109/ICCV.2017.616

Rabiner,L.R.,&Juang,B.H.(1986).AnintroductiontohiddenMarkovmodels.IEEE ASSP Magazine,3(1),4–16.

Rhinehart,N.,&Kitani,K.M.(2017,October).First-personactivityforecastingwithonlineinversereinforcementlearning.InProceedings of the IEEE International Conference on Computer Vision(pp.3696-3705).10.1109/ICCV.2017.399

Schuldt,C.,Laptev,I.,&Caputo,B.(2004).Recognizinghumanactions:alocalSVMapproach.InProceedings of the 17th International Conference on Pattern Recognition ICPR 2004(Vol.3,pp.32-36).IEEE.

http://dx.doi.org/10.1109/TAC.1959.1104847

http://dx.doi.org/10.1109/TPAMI.1986.4767851

http://dx.doi.org/10.1006/cviu.1998.0716

http://dx.doi.org/10.1016/0004-3702(81)90024-2

http://dx.doi.org/10.1109/TPAMI.2012.59


31

John A. Castro-Vargas is a PhD student at the University of Alicante. His areas of interest are: Robotics, DeepLearning, gesture recognition and action recognition. He has participated in the nationally funded projects “Multi-sensorial robotic system with dual manipulation for human-robot assistance tasks” and “COMBAHO: COMe BAck HOme system for enhancing autonomy of people with acquired brain injury and dependent on their integration into society”.

Sigurdsson,G.A.,Divvala,S.K.,Farhadi,A.,&Gupta,A.(2017,July).Asynchronous Temporal Fields for Action Recognition(Vol.6,p.8).CVPR.

Simonyan,K.,&Zisserman,A.(2014).Two-streamconvolutionalnetworksforactionrecognitioninvideos.InAdvancesinneuralinformationprocessingsystems(pp.568-576).

Singh, G., Saha, S., Sapienza, M., Torr, P., & Cuzzolin, F. (2017, October). Online real-time multiplespatiotemporalactionlocalisationandprediction.In2017 IEEE International Conference on Computer Vision (ICCV)(pp.3657-3666).IEEE.10.1109/ICCV.2017.393

Soomro,K.,Zamir,A.R.,&Shah,M.(2012).UCF101:Adatasetof101humanactionsclassesfromvideosinthewild.arXiv:1212.0402

Wang,L.,Xiong,Y.,Wang,Z.,Qiao,Y.,Lin,D.,Tang,X.,&VanGool,L.(2018).Temporalsegmentnetworksforactionrecognitioninvideos.IEEE Transactions on Pattern Analysis and Machine Intelligence.

Zeng,K.H.,Shen,W.B.,Huang,D.A.,Sun,M.,&Niebles, J.C. (2017,August).Visual forecastingbyimitatingdynamicsinnaturalsequences.In IEEE International Conference on Computer Vision (ICCV)(Vol.2).10.1109/ICCV.2017.326

Zhu,Y.,Lan,Z.,Newsam,S.,&Hauptmann,A.G.(2017).Hiddentwo-streamconvolutionalnetworksforactionrecognition.arXiv:1704.00389

eNDNoTeS

1 https://github.com/NVIDIA/nvvl/tree/master/pytorch/test2 http://crcv.ucf.edu/data/UCF101.php

https://github.com/NVIDIA/nvvl/tree/master/pytorch/test