Data Integration Tasks on Heterogeneous Systems Using OpenCL · Execution time of breadth first...
Transcript of Data Integration Tasks on Heterogeneous Systems Using OpenCL · Execution time of breadth first...
Threeseparatearchitectureplatforms(CPU/GPU/FPGA)areusedtoevaluateOpenCL’sperformancewhenperformingdataintegrationtasks.BoththeGPUandCPU Implementsomeformofmulti-workitemexecutionwhiletheFPGAimplementsadeeplypipelinedexecutionstyleusingasingleworkitem.
DataIntegrationTasksonHeterogeneousSystemsUsingOpenCL
ClaytonFaber* AnthonyCabrera* Orondé Booker* GabeMaayan† RogerChamberlain**WashingtonUniv.inSt.Louis†RensselaerPolytechnicInst.
International Workshop on OpenCL, 2019. Boston, MA
Anoftenoverlookedpainpointofbigdataapplicationsistheneedofdatatransformationasapre-processingstep.DIBSBenchmark[1]characterizestheseapplications.WeutilizeOpenCLtoimplementasubsetoftheappsinDIBSonthreeseparateplatformsandevaluatetheirperformance.Wealsoevaluatesomeimplementationtuningparametersthatdonotaffectthefunctionalitybutinsteadtheperformanceoftheseapplications.
MOTIVATION
This work was funded in part by NSF grants CNS-1205721, CNS-1527510, CCF-1527692, and CNS-176350Special thanks to Intel for the Hardware Accelerator Research Program
RESULTS
CHARACTERISTICS
PLATFORMS
DISCUSSION• AlthoughweseepromisingspeedupscomparedtothereportedDIBSpaperbandwidth,performancegainsarenot
universalacrosstheboard.TheCSVparser(GoTrack →CSV)istheworstoffender.• TheFPGAandCPUactuallybeatouttheGPUimplementationwhenperformingtheFASTA→2Bittransformation.• WeattemptedtomakeadjustmentstotheFix→Floatapplicationwhenimplementingandfoundthattryingtoturn
onalloptimizationsnettedusworseperformance.• Furtherworkwillincludeimplementingdifferentstylesofexecutiononallplatformstoseeifperformanceimproves,
particularlyontheCSVparser.
l ThedataintegrationappsinDIBShaveanoverallappearanceofastreamingapplication,however,thereispotentialforcontentioninindividualapplications
l Fix→FloatandFASTA→2Bitcouldbeconsideredembarrassinglyparallel.
l GoTrack →CSVandIDX3→TIFFaremorenuancedaseachdatarecordismadeofmultipleelementsthatcanbedependentonotherelements.
l Applicationsthatrequiremorefinesseintheiroperationcancreateproblemsforcertaintypesofparallelarchitectureswherebarriersareofgreatconcern.
l Aworkitemmaybeablockofdatarecordsorasingledatarecorddependingontheapplication.
Executiontimeofbreadthfirstsearchontwitterdata.[2]
Inmulti-workitemexecutionsometypeofSIMDunitisimplementedinhardwareandmultipleworkitemsareexecutedinsequence.
Throughhighlevelsynthesis,asingleworkitemexecutionstylecanberealized.Withspecializedhardwareineachstageeachclockcycleproducesanoutputwhenthepipelineisfull.
CPU:[email protected]– AWSdedicatedinstance,SVMmemoryGPU:Nvidia GTX980Ti – OpenCL1.2(Usingcl_mem)FPGA:HARPv2(Arria 10insocketw/Cachecoherentbus),SVMmemory
[1] Cabrera, Faber, et al. 2018. DIBS: A Data Integration Benchmark Suite. ICPE '18. [2] Malicevic, Lepers et al. 2017. Everything you always wanted to know about multicore
graph processing but were afraid to ask. USENIX ATC ‘17
AdvertisedTime
Preprocessing