Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking:...
Transcript of Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking:...
![Page 1: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/1.jpg)
DigitalScienceCenter
BigDataBenchmarking:ApplicationsandSystems
GeoffreyFox,December10,2018
2018InternationalSymposiumonBenchmarking,MeasuringandOptimizing(Bench’18)atIEEEBigData2018
Dec10- Dec11,2018@Seattle,WA,USA
DigitalScienceCenterIndiana University
[email protected], http://www.dsc.soic.indiana.edu/, http://spidal.org/
112/29/18
![Page 2: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/2.jpg)
DigitalScienceCenter
BigDataandExtreme-scaleComputinghttp://www.exascale.org/bdec/
• BDECPathwaystoConvergenceReport
• NewseriesBDEC2“CommonDigitalContinuumPlatformforBigDataandExtremeScaleComputing”withfirstmeetingNovember28-30,2018BloomingtonIndianaUSA(focusonapplications).
• Workinggroupsonplatform(technology),applications,communitybuilding• BigDataBench presentedawhitepaper
• Nextmeetings:February19-21Kobe,Japan(focusonplatform)followedbytwoinEurope,oneinUSAandoneinChina.
http://www.exascale.org/bdec/sites/www.exascale.org.bdec/files/whitepapers/bdec2017pathways.pdf
2
![Page 3: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/3.jpg)
DigitalScienceCenter
BenchmarksshouldmimicUseCases?Needtocollectusecases?
Canclassifyusecasesandbenchmarksalongseveraldifferentdimensions
312/29/18
![Page 4: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/4.jpg)
DigitalScienceCenter
Software:MIDASHPC-ABDS
NSF 1443054: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science
Ogres Application Analysis
HPC-ABDS and HPC-FaaS SoftwareHarp and Twister2 Building Blocks
SPIDAL Data Analytics Library
4
![Page 5: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/5.jpg)
DigitalScienceCenter
MyviewofSystemGAIMSC
5
![Page 6: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/6.jpg)
DigitalScienceCenter
SystemsChallengesforGAIMSC• MicrosoftnotedwearecollectivelybuildingtheGlobalAISupercomputer.• Generalizebyaddingmodeling• ArchitectureoftheGlobalAIandModelingSupercomputerGAIMSCmustreflect
• Global capturestheneedtomashupservicesfrommanydifferentsources;• AI capturestheincredibleprogressinmachinelearning(ML);• Modeling capturesbothtraditionallarge-scalesimulationsandthemodelsanddigitaltwinsneededfordatainterpretation;
• Supercomputercapturesthateverythingishugeandneedstobedonequicklyandofteninrealtimeforstreamingapplications.
• TheGAIMSCincludesanintelligentHPCcloudlinkedviaanintelligentHPCFogtoanintelligentHPCedge.Weconsiderthisdistributedenvironmentasasetofcomputationalanddata-intensivenuggetsswimminginanintelligentaether.
• Wewilluseadataflowgraphtodefineameshintheaether12/29/18 6
![Page 7: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/7.jpg)
DigitalScienceCenter
GlobalAIandModelingSupercomputerGAIMSC• Thereisonlyacloudatthelogicalcenterbutit’sphysicallydistributedandownedbyafewmajorplayers
• ThereisaverydistributedsetofdevicessurroundedbylocalFogcomputing;thisformsthelogicallyandphysicallydistributeedge
• Theedgeisstructuredandlargelydata• ThesearetwodifferencesfromtheGridofthepast• e.g.selfdrivingcarwillhaveitsownfogandwillnotsharefogwithtruckthatitisabouttocollidewith
• Thecloudandedgewillbothbeveryheterogeneouswithvaryingaccelerators,memorysizeanddiskstructure.
• GAIMSCrequiresparallelcomputingtoachievehighperformanceonlargeMLandsimulationnuggetsanddistributedsystemtechnologytobuildtheaether andsupportthedistributedbutconnectednuggets.
• Inthelatterrespect,theintelligentaether mimicsagridbutitisadatagridwheretherearecomputationsbuttypicallythoseassociatedwithdata(oftenfromedgedevices).
• Sounlikethedistributedsimulationsupercomputerthatwasoftenstudiedinpreviousgrids,GAIMSCisasupercomputeraimedatverydifferentdataintensiveAI-enrichedproblems.
7
![Page 8: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/8.jpg)
DigitalScienceCenter
GAIMSCGlobalAI&ModelingSupercomputerQuestions• Whatdogainfromtheconcept?e.g.AbilitytoworkwithBigDatacommunity• Whatdowelosefromtheconcept?e.g.everythingrunsasslowasSpark• IsGAIMSCusefulforBDEC2initiative?ForNSF?ForDoE?
ForUniversities?ForIndustry?Forusers?• Doesaddingmodelingtoconceptaddvalue?• WhataretheresearchissuesforGAIMSC?e.g.howtoprogram?• WhatcanwedowithGAIMSCthatwecouldn’tdowithclassicBigDatatechnologies?
• WhatcanwedowithGAIMSCthatwecouldn’tdowithclassicHPCtechnologies?
• Aretheredeeporimportantissuesassociatedwiththe“Global”inGAIMSC?• Istheconceptofanauto-tunedGlobalAIandModelingSupercomputerscary?
8
![Page 9: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/9.jpg)
DigitalScienceCenter
IntegrationofDataandModelfunctionswithMLwrappersinGAIMSC
• ThereisarapidincreaseintheintegrationofMLandsimulations.• MLcananalyzeresults,guidetheexecutionandsetupinitialconfigurations(auto-tuning).ThisisequallytrueforAIitself-- theGAIMSCwilluseitselftooptimizeitsexecutionforbothanalyticsandsimulations.
• Inprincipleeverytransferofcontrol(joborfunctioninvocation,alinkfromdevicetothefog/cloud)shouldpassthroughanAIwrapperthatlearnsfromeachcallandcandecidebothifcallneedstobeexecuted(maybewehavelearnedtheansweralreadyandneednotcomputeit)andhowtooptimizethecallifitreallyneedstobeexecuted.
• ThedigitalcontinuumproposedbyBDEC2isanintelligentaether learningfromandinformingtheinterconnectedcomputationalactionsthatareembeddedintheaether.
• Implementingtheintelligentaether embracingandextendingtheedge,fog,andcloudisamajorresearchchallengewhereboldnewideasareneeded!
• WeneedtounderstandhowtomakeiteasytoautomaticallywrapeverynuggetwithML.
12/29/18 9
![Page 10: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/10.jpg)
DigitalScienceCenter
ImplementingtheGAIMSC• ThenewMIDASmiddlewaredesignedinSPIDALhasbeenengineeredtosupporthigh-performancetechnologiesandyetpreservethekeyfeaturesoftheApacheBigDataSoftware.
• MIDASseemswellsuitedtobuildtheprototypeintelligenthigh-performanceaether.
• NotethiswillmixmanyrelativelysmallnuggetswithAIwrappersgeneratingparallelismfromthenumberofnuggetsandnotinternallytothenuggetanditswrapper.
• However,therewillbealsolargeglobaljobsrequiringinternalparallelismforindividuallarge-scalemachinelearningorsimulationtasks.
• Thusparallelcomputinganddistributedsystems(grids)mustbelinkedinadeepfashionalthoughthekeyparallelcomputingideasneededforMLarecloselyrelatedtothosealreadydevelopedforsimulations.
12/29/18 10
![Page 11: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/11.jpg)
DigitalScienceCenter
UnderlyingHPCBigDataConvergenceIssues
11
![Page 12: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/12.jpg)
DigitalScienceCenter
• NeedtodiscussDataandModel asproblemshavebothintermingled,butwecangetinsightbyseparatingwhichallowsbetterunderstandingofBigData- BigSimulation“convergence”(ordifferences!)
• TheModel isauserconstructionandithasa“concept”, parametersandgivesresultsdeterminedbythecomputation.Weuseterm“model”inageneralfashiontocoverallofthese.
• BigDataproblems canbebrokenupintoDataandModel• Forclustering,themodelparametersareclustercenterswhilethedataissetofpointstobeclustered
• Forqueries,themodelisstructureofdatabaseandresultsofthisquerywhilethedataiswholedatabasequeriedandSQLquery
• FordeeplearningwithImageNet,themodelischosennetworkwithmodelparametersasthenetworklinkweights.Thedataissetofimagesusedfortrainingorclassification
DataandModelinBigDataandSimulationsI
12
![Page 13: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/13.jpg)
DigitalScienceCenter
• Simulations canalsobeconsideredasData plusModel• Model canbeformulationwithparticledynamicsorpartialdifferentialequationsdefinedbyparameterssuchasparticlepositionsanddiscretizedvelocity,pressure,densityvalues
• Data couldbesmallwhenjustboundaryconditions• Data largewithdataassimilation(weatherforecasting)orwhendatavisualizationsareproducedbysimulation
• BigDataimpliesDataislargebutModelvariesinsize• e.g.LDA (LatentDirichletAllocation)withmanytopicsordeeplearninghasalargemodel• Clustering orDimensionreductioncanbequitesmallinmodelsize
• Data oftenstaticbetweeniterations(unlessstreaming);Modelparameters varybetweeniterations
• Data andModelParametersareoftenconfusedinpapersastermdatausedtodescribetheparametersofmodels.
• Modelsin BigDataandSimulationshavemanysimilaritiesandallowconvergence
DataandModelinBigDataandSimulationsII
13
![Page 14: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/14.jpg)
DigitalScienceCenter
• ManyapplicationsuseLMLorLocalmachineLearningwheremachinelearning(oftenfromRorPythonorMatlab)isrunseparatelyoneverydataitemsuchasoneveryimage
• Butothersare GMLGlobalMachineLearningwheremachinelearningisabasicalgorithmrunoveralldataitems(overallnodesincomputer)
• maximumlikelihoodorc2 withasumovertheNdataitems– documents,sequences,itemstobesold,imagesetc.andoftenlinks(point-pairs).
• GMLincludesGraphanalytics,clustering/communitydetection,mixturemodels,topicdetermination,Multidimensionalscaling,(Deep)LearningNetworks
• NoteFacebookmayneedlotsofsmallgraphs(oneperpersonand~LML)ratherthanonegiantgraphofconnectedpeople(GML)
LocalandGlobalMachineLearning
14
![Page 15: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/15.jpg)
DigitalScienceCenter
• Applications– Divideusecasesinto Data andModelandcomparecharacteristicsseparatelyinthesetwocomponentswith64ConvergenceDiamonds(features).
• Identifyimportanceofstreamingdata,pleasinglyparallel,global/localmachine-learning• Software– Singlemodelof HighPerformanceComputing(HPC)EnhancedBigDataStackHPC-ABDS.21LayersaddinghighperformanceruntimetoApachesystemsHPC-FaaSProgrammingModel
• Serverless InfrastructureasaServiceIaaS• Hardwaresystemdesignedforfunctionalityandperformanceofapplicationtypee.g.disks,interconnect,memory,CPUaccelerationdifferentformachinelearning,pleasinglyparallel,datamanagement,streaming,simulations
• UseDevOpstoautomatedeploymentofevent-drivensoftwaredefinedsystemsonhardware:HPCCloud 2.0
• TotalSystemSolutions(wisdom)asaService:HPCCloud 3.0
Convergence/DivergencePointsforHPC-Cloud-Edge-BigData-Simulation
15
![Page 16: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/16.jpg)
DigitalScienceCenter
ApplicationStructure
http://www.iterativemapreduce.org/
1612/29/18
![Page 17: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/17.jpg)
DigitalScienceCenter
• Real-time (streaming)dataisincreasinglycommoninscientificandengineeringresearch,anditisubiquitousincommercialBigData(e.g.,socialnetworkanalysis,recommendersystemsandconsumerbehaviorclassification)
• SofarlittleuseofcommercialandApachetechnologyinanalysisofscientificstreamingdata
• Pleasinglyparallelapplicationsimportantinscience(longtail)anddatacommunities• Commercial-Scienceapplicationdifferences:Searchandrecommenderengineshavedifferentstructuretodeeplearning,clustering,topicmodels,graphanalysessuchassubgraphmining
• Latterverysensitivetocommunicationandcanbehardtoparallelize• SearchtypicallynotasimportantinScienceasincommercialuseassearchvolumescalesbynumberofusers
• Shoulddiscuss dataandmodel separately• Termdataoftenusedrathersloppilyandoftenreferstomodel
StructureofApplications
17
![Page 18: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/18.jpg)
DigitalScienceCenter
DistinctiveFeaturesofApplications
• Ratioofdatatomodelsizes:verticalaxisonnextslide• ImportanceofSynchronization– ratioofinter-nodecommunicationtonodecomputing:horizontalaxisonnextslide
• SparsityofDataorModel;impactsvalueofGPU’sorvectorcomputing
• IrregularityofDataorModel• GeographicdistributionofDataasinedgecomputing;useofstreaming(dynamicdata)versusbatchparadigms
• Dynamicmodelstructureasinsomeiterativealgorithms
18
![Page 19: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/19.jpg)
DigitalScienceCenter
BigDataandSimulationDifficultyinParallelismSizeofSynchronizationconstraints
PleasinglyParallelOftenindependentevents
MapReduceasinscalabledatabases
StructuredAdaptiveSparse
LooselyCoupled
Largestscalesimulations
CurrentmajorBigDatacategory
CommodityCloudsHPCClouds:Accelerators
HighPerformanceInterconnect
ExascaleSupercomputers
GlobalMachineLearninge.g.parallelclustering
DeepLearning
HPCClouds/SupercomputersMemoryaccessalsocritical
UnstructuredAdaptiveSparse
GraphAnalyticse.g.subgraphmining
LDA
LinearAlgebraatcore(oftennotsparse)
SizeofDiskI/O
TightlyCoupled
Parametersweepsimulations
JusttwoproblemcharacteristicsThereisalsodata/computedistributionseeningrid/edgecomputing
19
![Page 20: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/20.jpg)
DigitalScienceCenter
ApplicationNexusofHPC,BigData,Simulation
ConvergenceUse-caseDataandModelNISTCollectionBigDataOgresConvergenceDiamondshttps://bigdatawg.nist.gov/
20
![Page 21: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/21.jpg)
DigitalScienceCenter
• 26fieldscompletedfor51areas• GovernmentOperation:4• Commercial:8• Defense:3• HealthcareandLifeSciences:10• DeepLearningandSocialMedia:6• TheEcosystemforResearch:4• AstronomyandPhysics:5• Earth,EnvironmentalandPolarScience:10• Energy:1
• Security&PrivacyEnhancedversion2• BDECHPCenhancedversion
OriginalUseCaseTemplate
21
![Page 22: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/22.jpg)
DigitalScienceCenter
• GovernmentOperation(4):NationalArchivesandRecordsAdministration,CensusBureau
• Commercial(8):FinanceinCloud,CloudBackup,Mendeley (Citations),Netflix,WebSearch,DigitalMaterials,Cargoshipping(asinUPS)
• Defense(3):Sensors,Imagesurveillance,SituationAssessment
• HealthcareandLifeSciences(10):Medicalrecords,GraphandProbabilisticanalysis,Pathology,Bioimaging,Genomics,Epidemiology,PeopleActivitymodels,Biodiversity
• DeepLearningandSocialMedia(6):DrivingCar,Geolocate images/cameras,Twitter,CrowdSourcing,NetworkScience,NISTbenchmarkdatasets
• TheEcosystemforResearch(4):Metadata,Collaboration,LanguageTranslation,Lightsourceexperiments
• AstronomyandPhysics(5):SkySurveysincludingcomparisontosimulation,LargeHadronCollideratCERN,BelleAcceleratorIIinJapan
• Earth,EnvironmentalandPolarScience(10):RadarScatteringinAtmosphere,Earthquake,Ocean,EarthObservation,IcesheetRadarscattering,Earthradarmapping,Climatesimulationdatasets,Atmosphericturbulenceidentification,SubsurfaceBiogeochemistry(microbestowatersheds),AmeriFluxandFLUXNETgassensors
• Energy(1):Smartgrid
• PublishedbyNISTasversion2https://bigdatawg.nist.gov/_uploadfiles/NIST.SP.1500-3r1.pdf withcommonsetof26featuresrecordedforeachuse-case
51DetailedUseCases:ContributedJuly-September2013Coversgoals,datafeaturessuchas3V’s,software,hardware
26FeaturesforeachusecaseBiasedtoscience 22
![Page 23: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/23.jpg)
DigitalScienceCenter
PartofPropertySummaryTable
23
![Page 24: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/24.jpg)
DigitalScienceCenter
• People:eithertheusers(butseebelow)orsubjectsofapplicationandoftenboth
• Decisionmakerslikeresearchersordoctors(usersofapplication)
• Items suchasImages,EMR,Sequencesbelow;observationsorcontentsofonlinestore• Imagesor“ElectronicInformationnuggets”• EMR:ElectronicMedicalRecords(oftensimilartopeopleparallelism)• ProteinorGeneSequences;• Material properties,ManufacturedObjectspecifications,etc.,incustomdataset• Modelled entities likevehiclesandpeople
• Sensors – InternetofThings
• Events suchasdetectedanomaliesintelescopeorcreditcarddataoratmosphere
• (Complex)Nodes inRDFGraph• Simplenodesasinalearningnetwork
• Tweets,Blogs,Documents,WebPages,etc.• Andcharacters/wordsinthem
• Files ordatatobebackedup,movedorassignedmetadata
• Particles/cells/mesh points asinparallelsimulations
51UseCases:WhatisParallelismOver?
24
![Page 25: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/25.jpg)
DigitalScienceCenter
• PP(26) “All”PleasinglyParallelorMapOnly• MR(18)ClassicMapReduceMR(addMRStat belowforfullcount)• MRStat (7)SimpleversionofMRwherekeycomputationsaresimplereductionasfoundinstatisticalaveragessuchashistogramsandaverages
• MRIter (23) IterativeMapReduceorMPI(Flink,Spark,Twister)• Graph(9) Complexgraphdatastructureneededinanalysis• Fusion(11) Integratediversedatatoaiddiscovery/decisionmaking;couldinvolvesophisticatedalgorithmsorcouldjustbeaportal
• Streaming(41)Somedatacomesinincrementallyandisprocessedthisway• Classify (30)Classification:dividedataintocategories• S/Q(12)Index,SearchandQuery
SampleFeaturesof51UseCasesI
25
![Page 26: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/26.jpg)
DigitalScienceCenter
• CF(4)CollaborativeFilteringforrecommenderengines• LML(36)LocalMachineLearning(Independentforeachparallelentity)– applicationcouldhaveGMLaswell
• GML(23)GlobalMachineLearning:DeepLearning,Clustering,LDA,PLSI,MDS,• LargeScaleOptimizationsasinVariational Bayes,MCMC,LiftedBeliefPropagation,StochasticGradientDescent,L-BFGS,Levenberg-Marquardt.CancallEGOorExascaleGlobalOptimizationwithscalableparallelalgorithm
• Workflow(51)Universal• GIS(16)GeotaggeddataandoftendisplayedinESRI,MicrosoftVirtualEarth,GoogleEarth,GeoServer etc.
• HPC(5)Classiclarge-scalesimulationofcosmos,materials,etc.generating(visualization)data
• Agent(2)Simulationsofmodelsofdata-definedmacroscopicentitiesrepresentedasagents
SampleFeaturesof51UseCasesII
26
![Page 27: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/27.jpg)
DigitalScienceCenter
BDEC2andNISTUseCases
27
![Page 28: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/28.jpg)
DigitalScienceCenter
53NISTUseCasesforResearchSpaceI• GOVERNMENTOPERATION
• 1:Census2010and2000—Title13BigData• 2:NARAAccession,Search,Retrieve,Preservation• 3:StatisticalSurveyResponseImprovement• 4:Non-TraditionalDatainStatisticalSurveyResponseImprovement(AdaptiveDesign)
1-4 are related to social science survey problems and are “Classic Data+ML” with interesting algorithms (recommender engines) plus databases and important privacy issues which are present in research cases
• COMMERCIAL• 5:CloudEco-SystemforFinancialIndustriesNO• 6:Mendeley—AnInternationalNetworkofResearch• 7:NetflixMovieServiceNO• 8:WebSearchNO• 9:BigDataBusinessContinuityandDisasterRecoveryWithinaCloudEco-System NO• 10:CargoShippingEdgeComputingNO• 11:MaterialsDataforManufacturing• 12:Simulation-DrivenMaterialsGenomics
6 is “Classic Data+ML” with Text Analysis (citation identification, topic models etc.)10 is DHL/Fedex/UPS and has no direct scientific analog. However, it is a good example of Edge computing system of a similar nature to the scientific research case.11 and 12 are material science covered in BDEC2 meeting
12/29/18 28
![Page 29: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/29.jpg)
DigitalScienceCenter
53NISTUseCasesforResearchSpaceII• DEFENSE
• 13:CloudLarge-ScaleGeospatialAnalysisandVisualization• 14:ObjectIdentificationandTrackingfromWide-AreaLargeFormatImageryorFullMotionVideo—PersistentSurveillance
• 15:IntelligenceDataProcessingandAnalysis13-15 are very similar to disaster response problems. They involve extensive “Classic Data+ML” for sensor collections and/or image processing. GIS and spatial analysis are important as in BDEC2 Pathology and Spatial Imagery talk. The geospatial aspect of applications means they are similar to earth science examples.
• Colorcodingofusecases• NOmeansnotsimilartoresearchapplication• Red meansnotrelevanttoBDEC2• Orange meansrelatedtoBDEC2Bloomingtonpresentations• BlackareuniqueusecasesofrelevancetoBDEC2butnotpresentedatBloomington• Purple are comments
12/29/18 29
![Page 30: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/30.jpg)
DigitalScienceCenter
UseCasesIII:HEALTHCAREANDLIFESCIENCES• 16:ElectronicMedicalRecord• 17:PathologyImaging/DigitalPathology• 18:ComputationalBioimaging• 19:GenomicMeasurements• 20:ComparativeAnalysisforMetagenomesandGenome• 21:IndividualizedDiabetesManagement• 22:StatisticalRelationalArtificialIntelligenceforHealthCare• 23:WorldPopulation-ScaleEpidemiologicalStudy• 24:SocialContagionModelingforPlanning,PublicHealth,andDisasterManagement
• 25:BiodiversityandLifeWatch
12/29/18 30
![Page 31: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/31.jpg)
DigitalScienceCenter
CommentsonUseCasesIII• 16 and 22 are “classic data + ML + database” based use cases using an
important technique, understood by the community but not presented at BDEC2
• 17 came originally from Saltz’s group and was updated in his BDEC2 talk• 18 describes biology image processing from many instruments microscopes,
MRI and light sources. The latter was directly discussed at BDEC2 and the other instruments were implicit.
• 19 and 20 are well recognized as a distributed Big Data problems with significant computing. They were represented by Chandrasekaran’s presentation at BDEC2 which inevitably only covered part (gene assembly) of problem.
• 21 relies on “classic data + graph analytics” which was not discussed in BDEC2 meeting but is certainly actively pursued.
• 23 and 24 originally came from Marathe and were updated in his BDEC2 presentation on massive bio-social systems
• 25 generalizes BDEC2 talks by Taufer and Rahnemoonfar on ocean and land monitoring and sensor array analysis.
12/29/18 31
![Page 32: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/32.jpg)
DigitalScienceCenter
UseCasesIV:DEEPLEARNINGANDSOCIALMEDIA• 26:Large-ScaleDeepLearning• 27:OrganizingLarge-Scale,UnstructuredCollectionsofConsumerPhotosNO• 28:Truthy—InformationDiffusionResearchfromTwitterData• 29:CrowdSourcingintheHumanitiesasSourceforBigandDynamicData• 30:CINET—CyberinfrastructureforNetwork(Graph)ScienceandAnalytics• 31:NISTInformationAccessDivision—AnalyticTechnologyPerformanceMeasurements,Evaluations,andStandards
• 26 on deep learning was covered in great depth at the BDEC2 meeting• 27 describes an interesting image processing challenge of geolocating multiple
photographs which is not so far directly related to scientific data analysis although related image processing algorithms are certainly important
• 28-30 are “classic data + ML” use cases with a focus on graph and text mining algorithms not covered in BDEC2 but certainly relevant to the process
• 31 on benchmarking and standard datasets is related to BigDataBench talk at end of BDEC2 meeting and Fosters talk on a model database
12/29/18 32
![Page 33: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/33.jpg)
DigitalScienceCenter
53NISTUseCasesforResearchSpaceVI• THEECOSYSTEMFORRESEARCH• 32:DataNet FederationConsortium• 33:TheDiscinnet ProcessNO• 34:SemanticGraphSearchonScientificChemicalandText-BasedData• 35:LightSourceBeamlines
32 covers data management with iRODS which is well regarded by the community but not discussed in BDEC2.33 is a Teamwork approach that doesn’t seem relevant to BDEC234 is a “classic data+ML” use case with a similar comments to 28-3035 was covered with more advanced deep learning algorithms in Yager and Foster’s BDEC2 talks
ASTRONOMYANDPHYSICS• 36:CatalinaReal-TimeTransientSurvey:ADigital,Panoramic,SynopticSkySurvey• 37:DOEExtremeDatafromCosmologicalSkySurveyandSimulations• 38:LargeSurveyDataforCosmology• 39:ParticlePhysics—AnalysisofLargeHadronColliderData:DiscoveryofHiggsParticle• 40:BelleIIHighEnergyPhysicsExperiment
36 to 38 are “Classic Data+ML” astronomy use cases related to BDEC2 SKA presentation and covering both archival and event detection cases. Use case 37 covers the integration of simulation data and observational data FOR ASTRONOMY; A TOPIC COVERED IN OTHER CASES AT BDEC2.39 and 40 are “Classic Data+ML” use cases for accelerator data analysis. This was not covered in BDEC2 but is currently the largest volume scientific data analysis problem whose importance and relevance is well understood.
12/29/18 33
![Page 34: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/34.jpg)
DigitalScienceCenter
UseCasesVII:EARTH,ENVIRONMENTAL,ANDPOLARSCIENCE• 41:EuropeanIncoherentScatterScientificAssociation3DIncoherentScatterRadarSystemBigRadarinstrumentmonitoringatmosphere.
• 42:CommonOperationsofEnvironmentalResearchInfrastructure• 43:RadarDataAnalysisfortheCenterforRemoteSensingofIceSheets• 44:UnmannedAirVehicleSyntheticApertureRadar(UAVSAR)DataProcessing,DataProductDelivery,andDataServices
• 45:NASALangleyResearchCenter/GoddardSpaceFlightCenteriRODS FederationTestBed• 46:MERRAAnalyticServices(MERRA/AS)Instrument• 47:AtmosphericTurbulence– EventDiscoveryandPredictiveAnalyticsImaging• 48:ClimateStudiesUsingtheCommunityEarthSystemModelattheU.S.DepartmentofEnergy(DOE)NERSCCenter
• 49:DOEBiologicalandEnvironmentalResearch(BER)SubsurfaceBiogeochemistryScientificFocusArea
• 50:DOEBERAmeriFlux andFLUXNETNetworksSensorNetworks• 2-1:NASAEarthObservingSystemDataandInformationSystem(EOSDIS)Instrument• 2-2:Web-EnabledLandsatData(WELD)ProcessingInstrument
12/29/18 34
![Page 35: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/35.jpg)
DigitalScienceCenter
CommentsonUseCasesVII• 41 43 44 are “Classic Data+ML” use cases involving radar data from
different instruments-- specialized ground, vehicle/plane, satellite - not directly covered in BDEC2
• 2-1 and 2-2 are use cases similar to 41 43 and 44 but applied to EOSDIS and LANDSAT earth observations from satellites in multiple modalities.
• 42 49 and 50 are “Classic Data+ML” environmental sensor arrays that extend the scope of talks of Taufer and Rahnemoonfar at BDEC2. See also use case 25 above
• 45 to 47 describe datasets from instruments and computations relevant to climate and weather. It relates to BDEC2 talk by Denvil and Miyoshi. 47 discusses the correlation of aircraft turbulent reports with simulation datasets
• 48 is data analytics and management associated with climate studies as covered in BDEC2 talk by Denvil
12/29/18 35
![Page 36: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/36.jpg)
DigitalScienceCenter
UseCasesVIII:ENERGY• 51:ConsumptionForecastinginSmartGrids
51 is a different subproblem but in the same area as Pothen and Azad’s talk on the electric power grid at BDEC2. This is a challenging edge computing problem as a large number of distributed but correlated sensors
• SC-18BOFApplication/IndustryPerspectivebyDavidKeyes,KingAbdullahUniversityofScienceandTechnology(KAUST)
• https://www.exascale.org/bdec/sites/www.exascale.org.bdec/files/SC18_BDEC2_BoF-Keyes.pdf
This is a presentation by David Keyes on seismic imaging for oil discovery and exploitation. It is “Classic Data+ML” for an array of sonic sensors
12/29/18 36
![Page 37: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/37.jpg)
DigitalScienceCenter
BDEC2UseCasesI:ClassicObservationalDataplusML• BDEC2-1:M.Deegan,BigDataandExtremeScaleComputing,2ndSeries(BDEC2)- StatementofInterestfromtheSquareKilometre ArrayOrganisation (SKAO)
• EnvironmentalScience• BDEC2-2:M.Rahnemoonfar,SemanticSegmentationofUnderwaterSonarImagerybasedonDeepLearning
• BDEC2-3:M.Taufer,CyberinfrastructureToolsforPrecisionAgricultureinthe21stCentury• HealthcareandLifesciences• BDEC2-4:J.Saltz,MultiscaleSpatialDataandDeepLearning• BDEC2-5:R.Stevens,ExascaleDeepLearningforCancer• BDEC2-6:S.Chandrasekaran,Developmentofaparallelalgorithmforwholegenomealignmentforrapiddeliveryofpersonalizedgenomics
• BDEC2-7:M.Marathe,Pervasive,PersonalizedandPrecision(P3)analyticsformassivebio-socialsystems
Instruments include Satellites, UAV’s, Sensors (see edge examples), Light sources (X-ray MRI Microscope etc.), Telescopes, Accelerators, Tokomaks (Fusion), Computers (as in Control, Simulation, Data, ML Integration)
12/29/18 37
![Page 38: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/38.jpg)
DigitalScienceCenter
BDEC2UseCasesII:Control,Simulation,Data,MLIntegration• BDEC2-8:W.Tang,NewModelsforIntegratedInquiry:FusionEnergyExemplar• BDEC2-9:O.Beckstein,Convergenceofdatagenerationandanalysisinthebiomolecularsimulationcommunity
• BDEC2-10:S.Denvil,Fromtheproductiontotheanalysisphase:newapproachesneededinclimatemodeling
• BDEC2-11:T.Miyoshi,PredictionScience:The5thParadigmFusingtheComputationalScienceandDataScience(weatherforecasting)
See also Marathe and Stevens talksSee also instruments under Classic Observational Data plus ML
• MaterialScience• BDEC2-12:K.Yager,AutonomousExperimentationasaParadigmforMaterialsDiscovery• BDEC2-13:L.Ward,DeepLearning,HPC,andDataforMaterialsDesign• BDEC2-14:J.Ahrens,AvisionforavalidateddistributedknowledgebaseofmaterialbehavioratextremeconditionsusingtheAdvancedCyberinfrastructurePlatform
• BDEC2-15:T.Deutsch,DigitaltransitionofMaterialNano-Characterization.
12/29/18 38
![Page 39: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/39.jpg)
DigitalScienceCenter
CommentsonControl,Simulation,Data,MLIntegration• Simulations often involve outside Data but always inside Data (from
simulation itself). Fields covered include Materials (nano), Climate, Weather, Biomolecular, Virtual tissues (no use case written up)
• We can see ML wrapping simulations to achieve many goals. ML replaces functions and/or ML guides functions
• Initial Conditions• Boundary Conditions• Data assimilation• Configuration -- blocking, use of cache etc. • Steering and Control• Support multi-scale• ML learns from previous simulations and so can predict function calls
• Digital Twins are a commercial link between simulation and systems• There are fundamental simulations covered by laws of physics and growingly
Complex System simulations with Bio (tissue) or social entities.12/29/18 39
![Page 40: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/40.jpg)
DigitalScienceCenter
BDEC2UseCasesIII:EdgeComputing• SmartCityandRelatedEdgeApplications• BDEC2-16:P.Beckman,EdgetoHPCCloud• BDEC2-17:G.Ricart,SmartCommunityCyberInfrastructure attheSpeedofLife• BDEC2-18:T.El-Ghazawi,ConvergenceofAI,BigData,ComputingandIOT(ABCI)-SmartCityasanApplicationDriverandVirtualIntelligenceManagement(VIM)
• BDEC2-19:M.Kondo,TheChallengesandopportunitiesofBDECsystemsforSmartCities
• OtherEdgeApplications• BDEC2-20:APothen,High-EndDataScienceandHPCfortheElectricalPowerGrid• BDEC2-21:J.Qiu,Real-TimeAnomalyDetectionfromEdgetoHPC-Cloud
There are correlated edge devices such as power grid and nearby vehicles (racing, road). Also largely independent edge devices interacting via databases such as surveillance cameras
12/29/18 40
![Page 41: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/41.jpg)
DigitalScienceCenter
BDECUseCasesIV• BDECEcosystem• BDEC2-22:IFoster,LearningSystemsforDeepScience• BDEC2-23:W.Gao,BigDataBench:AScalableandUnifiedBigDataandAIBenchmarkSuite
• Image-based Applications• One cross-cutting theme is understanding Generalized (light, sound, other sensors such as temperature, chemistry, moisture) Images with 2D, 3D spatial and time dependence
• Modalities include Radar, MRI, Microscopes, Surveillance and other cameras, X-ray scattering, UAV hosted, and related non-optical sensor networks as in agriculture, wildfires, disaster monitoring and Oil exploration. GIS and geospatial properties are often relevant
12/29/18 41
![Page 42: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/42.jpg)
DigitalScienceCenter
NISTGenericDataProcessingUseCases
42
![Page 43: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/43.jpg)
DigitalScienceCenter
10GenericDataProcessingUseCases1) Multipleusersperforminginteractivequeriesandupdatesonadatabasewithbasicavailabilityand
eventualconsistency(BASE=(BasicallyAvailable,Softstate,Eventualconsistency)asopposedtoACID=(Atomicity,Consistency,Isolation,Durability))
2) Performrealtimeanalyticsondatasourcestreamsandnotifyuserswhenspecifiedeventsoccur3) Movedatafromexternaldatasourcesintoahighlyhorizontallyscalabledatastore,transformitusing
highlyhorizontallyscalableprocessing(e.g.Map-Reduce),andreturnittothehorizontallyscalabledatastore(ELTExtractLoadTransform)
4) Performbatchanalyticsonthedatainahighlyhorizontallyscalabledatastoreusinghighlyhorizontallyscalableprocessing(e.g MapReduce)withauser-friendlyinterface(e.g.SQLlike)
5) Performinteractiveanalyticsondatainanalytics-optimizeddatabasewith5A)Science6) VisualizedataextractedfromhorizontallyscalableBigDatastore7) MovedatafromahighlyhorizontallyscalabledatastoreintoatraditionalEnterpriseDataWarehouse
(EDW)8) Extract,process,andmovedatafromdatastorestoarchives9) CombinedatafromClouddatabasesandonpremisedatastoresforanalytics,datamining,and/or
machinelearning10) Orchestratemultiplesequentialandparalleldatatransformationsand/oranalyticprocessingusinga
workflowmanager
43
![Page 44: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/44.jpg)
DigitalScienceCenter
AccessPattern5A.Performinteractiveanalyticsonobservationalscientificdata
44
GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…
DataStorage:HDFS,Hbase,FileCollection
StreamingTwitterdataforSocialNetworking
ScienceAnalysisCode,Mahout,R
RecordScientificDatain“field”
LocalAccumulateandinitialcomputing
DirectTransfer
ExamplesincludeLHC,RemoteSensing,AstronomyandBioinformatics
GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…
DataStorage:HDFS,Hbase,FileCollection
StreamingTwitterdataforSocialNetworking
ScienceAnalysisCode,Mahout,R
RecordScientificDatain“field”
LocalAccumulateandinitialcomputing
GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…
DataStorage:HDFS,Hbase,FileCollection
StreamingTwitterdataforSocialNetworking
ScienceAnalysisCode,Mahout,R
Transportbatchofdatatoprimaryanalysisdatasystem
RecordScientificDatain“field”
LocalAccumulateandinitialcomputing
GridorManyTaskSoftware,Hadoop,Spark,Giraph,Pig…
DataStorage:HDFS,Hbase,FileCollection
StreamingTwitterdataforSocialNetworking
ScienceAnalysisCode,Mahout,R
RecordScientificDatain“field”
LocalAccumulateandinitialcomputing
![Page 45: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/45.jpg)
DigitalScienceCenter
PolarGrid
LightweightCyberinfrastructuretosupportmobileDatagatheringexpeditionsplusclassiccentralresources(asacloud)
45BATCHMODE
![Page 46: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/46.jpg)
DigitalScienceCenter
“TheUniverseisnowbeingexploredsystematically,inapanchromaticway,overarangeofspatialandtemporalscalesthatleadtoamorecomplete,andlessbiasedunderstandingofitsconstituents,theirevolution,theirorigins,andthephysicalprocessesgoverningthem.”
TowardsaNationalVirtualObservatory
HubbleTelescope PalomarTelescope
SloanTelescope
TrackingtheHeavens
46DISTRIBUTEDLARGEINSTRUMENTMODE
![Page 47: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/47.jpg)
DigitalScienceCenter
OtherUse-caseCollections
47
![Page 48: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/48.jpg)
DigitalScienceCenter
7ComputationalGiantsofNRCMassiveDataAnalysisReport
1) G1: BasicStatisticse.g.MRStat2) G2: GeneralizedN-BodyProblems3) G3: Graph-TheoreticComputations4) G4: LinearAlgebraicComputations5) G5: Optimizationse.g.LinearProgramming6) G6: Integratione.g.LDAandotherGML7) G7: AlignmentProblemse.g.BLAST
http://www.nap.edu/catalog.php?record_id=18374 Big Data Models?
48
![Page 49: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/49.jpg)
DigitalScienceCenter
• Linpack orHPL:ParallelLUfactorizationforsolutionoflinearequations;HPCG
• NPB version1:MainlyclassicHPCsolverkernels• MG:Multigrid• CG:ConjugateGradient• FT:FastFourierTransform• IS:Integersort• EP:EmbarrassinglyParallel• BT:BlockTridiagonal• SP:ScalarPentadiagonal• LU:Lower-UppersymmetricGaussSeidel
HPC(Simulation)BenchmarkClassics
Simulation Models
49
![Page 50: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/50.jpg)
DigitalScienceCenter
1) DenseLinearAlgebra2) SparseLinearAlgebra3) SpectralMethods4) N-BodyMethods5) StructuredGrids6) UnstructuredGrids7) MapReduce8) CombinationalLogic9) GraphTraversal10) DynamicProgramming11) Backtrackand
Branch-and-Bound12) GraphicalModels13) FiniteStateMachines
13BerkeleyDwarfsFirst6ofthesecorrespondtoColella’s original.(Classicsimulations)MonteCarlodropped.N-bodymethodsareasubsetofParticleinColella.
NotealittleinconsistentinthatMapReduceisaprogrammingmodelandspectralmethodisanumericalmethod.Needmultiplefacetstoclassifyusecases!
LargelyModelsforDataorSimulation
50
![Page 51: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/51.jpg)
DigitalScienceCenter
ClassifyingUsecases
51
![Page 52: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/52.jpg)
DigitalScienceCenter
• TheBigDataOgresbuiltonacollectionof51bigdatausesgatheredbytheNISTPublicWorkingGroupwhere26propertiesweregatheredforeachapplication.
• ThisinformationwascombinedwithotherstudiesincludingtheBerkeleydwarfs,theNASparallelbenchmarksandtheComputationalGiantsoftheNRCMassiveDataAnalysisReport.
• TheOgreanalysisledtoasetof50featuresdividedintofourviewsthatcouldbeusedtocategorizeanddistinguishbetweenapplications.
• ThefourviewsareProblemArchitecture(Macropattern);ExecutionFeatures(Micropatterns);DataSourceandStyle;andfinallytheProcessingVieworruntimefeatures.
• WegeneralizedthisapproachtointegrateBigDataandSimulationapplicationsintoasingleclassificationlookingseparatelyatDataandModel withthetotalfacetsgrowingto64innumber,calledconvergencediamonds,andsplitbetweenthesame4views.
• AmappingoffacetsintoworkoftheSPIDALprojecthasbeengiven.
ClassifyingUseCases
52
![Page 53: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/53.jpg)
DigitalScienceCenter
Pleasingly ParallelClassic MapReduceMap-CollectiveMap Point-to-Point
Shared MemorySingle Program Multiple DataBulk Synchronous ParallelFusionDataflowAgentsWorkflow
Geospatial Information SystemHPC SimulationsInternet of ThingsMetadata/ProvenanceShared / Dedicated / Transient / PermanentArchived/Batched/StreamingHDFS/Lustre/GPFSFiles/ObjectsEnterprise Data ModelSQL/NoSQL/NewSQL
Performance
Metrics
FlopsperB
yte;M
emory
I/OExecution
Environment;
Core
librariesVolum
eVelocityVarietyVeracityC
omm
unicationStructure
Data
Abstraction
Metric
=M
/Non-M
etric=
NON#
=N
N/O(N)
=N
Regular
=R
/Irregular=
ID
ynamic
=D
/Static=
S
Visualization
Graph A
lgorithms
Linear Algebra K
ernelsA
lignment
Streaming
Optim
ization Methodology
LearningC
lassificationSearch / Q
uery / Index
Base Statistics
Global A
nalyticsLocal A
nalyticsM
icro-benchmarks
Recom
mendations
Data Source and Style View
Execution View
Processing View
234
6789
101112
10987654
321
1 2 3 4 5 6 7 8 9 10 12 14
9 8 7 5 4 3 2 114 13 12 11 10 6
13
Map Streaming 5
4 Ogre Views and 50 Facets
Iterative/Sim
ple
11
1
Problem Architecture View
Theoriginal50Ogresin4views
53
• Processing• DataSourceandStyle
• ProblemArchitecture(metapatterns)
• Execution(micropatterns)
![Page 54: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/54.jpg)
DigitalScienceCenter
Local(An
alytics/Inform
atics/Simulations)
2M
DataSourceandStyleView
PleasinglyParallelClassicMapReduce
Map-CollectiveMapPoint-to-Point
SharedMemorySingleProgramMultipleData
BulkSynchronousParallel
FusionDataflowAgents
Workflow
GeospatialInformationSystemHPCSimulationsInternetofThingsMetadata/ProvenanceShared/Dedicated/Transient/Permanent
Archived/Batched/Streaming – S1,S2,S3,S4,S5
HDFS/Lustre/GPFS
Files/ObjectsEnterpriseDataModelSQL/NoSQL/NewSQL
1M
Micro-benchmarks
ExecutionView
ProcessingView1234
6
78
910
11M
12
10D98D7D6D
5D
4D
3D2D1D
MapStreaming 5
ConvergenceDiamondsViewsandFacets
ProblemArchitectureView
15MCo
reLibrarie
sVisualiza
tion
14M
GraphAlgorithm
s
13M
LinearAlgebraKernels/M
anysubclasses
12M
Global(A
nalytics/Inform
atics/Simulations)
3M
RecommenderEngine
5M
4M
BaseDataStatistics
10M
Stream
ingDa
taAlgorith
ms
Optimiza
tionMethodology
9M
Learning
8M
DataClassificatio
n
7M
DataSe
arch/Q
uery/In
dex
6M
11M
DataAlignm
ent
BigDataProcessingDiamonds
MultiscaleMethod
17M
16M
IterativePD
ESolvers
22M
Natureofm
eshifused
EvolutionofDiscreteSystem
s
21M
ParticlesandFields
20M
N-bodyM
ethods
19M
Spectra
lMethods
18M
Simulation(Exascale)ProcessingDiamonds
DataAbstraction
D12
ModelAbstraction
M12
DataMetric
=M
/Non-Metric
=N
D13
DataMetric
=M
/Non-Metric
=N
M13
𝑂𝑁#
=NN
/𝑂(𝑁)=N
M14
Regular=R/Irregular=
IModel
M10
Veracity
7
Iterative/Sim
ple
M11
Communication
Structure
M8
Dynamic=D/Static
=S
D9
Dynamic=D/Static=
SM9
Regular=R/Irregular=
IData
D10
ModelVariety
M6
DataVelocity
D5
Performance
Metrics
1
DataVariety
D6
FlopsperByte/M
emory
IO/Flopsperw
att
2
ExecutionEnvironm
ent;Corelibraries
3
DataVolum
e
D4
ModelSize
M4
Simulations Analytics(ModelforBigData)
Both
(AllModel)
(NearlyallData+Model)
(NearlyallData)
(MixofDataandModel)
54
64Featuresin4viewsforUnifiedClassificationofBig
DataandSimulationApplications
![Page 55: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/55.jpg)
DigitalScienceCenter
ConvergenceDiamondsandtheir4ViewsI• Oneview istheoverall problemarchitectureormacropatterns whichisnaturallyrelatedtothemachinearchitectureneededtosupportapplication.
• UnchangedfromOgresanddescribespropertiesofproblemsuchas“PleasingParallel”or“UsesCollectiveCommunication”
• Theexecution(computational)featuresormicropatterns view,describesissuessuchasI/Oversuscomputerates,iterativenatureandregularityofcomputationandtheclassicV’sofBigData:definingproblemsize,rateofchange,etc.
• SignificantchangesfromogrestoseparateDataandModelandaddcharacteristicsofSimulationmodels.e.g.bothmodelanddatahave“V’s”;DataVolume,ModelSize
• e.g.O(N2)Algorithmrelevanttobigdataorbigsimulationmodel55
![Page 56: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/56.jpg)
DigitalScienceCenter
ConvergenceDiamondsandtheir4ViewsII• Thedatasource&style view includesfacetsspecifyinghowthedataiscollected,storedandaccessed.Hasclassicdatabasecharacteristics
• Simulationscanhavefacetsheretodescribeinputoroutputdata• Examples:Streaming,filesversusobjects,HDFSv.Lustre
• Processing view hasmodel(notdata)facetswhichdescribetypesofprocessingstepsincludingnatureofalgorithmsandkernelsbymodele.g.LinearProgramming,Learning,MaximumLikelihood,Spectralmethods,Meshtype,
• mixofBigDataProcessingViewandBigSimulationProcessingViewandincludessomefacetslike“useslinearalgebra”neededinboth:hasspecificsofkeysimulationkernelsandinparticularincludesfacetsseeninNASParallelBenchmarksandBerkeleyDwarfs
• InstancesofDiamondsareparticularproblemsandasetofDiamondinstancesthatcoverenoughofthefacetscouldformacomprehensivebenchmark/mini-app set
• Diamondsandtheirinstancescanbeatomic orcomposite56
![Page 57: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/57.jpg)
DigitalScienceCenter
ProgrammingEnvironmentforGlobalAIandModelingSupercomputerGAIMSC
http://www.iterativemapreduce.org/
5712/29/18
![Page 58: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/58.jpg)
DigitalScienceCenter
WaysofaddingHighPerformancetoGlobalAI(andModeling)Supercomputer
• FixperformanceissuesinSpark,Heron,Hadoop,Flinketc.• Messyassomefeaturesofthesebigdatasystemsintrinsicallyslowinsome(notall)cases
• Allthesesystemsare“monolithic”anddifficulttodealwithindividualcomponents• ExecuteHPBDCfromclassicbigdatasystemwithcustomcommunicationenvironment– approachofHarpfortherelativelysimpleHadoopenvironment
• ProvideanativeMesos/Yarn/Kubernetes/HDFShighperformanceexecutionenvironmentwithallcapabilitiesofSpark,HadoopandHeron– goalofTwister2
• ExecutewithMPIinclassic(Slurm,Lustre)HPCenvironment• AddmodulestoexistingframeworkslikeScikit-LearnorTensorflow eitherasnewcapabilityorasahigherperformanceversionofexistingmodule.
58
![Page 59: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/59.jpg)
DigitalScienceCenter
GAIMSCProgrammingEnvironmentComponentsIArea Component Implementation Comments: User API
Architecture Specification
Coordination PointsState and Configuration Management; Program, Data and Message Level
Change execution mode; save and reset state
Execution Semantics
Mapping of Resources to Bolts/Maps in Containers, Processes, Threads
Different systems make different choices - why?
Parallel Computing Spark Flink Hadoop Pregel MPI modes Owner Computes Rule
Job Submission (Dynamic/Static) Resource Allocation
Plugins for Slurm, Yarn, Mesos, Marathon, Aurora
Client API (e.g. Python) for Job Management
Task System
Task migration Monitoring of tasks and migrating tasks for better resource utilization
Task-based programming with Dynamic or Static Graph API;
FaaS API;
Support accelerators (CUDA,FPGA, KNL)
Elasticity OpenWhisk
Streaming and FaaS Events
Heron, OpenWhisk, Kafka/RabbitMQ
Task Execution Process, Threads, Queues
Task Scheduling Dynamic Scheduling, Static Scheduling,Pluggable Scheduling Algorithms
Task Graph Static Graph, Dynamic Graph Generation
59
![Page 60: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/60.jpg)
DigitalScienceCenter
GAIMSCProgrammingEnvironmentComponentsIIArea Component Implementation Comments
Communication API
Messages Heron This is user level and could map to multiple communication systems
Dataflow Communication
Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA
Coarse grain Dataflow from NiFi, Kepler?
Streaming, ETL data pipelines;
Define new Dataflow communicationAPI and library
BSP CommunicationMap-Collective
Conventional MPI, Harp MPI Point to Point and Collective API
Data AccessStatic (Batch) Data File Systems, NoSQL, SQL
Data APIStreaming Data Message Brokers, Spouts
Data Management Distributed Data Set
Relaxed Distributed Shared Memory(immutable data), Mutable Distributed Data
Data Transformation API;
Spark RDD, Heron Streamlet
Fault Tolerance Check PointingUpstream (streaming) backup; Lightweight; Coordination Points; Spark/Flink, MPI and Heron models
Streaming and batch casesdistinct; Crosses all components
Security Storage, Messaging, execution
Research needed Crosses all Components
60
![Page 61: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/61.jpg)
DigitalScienceCenter
• Harp-DAAL withakernelMachineLearninglibraryexploitingtheIntelnodelibraryDAALandHPCcommunicationcollectiveswithintheHadoopecosystem.
• Harp-DAALsupportsall5classesofdata-intensiveAIfirstcomputation,frompleasinglyparalleltomachinelearningandsimulations.
• Twister2 isatoolkitofcomponentsthatcanbepackagedindifferentways• IntegratedbatchorstreamingdatacapabilitiesfamiliarfromApacheHadoop,Spark,HeronandFlinkbutwithhighperformance.
• Separatebulksynchronousanddataflowcommunication;• TaskmanagementasinMesos,YarnandKubernetes• Dataflowgraphexecutionmodels• LaunchingoftheHarp-DAALlibrarywithnativeMesos/Kubernetes/HDFSenvironment• Streamingandrepositorydataaccessinterfaces,• In-memorydatabasesandfaulttoleranceatdataflownodes.(useRDDtodoclassiccheckpoint-restart)
IntegratingHPCandApacheProgrammingEnvironments
61
![Page 62: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/62.jpg)
DigitalScienceCenter
Map Collective Run time merges MapReduce and HPC
allreducereduce
rotatepush & pull
allgather
regroup
broadcast
RuntimesoftwareforHarp
62
![Page 63: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/63.jpg)
DigitalScienceCenter
• Datasets:5millionpoints,10thousandcentroids,10featuredimensions
• 10to20nodesofIntelKNL7250processors
• Harp-DAALhas15xspeedupsoverSparkMLlib
• Datasets:500Kor1milliondatapointsoffeaturedimension300
• RunningonsingleKNL7250(Harp-DAAL)vs.singleK80GPU(PyTorch)
• Harp-DAALachieves3xto6xspeedups
• Datasets:Twitterwith44millionvertices,2billionedges,subgraphtemplatesof10to12vertices
• 25nodesofIntelXeonE52670• Harp-DAALhas2xto5xspeedups
overstate-of-the-artMPI-Fasciasolution
Harpv.SparkHarpv.TorchHarpv.MPI
63
![Page 64: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/64.jpg)
DigitalScienceCenter
Twister2DataflowCommunications• Twister:Net offerstwocommunicationmodels• BSP (BulkSynchronousProcessing)message-levelcommunicationusingTCPorMPIseparatedfromitstaskmanagementplusextraHarpcollectives
• DFWanewDataflowlibrarybuiltusingMPIsoftwarebutatdatamovementnotmessagelevel
• Non-blocking• Dynamicdatasizes• Streamingmodel
• Batchcaseismodeledasafinitestream• Thecommunicationsarebetweenasetoftasksinanarbitrarytaskgraph
• Keybasedcommunications• Data-levelCommunicationsspillingtodisks• Targettaskscanbedifferentfromsourcetasks
6412/29/18
![Page 65: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/65.jpg)
DigitalScienceCenter
LatencyofApacheHeronandTwister:NetDFW(Dataflow)forReduce,BroadcastandPartitionoperationsin16nodeswith256-wayparallelism
Twister:Net andApacheHeronandSparkLeft:K-meansjobexecutiontimeon16nodeswithvaryingcenters,2millionpointswith320-wayparallelism.Right:K-Meanswth 4,8and16nodeswhereeachnodehaving20tasks.2millionpointswith16000centersused.
6512/29/18
![Page 66: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/66.jpg)
DigitalScienceCenter
IntelligentDataflowGraph• Thedataflowgraphspecifiesthedistributionandinterconnectionofjobcomponents
• HierarchicalandIterative• AllowMLwrappingofcomponentateachdataflownode• Checkpointaftereachnodeofthedataflowgraph
• Naturalsynchronizationpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms
• Intelligentnodessupportcustomizationofcheckpointing,ML,communication
• Nodescanbecoarse(largejobs)orfinegrainrequiringdifferentactions
6612/29/18
![Page 67: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/67.jpg)
DigitalScienceCenter
DataflowatDifferentGrainsizes
Reduce
Maps
Iterate
InternalExecutionDataflowNodes
HPCCommunication
CoarseGrainDataflowslinksjobsinsuchapipeline
Datapreparation ClusteringDimensionReduction
Visualization
Butinternallytoeachjobyoucanalsoelegantlyexpressalgorithmasdataflowbutwithmorestringentperformanceconstraints
• P=loadPoints()• C=loadInitCenters()• for(int i =0;i <10;i++){• T=P.map().withBroadcast(C)• C=T.reduce()}
Iterate
CorrespondingtoclassicSparkK-meansDataflow
6712/29/18
![Page 68: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/68.jpg)
DigitalScienceCenter
NiFi Coarse-grainWorkflow
12/29/18 68
![Page 69: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/69.jpg)
DigitalScienceCenter
FuturesImplementingTwister2
forGlobalAIandModelingSupercomputer
http://www.iterativemapreduce.org/
6912/29/18
![Page 70: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/70.jpg)
DigitalScienceCenter
Twister2Timeline:CurrentRelease(EndofSeptember2018)
• Twister:Net DataflowCommunicationAPI• DataflowcommunicationswithMPIorTCP
• Dataaccess• LocalFileSystems• HDFSIntegration
• TaskGraph• StreamingandBatchanalytics– Iterativejobs• Datapipelines• DeploymentsonDocker,Kubernetes,Mesos(Aurora),Slurm• HarpforMachineLearning(CustomBSPCommunications)
• Richcollectives• Around30MLalgorithms
7012/29/18
![Page 71: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/71.jpg)
DigitalScienceCenter
Twister2Timeline:January2018• DataSet APIsimilartoSparkbatchandHeronstreamingwithTsetrealization
• CanuseTsets forwritingRDD/Streamletstyledatasets
• FaulttoleranceasinHeronandSpark• StormAPIforStreaming• HierarchicalDynamicHeterogeneousTaskGraph
• Coarsegrain andfinegraindataflow
• Cyclictaskgraphexecution• Dynamicscalingofresourcesandheterogeneous resources(atthe resourcelayer)forstreamingandheterogeneousworkflow
• Link toPilotJobs
7112/29/18
![Page 72: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/72.jpg)
DigitalScienceCenter
Twister2Timeline:July1,2018• NaiadmodelbasedTasksystemforMachineLearning• NativeMPIintegrationtoMesos,Yarn• Dynamictaskmigrations• RDMAandothercommunicationenhancements• IntegratepartsofTwister2componentsasbigdatasystemsenhancements(i.e.runcurrentBigDatasoftwareinvokingTwister2components)
• Heron(easiest),Spark,Flink,Hadoop(likeHarptoday)• Tsets becomecompatiblewithRDD(Spark)andStreamlet(Heron)
• SupportdifferentAPIs(i.e.runTwister2lookinglikecurrentBigDataSoftware),Hadoop, Spark(Flink), Storm
• RefinementslikeMarathonwithMesosetc.• FunctionasaServiceandServerless• Supporthigherlevelabstractions
• Twister:SQL (majorSparkusecase)• GraphAPI
7212/29/18
![Page 73: Big Data Benchmarking: Applications and Systems · Digital Science Center Big Data Benchmarking: Applications and Systems Geoffrey Fox, December 10, 2018 2018 International Symposium](https://reader034.fdocuments.us/reader034/viewer/2022042923/5f71e76913ea5c787a4c5d1d/html5/thumbnails/73.jpg)
DigitalScienceCenter
Conclusions• Canmakeusecasecollectionstomotivatebenchmarks
• NISTandBDEChavetemplates• Couldhelpfullyfillintemplatesforbenchmarks
• Researchapplicationshavesomesimilaritiesbutmanydifferencesfromcommercialusecases
• IncreasingimportanceofintegrationofsimulationandMachineLearning
• IncreasingimportanceofdistributedEdgeapplications• ShouldbenchmarkdataflowandBSPstylecommunication• Twister2willcombineHeronandSparkwithbuiltinHPCperformance
12/29/18 73