A Multidimensional Distributed Array Abstraction for PGAS (HPCC'16)
Performance of PGAS Models on Emerging Multi-...
Transcript of Performance of PGAS Models on Emerging Multi-...
PerformanceofPGASModelsonEmergingMulti-/Many-coreArchitecturesusingMVAPICH2-X
J.Hashmi,H.SubramoniandDKPandaTheOhioStateUniversity
E-mail:{hashmi.29,Subramoni.1,panda.2}@osu.edu
Supercomputing’17OSUBoothTalk
Supercomputing‘17 2NetworkBasedComputingLaboratory
ParallelProgrammingModelsOverview
P1 P2 P3
SharedMemory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory MemoryLogicalsharedmemory
SharedMemoryModel
SHMEM,DSMDistributedMemoryModel
MPI(MessagePassingInterface)
PartitionedGlobalAddressSpace(PGAS)
GlobalArrays,UPC,Chapel,X10,CAF,…
• Programmingmodelsprovideabstractmachinemodels
• Modelscanbemappedondifferenttypesofsystems– e.g.DistributedSharedMemory(DSM),MPIwithinanode,etc.
• PGASmodelsandHybridMPI+PGASmodelsaregraduallyreceivingimportance
Supercomputing‘17 3NetworkBasedComputingLaboratory
PartitionedGlobalAddressSpace(PGAS)Models
• Keyfeatures- Simplesharedmemoryabstractions
- Lightweightone-sidedcommunication
- Easiertoexpressirregularcommunication
• DifferentapproachestoPGAS- Languages
• UnifiedParallelC(UPC)
• Co-ArrayFortran(CAF)
• X10
• Chapel
- Libraries• OpenSHMEM
• UPC++
• GlobalArrays
Supercomputing‘17 4NetworkBasedComputingLaboratory
Hybrid(MPI+PGAS)Programming
• Applicationsub-kernelscanbere-writteninMPI/PGASbasedoncommunicationcharacteristics
• Benefits:
– BestofDistributedComputingModel
– BestofSharedMemoryComputingModel
• Cons
– Twodifferentruntimes
– Needgreatcarewhileprogramming
– Pronetodeadlockifnotcareful
Kernel1MPI
Kernel2MPI
Kernel3MPI
KernelNMPI
HPCApplication
Kernel2PGAS
KernelNPGAS
Supercomputing‘17 5NetworkBasedComputingLaboratory
OverviewoftheMVAPICH2Project• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)
– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014
– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015
– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015
– Usedbymorethan2,825organizationsin85countries
– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(June‘17ranking)
• 1st,10,649,600-core(SunwayTaihuLight)atNationalSupercomputingCenterinWuxi,China
• 15th,241,108-core(Pleiades)atNASA
• 20th,462,462-core(Stampede)atTACC
• 44th,74,520-core(Tsubame2.5)atTokyoInstituteofTechnology
– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu• EmpoweringTop500systemsforoveradecade
– System-XfromVirginiaTech(3rd inNov2003,2,200processors,12.25TFlops)->
– SunwayTaihuLight(1st inJun’17,10Mcores,100PFlops)
Supercomputing‘17 6NetworkBasedComputingLaboratory
MVAPICH2SoftwareFamilyHigh-PerformanceParallelProgrammingLibraries
MVAPICH2 SupportforInfiniBand,Omni-Path,Ethernet/iWARP,andRoCE
MVAPICH2-X AdvancedMPIfeatures,OSUINAM,PGAS(OpenSHMEM,UPC,UPC++,andCAF),andMPI+PGASprogrammingmodelswithunifiedcommunicationruntime
MVAPICH2-GDR OptimizedMPIforclusterswithNVIDIAGPUs
MVAPICH2-Virt High-performanceandscalableMPIforhypervisorandcontainerbasedHPCcloud
MVAPICH2-EA EnergyawareandHigh-performanceMPI
MVAPICH2-MIC OptimizedMPIforclusterswithIntelKNC
Microbenchmarks
OMB MicrobenchmarkssuitetoevaluateMPIandPGAS(OpenSHMEM,UPC,andUPC++)librariesforCPUsandGPUs
Tools
OSUINAM Networkmonitoring,profiling,andanalysisforclusterswithMPIandschedulerintegration
OEMT UtilitytomeasuretheenergyconsumptionofMPIapplications
Supercomputing‘17 7NetworkBasedComputingLaboratory
• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin
MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-
point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM
ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X
communicationruntime
PerformanceofPGASModelsusingMVAPICH2-X
Supercomputing‘17 8NetworkBasedComputingLaboratory
MVAPICH2-XforHybridMPI+PGASApplications
• CurrentModel– SeparateRuntimesforOpenSHMEM/UPC/UPC++/CAFandMPI– Possibledeadlockifbothruntimesarenotprogressed
– Consumesmorenetworkresource
• UnifiedcommunicationruntimeforMPI,UPC,UPC++,OpenSHMEM,CAF– Availablewithsince2012(startingwithMVAPICH2-X1.9)– http://mvapich.cse.ohio-state.edu
Supercomputing‘17 9NetworkBasedComputingLaboratory
ApplicationLevelPerformancewithGraph500andSortGraph500ExecutionTime
J.Jose,S.Potluri,K.TomkoandD.K.Panda,DesigningScalableGraph500BenchmarkwithHybridMPI+OpenSHMEMProgrammingModels,InternationalSupercomputingConference(ISC’13),June2013
05101520253035
4K 8K 16K
Time(s)
No.ofProcesses
MPI-SimpleMPI-CSCMPI-CSRHybrid(MPI+OpenSHMEM)
13X
7.6X
• PerformanceofHybrid(MPI+OpenSHMEM)Graph500Design• 8,192processes
- 2.4X improvementoverMPI-CSR- 7.6X improvementoverMPI-Simple
• 16,384processes- 1.5X improvementoverMPI-CSR- 13X improvementoverMPI-Simple
SortExecutionTime
0
1000
2000
3000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Time(secon
ds)
InputData- No.ofProcesses
MPI Hybrid
51%
• PerformanceofHybrid(MPI+OpenSHMEM)SortApplication
• 4,096processes,4TBInputSize- MPI– 2408 sec; 0.16TB/min- Hybrid – 1172 sec;0.36TB/min- 51% improvementoverMPI-design
J.Jose,S.Potluri,H.Subramoni,X.Lu,K.Hamidouche,K.Schulz,H.SundarandD.PandaDesigningScalableOut-of-coreSortingwithHybridMPI+PGASProgrammingModels,PGAS’14
Supercomputing‘17 10NetworkBasedComputingLaboratory
AcceleratingMaTExk-NNwithHybridMPIandOpenSHMEM
KDD(2.5GB)on512cores
9.0%
KDD-tranc (30MB)on256cores
27.6%
• Benchmark:KDDCup2010(8,407,752records,2classes,k=5)• FortruncatedKDDworkloadon256cores,reduce27.6% executiontime• ForfullKDDworkloadon512cores,reduce9.0% executiontimeJ.Lin,K.Hamidouche,J.Zhang,X.Lu,A.Vishnu,D.Panda.Acceleratingk-NNAlgorithmwithHybridMPIandOpenSHMEM,OpenSHMEM2015
• MaTEx:MPI-basedMachinelearningalgorithmlibrary• k-NN:apopularsupervisedalgorithmforclassification• Hybriddesigns:
– OverlappedDataFlow;One-sidedDataTransfer;Circular-bufferStructure
Supercomputing‘17 11NetworkBasedComputingLaboratory
• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin
MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-
point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM
ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X
communicationruntime
PerformanceofPGASModelsusingMVAPICH2-X
Supercomputing‘17 12NetworkBasedComputingLaboratory
IntelKnightsLanding(KNL)ProcessorArchitecture• HardwareMulti-threadedcores
– Upto72cores(model7290)
• Allcoresdividedinto36Tiles
• Eachtilecontainstwocore
– 2VPUpercore
– 1MBsharedL2cache
• 512-bitwidevectorregisters
– AVX-512extension
2 VPU
Core
2 VPU
CoreL2 Cache
(1MB)
CHA
AsingleTileofKNL
Supercomputing‘17 13NetworkBasedComputingLaboratory
IntelKnightsLanding(KNL)Processor- MCDRAM
• On-packageMultiChannelDRAM(MCDRAM)
– 450GB/softheoreticalbandwidth(4xofDDR)
– ConfigurableinFlat,Cache,andHybridmodes
DDR MCDRAM DDR DDR MCD
RAM
MCDRAM
MCDRAM
Flatmode Cachemode Hybridmode
Supercomputing‘17 14NetworkBasedComputingLaboratory
Motivation• OptimizingHPCprogrammingmodelsandruntimeson
emergingmulti-/many-coresisofgreatresearchinterest
• ExploringbenefitsofthearchitecturalfeaturesofmodernarchitecturesforPGASmodelsandapplications
• Characterizingandunderstanding
– theimpactofvectorizationonapplicationkernels
– MCDRAMvs.DDRperformance
– Exploitinghardwaremulti-threading
Supercomputing‘17 15NetworkBasedComputingLaboratory
• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin
MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-
point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM
ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X
communicationruntime
PerformanceofPGASModelsusingMVAPICH2-X
Supercomputing‘17 16NetworkBasedComputingLaboratory
PerformanceofPGASModelsonKNLusingMVAPICH2-X
0.01
0.1
1
10
100
1000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency(us)
MessageSize
shmem_put
upc_putmem
upcxx_async_put
Intra-nodePUT
0.01
0.1
1
10
100
1000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency(us)
MessageSize
shmem_get
upc_getmem
upcxx_async_get
Intra-nodeGET
• Intra-nodeperformanceofone-sidedPut/GetoperationsofPGASlibraries/languagesusingMVAPICH2-Xcommunicationconduit
• Near-nativecommunicationperformanceisobservedonKNL
Supercomputing‘17 17NetworkBasedComputingLaboratory
PerformanceofPGASModelsonKNLusingMVAPICH2-X
0.1
1
10
100
1000
10000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency(us)
MessageSize
shmem_put
upc_putmem
upcxx_async_put
Inter-nodePUT
0.1
1
10
100
1000
10000
1 4 16 64 256 1K 4K 16K 64K 256K 1M
Latency(us)
MessageSize
shmem_get
upc_getmem
upcxx_async_get
Inter-nodeGET
• Inter-nodeperformanceofone-sidedPut/GetoperationsusingMVAPICH2-XcommunicationconduitwithInfiniBandHCA(MT4115)
• NativeIBperformanceforallthreePGASmodelsisobserved.
Supercomputing‘17 18NetworkBasedComputingLaboratory
• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin
MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-
point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM
ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X
communicationruntime
PerformanceofPGASModelsusingMVAPICH2-X
Supercomputing‘17 19NetworkBasedComputingLaboratory
MicrobenchmarkEvaluations(Intra-nodePut/Get)
0.01
0.1
1
10
100
1000
Time(us)
Messagesize(bytes)
Broadwell KNL
• Broadwellshowsabout3XbetterperformancethanKNLonlargemessage• Muti-threadedmemcpyroutinesonKNLcouldoffsetthedegradationcausedby
theslowercoreonbasicPut/Getoperations
Shmem_putmem Shmem_getmem
0.01
0.1
1
10
100
1000
Time(us)
Messagesize(bytes)
Broadwell KNL312.6
81.4
331.7
96.5
J.Hashmi,M.Li,H.Subramoni,D.Panda.ExploitingandEvaluatingOpenSHMEMonKNLArchitecture,OpenSHMEM2017.
Supercomputing‘17 20NetworkBasedComputingLaboratory
MicrobenchmarkEvaluations(Inter-nodePut/Get)
1
10
100
Time(us)
Messagesize(bytes)
Broadwell KNL
• Inter-nodesmallmessagelatencyisonly2XworseonKNL.WhilelargemessageperformanceisalmostsimilaronbothKNLandBroadwell.
Shmem_putmem Shmem_getmem
1
10
100
1000
Time(us)
Messagesize(bytes)
Broadwell KNL
2.18
1.76
2.78
2.21
Supercomputing‘17 21NetworkBasedComputingLaboratory
1
10
100
1000
10000
Time(us)
Messagesize(bytes)
Broadwell KNL
1
10
100
1000
10000
Time(us)
Messagesize(bytes)
Broadwell KNL
MicrobenchmarkEvaluations(Collectives)
• 2KNLnodes(64ppn)and8Broadwellnodes(16ppn).• 4XdegradationisobservedonKNLusingcollectivebenchmarks.• Basicpoint-to-pointperformancedifferenceisreflectedincollectivesaswell
Shmem_reduceon128processes Shmem_broadcaston128processes
~4X ~2X
Supercomputing‘17 22NetworkBasedComputingLaboratory
MicrobenchmarkEvaluations(Atomics)
• UsingmultiplenodesofKNL,atomicoperationsshowedabout2.5Xdegradationoncompare-swap,andIncatomics
• Fetch-and-add(32-bit)showedupto4XdegradationonKNL
0
2
4
6
8
10
12
fadd32 fadd64 cswap32 cswap64 inc32 inc64
Time(us)
KNL Broadwell
OpenSHMEMatomicson128processes
Supercomputing‘17 23NetworkBasedComputingLaboratory
• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin
MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-
point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM
ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X
communicationruntime
PerformanceofPGASModelsusingMVAPICH2-X
Supercomputing‘17 24NetworkBasedComputingLaboratory
NASParallelBenchmarkEvaluationNAS-EP(RNG),CLASS=B
• AVX-512vectorizedexecutionofBTkernelonKNLshowed30%improvementoverdefaultexecutionwhileEPkerneldidn’tshowanyimprovement
• Broadwellshowed20%improvementoveroptimizedKNLonBTand4XimprovementoverallKNLexecutionsonEPkernel(randomnumbergeneration).
010203040506070
16 36 64 100
Time(s)
No.ofprocesses
KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell
NAS-BT(PDEsolver),CLASS=B
0
4
8
12
16
20
16 32 64 128
Time(s)
No.ofprocesses
KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell
Supercomputing‘17 25NetworkBasedComputingLaboratory
NASParallelBenchmarkEvaluation(contd.)NAS-MG(MultiGridsolver),CLASS=B
• SimilarperformancetrendsareobservedonBTandMGkernelsaswell• OnSPkernel,MCDRAMbasedexecutionshowedupto20%improvement
overdefaultat16processes.
0
10
20
30
40
50
16 36 64 100
Time(s)
No.ofprocesses
KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell
NAS-SP(non-linearPDE),CLASS=B
0
1
2
3
16 32 64 128
Time(s)
No.ofprocesses
KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell
Supercomputing‘17 26NetworkBasedComputingLaboratory
ApplicationKernelsEvaluationHeatImageKernel
• OnheatdiffusionbasedkernelsAVX-512vectorizationshowedbetterperformance• MCDRAMshowedsignificantbenefitsonHeat-Imagekernelforallprocesscounts.
CombinedwithAVX-512vectorization,itshowedupto4Ximprovedperformance
1
10
100
1000
16 32 64 128
Time(s)
No.ofprocesses
KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell
Heat-2DKernelusingJacobimethod
0.1
1
10
100
16 32 64 128
Time(s)
No.ofprocesses
KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell
Supercomputing‘17 27NetworkBasedComputingLaboratory
ApplicationKernelsEvaluation(contd.)DAXPYkernel
• Vectorizationhelpsinmatrixmultiplicationandvectoroperations.• Duetoheavilycomputeboundnatureofthesekernels,MCDRAMdidn’tshowany
significantperformanceimprovement.
0.1
10
1000
16 32 64 128
Time(s)
No.ofprocesses
KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell
MatrixMultiplicationkernel
1
10
100
16 32 64 128
Time(s)
No.ofprocesses
KNL(Default)KNL(AVX-512) KNL(AVX-512+MCDRAM)Broadwell
Supercomputing‘17 28NetworkBasedComputingLaboratory
ApplicationKernelsEvaluation(contd.)
• Upto3Ximprovementonun-optimizedexecutionisobservedonKNL• Broadwellshowedupto2Xbetterperformanceforcore-by-corecomparison
0.01
0.1
1
10
100
2 4 8 16 32 64 128
Time(s)
No.ofprocesses
KNL(Default) KNL(AVX-512) KNL(AVX-512+MCDRAM) Broadwell
ScalableIntegerSortKernel(ISx)
Supercomputing‘17 29NetworkBasedComputingLaboratory
Node-by-nodeEvaluationusingApplicationKernels
• AsinglenodeofKNLisevaluatedagainstasinglenodeofBroadwellusingalltheavailablephysicalcores
• HeatImageandISxkernels,showedbetterperformancethanIntelXeon
0.01
0.1
1
10
100
2DHeat HeatImg MatMul DAXPY ISx(strong)
Time(s)
KNL(68) Broadwell(28)
ApplicationKernelsonasingleKNLvs.singleBroadwellnode
+30%
-5%
Supercomputing‘17 30NetworkBasedComputingLaboratory
• PGASandHybridMPI+PGASmodelssupportinMVAPICH2-X• OptimizationsofPGASmodelsfordifferentarchitecturesin
MVAPICH2-X– PerformanceofPutandGetwithOpenSHMEM,UPC,andUPC++– ComparisononKNLandBroadwellforOpenSHMEMpoint-to-
point,collectives,andatomicsOperations– ImpactofAVX-512VectorizationandMCDRAMonOpenSHMEM
ApplicationKernels– PerformanceofUPC++ApplicationkernelsonMVAPICH2-X
communicationruntime
PerformanceofPGASModelsusingMVAPICH2-X
Supercomputing‘17 31NetworkBasedComputingLaboratory
• WeusedtwoapplicationkernelstoevaluateUPC++modelusingMVAPICH2-Xascommunicationruntime
• SparseMatrixVectorMultiplication(SpMV)
• AdaptiveMeshRefinement(AMR)kernel
– 2D-HeatconductionusingJacobiiterative
• Wedesigned2D-HeatkernelusingpureUPC++asynchronousprimitivesandprovideMVAPICH2-Xbasedcommunicationsupporttoachievenear-nativeperoformance.
• Weobservednearoptimalspeed-upforthesekernelsontwoKNLnodes
UPC++ApplicationKernelsPerformanceonKNL
Supercomputing‘17 32NetworkBasedComputingLaboratory
ApplicationKernelsPerformanceofUPC++onMVAPICH2-X
0
0.2
0.4
0.6
0.8
1
2 4 8 16 32 64 128
Time(s)
No.ofProcesses
UPC++SpMV
Strong-scalingPerformanceofSpMV kernel(2Kx2K)
0
40
80
120
2 4 8 16 32 64 128
Time(s)
No.ofProcesses
UPC++2DHeat
Strong-scalingPerformanceof2D-Heatkernel(512x512)
• Implemented2DHeatapplicationkernelsinUPC++• SpMV and2DHeatkernelsusingMVAPICH2-Xshowedgoodscalabilityon
increasingnumberofprocessesofKNLJ.Hashmi,M.Li,H.Subramoni,D.Panda.PerformanceofPGASModelsonKNL:AComprehensiveStudywithMVAPICH2-X,IXPUG2017.
Supercomputing‘17 33NetworkBasedComputingLaboratory
PerformanceResultsSummary
Put/Get and AtomicsPerformance
Core-by-core Application Performance
CollectivesPerformance
KNL (Default)
KNL (AVX512)
KNL (AVX512 + MCDRAM)
Broadwell
(Closer to center is better)
Node-by-node Application
Performance
Supercomputing‘17 34NetworkBasedComputingLaboratory
Conclusion• ComprehensiveperformanceevaluationofMVAPICH2-XbasedOpenSHMEM,
UPC,andUPC++modelsovertheKNLarchitecture• Observedsignificantperformancegainsonapplicationkernelswhenusing
AVX-512vectorization– 2.5xperformancebenefitsintermsofexecutiontime
• MCDRAMbenefitsarenotprominentonmostoftheapplicationkernels– Lackofmemoryboundoperations
• KNLshowedupto3XworseperformancethanBroadwellforcore-by-coreevaluation
• KNLshowedbetteroron-parperformancethanBroadwellonHeat-ImageandISxkernelsforNode-by-Nodeevaluation
• TheruntimeimplementationsneedtotakeadvantageoftheconcurrencyofKNLcores
Supercomputing‘17 35NetworkBasedComputingLaboratory
ThankYou!
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/
TheHigh-PerformanceMPI/PGASProjecthttp://mvapich.cse.ohio-state.edu/
TheHigh-PerformanceDeepLearningProjecthttp://hidl.cse.ohio-state.edu/
TheHigh-PerformanceBigDataProjecthttp://hibd.cse.ohio-state.edu/