Data-Intensive Computing: From Clouds to GPUs

download Data-Intensive Computing: From Clouds to GPUs

of 48

  • date post

    31-Dec-2015
  • Category

    Documents

  • view

    39
  • download

    1

Embed Size (px)

description

Data-Intensive Computing: From Clouds to GPUs. Gagan Agrawal. Motivation. Growing need for analysis of large scale data Scientific Commercial Data-intensive Supercomputing (DISC) Map-Reduce has received a lot of attention Database and Datamining communities - PowerPoint PPT Presentation

Transcript of Data-Intensive Computing: From Clouds to GPUs

  • Data-Intensive Computing: From Clouds to GPUsGagan Agrawal **

  • **Growing need for analysis of large scale data Scientific Commercial Data-intensive Supercomputing (DISC) Map-Reduce has received a lot of attention Database and Datamining communities High performance computing community

    Motivation

  • Motivation (2) Processor Architecture TrendsClock speeds are not increasing Trend towards multi-core architectures Accelerators like GPUs are very popular Cluster of multi-cores / GPUs are becoming common Trend Towards the Cloud Use storage/computing/services from a provider How many of you prefer gmail over cse/osu account for e-mail? Utility model of computation Need high-level APIs and adaptation**

  • My Research Group

    Data-intensive theme at multiple levels Parallel programming Models Multi-cores and accelerators Adaptive Middleware Scientific Data Management / WorkflowsDeep Web Integration and Analysis

    **

  • Personnel Currently 10 PhD students 4 MS thesis students Graduated PhDs 7 Graduated PhDs between 2005 and 2008

    **

  • This Talk Parallel Programming API for Data-Intensive Computing An alternate API and System for Googles Map-Reduce Show actual comparison Data-intensive Computing on Accelerators Compilation for GPUs Overview of other topics Scientific Data Management Adaptive and Streaming Middleware Deep web

    **

  • Map-Reduce Simple API for (data-intensive) parallel programming Computation is: Apply map on each input data elementProduce (key,value) pair(s) Sort them using the key Apply reduce on the set with a distinct key values

    **

  • **Map-Reduce Execution

  • **Positives: Simple API Functional language based Very easy to learnSupport for fault-tolerance Important for very large-scale clusters QuestionsPerformance? Comparison with other approachesSuitability for different class of applications?

    Map-Reduce: Positives and Questions

  • Class of Data-Intensive ApplicationsMany different types of applications Data-center kind of applications Data scans, sorting, indexing More ``compute-intensive data-intensive applications Machine learning, data mining, NLP Map-reduce / Hadoop being widely used for this class Standard Database Operations Sigmod 2009 paper compares Hadoop with Databases and OLAP systems What is Map-reduce suitable for?What are the alternatives?MPI/OpenMP/Pthreads too low level?**

  • Hadoop Implementation**HDFSAlmost GFS, but no file updateCannot be directly mounted by an existing operating systemFault toleranceName nodeJob TrackerTask Tracker

  • **FREERIDE: GOALSFramework for Rapid Implementation of Data Mining EnginesThe ability to rapidly prototype a high-performance mining implementationDistributed memory parallelizationShared memory parallelization Ability to process disk-resident datasetsOnly modest modifications to a sequential implementation for the above three Developed 2001-2003 at Ohio State**

  • FREERIDE Technical Basis**Popular data mining algorithms have a common canonical loop Generalized Reduction Can be used as the basis for supporting a common API Demonstrated for Popular Data Mining and Scientific Data Processing Applications

    While( ) { forall (data instances d) { I = process(d) R(I) = R(I) op f(d) } . }

  • **Similar, but with subtle differencesComparing Processing Structure

  • Observations on Processing StructureMap-Reduce is based on functional idea Do not maintain state This can lead to sorting overheads FREERIDE API is based on a programmer managed reduction object Not as clean But, avoids sorting Can also help shared memory parallelization Helps better fault-recovery**

  • **Tuning parameters in HadoopInput Split sizeMax number of concurrent map tasks per nodeNumber of reduce tasksFor comparison, we used four applicationsData Mining: KMeans, KNN, AprioriSimple data scan application: Wordcount Experiments on a multi-core cluster8 cores per node (8 map tasks)

    Experiment Design

  • **KMeans: varying # of nodesAvg. Time Per Iteration (sec)# of nodesDataset: 6.4GK : 1000Dim: 3Results Data Mining

  • **Results Data Mining (II)**Apriori: varying # of nodesAvg. Time Per Iteration (sec)# of nodesDataset: 900MSupport level: 3%Confidence level: 9%

  • ****KNN: varying # of nodesAvg. Time Per Iteration (sec)# of nodesDataset: 6.4GK : 1000Dim: 3Results Data Mining (III)

  • **Wordcount: varying # of nodesTotal Time (sec)# of nodesDataset: 6.4GResults Datacenter-like Application

  • **KMeans: varying dataset sizeAvg. Time Per Iteration (sec)Dataset SizeK : 100Dim: 3On 8 nodesScalability Comparison

  • **Wordcount: varying dataset sizeTotal Time (sec)Dataset SizeOn 8 nodesScalability Word Count

  • **Four components affecting the hadoop performanceInitialization costI/O timeSorting/grouping/shufflingComputation time What is the relative impact of each ? An Experiment with k-meansOverhead Breakdown

  • **Varying the number of clusters (k)Avg. Time Per Iteration (sec)# of KMeans ClustersDataset: 6.4GDim: 3On 16 nodesAnalysis with K-means

  • **Varying the number of dimensionsAvg. Time Per Iteration (sec)# of DimensionsDataset: 6.4GK : 1000On 16 nodesAnalysis with K-means (II)

  • Observations**Initialization costs and limited I/O bandwidth of HDFS are significant in Hadoop

    Sorting is also an important limiting factor for Hadoops performance

  • This Talk Parallel Programming API for Data-Intensive Computing An alternate API and System for Googles Map-Reduce Show actual comparison Data-intensive Computing on Accelerators Compilation for GPUs Overview of other topics Scientific Data Management Adaptive and Streaming Middleware Deep web

    **

  • Background - GPU ComputingMany-core architectures/Accelerators are becoming more popular GPUs are inexpensive and fastCUDA is a high-level language for GPU programming

  • CUDA ProgrammingSignificant improvement over use of Graphics Libraries But .. Need detailed knowledge of the architecture of GPU and a new language Must specify the grid configurationDeal with memory allocation and movementExplicit management of memory hierarchy

  • Parallel Data miningCommon structure of data mining applications (FREERIDE)

    /* outer sequential loop */while() { /* Reduction loop */ Foreach (element e){ (i, val) = process(e); Reduc(i) = Reduc(i) op val; }

  • Porting on GPUsHigh-level Parallelization is straight-forward Details of Data Movement Impact of Thread Count on Reduction timeUse of shared memory

  • Architecture of the SystemVariable informationReduction functionsOptional functionsCode Analyzer( In LLVM)Variable AnalyzerCode GeneratorVariable Access Pattern and Combination OperationsHost ProgramGrid configuration and kernel invocationKernel functionsExecutableUser Input

  • User Input

  • Analysis of Sequential CodeGet the information of access features of each variableDetermine the data to be replicatedGet the operator for global combinationVariables for shared memory

  • Memory Allocation and CopyCopy the updates back to host memory after the kernel reduction function returnsC.Need copy for each threadT0T1T2T3T4T61T62T63T0T1T0T1T2T3T4T61T62T63T0T1A.B.

  • Generating CUDA Code and C++/C code Invoking the Kernel FunctionMemory allocation and copyThread grid configuration (block number and thread number)Global functionKernel reduction functionGlobal combination

  • OptimizationsUsing shared memoryProviding user-specified initialization functions and combination functionsSpecifying variables that are allocated once

  • ApplicationsK-means clusteringEM clusteringPCA

  • Experimental ResultsSpeedup of k-means

  • Speedup of k-means on GeForce 9800X2

  • Speedup of EM

  • Speedup of PCA

  • wangfa@cse.ohio-state.eduDeep Web Data Integration The emerge of deep webDeep web is hugeDifferent from surface webChallenges for integrationNot accessible through search enginesInter-dependences among deep web sources

  • Our ContributionsStructured Query Processing on Deep WebSchema Mining/MatchingAnalysis of Deep web data sourcesP*

  • *Adaptive Middleware To Enable the Time-critical Event Handling to Achieve the Maximum Benefit, While Satisfying the Time/Budget ConstraintTo be Compatible with Grid and Web ServicesTo Enable Easy Deployment and Management with Minimum Human InterventionTo be Used in a Heterogeneous Distributed EnvironmentICAC 2008

  • *HASTE Middleware Design ICAC 2008

    Application Layer

    OGSA Infrastructure (Globus Toolkit 4.0)

    Service Layer

    Application Deployment Service

    AUTONOMIC SERVICECOMPONENTS

    App.Service 1

    Agent/Controller

    ...

    ...

    App.Service 2

    Agent/Controller

    App.Service 3

    Agent/Controller

    App.Service 4

    Agent/Controller

    App.Service 5

    Agent/Controller

    Code

    ConfigurationFile

    BenefitFunction

    Application

    Time-Critical Event

    CPU

    Memory

    Bandwidth

    Resource Monitoring Service

    Resource Allocation Service

    Efficiency ValueEstimation

    Scheduling

    Autonomic Adaptation Service

    SystemModel

    Estimator

  • *Workflow Composition System

  • SummaryGrowing data is creating new challenges in HPC, grid and cloud environments Number of topics being addressed Many opportunities for involvement 888 meets Thursdays 5:00 6:00, DL 280**

    6.4G dataset, k = 1000, dim = 3850M dat