Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

31
PDAC-10 Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David Chiu, Yu Su, ..) 1

description

Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds. Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David Chiu, Yu Su, ..). Motivation. Cloud Resources Pay-as-you-go Elasticity Black boxes from a performance view-point Scientific Data - PowerPoint PPT Presentation

Transcript of Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

Page 1: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Middleware Solutions for Data-Intensive (Scientific) Computing on

Clouds

Gagan Agrawal

Ohio State University

(Joint Work with Tekin Bicer, David Chiu, Yu Su, ..)

1

Page 2: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Motivation

• Cloud Resources• Pay-as-you-go• Elasticity • Black boxes from a performance view-point

• Scientific Data – Specialized formats, like NetCDF, HDF5, etc.

– Very Large Scale

2

Page 3: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Ongoing Work at Ohio State

• MATE-EC2: Middleware for Data-Intensive Computing on EC2 – Alternative to Amazon Elastic MapReduce

• Data Management Solutions for Scientific Datasets– Target NetCDF and HDF5

• Accelerating Data Mining Computations Using Accelerators • Resource Allocation Problems on Clouds

3

Page 4: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

MATE-EC2: Motivation

• MATE – MapReduce with an Alternate API • MATE-EC2: Implementation for AWS Environments• Cloud resources are blackboxes• Need for services and tools that can…

– get the most out of cloud resources– help their users with easy APIs

4

Page 5: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

MATE vs. Map-Reduce Processing Structure

5

• Reduction Object represents the intermediate state of the execution• Reduce func. is commutative and associative• Sorting, grouping.. overheads are eliminated with red. func/obj.

Page 6: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

MATE-EC2 Design

• Data organization– Three levels: Buckets, Chunks and Units

– Metadata information

• Chunk Retrieval– Threaded Data Retrieval

– Selective Job Assignment

• Load Balancing and handling heterogeneity– Pooling mechanism

6

Page 7: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

MATE-EC2 Processing Flow

7

C0

C5

Cn

Computing LayerJob Scheduler Metadata File

Request Job from Master NodeC0 is assigned as jobRetrieve chunk pieces andWrite them into the buffer

T0 T

1T

2 T3

Pass retrieved chunk to Computing Layer and processRequest another jobC5 is assigned as a jobRetrieve the new job

EC2 Slave Node

S3 Data Object

EC2 Master Node

Page 8: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Experiments• Goals:

– Finding the most suitable setting for AWS – Performance of MATE-EC2 on heterogeneous and

homogeneous environments– Performance comparison of MATE-EC2 and Map-

Reduce• Applications: KMeans and PCA• Used Resources:

– 4 Large EC2 instances for processing, 1 Large instance for Master– 16 Data objects on S3 (8.2GB total data set for both app.)

8

Page 9: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Diff. Data Chunk Sizes

• KMeans• 16 Retrieval threads• Performance

increase– 8M vs. others

• 1.13 to 1.30

– 1 Thread vs. 16 Threads versions

• 1.24 to 1.81

9

Page 10: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Diff. Number of Threads

10

• 128MB chunk size• Performance

increase in Fig. (KMeans)– 1.37 to 1.90

• Performance increase for PCA– 1.38 to 1.71

Page 11: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Selective Job Assignment

11

• Performance increase in Fig. (KMeans)– 1.01 to 1.14

• For PCA– 1.19 to 1.68

Page 12: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Heterogeneous Env.

12

• L: Large instances S: Small instances

• 128MB chunk size• Overheads in Fig.

(KMeans)– Under 1%

• Overheads for PCA– 1.1 to 11.7

Page 13: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

MATE-EC2 vs. Map-Reduce

13

• Scalability (MATE)– Efficiency: 90%

• Scalability (MR)– Efficiency: 74%

• Speedups: – MATE vs. MR

• 3.54 to 4.58

Page 14: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

MATE-EC2: Continuing Directions

• Cloud Bursting– Cloud as an Complement or On-Demand Alternative

to Local Resources

• Autotuning for a New Cloud Environment – Data Storage can be black-box

• Data-Intensive Applications on Cluster of GPUs– Programming Model, System Design

14

Page 15: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Outline

• MATE-EC2: Middleware for Data-Intensive Computing on EC2 – Alternative to Amazon Elastic MapReduce

• Data Management Solutions for Scientific Datasets– Target NetCDF and HDF5

• Accelerating Data Mining Computations Using Accelerators • Resource Allocation Problems on Clouds

15

Page 16: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Data Management: Motivation

• Datasets are becoming extremely large• Scientific datasets are in formats like NetCDF and

HDF5• Existing database solutions are not scalable

– Can’t help with native data formats

16

Page 17: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Data Management: Use Scenarios

• Data Dissemination Efforts– Support User-Defined Subsetting and Data Aggregation

• Implementing Data Processing Applications– Higher-level API than NetCDF/HDF5 libraries

• Visualization Tools (ParaView etc.) – Data format Conversion on Large Datasets

17

Page 18: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Initial Prototype: Data Subsetting With Relational View on NetCDF

Parse the SQL expression

Metadata for netcdf dataset

Generate data access code

Filter variable value

Filter dimensions

Partition tasks and assign to slave processes

Execute query

Page 19: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Metadata descriptor

• Dataset Storage Description– List the nodes and the directories where the data is

resident.

• Dataset Layout Description– Header part of each netcdf file

• Naturally included in netcdf dataset• Save the energy for generating the metadata

– Describe the layout of each netcdf file

Page 20: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Pre-filter and Post-filter

• Pre-filter: – Take SQL grammar and metadata as input

– Do filtering based on dimensions of variable

– Support both direct dimensions and coordinate variable

• Post-filer:– Do filtering based on variable value

Page 21: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Query Partition

• Partition current query into several sub-queries and assign each sub-query to a slave process.

• Two partition criteria– Consider the continuous of the memory

– Consider data aggregation(future)

Page 22: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Experiment Setup

• Application: – Global Cloud Resolving Model and Data (GCRM)

• Environment: – Glenn System in Ohio Supercomputer Center

Page 23: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

SQL queries

No. Description Percent

SQL 1 SELECT pressure FROM dataset; 100%

SQL 2 SELECT pressure FROM dataset WHERE cells<=20481

50%

SQL 3 SELECT pressure FROM dataset WHERE cells>20481 AND layers>330;

25%

SQL 4 SELECT pressure FROM dataset WHERE cells<=20481 AND layers<250;

10%

SQL 5 SELECT pressure FROM dataset WHERE cells <= 20481 AND time<=781710 AND layers<250;

1%

Page 24: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Scalability with different data size

• 8 processes

• Execution time scaled almost linearly within each query

Page 25: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Time improvement for using pre-filter

• 4 processes;

• SQL5 (only query 1% of the data);

• Prefilter efficiently decreases the query size, improve the performance.

Page 26: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Scalability with Increasing No. of Sources

• 4G dataset;

• SQL1 (full scan of the data table);

• Execution time scaled almost linearly

Page 27: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Data Management: Continuing Work

• Similar Prototype with HDF5 under Implementation• Consider processing, not just

subsetting/aggregation– Map-Reduce like Processing for NetCDF/HDF5

datasets?

• Consider Format Conversion for Existing Tools

27

Page 28: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Outline

• MATE-EC2: Middleware for Data-Intensive Computing on EC2 – Alternative to Amazon Elastic MapReduce

• Data Management Solutions for Scientific Datasets– Target NetCDF and HDF5

• Accelerating Data Mining Computations • Resource Allocation Problems on Clouds

28

Page 29: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10 29

User Input:Simple C code with

annotations

Application Developer

Multi-core Middlewar

e API

GPU Code for CUDA

Compilation Phase

Code Generator

Run-time System

Worker Thread Creation and Management

Map Computation to CPU and GPU

Dynamic Work Distribution

System for Mapping to Heterogeneous Configurations

Page 30: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

K-Means on GPU + Multi-Core CPUs

30

Page 31: Middleware Solutions for Data-Intensive (Scientific) Computing on Clouds

PDAC-10

Summary

• Dataset Sizes are Increasing • Clouds add many challenges • Many challenges in data processing on clouds

31