Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

21
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2012, Ottawa, Canada

description

Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets. Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. CCGrid 2012, Ottawa, Canada. Outline. Motivation and Introduction Background System Overview Experiment - PowerPoint PPT Presentation

Transcript of Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

Page 1: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Supporting User Defined Subsetting and Aggregation over Parallel

NetCDF Datasets

Yu Su and Gagan AgrawalDepartment of Computer Science and Engineering

The Ohio State University

CCGrid 2012, Ottawa, Canada

Page 2: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Outline

• Motivation and Introduction• Background• System Overview• Experiment• Conclusion

Page 3: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Motivation

• Science become increasingly data driven• Strong desire for efficient data analysis• Challenges

– Data sizes grow rapidly– Slow IO and Network Bandwidth

• An example– Different kinds of subsetting requests– Different scientific data formats

Page 4: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

An Example• GCRM (Global Cloud Resolving Model)

– A global atmospheric circulation model

Parameter ValueCurrent Grid Cell Size 4 KM

Number of Cells 3 billion

Number of Layers > 100

Time Step 10 seconds

Data Generation Speed 100 TB per day

Future Grid Cell Size 1KM

Future Data Generation Speed 6.4 PB per dayNetwork Speed 10 GB per sec

7.4 days!

Page 5: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Client-side vs. Sever-side subsetting and aggregation

SimpleRequest

AdvancedRequest

Page 6: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Data Virtualization

• Support SQL queries over scientific dataset– Standard– Flexible

• Keep data in native format(etc. NetCDF, HDF5)• Compare with other scientific data management

tools– SciDB: support for data arrays in parallel– OPeNDAP: no flexible subsetting and aggregation

Page 7: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Our Approach• User-defined subsetting and aggregations

– Subsetting: Dimensions, Coordinates, Variables– Aggregation: SUM, AVG, COUNT, MAX, MIN

• Support NetCDF data format– Developed by UCAR– Widely used in climate simulation

• Parallel Data Access– Data Partition Strategy– Different Parallel Level

Page 8: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Background - NetCDFnetcdf mynetcdf{dimensions:

X=4;Y=5;Time=UNLIMITED;

variables:float X(X);float Y(Y);int Time(Time);float Temperature(Time, Y, X);

Temperature:unit = ‘Celsius’data:

X = 10, 20, 30, 40;Y = 110, 120, 130, 140;Time = 31, 59, 90;

Temperature =111,211,311,411,121,221,321,421,131,231,331,431,141,241,341,441,112,212,312,412,122,222,322,422,132,232,332,432,142,242,342,442,113,213,313,413,123,223,323,423,133,233,333,433,143,243,343,443;

}

Y

X

Time

Time = 1 to 3

Y = 1 to 4

X = 1 to 4

Metadata

Actual value stored in m-d array

Page 9: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

System Architecture

Parse the SQL expression

Parse the metadata file

Physical MetadataLogical Metadata

Generate Query Request

Partition Criteria: Subsetting: Disk AccessAggregation: Data Transfer

Read DataPost-filter dataLocal Data Aggregation

Page 10: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Data Aggregation

SQL: SELECT SUM(pressure) FROM GCRM

Slave Processes

Master Process

Page 11: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Data Parallelism

Level 3: data block (12)

Level 1: data file (2 < 12?)

Level 2: variable (5 < 12?)

Page 12: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012 12

Experiment Goals

• To compare the functionality and performance of our system with OPeNDAP– OPeNDAP makes local data accessible to remote

locations regardless of local storage format. – Data Translation Mechanism– No flexible subsetting and aggregation support

• To evaluate the parallel scalability of our system• To show how aggregation queries reduce the

data transfer cost.

Page 13: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Compare with OPeNDAP for Type 1 Queries

• Data size: 4GB• Input: 50 SQL queries• Query Type: queries only include

dimensions• Object:

• Baseline: NetCDF query time• Our system without parallelism• OPeNDAP

• Relative Speedup: 2.34 – 3.10

Page 14: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Compare with OPeNDAP for Type 2, Type 3 Queries

• Data size: 4GB• Input: 50 SQL queries• Query Type: queries include

coordinates and variables• Object:

• Baseline• Our system without parallelism• OPeNDAP + Filter

• Relative Speedup: 1.58 – 3.47

Page 15: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Parallel Optimization – Different Data Size

• Data size: 4GB – 32GB • Process number: 1 to 16• Input: select the whole variable• Relative Speedup:

• 4 procs: 2.17 – 2.87• 8 procs: 4.06 – 5.54• 16 procs: 7.23 – 9.33

Page 16: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Parallel Optimization – Different Queries

• Data size: 32GB• Processes number: 1 to16• Input: 100 SQL queries• Query Type: queries include

dimensions, coordinates and variables

• Relative Speedup: • 4 procs: 2.20 – 2.92• 8 procs: 3.95 – 4.21• 16 procs: 7.25 – 7.74

Page 17: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Data Aggregation - Time

• Data size: 16GB• Process number: 1 - 16• Input: 60 aggregation queries• Query Type:

• Only Agg• Agg + Group by + Having • Agg + Group by

• Relative Speedup: • 4 procs: 2.61 – 3.08• 8 procs: 4.31 – 5.52• 16 procs: 6.65 – 9.54

Page 18: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Data Aggregation – Data Transfer Amount

• Data size: 16GB• Process number: 1 - 16• Input: 60 aggregation queries• Query Type:

• Only Agg• Agg + Group by + Having• Agg + Group by

Page 19: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Conclusion

• Data sizes increase in a fast speed• Goal: Find exact data subset as user specifies• Data virtualization on top of NetCDF dataset• Query request partition and parallel processing• A good speedup compared with OPeNDAP

Page 20: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012 20

Thanks

Page 21: Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Pre-filter Module

Dataset Storage Metadata Dataset Logical Metadata Request Partition Strategy

Phase 1 Phase 2 Phase 3