Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGrid, 2012

Supporting User Defined Subsetting and Aggregation over Parallel

NetCDF Datasets

Yu Su and Gagan AgrawalDepartment of Computer Science and Engineering

The Ohio State University

CCGrid 2012, Ottawa, Canada

CCGrid, 2012

Outline

• Motivation and Introduction• Background• System Overview• Experiment• Conclusion

CCGrid, 2012

Motivation

• Science become increasingly data driven• Strong desire for efficient data analysis• Challenges

– Data sizes grow rapidly– Slow IO and Network Bandwidth

• An example– Different kinds of subsetting requests– Different scientific data formats

CCGrid, 2012

An Example• GCRM (Global Cloud Resolving Model)

– A global atmospheric circulation model

Parameter ValueCurrent Grid Cell Size 4 KM

Number of Cells 3 billion

Number of Layers > 100

Time Step 10 seconds

Data Generation Speed 100 TB per day

Future Grid Cell Size 1KM

Future Data Generation Speed 6.4 PB per dayNetwork Speed 10 GB per sec

7.4 days!

CCGrid, 2012

Client-side vs. Sever-side subsetting and aggregation

SimpleRequest

AdvancedRequest

CCGrid, 2012

Data Virtualization

• Support SQL queries over scientific dataset– Standard– Flexible

• Keep data in native format(etc. NetCDF, HDF5)• Compare with other scientific data management

tools– SciDB: support for data arrays in parallel– OPeNDAP: no flexible subsetting and aggregation

CCGrid, 2012

Our Approach• User-defined subsetting and aggregations

– Subsetting: Dimensions, Coordinates, Variables– Aggregation: SUM, AVG, COUNT, MAX, MIN

• Support NetCDF data format– Developed by UCAR– Widely used in climate simulation

• Parallel Data Access– Data Partition Strategy– Different Parallel Level

CCGrid, 2012

Background - NetCDFnetcdf mynetcdf{dimensions:

X=4;Y=5;Time=UNLIMITED;

variables:float X(X);float Y(Y);int Time(Time);float Temperature(Time, Y, X);

Temperature:unit = ‘Celsius’data:

X = 10, 20, 30, 40;Y = 110, 120, 130, 140;Time = 31, 59, 90;

Temperature =111,211,311,411,121,221,321,421,131,231,331,431,141,241,341,441,112,212,312,412,122,222,322,422,132,232,332,432,142,242,342,442,113,213,313,413,123,223,323,423,133,233,333,433,143,243,343,443;

}

Y

X

Time

Time = 1 to 3

Y = 1 to 4

X = 1 to 4

Metadata

Actual value stored in m-d array

CCGrid, 2012

System Architecture

Parse the SQL expression

Parse the metadata file

Physical MetadataLogical Metadata

Generate Query Request

Partition Criteria: Subsetting: Disk AccessAggregation: Data Transfer

Read DataPost-filter dataLocal Data Aggregation

CCGrid, 2012

Data Aggregation

SQL: SELECT SUM(pressure) FROM GCRM

Slave Processes

Master Process

CCGrid, 2012

Data Parallelism

Level 3: data block (12)

Level 1: data file (2 < 12?)

Level 2: variable (5 < 12?)

CCGrid, 2012 12

Experiment Goals

• To compare the functionality and performance of our system with OPeNDAP– OPeNDAP makes local data accessible to remote

locations regardless of local storage format. – Data Translation Mechanism– No flexible subsetting and aggregation support

• To evaluate the parallel scalability of our system• To show how aggregation queries reduce the

data transfer cost.

CCGrid, 2012

Compare with OPeNDAP for Type 1 Queries

• Data size: 4GB• Input: 50 SQL queries• Query Type: queries only include

dimensions• Object:

• Baseline: NetCDF query time• Our system without parallelism• OPeNDAP

• Relative Speedup: 2.34 – 3.10

CCGrid, 2012

Compare with OPeNDAP for Type 2, Type 3 Queries

• Data size: 4GB• Input: 50 SQL queries• Query Type: queries include

coordinates and variables• Object:

• Baseline• Our system without parallelism• OPeNDAP + Filter

• Relative Speedup: 1.58 – 3.47

CCGrid, 2012

Parallel Optimization – Different Data Size

• Data size: 4GB – 32GB • Process number: 1 to 16• Input: select the whole variable• Relative Speedup:

• 4 procs: 2.17 – 2.87• 8 procs: 4.06 – 5.54• 16 procs: 7.23 – 9.33

CCGrid, 2012

Parallel Optimization – Different Queries

• Data size: 32GB• Processes number: 1 to16• Input: 100 SQL queries• Query Type: queries include

dimensions, coordinates and variables

• Relative Speedup: • 4 procs: 2.20 – 2.92• 8 procs: 3.95 – 4.21• 16 procs: 7.25 – 7.74

CCGrid, 2012

Data Aggregation - Time

• Data size: 16GB• Process number: 1 - 16• Input: 60 aggregation queries• Query Type:

• Only Agg• Agg + Group by + Having • Agg + Group by

• Relative Speedup: • 4 procs: 2.61 – 3.08• 8 procs: 4.31 – 5.52• 16 procs: 6.65 – 9.54

CCGrid, 2012

Data Aggregation – Data Transfer Amount

• Data size: 16GB• Process number: 1 - 16• Input: 60 aggregation queries• Query Type:

• Only Agg• Agg + Group by + Having• Agg + Group by

CCGrid, 2012

Conclusion

• Data sizes increase in a fast speed• Goal: Find exact data subset as user specifies• Data virtualization on top of NetCDF dataset• Query request partition and parallel processing• A good speedup compared with OPeNDAP

CCGrid, 2012 20

Thanks

CCGrid, 2012

Pre-filter Module

Dataset Storage Metadata Dataset Logical Metadata Request Partition Strategy

Phase 1 Phase 2 Phase 3

Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

Documents

Transcript of Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets