1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz,...

14
1 Scalable Exploratory Data Mining of Distributed Geoscientific Data thors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan

Transcript of 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz,...

Page 1: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

1

Scalable Exploratory Data Mining of Distributed Geoscientific Data

Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng

bySona Srinivasan

Page 2: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

2

Outline

• Introduction

• Geoscientific Data Modeling

• Geoscientific Algebraic Operators

• Physical Data Model

• Parallel Query Execution

• Automatic Query Execution

• Heterogeneous Distributed Data Access

• Implementations and Experiences

• Conclusion

• References

Page 3: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

3

Introduction

• Geoscience studies produce a tremendous amount of raw data • Involves extracting interesting geoscientific phenomena

not observed directly from raw datasets• Cyclone tracks - trajectories traveled along low-pressure areas

over time, that can be extracted from a sea-level pressure dataset • Data mining in business applications and Geoscientific feature

extraction involve sieving through large volumes of isolated events

and data to locate salient patterns• A database query processing problem in order to take advantage of

automatic query optimization, parallelization techniques• Conquest - an extensible parallel geoscientific query processing

system

Page 4: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

4

Geoscientific Data Model

Example Geographic Data Field

Page 5: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

5

Geoscientific Data Model

• A field - which associates parameter values with cells in a

multidimensional coordinate space• Cells can be of various geometric object types• The type of cells and the coordinate space they lie in

is determined by the Coordinate space• Values for the cells lie in a multidimensional variable space• Variable Attributes -The type of values associated with a cell in the

coordinate space • A cell record - a cell and the variable value associated with it• Cell coverage - the set of distinct cells in the coordinate space for

which variable values are recorded

Page 6: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

6

Geoscientific Algebraic Operators

• A base set of general purpose logical field data manipulation

operators. Users may introduce operators based on application

specific algorithms

• Set-Oriented Relational operators - Selection, Projection, Cartesian

Product, Union, Intersection, Set Difference, Join

• Sequence-Oriented Operators

• Grouping Operators - Nest and Unnest

• Space Conversion Operators

Page 7: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

7

Physical Data Model

Nesting of a Data Field

Page 8: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

8

Parallel Query Execution

• Parallelization Techniques are used to remove bottlenecks in I/O and computation and improve query performance

Pipelining Processing or Dataflow Parallelism Partitioning or Intra-Operator Parallelism Multicasting

Page 9: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

9

Query Parallelization

• Window of Relevance - Maximum length of time between

arrival of an object and the time it ceases to have an effect on

the execution state of the operator

Instantaneous Known Random but Bounded Fixed Windows

Page 10: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

10

Heterogeneous Distributed Data Access

• Only a small percentage of data is analyzed, due to unavailable

storage, bandwidth and difficulty in integrating distributed

datasets

• Conquest supports datasets both through distributed object

interface and a repository- specific scanner operator, as accessing

data from distributed objects eliminates opportunities for query

capability of data repositories to optimize query evaluation

Page 11: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

11

Implementations and Experiences

• Ported to run IBM SP1, SP2 and Intel Paragon• Has been used for the past five years for exploratory data analysis and data mining of spatio-temporal phenomena produced at UCLA and also for extraction and analysis of cyclonic activity, blocking features, and oceanic warm pools.

Number of upward wave propagation trajectories between 500mb and 50mb levels extracted per year

Page 12: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

12

Implementations … (Contd.)

Number of upward wave propagation trajectories between 500mb and 50mb at different latitudes

Page 13: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

13

Conclusion

• Conquest - geoscientific data model that applies distributed

and parallel database query processing to handle computationally

expensive data mining queries on massive datasets.

• Helps analyze the large volumes of data to extract the necessary

information

• Query Optimization emphasizes parallelization and optimal data

access

• Future Work - This system is currently being integrated as part of a

larger environment.

Page 14: 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

14

References

• E.C. Shek, R.R. Muntz, E. Mesrobian, and K. Ng, "Scalable Exploratory Data Mining of Distributed Geoscientific Data", KDD, 1996• E.C. Shek, E. Mesrobian, and R.R. Muntz, "On Heterogeneous Distributed Geoscientific Query Processing", Feb. 1996• F. Fabbrocino, E.C. Shek, R.R. Muntz, “ The Design and Implementation of the Conquest Query Execution Environment”, July. 1997• E. Mesrobian, et al…, "Exploratory Data Mining and Analysis Using Conquest", May 1995