Integrated Data Analysis and Visualization Group 5 Report DOE Data Management Workshop May 24-26,...

Integrated Data Analysis and Visualization

Group 5 Report

DOE Data Management Workshop

May 24-26, 2004

Chicago, IL

Group 5 Participants

Wes Bethel, LBNL George Michaels, PNNL

John Blondin, NC State Habib Najm, SNL

Julian Bunn, Caltech John van Rosendale, DOE/HQ

George Chin, PNNL Nagiza Samatova, ORNL

Chris Ding, LBNL David Schissel, General Atomics

Irwin Gaines, FNAL Gary Strand, NCAR

Chandrika Kamath, LLNL Todd Smith, Geospiza, Inc.

James Kohl, ORNL

Outline

What is integrated data analysis and visualization? Why do we care?

Data Complexity & Implications

Applications-driven capabilities

Technology gaps to address these capabilities

General Recommendations

Conclusions

The Curse of Ultrascale Computation and High-throughput Experimentation

Computational and experimental advances enable capturing of complex natural phenomena, on a scale not possible just a few years ago. With this opportunity, comes a new problem – the petabyte quantities of produced data. As a result, answers to fundamental questions about the nature of the universe largely remain hidden in these data.

How to enable scientists perform analyses and visualizations of these raw data to extract knowledge?

Tony’s ScenarioData Select Data Access Correlate Render Display(density, pressure)From astro-data Where (step=101)(x-velocity>Y);

Sample (density, pressure) Visualize scatter plot

Run viz filterRun analysis

VIZTool

SelectData

AnalysisTool

TakeSample

Use Bitmap(condition)

Get variables(var-names, ranges)

Read Data(buffer-name)Write Data

Read Data(buffer-name)Write Data

Read Data(buffer-name)

Parallel HDF

PVFS Bitmap Index

Selection

Hardware, OS, and MSS (HPSS)

Workflow Design & Execution

Data Mining & Analysis Layer

Scientific ProcessAutomation Layer

Storage EfficientAccess Layer

Integration Must Happen at Multiple Levels

To enable end-to-end system performance, 80-20 rule and novel discoveries integration must happen:

Between and within data flow levels: Workflows Analysis & Viz Access &

Movement)

Across geographically distributed resources

Across multiple data scales and resolutions

Challenge of Data MassivenessDrinking from the firehose

Climate Now: 20-40TB per simulated year 5 yrs: 100TB/yr 5-10PB/yr

Fusion Now: 100Mbytes/15min 5 yrs: 1000Mbytes/2 min with realtime

comparison with running experiment, 500Mbits/sec guaranteed (QoS)

High Energy Physics Now: 1-10PB data stored, Gigabit net. 5 yrs: 100PB data, 100Gbits/sec net

Chemistry (Combustion and Nanostructures) Now: 10-30TB data 5 yrs: 30-100TB data, 10Gbits/sec multicast

Astrophysics Now and 5 yrs: Can soak up anything you

build!(John Sharf’s stats, LBL)

Most of this Data will NEVER Be Touched with the current trends in technology

The amount of data stored online quadruples every 18 months, while processing power ‘only’ doubles every 18 months.

Unless the number of processors increases unrealistically rapidly, most of this data will never be touched.

Storage device capacity doubles every 9 months, while memory capacity doubles every 18 months (Moore’s law).

Even if the divergence between these rates of growth will converge, the memory latency is and will remain the rate-limiting step in data-intensive computations

Operating systems struggle to handle files larger than a few GBs.

OS constraints and memory capacity determine data set file size and fragmentation

Challenge of Breaking the Algorithmic Complexity Bottleneck

3 yrs.0.1 sec.10-2 sec.10GB

3 hrs10-3 sec.10-4 sec.100MB

1 sec.10-5 sec.10-6 sec.1MB

10-4sec.10-8 sec.10-8 sec.10KB

10-8 sec.10-10 sec.10-10sec.100B

n2nlog(n)n

Algorithm Complexity

Data size, n

Algorithmic Complexity:

Calculate means O(n)

Calculate FFT O(n log(n))

Calculate SVD O(r • c)

Clustering algorithms O(n2)

For illustration chart assumes 10-12 sec. calculation time per data point

MS Data Rates: 100’sGB10’sTB/day(2004)1.0’sPB/day(2008)

Massive Data Sets are Naturally Distributed BUT Effectively Immoveable(Skillicorn, 2001)

Bandwidth is increasing but not at the same rate as stored data There are some parts of the world with high available bandwidth BUT there are

enough bottlenecks that high effective bandwidth is unachievable across heterogeneous networks

Latency for transmission at global distances is significant Most of this latency is time-of-flight and so will not be reduced by technology

Data has a property similar to inertia: It is cheap to store and cheap to keep moving, but the transitions between

these two states are expensive in time and hardware.

Legal and political restrictions

Social restrictions Data owners may let access data but only by retaining control of it

• Should we move computations to the data, rather than data to the computations?• Should we cache the data close to analysis and viz.?• Should we be smarter about reducing the size of the data while having the same or richer information content?

Cells & Tissues

Time

GeneticsEnvironments

Genetic Manipulations

The experiment paradigm is changing to statistically capture the complexity. We will get maximum value when we explore three or more dimensions/scales in a single experiment.

Phenotypes

Po

pula

tion

sT

reat

men

ts

Challenge of High Dimensionality, Multi-Scale and Multi-Resolution

(From G. Michaels, PNNL)

But multi-scale and multi-resolution analysis and visualization are in their infancy!

Know Our Limits & Be SmartObligations are two-sided: CS and Apps

Ultrascale Simulations:Must be smart about which probe combinations to see!

Physical Experiments:Must be smart about probe placement!

Not humanly possible to browse a petabyte of data.Analysis must select views or reduce to quantities of interest rather than push more views past the user.

Terabytes

Petabytes

Gigabytes

Megabytes

NoAnalysis

Region Selection

Analysis-driven summarization

Visualization Scalability

through guidance by

analysis of full context

More analysis

Mo

re d

ata

Human Bandwid

th

Overload?

Can we browse a petabyte of data?

To see 1 percent of a petabyte at 10 megabytes per second takes 35 8-hour days!

Analysis of full context must select views or reduce to quantities of interest in addition to fast rendering of data.

Frame of Context Differences Suggest Needs in Hardware and Software for Analysis and Visualization

Space

Time

Simulation Analysis

Storage

Visualization

Frames of context for major steps of space-time simulation scientific discovery process.

Need hardware and software forFull Context analysis and visualization

(From G. Ostrouchov, ORNL)

Arguably, visualization can be the most critical step of a simulation experiment. But it has to be in a full context.

I hear and I forget.I see and I believe.I do Visual Analysis and I understand.—Confucius (551-479 BC)

But Tony Still Has a Dream – Internet “Plug-ins” for Ultrascale Computing!

ASPECT

Paraview

From Dreams to Achievable Application-driven CapabilitiesThe first step in Group 5 discussion

Technology Gaps Ranking:

Research & Development -- 3

Hardening Technology -- 2

Deployment & Maintenance -- 1

Representation by Applications:

• Climate• Biology• Combustion• Fusion• HENP• Supernova

Capability #1: IDL-like SCALABLE, open source environment (J.Blondin)

High performance-enabling technologies: Parallel analysis and viz. algorithms (e.g. pVTK, pMatlab, parallel-R) (3—2)

Portable implementation on HPC platforms (2—1)

Hardware accelerated implementations (GPUs, FPGAs) (2—1)

Parallel I/O libraries coupled with analysis and viz (ROMIO+pVTK, pNetCDF+Parallel-R) (2—1)

Information visualization (3—2)

Interoperability-enabling technologies: Component architectures (CCA) (3—2)

Core data models and data structures unification (3—2)

Structural, semantic and syntactic mediation (3—2)

Scripting environments: IDL/Matlab-like high-level programming languages (3)

Optimized (parallel, accelerated) functions (core libraries) (3)

Simulation interfaces (3)

Visualization interfaces (information, statistical and scientific visualization) (3—2)

Capability #2: Domain-specific libraries and tools

Same technologies as for Capability #1 Plus More

Novel algorithms for domain-specific analysis and visualization:

Feature extraction/selection/tracking (e.g., ICA for climate) (3—2)

New types of data (e.g. trees, networks) (3—2)

Interpolation & transformation (3—2)

Multi-scale/hierarchical features correlation (3)

Novel data models if necessary (3)

Capability #3: “Plug and Play” Analysis and Visualization Envs

Community-specific data model(s) (3)

Standardization that still provides efficiency and flexibility

Community-specific common APIs (3)

Unified still extensible data structures

Common component architectures (3—2)

“i”-ntegration vs. “I”-ntegration

“i”-integration within the same application:

Same set of data structures => brute-force check (at worst)

Same language, execution and control model

Scripting languages (TCL, Python, R)

Run on the same cluster of machines

“I”-integration across multiple applications:

Different data structures (unknown for future apps)

Different execution & control models

Different programming languages

Run on different hosts

Components integration strategies should be assessed within single and between multiple higher-level applications

• File App• App_X App_X• App_X App_Y• App File

Data Formats Transformations:

Capability #4: Feature (region) detection, extraction, tracking

Efficient and effective data indexing that (3-2):

Supports unstructured data in files (e.g., bitmap indexing extension to AMR)

Supports heterogeneous/non-scalar data (e.g., vector fields, protein sequence, protein function, pathway, network)

Supports on-demand derived data (e.g. F(X)/G(Y)<5: entropy as a function of indexed density and pressure)

Information visualization (3—2)

Other Capabilities Remote, collaborative & interactive analysis and visualization (3-2):

Network-aware analysis and viz

Novel means of hiding latency (e.g. caching via LoCi, view-dependent isosurfaces)

Sensitivity & uncertainty quantification (3)

Streaming analysis & viz (3): Approx. multi-res. algorithms

Data transformations on streaming data

Annotation & provenance of analysis and visualization results System-, analysis & viz-, data-level metadata

Verification and validation

Comparative analysis and visualization

Cross-cutting capabilities: Integration of analysis and visualization with workflows

Integration of analysis and visualization with data bases (e.g., query-based)

General Recommendations

Encourage open source software

Move out mature technologies (??)

Encourage/force data model(s) & APIs standardization efforts?

Do not expect scientists to develop their domain-specific components rather fund collaborative CS & Apps teams Will assure more robust and reusable solutions and take the burden of CS tasks from scientists

Conclusions

Integration must occur at multiple levels

Integration is more easily achievable within a community than across communities

Community-based data model(s) and APIs are required for “Plug & Play”

Integrated Data Analysis and Visualization Group 5 Report DOE Data Management Workshop May 24-26,...

Documents

Transcript of Integrated Data Analysis and Visualization Group 5 Report DOE Data Management Workshop May 24-26,...