Integrated Data Analysis and Visualization Group 5 Report DOE Data Management Workshop May 24-26,...
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of Integrated Data Analysis and Visualization Group 5 Report DOE Data Management Workshop May 24-26,...
Integrated Data Analysis and Visualization
Group 5 Report
DOE Data Management Workshop
May 24-26, 2004
Chicago, IL
Group 5 Participants
Wes Bethel, LBNL George Michaels, PNNL
John Blondin, NC State Habib Najm, SNL
Julian Bunn, Caltech John van Rosendale, DOE/HQ
George Chin, PNNL Nagiza Samatova, ORNL
Chris Ding, LBNL David Schissel, General Atomics
Irwin Gaines, FNAL Gary Strand, NCAR
Chandrika Kamath, LLNL Todd Smith, Geospiza, Inc.
James Kohl, ORNL
Outline
What is integrated data analysis and visualization? Why do we care?
Data Complexity & Implications
Applications-driven capabilities
Technology gaps to address these capabilities
General Recommendations
Conclusions
The Curse of Ultrascale Computation and High-throughput Experimentation
Computational and experimental advances enable capturing of complex natural phenomena, on a scale not possible just a few years ago. With this opportunity, comes a new problem – the petabyte quantities of produced data. As a result, answers to fundamental questions about the nature of the universe largely remain hidden in these data.
How to enable scientists perform analyses and visualizations of these raw data to extract knowledge?
Tony’s ScenarioData Select Data Access Correlate Render Display(density, pressure)From astro-data Where (step=101)(x-velocity>Y);
Sample (density, pressure) Visualize scatter plot
Run viz filterRun analysis
VIZTool
SelectData
AnalysisTool
TakeSample
Use Bitmap(condition)
Get variables(var-names, ranges)
Read Data(buffer-name)Write Data
Read Data(buffer-name)Write Data
Read Data(buffer-name)
Parallel HDF
PVFS Bitmap Index
Selection
Hardware, OS, and MSS (HPSS)
Workflow Design & Execution
Data Mining & Analysis Layer
Scientific ProcessAutomation Layer
Storage EfficientAccess Layer
Integration Must Happen at Multiple Levels
To enable end-to-end system performance, 80-20 rule and novel discoveries integration must happen:
Between and within data flow levels: Workflows Analysis & Viz Access &
Movement)
Across geographically distributed resources
Across multiple data scales and resolutions
Challenge of Data MassivenessDrinking from the firehose
Climate Now: 20-40TB per simulated year 5 yrs: 100TB/yr 5-10PB/yr
Fusion Now: 100Mbytes/15min 5 yrs: 1000Mbytes/2 min with realtime
comparison with running experiment, 500Mbits/sec guaranteed (QoS)
High Energy Physics Now: 1-10PB data stored, Gigabit net. 5 yrs: 100PB data, 100Gbits/sec net
Chemistry (Combustion and Nanostructures) Now: 10-30TB data 5 yrs: 30-100TB data, 10Gbits/sec multicast
Astrophysics Now and 5 yrs: Can soak up anything you
build!(John Sharf’s stats, LBL)
Most of this Data will NEVER Be Touched with the current trends in technology
The amount of data stored online quadruples every 18 months, while processing power ‘only’ doubles every 18 months.
Unless the number of processors increases unrealistically rapidly, most of this data will never be touched.
Storage device capacity doubles every 9 months, while memory capacity doubles every 18 months (Moore’s law).
Even if the divergence between these rates of growth will converge, the memory latency is and will remain the rate-limiting step in data-intensive computations
Operating systems struggle to handle files larger than a few GBs.
OS constraints and memory capacity determine data set file size and fragmentation
Challenge of Breaking the Algorithmic Complexity Bottleneck
3 yrs.0.1 sec.10-2 sec.10GB
3 hrs10-3 sec.10-4 sec.100MB
1 sec.10-5 sec.10-6 sec.1MB
10-4sec.10-8 sec.10-8 sec.10KB
10-8 sec.10-10 sec.10-10sec.100B
n2nlog(n)n
Algorithm Complexity
Data size, n
Algorithmic Complexity:
Calculate means O(n)
Calculate FFT O(n log(n))
Calculate SVD O(r • c)
Clustering algorithms O(n2)
For illustration chart assumes 10-12 sec. calculation time per data point
MS Data Rates: 100’sGB10’sTB/day(2004)1.0’sPB/day(2008)
Massive Data Sets are Naturally Distributed BUT Effectively Immoveable(Skillicorn, 2001)
Bandwidth is increasing but not at the same rate as stored data There are some parts of the world with high available bandwidth BUT there are
enough bottlenecks that high effective bandwidth is unachievable across heterogeneous networks
Latency for transmission at global distances is significant Most of this latency is time-of-flight and so will not be reduced by technology
Data has a property similar to inertia: It is cheap to store and cheap to keep moving, but the transitions between
these two states are expensive in time and hardware.
Legal and political restrictions
Social restrictions Data owners may let access data but only by retaining control of it
• Should we move computations to the data, rather than data to the computations?• Should we cache the data close to analysis and viz.?• Should we be smarter about reducing the size of the data while having the same or richer information content?
Cells & Tissues
Time
GeneticsEnvironments
Genetic Manipulations
The experiment paradigm is changing to statistically capture the complexity. We will get maximum value when we explore three or more dimensions/scales in a single experiment.
Phenotypes
Po
pula
tion
sT
reat
men
ts
Challenge of High Dimensionality, Multi-Scale and Multi-Resolution
(From G. Michaels, PNNL)
But multi-scale and multi-resolution analysis and visualization are in their infancy!
Know Our Limits & Be SmartObligations are two-sided: CS and Apps
Ultrascale Simulations:Must be smart about which probe combinations to see!
Physical Experiments:Must be smart about probe placement!
Not humanly possible to browse a petabyte of data.Analysis must select views or reduce to quantities of interest rather than push more views past the user.
Terabytes
Petabytes
Gigabytes
Megabytes
NoAnalysis
Region Selection
Analysis-driven summarization
Visualization Scalability
through guidance by
analysis of full context
More analysis
Mo
re d
ata
Human Bandwid
th
Overload?
Can we browse a petabyte of data?
To see 1 percent of a petabyte at 10 megabytes per second takes 35 8-hour days!
Analysis of full context must select views or reduce to quantities of interest in addition to fast rendering of data.
Frame of Context Differences Suggest Needs in Hardware and Software for Analysis and Visualization
Space
Time
Simulation Analysis
Storage
Visualization
Frames of context for major steps of space-time simulation scientific discovery process.
Need hardware and software forFull Context analysis and visualization
(From G. Ostrouchov, ORNL)
Arguably, visualization can be the most critical step of a simulation experiment. But it has to be in a full context.
I hear and I forget.I see and I believe.I do Visual Analysis and I understand.—Confucius (551-479 BC)
But Tony Still Has a Dream – Internet “Plug-ins” for Ultrascale Computing!
ASPECT
Paraview
From Dreams to Achievable Application-driven CapabilitiesThe first step in Group 5 discussion
Technology Gaps Ranking:
Research & Development -- 3
Hardening Technology -- 2
Deployment & Maintenance -- 1
Representation by Applications:
• Climate• Biology• Combustion• Fusion• HENP• Supernova
Capability #1: IDL-like SCALABLE, open source environment (J.Blondin)
High performance-enabling technologies: Parallel analysis and viz. algorithms (e.g. pVTK, pMatlab, parallel-R) (3—2)
Portable implementation on HPC platforms (2—1)
Hardware accelerated implementations (GPUs, FPGAs) (2—1)
Parallel I/O libraries coupled with analysis and viz (ROMIO+pVTK, pNetCDF+Parallel-R) (2—1)
Information visualization (3—2)
Interoperability-enabling technologies: Component architectures (CCA) (3—2)
Core data models and data structures unification (3—2)
Structural, semantic and syntactic mediation (3—2)
Scripting environments: IDL/Matlab-like high-level programming languages (3)
Optimized (parallel, accelerated) functions (core libraries) (3)
Simulation interfaces (3)
Visualization interfaces (information, statistical and scientific visualization) (3—2)
Capability #2: Domain-specific libraries and tools
Same technologies as for Capability #1 Plus More
Novel algorithms for domain-specific analysis and visualization:
Feature extraction/selection/tracking (e.g., ICA for climate) (3—2)
New types of data (e.g. trees, networks) (3—2)
Interpolation & transformation (3—2)
Multi-scale/hierarchical features correlation (3)
Novel data models if necessary (3)
Capability #3: “Plug and Play” Analysis and Visualization Envs
Community-specific data model(s) (3)
Standardization that still provides efficiency and flexibility
Community-specific common APIs (3)
Unified still extensible data structures
Common component architectures (3—2)
“i”-ntegration vs. “I”-ntegration
“i”-integration within the same application:
Same set of data structures => brute-force check (at worst)
Same language, execution and control model
Scripting languages (TCL, Python, R)
Run on the same cluster of machines
“I”-integration across multiple applications:
Different data structures (unknown for future apps)
Different execution & control models
Different programming languages
Run on different hosts
Components integration strategies should be assessed within single and between multiple higher-level applications
• File App• App_X App_X• App_X App_Y• App File
Data Formats Transformations:
Capability #4: Feature (region) detection, extraction, tracking
Efficient and effective data indexing that (3-2):
Supports unstructured data in files (e.g., bitmap indexing extension to AMR)
Supports heterogeneous/non-scalar data (e.g., vector fields, protein sequence, protein function, pathway, network)
Supports on-demand derived data (e.g. F(X)/G(Y)<5: entropy as a function of indexed density and pressure)
Information visualization (3—2)
Other Capabilities Remote, collaborative & interactive analysis and visualization (3-2):
Network-aware analysis and viz
Novel means of hiding latency (e.g. caching via LoCi, view-dependent isosurfaces)
Sensitivity & uncertainty quantification (3)
Streaming analysis & viz (3): Approx. multi-res. algorithms
Data transformations on streaming data
Annotation & provenance of analysis and visualization results System-, analysis & viz-, data-level metadata
Verification and validation
Comparative analysis and visualization
Cross-cutting capabilities: Integration of analysis and visualization with workflows
Integration of analysis and visualization with data bases (e.g., query-based)
General Recommendations
Encourage open source software
Move out mature technologies (??)
Encourage/force data model(s) & APIs standardization efforts?
Do not expect scientists to develop their domain-specific components rather fund collaborative CS & Apps teams Will assure more robust and reusable solutions and take the burden of CS tasks from scientists
Conclusions
Integration must occur at multiple levels
Integration is more easily achievable within a community than across communities
Community-based data model(s) and APIs are required for “Plug & Play”