E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University [email protected].

of 25 /25
e-Science and Datacentric Framework s Hyunseung Choo Sungkyunkwan University http://monet.skku.ac.kr [email protected]

Transcript of E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University [email protected].

Page 1: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

e-Science and Datacentric Frameworks

Hyunseung ChooSungkyunkwan University

http://[email protected]

Page 2: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

e-Science and its examples

Page 3: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

‘e-Science’ is about global collaboration in

key areas of science, and the next generation

of infrastructure that will enable it.

‘e-Science’ will change the dynamic of the

way science is undertaken.

Director General of Research Councils Office of Science and Technology

John Taylor

e-Science

Page 4: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

GRID vs. e-Science

G R I D e-Science

Goals

Enhancing research productivity and acquiring national competitive power based on R&D infrastructure in 21st century

IT based infrastructure for novel computing services

Renovation of R&D capability based on

proper infrastructure

OrganizationAdvanced NetworksMiddlewares

ApplicationsAdvanced Users

Roles IT Infrastructure Virtual Organizations

Resources HPC, mass storages, DB, advanced instruments, human resources, etc

Characteristics Shared data, information and computation by geographically dispersed communities

DifferencesProvider-oriented(Technology-Push) Focus on networks and middlewares

Consumer-oriented (Science-Pull)Focus on actual applications

<KIPS Review, May, 2003>

Page 5: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

■ Exponential Growth of Network Technology

Network vs. Computer Performance Computer speed doubles every 18 months Network speed doubles every 9 months Difference = order of magnitude per 5 years

1986 to 2000 Computers: x 500 Networks: x 340,000

2001 to 2010 Computers: x 60 Networks: x 4000

From Networking to Grid Computing

Page 6: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

■ More and more data Instrument resolution doubling / 12 months

Instrument and telemetry speeds increasing Mobile sensors & radio digital networks

Storage capacity doubling / 12 months

■ More and more computation Computations available doubling / 18 months

■ Faster networks can change methods Raw bandwidth doubling / 9 months

■ These integrate and enable More interplay between computation and data More collaboration: scientists, medics, engineers, etc. More international collaboration

The Driver for e-Science

Page 7: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

■ Shared Infrastructure Intrinsically distributed Intrinsically multi-organizational Multiple uses interwoven

■ Shared Software A new attempt at making distributed computing economic,

dependable and accessible Scientists from all disciplines share in its design and use

■ Shared & Automated System Administration Replicated farms of replicated systems Autonomic management

■ Immediate Benefits Faster transfer of ideas and techniques between disciplines Amortization of development, operation and education

The New Behavior

Page 8: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

Earth Observation Systems

severe weather predictions, climate variations, flood monitoring, earthquakes, and tsunami (a tidal wave)

Virtual Observatories

Robotic Telescopes

Bioinformatics / Functional genomics

Collaborative Engineering

Medical / Healthcare informatics

TeleMicroscopy, and so on

Examples on e-Science

Page 9: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

NEESgrid

National infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other.

Argonne, Michigan, NCSA, UIUC, USC

Example 1 – Earthquake Simulation

Page 10: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

NASA Information Power Grid (IPG)Aircraft, flight paths, airport operations and the environment are combined to get a virtual national airspace

Example 2 – Airspace Simulation

Page 11: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

e-Science (USA)

■ Cyber infrastructure program like “e-Science community” for federal offices, supercomputing centers, and research institutes Budget in 2003 : U$ 1.1 billion

■ e-Science Cases Telescience Portal : X-ray related applications including Microbi

oanalysis NASA IPG (Information Power Grid) : Aircraft simulation and ana

lysis to reduce the design processing time BIRN(Biomedical Informatics Research Network) : Study

on human and animal brains for the new era in medical science

Page 12: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

BIRN (Biomedical Informatics Research Network)

■ Processing Pipelines for Morphometric Analysis

■ Medical Applications for HPC non-linear registrations biomechanical simulations statistical analysis of large po

pulations

Page 13: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

AccessGrid always-on video walls

e-Science Centre (UK)

Page 14: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

e-Science Pilot Project (UK) (1/2)

■ Many to one project

■ Particle Physics and Astronomy Research Council (PPARC) GridPP: A prototype Grid infrastructure for the CERN Large Hadron collider AstroGrid: A Grid based Virtual Observatory

■ Biotechnology and Biological Sciences Research Council (BBSRC)

■ Medical Research Council (MRC)

■ Natural Environment Research Council (NERC) Grid for Environmental Systems Diagnostics and Visualization Climateprediction.com: Distributed computing for global climate research Environment from the Molecular Level: Modeling the atomistic processes involved

in environmental issues

Page 15: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

e-Science Pilot Project (UK) (2/2)

■ Economic Social Research Council (ESRC)■ Engineering and Physical Sciences Research Council (EPSRC)

The Reality Grid: a tool for investigating condensed matter and materials

Comb-e-chem: Structure-Property Mapping: Combinatorial Chemistry and the Grid

DAME: Distributed Aircraft Maintenance Environment GEODISE: Grid Enabled Optimization and Design Search for Engineering Discovery Net: An e-Science Testbed for High Throughput Informatics MyGrid: Directly Supporting the e-Scientist

■ Council for the Central Laboratory of the Research Councils (CLRC)

Page 16: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

e-Science (JP)

■ IT-based laboratory (ITBL), Grid based fundamental Informatics (A05), 100 Teraflop high performance computing (NAREGI) All led by Ministry of Education, Culture, Sports, Science, and T

echnology ( 문부과학성 )

■ e-Science Cases ITBL : Project for virtual research environments A05 : Grid computing project NAREGI : Integrating distributed computing resources by high pe

rformance networks for 100 Teraflop HPC

Page 17: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

ITBL (IT-Based Laboratory)

■ 6 Organizations at ITBL Japan Atomic Energy Research Institute (JAERI) 일본원자력 연구소 RIKEN (The Institute of Physical and Chemical Research) 이화학연구소 National Institute for Materials Science (NIMS) 재료 물질 연구 기구 National Aerospace Laboratory of Japan (NAL) 항공우주기술연구소 National Research Institute for National Research Institute for Earth Science

and Disaster Prevention (NIED) 방재과학기술연구소 Japan Science and Technology Corporation (JST) 과학진흥 사업단

■ Massive collaborative research environment for remote researchers by SuperSINET based on IT infrastructure

Page 18: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

e-Science (CN)

■ Grid Projects in China (2002-2005) The Ministry of Science & Technology 863 Grid Project

Grid Enabling Cluster (>4 Tflop/s)Grid Nodes (Total 6-10 Tflop/s)Grid Software (Grid OS, Developer and User Environment)Grid Applications in Science, Manufacturing, Service industry, and E

nvironment/Resource sector

The “Next Internet” Project (led by Chinese NSF)Upgrade network infrastructureBasic research in computing, data and access grids

The Chinese Academy of Sciences e-Science Grid The Beijing City Manufacturing Grid

Page 19: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

Datacentric Applications

Page 20: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

Three different kinds of grids

■ Computational grids These represent the natural extension of large parallel and distributed s

ystems, and exist to provide high-performance computing

■ Access grids This requires managing access to many specific, small resources that a

re actually located inside large, complex, organizational computer systems and networks

■ Data grids These exist in order to allow large datasets to be stored in repositories

and moved about with the same ease that small public files can be moved today

☼ Datacentric grids

Page 21: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

Facts about online data

■ They are big and growing fast Data stored online quadruples every 18 months. Process power ‘only’ doubles every 18 months.

■ They are naturally distributed Data is captured via multiple channels Operating systems struggle to handle files larger

than a few GB

■ They are hard to move Pragmatics: Few sites have enough swap space

to handle the arrival of a terabyte dataset for temporary use

Performance Politics: Data about individuals cannot be moved

out of jurisdictions with strong privacy rules

Page 22: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

Implications of datasets that are large, distributed, and immovable

■ It’s much more effective to divide programs into separated pieces and send them to data

■ This requires a datacentric view of computation, rather than the conventional processor-centric view. A new programming model is needed Applications must be decomposable The results of (partial) computations must be small enough to

move around These condensed forms are worth keeping Execution nodes must be able to provide both computing cycles

and high-performance data access.

Page 23: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

Some properties

■ Users can be productive even from a thin client

■ Applications require only thin pipes within the internet

■ Code mobility is essential

■ The format and content of a data repository will often be unknown to an application until it actually starts accessing it

■ Applications will tend to be standardized

■ Applications will often be built from templates, perhaps even expressed using a query language

■ Re-execution of an application on a different or updated dataset will be common

■ There will be increased sensitivity about information leakage

Page 24: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

A typical datacentric application

Page 25: E-Science and Datacentric Frameworks Hyunseung Choo Sungkyunkwan University  choo@skku.ac.kr.

Conclusion

■ e-Science and datacentric grid are strongly coupled

■ Meteorology data require dataqcentric grid computing in the future Typical e-Science characteristics Huge data size Poor data site accessibility Experts are spread over the country/world

■ Basically all are based on reliable networks Exact computing on network probabilistic connectivity (one asp

ect of reliability measures) is theoretically hard Fast approaches and good enough approximation algorithm ar

e developed (will be published)