High Performance Distributed Computing and Data Science

18
Henri Bal Vrije Universiteit Amsterdam High Performance Distributed Computing

description

Henri Ball describes how high performance computing is driven by the demands of large scale data problems. He also describes his links to other computer science disciplines within the DSRC.

Transcript of High Performance Distributed Computing and Data Science

Page 1: High Performance Distributed Computing and Data Science

Henri BalVrije Universiteit Amsterdam

High Performance Distributed Computing

Page 2: High Performance Distributed Computing and Data Science

Outline

1. Development of the field 2. Highlights VU-HPDC group 3. Links to data science cycle

4. Conclusions

Page 3: High Performance Distributed Computing and Data Science

Developments

• Multiple types of data explosions:– Big data: huge processing/transportation

demands– Complex heterogeneous data

10-100 x global internet traffic per year,exascale processing

Complex data

Page 4: High Performance Distributed Computing and Data Science

Developments

• Infrastructure explosion– High complexity: heterogeneous systems with

diversity of processors, systems, networks

Page 5: High Performance Distributed Computing and Data Science

VU HPDC GROUP

• Bridge the gap between demanding applications and complex infrastructure

• Distributed programming systems for– Clusters, grids, clouds– Heterogeneous systems (``Jungles”) – Accelerators (GPUs) – Clouds & mobile devices

• Applications: multimedia, semantic web, model checking, games, astronomy, astrophysics, climate modeling ….

Page 6: High Performance Distributed Computing and Data Science

Highlights VU-HPDC group

1st Prize: SCALE 2008

AAAI-VC 2007 DACH 2008 - BS DACH 2008 - FT

3rd Prize: ISWC 2008 1st Prize: SCALE 2010 EYR 2011Sustainability award

Solved Awari 2002

889Billiongamestates

Multimediadata

Astronomydata

Semanticweb

Semanticweb

Multimediadata

Page 7: High Performance Distributed Computing and Data Science

Links to data science cycle

Understand and decide

Analyze and model

Store and process

Reasoning

Knowledge representati

on

MultimediaRetrieval

Modeling and

simulation

Machine Learning

Information Retrieval

Decision Theory

PerceptionCognition

VisualAnalytics

DistributedProcessing

Large Scale Databases

SoftwareEng.

System / Network

Eng.

Distributed reasoning

Distributed computingMapreduce

Page 8: High Performance Distributed Computing and Data Science

Reasoning – Semantic Web

• Make the Web smarter by injecting meaning so that machines can “understand” it.o initial idea by Tim Berners-Lee in 2001

• Now attracted the interest of big IT companies

Page 9: High Performance Distributed Computing and Data Science

Google Example

Page 10: High Performance Distributed Computing and Data Science

Google Example

Page 11: High Performance Distributed Computing and Data Science

Distributed Reasoning

• WebPIE: web-scale distributed reasoner doing full materialization

• QueryPIE: distributed reasoning with backward-chaining + pre-materialization of schema-triples

• DynamiTE: maintains materialization after updates (additions & removals)

Challenge: real-time incremental reasoning on web scale, combining new (streaming) data & existing historic data

With: Jacopo Urbani, Alessandro Margara, Frank van Harmelen

COMMIT/

Page 12: High Performance Distributed Computing and Data Science

Distributed Computing• Jungle computing with Ibis

– Distributed, heterogeneous, hierarchical systems

• Programming accelerators

With: NLeSC (Frank Seinstra, Rob van Nieuwpoort et al.)

Page 13: High Performance Distributed Computing and Data Science

Ibis

• ComputationalAstrophysics (Leiden)

• Climate Modeling (Utrecht)

• Multimedia Content Analysis (UvA)

AMUSE

radiative transport

gravitational dynamics

hydro-dynamics

stellar evolution

Page 14: High Performance Distributed Computing and Data Science

Accelerators (GPUs)

• Use cases– Multimedia content analysis– Climate modeling– LOFAR (pulsar pipelines)

• Methodology for efficient GPU programming– Stepwise refinement, different levels of

hardware abstraction– Compiler feedback at each level

L2 Cache

Mem

ory

Contro

ller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory

Contro

ller

Mem

ory

Contro

ller

Mem

ory

Contr

oller

Mem

ory

Contr

oller

Mem

ory

Contr

oller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

L2 Cache

Mem

ory

Contro

ller

GPCSM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPCSM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Mem

ory

Contro

ller

Mem

ory

Contro

ller

Mem

ory

Contr

oller

Mem

ory

Contr

oller

Mem

ory

Contr

oller

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

GPC

SM

Raster Engine

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

SM

Polymorph Engine

Polymorph Engine

Host Interface

GigaThread Engine

Challenge: getting grip on performance

Page 15: High Performance Distributed Computing and Data Science

Glasswing: MapReduceon Accelerators

• Use accelerators (OpenCL) as mainstream feature

• Massive out-of-core data sets• Scale vertically & horizontally• Maintain MapReduce abstraction

With: Ismail El Helw, Rutger Hofman, UvA-SNE

Page 16: High Performance Distributed Computing and Data Science

Glasswing Pipeline

• Overlaps computation, communication & disk access

• Supports multiple buffering levels

Page 17: High Performance Distributed Computing and Data Science

Evaluation (DAS-4, EC2)

• Compute-bound applications benefit dramatically from GPUs (up to 107×)

• Better scalability than Hadoop• Runs on a variety of accelerators &

clouds

Challenge: real-world (compute-intensive) applications

Page 18: High Performance Distributed Computing and Data Science

Conclusions

• Strong links with Big data & Complex data

Understand and decide

Analyze and model

Store and process

Reasoning

Knowledge representati

on

MultimediaRetrieval

Modeling and

simulation

Machine Learning

Information Retrieval

Decision Theory

PerceptionCognition

VisualAnalytics

DistributedProcessing

Large Scale Databases

SoftwareEng.

System / Network

Eng.