Infraestructuras data science_portugal_ipca_industry_4.0_v2

Post on 29-Jan-2018

65 views 0 download

Transcript of Infraestructuras data science_portugal_ipca_industry_4.0_v2

Dr. Andrés Gómez

agomez@cesga.es

Feb. 2017

Data Science - Infraestruturas de suporte(Data Science – Support Infrastructures)

CESGA Mission

“Contribute to the advancement of Science and Technical Knowledge, by means of research and application of high performance computing and communications, as well as other information

technologies resources, in collaboration with other institutions, for the profit of society”

Contribuir ao avanço da Ciência e a Técnica, mediante a investigação e aplicação de

computação e comunicações de altas prestações, bem como outros recursos das

tecnologias da informação, em colaboração com outras instituições, para o benefício da

Sociedade

CESGA activities

PT Academic Network to Geant

Universities (mainly from Galicia)

R&D&I centres (mainly from Galicia)

CSIC (around Spain)

Other institutions from Spain and Europe: Hospitals (ONLY R&D)

Companies (mainly SMEs) Other non-profit R&D&I organizations

Non-Fee Access for Europeans through: RES open calls PRACE open calls

Our Customers

CESGA Computing Infrastructure

2.200 TB

FINIS TERRAE II:

HPC

7,712 cores

SVG:

HTC and

Cloud

~ 3.300 cores

Online Disk

1200 TB

Cloud for

Industry

240 cores

BigData

456

Cores

Remote Visualisation

80 cores

Infrastructures for Data Science

What is Big Data?

Why now:

Produce data is very cheap (sensors, people, ….) Storage is also cheap Unstructured and high-dimensional data

Big Data consists of extensive datasets - primarily in the

characteristics of volume, variety, velocity, and/or variability - that

require a scalable architecture for efficient storage, manipulation,

and analysis

NIST Big Data Public Working Group. (2015). NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST

Special Publication (Vol. 1). Gaithersburg, MD. Retrieved from http://dx.doi.org/10.6028/NIST.SP.1500-1

V’s Big Data Challenges

Volume Velocity

Variety

Veracity

Value

Added-Value or Knowledge

Variability

Adapted from: Demchenko, Y., Grosso, P., & Membrey, P. (2013). Addressing Big Data Issues in Scientific Data Infrastructure.

Collaboration Technologies and Systems (CTS), 2013 International Conference on (Pp. 48-55). IEEE., 48–55.

http://doi.org/10.1109/CTS.2013.6567203

What is Data Science?

Data science is the extraction of actionable knowledge directly

from data through a process of discovery, or hypothesis

formulation and hypothesis testing.

NIST Big Data Public Working Group. (2015). NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST

Special Publication (Vol. 1). Gaithersburg, MD. Retrieved from http://dx.doi.org/10.6028/NIST.SP.1500-1

Data Scientist: A Champion !

Collaboration isbetter

Architecture

(NBD-PWG), N. B. D. P. W. G. (2015). NIST Big Data Interoperability Framework: Volume 6, Reference Architecture (Vol. 6).

Gaithersburg, MD. Retrieved from http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-6.pdf

Big Data Requirements

Very Large Storage (TB, PB, EB,…)Parallel Very Fast I/O (GB/s)

Computing capacity (move process to data)Parallel processing.Interactive, streamed and batch.

Visualisation (first step data analysis)Advanced Data Analytics and ML packagesRemote Access

Etc

HETEROGENEOUS NEEDS &

USER PROFILES

HETEROGENEOUS

INFRASTRUCTURE &

ACCESS MODES

CESGA Solucion: Static

Based on Hortonworks HDP

HARDWARE PLATFORM FOR BIG DATA

HDFS

YARN

MAP

REDUCEHBASESPARK HIVE

Jupyter/Hue/Zeppelin/R

CESGA Solucion: Dynamic

Create your own cluster for Data Science

HARDWARE PLATFORM FOR BIG DATA

DOCKER

MESOS

Your

Config

Cluster

CassandraSPARK SciDB

PaaS API

WEB Interface

CESGA Solution: HPC

When data processing needs large computing

HARDWARE

PLATFORM FOR HPC

+ GPUs

HIGH PERFORMANCE

STORAGE: LUSTRE

HIGH SPEED COMM: IB

Theano TensorflowR Caffe

SLURM

WEB Interface/Remote Desktop SSH

CESGA Data Scientist

CESGA has no Data Scientist

CESGA offers this service in collaboration

Open to collaborations in Portugal