Infraestructuras data science_portugal_ipca_industry_4.0_v2

19
Dr. Andrés Gómez [email protected] Feb. 2017 Data Science - Infraestruturas de suporte (Data Science – Support Infrastructures)

Transcript of Infraestructuras data science_portugal_ipca_industry_4.0_v2

Page 1: Infraestructuras data science_portugal_ipca_industry_4.0_v2

Dr. Andrés Gómez

[email protected]

Feb. 2017

Data Science - Infraestruturas de suporte(Data Science – Support Infrastructures)

Page 2: Infraestructuras data science_portugal_ipca_industry_4.0_v2
Page 3: Infraestructuras data science_portugal_ipca_industry_4.0_v2

CESGA Mission

“Contribute to the advancement of Science and Technical Knowledge, by means of research and application of high performance computing and communications, as well as other information

technologies resources, in collaboration with other institutions, for the profit of society”

Contribuir ao avanço da Ciência e a Técnica, mediante a investigação e aplicação de

computação e comunicações de altas prestações, bem como outros recursos das

tecnologias da informação, em colaboração com outras instituições, para o benefício da

Sociedade

Page 4: Infraestructuras data science_portugal_ipca_industry_4.0_v2

CESGA activities

Page 5: Infraestructuras data science_portugal_ipca_industry_4.0_v2

PT Academic Network to Geant

Page 6: Infraestructuras data science_portugal_ipca_industry_4.0_v2

Universities (mainly from Galicia)

R&D&I centres (mainly from Galicia)

CSIC (around Spain)

Other institutions from Spain and Europe: Hospitals (ONLY R&D)

Companies (mainly SMEs) Other non-profit R&D&I organizations

Non-Fee Access for Europeans through: RES open calls PRACE open calls

Our Customers

Page 7: Infraestructuras data science_portugal_ipca_industry_4.0_v2

CESGA Computing Infrastructure

2.200 TB

FINIS TERRAE II:

HPC

7,712 cores

SVG:

HTC and

Cloud

~ 3.300 cores

Online Disk

1200 TB

Cloud for

Industry

240 cores

BigData

456

Cores

Remote Visualisation

80 cores

Page 8: Infraestructuras data science_portugal_ipca_industry_4.0_v2

Infrastructures for Data Science

Page 9: Infraestructuras data science_portugal_ipca_industry_4.0_v2

What is Big Data?

Why now:

Produce data is very cheap (sensors, people, ….) Storage is also cheap Unstructured and high-dimensional data

Big Data consists of extensive datasets - primarily in the

characteristics of volume, variety, velocity, and/or variability - that

require a scalable architecture for efficient storage, manipulation,

and analysis

NIST Big Data Public Working Group. (2015). NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST

Special Publication (Vol. 1). Gaithersburg, MD. Retrieved from http://dx.doi.org/10.6028/NIST.SP.1500-1

Page 10: Infraestructuras data science_portugal_ipca_industry_4.0_v2

V’s Big Data Challenges

Volume Velocity

Variety

Veracity

Value

Added-Value or Knowledge

Variability

Adapted from: Demchenko, Y., Grosso, P., & Membrey, P. (2013). Addressing Big Data Issues in Scientific Data Infrastructure.

Collaboration Technologies and Systems (CTS), 2013 International Conference on (Pp. 48-55). IEEE., 48–55.

http://doi.org/10.1109/CTS.2013.6567203

Page 11: Infraestructuras data science_portugal_ipca_industry_4.0_v2

What is Data Science?

Data science is the extraction of actionable knowledge directly

from data through a process of discovery, or hypothesis

formulation and hypothesis testing.

NIST Big Data Public Working Group. (2015). NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST

Special Publication (Vol. 1). Gaithersburg, MD. Retrieved from http://dx.doi.org/10.6028/NIST.SP.1500-1

Data Scientist: A Champion !

Collaboration isbetter

Page 12: Infraestructuras data science_portugal_ipca_industry_4.0_v2

Architecture

(NBD-PWG), N. B. D. P. W. G. (2015). NIST Big Data Interoperability Framework: Volume 6, Reference Architecture (Vol. 6).

Gaithersburg, MD. Retrieved from http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-6.pdf

Page 13: Infraestructuras data science_portugal_ipca_industry_4.0_v2

Big Data Requirements

Very Large Storage (TB, PB, EB,…)Parallel Very Fast I/O (GB/s)

Computing capacity (move process to data)Parallel processing.Interactive, streamed and batch.

Visualisation (first step data analysis)Advanced Data Analytics and ML packagesRemote Access

Etc

Page 14: Infraestructuras data science_portugal_ipca_industry_4.0_v2

HETEROGENEOUS NEEDS &

USER PROFILES

HETEROGENEOUS

INFRASTRUCTURE &

ACCESS MODES

Page 15: Infraestructuras data science_portugal_ipca_industry_4.0_v2

CESGA Solucion: Static

Based on Hortonworks HDP

HARDWARE PLATFORM FOR BIG DATA

HDFS

YARN

MAP

REDUCEHBASESPARK HIVE

Jupyter/Hue/Zeppelin/R

Page 16: Infraestructuras data science_portugal_ipca_industry_4.0_v2

CESGA Solucion: Dynamic

Create your own cluster for Data Science

HARDWARE PLATFORM FOR BIG DATA

DOCKER

MESOS

Your

Config

Cluster

CassandraSPARK SciDB

PaaS API

WEB Interface

Page 17: Infraestructuras data science_portugal_ipca_industry_4.0_v2

CESGA Solution: HPC

When data processing needs large computing

HARDWARE

PLATFORM FOR HPC

+ GPUs

HIGH PERFORMANCE

STORAGE: LUSTRE

HIGH SPEED COMM: IB

Theano TensorflowR Caffe

SLURM

WEB Interface/Remote Desktop SSH

Page 18: Infraestructuras data science_portugal_ipca_industry_4.0_v2

CESGA Data Scientist

CESGA has no Data Scientist

CESGA offers this service in collaboration

Open to collaborations in Portugal

Page 19: Infraestructuras data science_portugal_ipca_industry_4.0_v2