Post on 29-Jan-2018
Dr. Andrés Gómez
agomez@cesga.es
Feb. 2017
Data Science - Infraestruturas de suporte(Data Science – Support Infrastructures)
CESGA Mission
“Contribute to the advancement of Science and Technical Knowledge, by means of research and application of high performance computing and communications, as well as other information
technologies resources, in collaboration with other institutions, for the profit of society”
Contribuir ao avanço da Ciência e a Técnica, mediante a investigação e aplicação de
computação e comunicações de altas prestações, bem como outros recursos das
tecnologias da informação, em colaboração com outras instituições, para o benefício da
Sociedade
CESGA activities
PT Academic Network to Geant
Universities (mainly from Galicia)
R&D&I centres (mainly from Galicia)
CSIC (around Spain)
Other institutions from Spain and Europe: Hospitals (ONLY R&D)
Companies (mainly SMEs) Other non-profit R&D&I organizations
Non-Fee Access for Europeans through: RES open calls PRACE open calls
Our Customers
CESGA Computing Infrastructure
2.200 TB
FINIS TERRAE II:
HPC
7,712 cores
SVG:
HTC and
Cloud
~ 3.300 cores
Online Disk
1200 TB
Cloud for
Industry
240 cores
BigData
456
Cores
Remote Visualisation
80 cores
Infrastructures for Data Science
What is Big Data?
Why now:
Produce data is very cheap (sensors, people, ….) Storage is also cheap Unstructured and high-dimensional data
Big Data consists of extensive datasets - primarily in the
characteristics of volume, variety, velocity, and/or variability - that
require a scalable architecture for efficient storage, manipulation,
and analysis
NIST Big Data Public Working Group. (2015). NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST
Special Publication (Vol. 1). Gaithersburg, MD. Retrieved from http://dx.doi.org/10.6028/NIST.SP.1500-1
V’s Big Data Challenges
Volume Velocity
Variety
Veracity
Value
Added-Value or Knowledge
Variability
Adapted from: Demchenko, Y., Grosso, P., & Membrey, P. (2013). Addressing Big Data Issues in Scientific Data Infrastructure.
Collaboration Technologies and Systems (CTS), 2013 International Conference on (Pp. 48-55). IEEE., 48–55.
http://doi.org/10.1109/CTS.2013.6567203
What is Data Science?
Data science is the extraction of actionable knowledge directly
from data through a process of discovery, or hypothesis
formulation and hypothesis testing.
NIST Big Data Public Working Group. (2015). NIST Big Data Interoperability Framework: Volume 1, Definitions. NIST
Special Publication (Vol. 1). Gaithersburg, MD. Retrieved from http://dx.doi.org/10.6028/NIST.SP.1500-1
Data Scientist: A Champion !
Collaboration isbetter
Architecture
(NBD-PWG), N. B. D. P. W. G. (2015). NIST Big Data Interoperability Framework: Volume 6, Reference Architecture (Vol. 6).
Gaithersburg, MD. Retrieved from http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-6.pdf
Big Data Requirements
Very Large Storage (TB, PB, EB,…)Parallel Very Fast I/O (GB/s)
Computing capacity (move process to data)Parallel processing.Interactive, streamed and batch.
Visualisation (first step data analysis)Advanced Data Analytics and ML packagesRemote Access
Etc
HETEROGENEOUS NEEDS &
USER PROFILES
HETEROGENEOUS
INFRASTRUCTURE &
ACCESS MODES
CESGA Solucion: Static
Based on Hortonworks HDP
HARDWARE PLATFORM FOR BIG DATA
HDFS
YARN
MAP
REDUCEHBASESPARK HIVE
Jupyter/Hue/Zeppelin/R
CESGA Solucion: Dynamic
Create your own cluster for Data Science
HARDWARE PLATFORM FOR BIG DATA
DOCKER
MESOS
Your
Config
Cluster
CassandraSPARK SciDB
PaaS API
WEB Interface
CESGA Solution: HPC
When data processing needs large computing
HARDWARE
PLATFORM FOR HPC
+ GPUs
HIGH PERFORMANCE
STORAGE: LUSTRE
HIGH SPEED COMM: IB
Theano TensorflowR Caffe
SLURM
WEB Interface/Remote Desktop SSH
CESGA Data Scientist
CESGA has no Data Scientist
CESGA offers this service in collaboration
Open to collaborations in Portugal