Analytics in the cloud

43
Analytics in the Cloud Natalino Busa - Head of Data Science

Transcript of Analytics in the cloud

Page 1: Analytics in the cloud

Analytics in the CloudNatalino Busa - Head of Data Science

Page 2: Analytics in the cloud

2 Natalino Busa - @natbusa

Distributed computing Machine Learning

Statistics Big/Fast Data Streaming Computing

Head of Applied Data Science at Teradata

On most networks:

@natbusa

Page 3: Analytics in the cloud

3 Natalino Busa - @natbusa

Let’s define Cloud Services

Page 4: Analytics in the cloud

4 Natalino Busa - @natbusa

Analytics in the cloud: stacking layers

Bare Metal: Physical Machines

Page 5: Analytics in the cloud

5 Natalino Busa - @natbusa

Analytics in the cloud: stacking layers

Bare Metal: Physical Machines

IAAS: Virtual Resources

Page 6: Analytics in the cloud

6 Natalino Busa - @natbusa

Analytics in the cloud: stacking layers

Bare Metal: Physical Machines

IAAS: Virtual Resources

CAAS: Containers,

Page 7: Analytics in the cloud

7 Natalino Busa - @natbusa

Analytics in the cloud: stacking layers

Bare Metal: Physical Machines

IAAS: Virtual Resources

CAAS: Containers,

dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes

Page 8: Analytics in the cloud

8 Natalino Busa - @natbusa

Bare Metal: Physical Machines

IAAS: Virtual Resources

CAAS: Containers,

dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes

DAAAS: Data Analytics as a Service

Watson ServicesAzure ML

GoogleCloud MLBigML

Analytics in the cloud: stacking layers

Page 9: Analytics in the cloud

9 Natalino Busa - @natbusa

Analytics in the cloud: today’s talk

Bare Metal: Physical Machines

IAAS: Virtual Resources

CAAS: Containers,

dPAAS: Datastores, Data Engines iPAAS: Tools Integration, Flows & Processes

DAAAS: Data Analytics as a Service

Page 10: Analytics in the cloud

10 Natalino Busa - @natbusa

“we live in an age of open source datacenters, so we can stack all these things together and we have open source from the ground to ceiling.”

Sam Ramji, CEO of Cloud Foundry

https://www.youtube.com/watch?v=7oCSFcUW-Qk

Page 11: Analytics in the cloud

11 Natalino Busa - @natbusa

Containers vs VMs

Page 12: Analytics in the cloud

12 Natalino Busa - @natbusa

Techs based on Containers

YARN

Page 13: Analytics in the cloud

13 Natalino Busa - @natbusa

Containers as a Service

https://aws.amazon.com/ecs/

For example: Amazon ECS

Page 14: Analytics in the cloud

14 Natalino Busa - @natbusa

CaaS: 6 offerings

https://www.linux.com/news/5-container-service-tools-you-should-know-about

Project Magnum

Amazon ECSDocker DataCenterGoogle

Container Engine

Page 15: Analytics in the cloud

15 Natalino Busa - @natbusa

Most new PaaS solutions are containerized

Page 16: Analytics in the cloud

16 Natalino Busa - @natbusa

PaaS: Big Data SQL Queries

Batch OrientedLarge Aggregations

Interactive QueriesData Exploration

Interactive QueriesMachine Learning

Streaming:Micro-batching

Interactive QueriesMachine Learning

Streaming:Event-driven

Page 17: Analytics in the cloud

17 Natalino Busa - @natbusa

Advanced Analytics: models and algorithms

Page 18: Analytics in the cloud

18 Natalino Busa - @natbusa

PaaS: Advanced Analytics

Graph analytics:

- Cluster items- Extract similarities- Detect patterns

Page 19: Analytics in the cloud

19 Natalino Busa - @natbusa

PaaS: Advanced Analytics

Text analytics:

- Sentiment Analysis- Language Detection- Summarization- Entity extraction

Page 20: Analytics in the cloud

20 Natalino Busa - @natbusa

PaaS: Advanced Analytics

Machine Learning:

- Classification- Regression- Clustering- Forecasting- Anomaly detection

Page 21: Analytics in the cloud

21 Natalino Busa - @natbusa

PaaS: Advanced Analytics

AI and Deep Learning- Unstructured Data- Object Detection- Natural Language Processing- Video Summarization- Speech Recognition

Page 22: Analytics in the cloud

22 Natalino Busa - @natbusa

PaaS: Advanced Analytics

SQL + Graph + Text + Machine Learning + Voice/Image/Video

Page 23: Analytics in the cloud

23 Natalino Busa - @natbusa

dPaaS: Machine (deep) Learning

… this are just a few examples ...

Page 24: Analytics in the cloud

24 Natalino Busa - @natbusa

Analytics Everywhere

Public Cloud Managed Cloud Private Cloud Private Infra

Page 25: Analytics in the cloud

25 Natalino Busa - @natbusa

iPaas: Components for Analytics in the Cloud

SQL : Big Data Data Warehousing

NoSQL

Machine LearningObjects Stores

Streaming Computing

SQL: RelationalTransactional DB

Page 26: Analytics in the cloud

26 Natalino Busa - @natbusa

iPaas, dPaaS:

Objects Stores

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

Page 27: Analytics in the cloud

27 Natalino Busa - @natbusa

iPaas, dPaaS:

NoSQLObjects Stores

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

CassandraRedisHBaseAccumulo

Neo4JElasticSearchMongoDBCouchbase

BigTable (GCP)DynamoDB

Page 28: Analytics in the cloud

28 Natalino Busa - @natbusa

iPaas, dPaaS:

NoSQLObjects Stores

SQL: RelationalTransactional DB

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

MySQLPostgreSQLMariaDB

Oracle (AWS MP)

CassandraRedisHBaseAccumulo

Neo4JElasticSearchMongoDBCouchbase

BigTable (GCP)DynamoDB

Page 29: Analytics in the cloud

29 Natalino Busa - @natbusa

iPaas, dPaaS:

SQL : Big Data Data Warehousing

NoSQLObjects Stores

SQL: RelationalTransactional DB

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

MySQLPostgreSQLMariaDB

Oracle (AWS MP)

HivePrestoSpark SQLImpala

Redshift (AWS)BigQuery (GCP)Big SQL (IBM)

Teradata (AWS MP)SAP Hana(AWS MP)Vertica (AWS MP)

CassandraRedisHBaseAccumulo

Neo4JElasticSearchMongoDBCouchbase

BigTable (GCP)DynamoDB

Page 30: Analytics in the cloud

30 Natalino Busa - @natbusa

iPaas, dPaaS:

SQL : Big Data Data Warehousing

NoSQL Machine Learning

Objects Stores

SQL: RelationalTransactional DB

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

MySQLPostgreSQLMariaDB

Oracle (AWS MP)

HivePrestoSpark SQLImpala

Redshift (AWS)BigQuery (GCP)Big SQL (IBM)

Teradata (AWS MP)SAP Hana(AWS MP)Vertica (AWS MP)

CassandraRedisHBaseAccumulo

Neo4JElasticSearchMongoDBCouchbase

BigTable (GCP)DynamoDB

Spark MLH2OFlinkAreosolveTheanoTensorflowXGboost

Azure MLAWS MLGoogle MLIBM Watson

Page 31: Analytics in the cloud

31 Natalino Busa - @natbusa

iPaas, dPaaS:

SQL : Big Data Data Warehousing

NoSQL Machine Learning

Objects Stores

Streaming Computing

SQL: RelationalTransactional DB

HDFSGlusterFSCephFSNFSSwiftNovaCassandraRedis

S3 (AWS)Storage (GCP)...

MySQLPostgreSQLMariaDB

Oracle (AWS MP)

HivePrestoSpark SQLImpala

Redshift (AWS)BigQuery (GCP)Big SQL (IBM)

Teradata (AWS MP)SAP Hana(AWS MP)Vertica (AWS MP)

CassandraRedisHBaseAccumulo

Neo4JElasticSearchMongoDBCouchbase

BigTable (GCP)DynamoDB

Spark MLH2OFlinkAreosolveTheanoTensorflowXGboost

Azure MLAWS MLGoogle MLIBM Watson

Heron (Storm)NiFiSpark StreamingFlinkKafka StreamsLogstashStreamSQL

Google DataFlow (GCP)

Page 32: Analytics in the cloud

32 Natalino Busa - @natbusa

iPaaS: Selecting your Analytical Stack

� Flexible. Powerful.- Combinations for this example:

8 * 3 * 4 * 8 * 7 * 7 = 37632

� Right tool for the right job- Fit for purpose- Multi-Genre Analytics

Hard to maintain and upgrade:- Extended Skills and Know-how- Components upgrades must be compatible

Hard to configure: - no matter if cloud or bare or vms- complex stacks with many tools and services

Page 33: Analytics in the cloud

33 Natalino Busa - @natbusa

iPaaS: Deploy & Manage your own Analytics

How to simplify? Select a bundle!

Page 34: Analytics in the cloud

34 Natalino Busa - @natbusa

iPaaS: bundled recipes & stacks

Select a recipe:- Hortonworks Data Platform- Cloudera Data Platform- Reactive Platform - Smack Stack- Pancake Stack- ELK Stack- Select your own

Page 35: Analytics in the cloud

35 Natalino Busa - @natbusa

iPaaS: my favs analytical stacks

Objects Stores

NoSQL SQL : Big Data Data Warehousing

Machine Learning Streaming Computing

All Hadoop (5) HDFS Hbase Hive Spark Storm

Smack stack (2) Cassandra Cassandra Spark Spark Spark

Elastic (5) HDFS ElasticSearch Hive H2O Kafka

Data Science (8) HDFS ElasticSearch Hive, Presto Spark, H2O, Tensorflow Flink

Real Time (2) Cassandra Cassandra Flink Flink Flink

Page 36: Analytics in the cloud

36 Natalino Busa - @natbusa

dPaaS: Managed Analytics

This is hard ! Can we access it as a service?

Page 37: Analytics in the cloud

37 Natalino Busa - @natbusa

dPaaS: Managed Hadoop & Spark

HDInsight: Hadoop, Spark, and R as services

Managed Spark Clusters, BigInsight (Hadoop)

DataFlow and DataProc: Flink, Spark and Hadoop Clusters as a Service

EMR: Hadoop components a la carte

Page 38: Analytics in the cloud

38 Natalino Busa - @natbusa

PaaS: Analytical clusters

Ephemeral

Create then Dispose

Clusters are Short-Lived

Data Exploration

Isolated, Personal

Simple Access Management

Interactive Analytics

Permanent

Clusters are Long Lived

Scheduled Operations

Production ETL

Co-Ordinated

Complex Access Management

Batch Analytics

vs

Page 39: Analytics in the cloud

39 Natalino Busa - @natbusa

DAaaS: Microsoft’s Cortana and ML Studio

Page 40: Analytics in the cloud

40 Natalino Busa - @natbusa

DAaaS: IBM Watson

Page 41: Analytics in the cloud

41 Natalino Busa - @natbusa

DAaaS: Google ML and AI as a service

Cloud Computing forDeep Neural Networks > Train, Score, Data

AI and ML models for:

● Speech (audio)● Language (text)● Vision (images/video)

Page 42: Analytics in the cloud

42 Natalino Busa - @natbusa

Summary

• Analytics in the Cloud:

The dawn of a new computing era

• IPaas, dPaas:

complexity vs flexibility, it’s a tradeoff

• Computing clusters:

Ephemeral and Persistent

Page 43: Analytics in the cloud

43 Natalino Busa - @natbusa

Head of Applied Data Science at Teradata

Distributed computing Machine Learning

Statistics Big/Fast Data Streaming Computing

Linkedin and Twitter:

natbusa