20141021 AWS Cloud Taekwon - Big Data on AWS

Post on 26-Jun-2015

431 views 3 download

Tags:

description

AWS APAC Principal Technology Evangelist인 Markku Lepisto의 발표내용입니다.

Transcript of 20141021 AWS Cloud Taekwon - Big Data on AWS

Big Data on AWS Markku Lepistö Principal Technology Evangelist @markkulepisto

Does this Data make me look big?

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Getting your Data into AWS

Amazon S3

Corporate Data Center

• Console Upload

• FTP

• AWS Import Export

• S3 API

• Direct Connect

• Storage Gateway

• 3rd Party Commercial Apps

• Tsunami UDP

1

Write directly to a data source

Your application Amazon S3

DynamoDB

Any other data store

Amazon S3

Amazon EC2

2

Zero Admin NoSQL Service

Unlimited Storage

Provisioned Throughput

Consistent <10ms response

Durable on SSD

Services: Database: Amazon DynamoDB

Compute Storage

AWS Global Infrastructure

Database

Networking

Queue, pre-process and then write to data source

Amazon Simple Queue Service

(SQS)

Amazon S3

DynamoDB

Any other data store

3

Aggregate and write to data source

Flume running on EC2

Amazon S3

Any other data store

HDFS

4

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Choose depending upon design

Courtesy http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

S3 as a “single source of truth”

S3

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Hadoop based Analysis Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon EMR

EMR is Hadoop in the Cloud

What is Amazon Elastic MapReduce (EMR)?

EMR Cluster

S3

Put the data into S3

Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc.

Get the output from S3

Launch the cluster using the EMR console, CLI, SDK, or APIs

You can also store everything in HDFS

How does EMR work ?

S3

What can you run on EMR…

EMR Cluster

SQL based processing Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon EMR

Amazon Redshift

Pre-processing framework

Petabyte scale Columnar Data -

warehouse

Amazon Redshift • Easily and rapidly analyze petabytes of data • Fully managed data warehouse service • Automated deployment and administration • 1/10th the cost of traditional data warehouses • < $1000 / Terabyte / year • Compatible with popular BI tools

Services: Database: Amazon Redshift

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Your choice of BI Tools on the cloud Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon EMR

Amazon Redshift

Pre-processing framework

Generation

Collection & storage

Analytics & computation

Collaboration & sharing

Collaboration and Sharing insights

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon EMR

Amazon Redshift

Sharing results and visualizations at scale

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon EMR

Amazon Redshift

Web App Server Visualization tools

Rinse and Repeat every day or hour

Rinse and Repeat

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon EMR

Amazon Redshift

Visualization tools

Business Intelligence Tools

Business Intelligence Tools

GIS tools on hadoop

GIS tools

Amazon data pipeline

The complete architecture

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon EMR

Amazon Redshift

Visualization tools

Business Intelligence Tools

Business Intelligence Tools

GIS tools on hadoop

GIS tools

Amazon data pipeline

No it isn’t !

What about Real-Time?

nopeampi data on parempi data

HAPPENING NOW! real-time == stream analytics

Ingest data streams Store durably

Distribute Scale out

Process as packets flow in

Realtime Analytics in the Cloud

Amazon Kinesis Streaming Data Service

Kinesis architecture

Clash of Clans

In-game activity

Amazon Kinesis

Kinesis: Real-time data stream of in-game activity

Clash of Clans

Kinesis-enabled apps on EC2

In-game activity

Kinesis: Real-time data stream of in-game activity Multiple Kinesis applications: Dashboards, analytics and storage

Clash of Clans

Real-time clickstream processing app

Amazon Kinesis

S3 Aggregate statistics

In-game activity

EC2: In-game engagement

trends dashboard

Kinesis: Real-time data stream of in-game activity Multiple Kinesis applications: Dashboards, analytics and storage S3 and Glacier: Data storage and long term archival

Clash of Clans

Kinesis-enabled apps on EC2

Real-time clickstream processing app

Amazon Kinesis

Business-intelligence user

EC2: In-game engagement

trends dashboard

In-game activity

S3 Aggregate statistics

Kinesis: Real-time data stream of in-game activity Multiple Kinesis applications: Dashboards, analytics and storage

Data Warehouse: BI reporting and interactive queries S3 and Glacier: Data storage and long term archival

Clash of Clans

Kinesis-enabled apps on EC2

EC2 Data

Warehouse

Real-time clickstream processing app

Amazon Kinesis

Glacier

EC2 Data

Warehouse

Clickstream archive

EC2: In-game engagement

trends dashboard

Real-time clickstream processing app

Kinesis: Real-time data stream of in-game activity Multiple Kinesis applications: Dashboards, analytics and storage

Data Warehouse: BI reporting and interactive queries S3 and Glacier: Data storage and long term archival

In-game activity

S3

Clash of Clans

Aggregate statistics

Business-intelligence user

Kinesis-enabled apps on EC2

Amazon Kinesis

Demo

Sliding Window Analytics Live Dashboard

S3 Storage Redshift Data Warehouse

Kinesis

Website Clickstream

logs

AWS Cloud Taekwon

Bonus

Internet of Things

Smart Devices

Powered by the Cloud

Smart Devices

Powered by the Cloud

Smart Devices

Powered by the Cloud

Smart Devices

Powered by the Cloud

Smart?evices

Powered by the Cloud

Smart?evices

Powered by the Cloud Arduino Uno Raspberry Pi

CPU 20MHz 8bit 700MHz 32bit Memory 2 KB 512 MB Storage 32 KB SD card

Smart Devices

Powered by the Cloud

Camera Microphone

Thermometer

Distance

GPS

Gyroscope

Actuator

Relay

Motor

Manipulator

Switch Pressure

Accelerometer

Wheel Propeller

Rotor

Challenges

Challenges

Thousands – Millions of Devices / Producers

Challenges

Thousands – Millions of Devices / Producers

Thousands – Millions of Users / Consumers

Distributed

Thousands – Millions of Devices / Producers

Thousands – Millions of Users / Consumers

At scale

Thousands – Millions of Devices / Producers

Thousands – Millions of Users / Consumers

Smart Devices

Powered by the Cloud

Smart Devices

Powered by the Cloud Unlimited Storage – Memory Unlimited Compute – Logic

Camera Microphone

Thermometer

Distance

GPS

Actuator

Relay

Motor

Manipulator

Switch Pressure

Wheel Propeller

Rotor

Gyroscope Accelerometer

Smart Devices

Powered by the Cloud

70

Demo

Arduino Yún

Raspberry Pi

Spark Core

Accele-rometer

MQTT

Mosquitto MQTT Broker MQTT-Kinesis Bridge

AWS SDK

Amazon Kinesis Real-time Streaming

Data Service

AWS APIs

AWS Elastic Beanstalk

Dashboard

Amazon SNS Earthquake

Alerts

Mobile Push

Demo

COLLECT | STORE | ANALYZE | SHARE

Import Export

Glacier

S3 EC2

Redshift DynamoDB

EMR

Data Pipeline

S3 Direct Connect

Kinesis

The AWS Big Data Portfolio

CloudFront

AWS Cloud Taekwon