Large Scale Data Analytics on AWS

111
Large Scale Data Analytics on AWS Ian Meyers, David Elliott, Denis Batalov Solution Architects, EMEA

Transcript of Large Scale Data Analytics on AWS

Page 1: Large Scale Data Analytics on AWS

Large Scale Data Analytics on AWS

Ian Meyers, David Elliott, Denis Batalov

Solution Architects, EMEA

Page 2: Large Scale Data Analytics on AWS

Agenda

2:00pm – 3:00pm - AWS & Analytics Services Overview

3:00pm – 4:30pm - Machine Learning with AWS Demonstration

4:30pm – 5:00pm - Break

5:00pm – 6:00pm - Data Analytics Platform Demonstration

Page 3: Large Scale Data Analytics on AWS

WHY BUILD LARGE SCALE ANALYTICS

APPLICATIONS ON AWS?

Page 4: Large Scale Data Analytics on AWS

It’s never been easier and less expensive to

collect, store, analyse & share data

Page 5: Large Scale Data Analytics on AWS

We are constantly producing more data

Page 6: Large Scale Data Analytics on AWS

From all types of industries

Page 7: Large Scale Data Analytics on AWS

From a diverse range of sources

Page 8: Large Scale Data Analytics on AWS

Discovery Development Delivery

Risk Marketing Reporting Trade

Sales

Broad Analytics Use In The AWS Cloud

Page 9: Large Scale Data Analytics on AWS

CLOUD COMPUTING?

Page 10: Large Scale Data Analytics on AWS

A broad and deep platform that helps customers

build sophisticated, scalable applications

What is Cloud Computing?

Cloud Computing

Page 11: Large Scale Data Analytics on AWS

On demand Pay as you go

UniformAvailable

Utility

Cloud Computing

Page 12: Large Scale Data Analytics on AWS

Infrastructure

Cloud Computing

Page 13: Large Scale Data Analytics on AWS

Compute

Database

Load Balancing

Networking

Storage

Analytics

Messaging

Email

Monitoring

Content Distribution

Security

DNS

Cloud Computing

Page 14: Large Scale Data Analytics on AWS

Availability Zones

Global Infrastructure

Page 15: Large Scale Data Analytics on AWS

US-WEST (Oregon)

EU-WEST (Ireland)

ASIA PAC (Tokyo)

US-WEST (N. California)

SOUTH AMERICA

(Sao Paulo)

US-EAST (Virginia)

AWS GovCloud(US)

ASIA PAC (Sydney)

ASIA PAC (Singapore)

ASIA PAC (Beijing)

EU-CENTRAL (Frankfurt)

Availability Zones

Global Infrastructure

Page 16: Large Scale Data Analytics on AWS

Accessible via API endpoints

Global Infrastructure

Page 17: Large Scale Data Analytics on AWS

aws ec2 run-instances

--image-id ami-a813fadf

--count 3

--placement AvailabilityZone=eu-west-1a

--instance-type m3.medium

aws ec2 run-instances

--image-id ami-a813fadf

--count 5

--placement AvailabilityZone=eu-west-1c

—instance-type m3.large

Global Infrastructure

Page 18: Large Scale Data Analytics on AWS

Traditional IT capacityCapacity

TimeYour actual capacity needs

Elastic Capacity (or lack of in this case)

Elasticity

Page 19: Large Scale Data Analytics on AWS

On and Off Fast Growth

Variable peaks Predictable peaks

Elastic Capacity (or lack of in this case)

Elasticity

Page 20: Large Scale Data Analytics on AWS

On and Off Fast Growth

Predictable peaksVariable peaks

Waste

Customer Dissatisfaction

Elastic Capacity (or lack of in this case)

Elasticity

Page 21: Large Scale Data Analytics on AWS

On and Off Fast Growth

Predictable peaksVariable peaks

Elastic Capacity

Elasticity

Page 22: Large Scale Data Analytics on AWS

From One Instance

Elasticity

Page 23: Large Scale Data Analytics on AWS

To Thousands

Elasticity

Page 24: Large Scale Data Analytics on AWS

And Back Again

Elasticity

Page 25: Large Scale Data Analytics on AWS

NetworkingVPC

Direct Connect

Route 53

AnalyticsLambda

EC2 Container Service

Elastic Beanstalk

EMR Data Pipeline KinesisMachine Learning

ComputeEC2

Storage & Content DeliveryS3

Developer ToolsCodeCommit CodeDeploy CodePipeline

Management ToolsCloudWatch

CloudFormation

CloudTrail Config OpsWorksService Catalog

Security & IdentityIdentity & Access

ManagementDirectory Service

Trusted Advisor

CloudFront EFS GlacierStorage Gateway

Application ServicesAPI Gateway AppStream CloudSearch

Elastic Transcoder

SES SQS SWF

Device FarmMobile

Analytics

Mobile ServicesCognito SNS

DatabaseRDS DynamoDB ElastiCache RedShift WorkSpaces WorkDocs WorkMail

Enterprise Applications

Broad Range Of Services

Page 26: Large Scale Data Analytics on AWS

https://aws.amazon.com/compliance/

Broadest Certification & Accreditations

Page 27: Large Scale Data Analytics on AWS

DATA INGESTION & STORAGE

Page 28: Large Scale Data Analytics on AWS

Makes it easy to establish a dedicated network connection from your premises to AWS

Establish private connectivity between AWS & your datacenter, office, or colocation environment

Reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience

The dedicated connection can be partitioned into multiple virtual interfaces using 802.1q VLANs

aws.amazon.com/directconnect

AWS Direct Connect

Data Ingestion & Storage

Page 29: Large Scale Data Analytics on AWS

Amazon S3

Secure, durable, highly-scalable object storage

Accessible via a simple web services interface

Store & retrieve any amount of data

Use alone or together with other AWS services

Different Tiers: Standard, Infrequent Access,

Reduced Redundancy, Glacier

Data Ingestion & Storage

Page 30: Large Scale Data Analytics on AWS

Elastic Block StoreHigh performance block storage

device

1GB to 1TB in size

Mount as drives to instances with

snapshot/cloning functionalities

IMAGE

Availability99.99%

Durability 99.999999999%

Is a Web StoreNot a file system

No Single Points of FailureEventually consistent

Paradigm Object store

Performance Very Fast

Redundancy Across Availability Zones

Security Public Key / Private Key

Pricing $0.03/GB/month

Typical use

case

Write once, read many

Limits 100 Buckets, Unlimited Storage, 5TB Objects

Simple Storage ServiceHighly scalable object storage for the internet

1 byte to 5TB in size

99.999999999% durability

Page 31: Large Scale Data Analytics on AWS

Amazon S3 Multipart Upload

Large file(Size < 5TB)

Large object(Size < 5TB)

Split file into parts Send parts to S3 S3 rejoins the parts

Data IngestionData Ingestion & Storage

Page 32: Large Scale Data Analytics on AWS

Simple Storage ServiceHighly scalable object storage

GlacierLong term object archive

Data Ingestion & Storage

Lifecycle Management

Page 33: Large Scale Data Analytics on AWS

Persistent block level storage volumes

For use with Amazon EC2 instances

Automatically replicated within Availability Zones

Offer consistent and low-latency performance

EBS Snapshot(stored on S3) EBS

Volume

EC2Instance

aws.amazon.com/ebs

Data Ingestion & Storage

Amazon Elastic Block Store

Page 34: Large Scale Data Analytics on AWS

AWS Import/Export

Move large amounts of data into and out of the AWS cloud using portable storage devices

Transfer your data directly onto and off of storage devices using Amazon’s high-speed internal network

For significant data sets, AWS Import/Export is often faster than Internet transfer and more cost effective than upgrading your connectivity

Supports upload & download from S3 & upload to Amazon EBS snapshots & Amazon Glacier Vaults

aws.amazon.com/importexport/

Data Ingestion & Storage

Page 35: Large Scale Data Analytics on AWS

An on-premises software appliance connecting with cloud-based storage

Supports industry-standard storage protocols that work with your existing applications and workflows

Provides low-latency performance by maintaining frequently accessed data on-premises while securely storing all of your data encrypted in Amazon S3 or Amazon Glacier

aws.amazon.com/storagegateway/

AWS Storage Gateway

Data Ingestion & Storage

Page 36: Large Scale Data Analytics on AWS

A fully managed, cloud-based service for real-time data processing over large, distributed data streams

Continuously capture and store terabytes of data per hour from hundreds of thousands of sources

Emit data to other streams and other AWS services such as Amazon S3, Amazon Redshift, Amazon Elastic Map Reduce (Amazon EMR), Dynamo DB

Elastically Add and Remove Shards for Performance

Use Kinesis Worker Library to Process Data

aws.amazon.com/kinesis

AWS Kinesis

Data Ingestion & Storage

Page 37: Large Scale Data Analytics on AWS

Millions of sources

producing 100s of TB per hour

FrontEnd

AuthenticationAuthorization

AZAZAZDurable, consistent replicas

across three AWS Availability Zones

Amazon Web Services RegionInexpensive: $0.0165 per million PUT Payload Units

(in EU Ireland)

Aggregate and archive to S3

Real-time dashboards and alarms

Machine learning algorithms

Aggregate analysis in Hadoop or a data warehouse

Ordered stream of events supporting multiple readers

Data Ingestion & Storage

AWS Kinesis Architecture

Page 38: Large Scale Data Analytics on AWS

As a startup, using AWS has

allowed us to scale nicely and use resources without spending a lot

of capital.

Brian Langel

CTO

Dash

• Needed scale IT resources to create an app that would offer real-time information to drivers

• Developed and deployed the Dash application on the AWS Cloud

• Streams more than 1 TB of real-time data per day using Amazon Kinesis and processes billions of entries using Amazon DynamoDB

• Scaled up to support large traffic spikes–several thousand updates per second–in app usage

• Reduced operating costs by $200,000 per year

Using AWS, Dash Streams More Than 1 TB of Real-Time Data Per Day

Find out more here: aws.amazon.com/solutions/case-studies/dash/

Page 39: Large Scale Data Analytics on AWS

Data Ingestion Ecosystem

Page 40: Large Scale Data Analytics on AWS

Log Analysis

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

CloudWatch LoggingAutomated Log Ingestion from Amazon Linux

Agents

Create Log Streams, Groups of Logs, and Log

Event Types

Analyze Log Data using Search Patterns

Alarms on Application Log Events

Integration with RSysLog

Page 41: Large Scale Data Analytics on AWS

STRUCTURED DATA MANAGEMENT

Page 42: Large Scale Data Analytics on AWS

Database

Relational Database ServiceManaged Oracle, MySQL, SQL Server & Aurora

Dynamo DBManaged NOSQL Database

ElastiCacheManaged In Memory Caching

RDS Dynamo DB

Redshift Elasticache

Amazon RedshiftMassively Parallel Petabyte Scale Data Warehouse

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 43: Large Scale Data Analytics on AWS

Database

Relational Database ServiceDatabase-as-a-Service

No need to install or manage database instances

Scalable and fault tolerant configurations

Integration with Data Pipeline

RDS Dynamo DB

Redshift Elasticache

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 44: Large Scale Data Analytics on AWS

Database

DynamoDBProvisioned throughput NoSQL database; single-

digit millisecond latency at any scale

Fast, predictable, configurable performance

Fully distributed, fault tolerant HA architecture

Supports both document, key-value and graph

Integration with EMR & Hive

RDS Dynamo DB

Redshift Elasticache

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 45: Large Scale Data Analytics on AWS

• Writes• Writes are acknowledged

(committed) once they exist in at least two physical data centers

• Writes are persisted to SSD

• Reads• Tunable for Application

Requirements

• No reduction in durability or consistency in order to achieve throughput

Dynamo Consistency

Eventually Consistent Read Strongly Consistent Read

Stale Values reads possible No Stale Values read

Highest Throughput Lower Potential Throughput

√ √

Page 46: Large Scale Data Analytics on AWS

Database

RDS Dynamo DB

Redshift Elasticache

ElastiCacheIn Memory Caching

Memcached or Redis

Automatic Node Failover / Replacement

Multi-AZ Standby

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 47: Large Scale Data Analytics on AWS

Database

RedshiftManaged Massively Parallel Petabyte Scale Data

Warehouse

Streaming Backup/Restore to S3

Load data from S3, DynamoDB and EMR

Extensive Security Features

Scale from 160 GB -> 2 PB Online

RDS Dynamo DB

Redshift Elasticache

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 48: Large Scale Data Analytics on AWS

Amazon Redshift parallelizes and distributes

everything

Query

Load

Backup

Restore

Resize

ComputeNode

ComputeNode

ComputeNode

LeaderNode

Common BI Tools

JDBC/ ODBC

10GigE Mesh

Page 49: Large Scale Data Analytics on AWS

Redshift lets you start small and grow big

Small Nodes: (dc1.l & ds2.xl)

3 spindles, 15-30GiB RAM 2 or 4 virtual cores, 10GigE

Single Node (160GB SSD or 2TB Magnetic)

Cluster 2-32 Nodes (320GB SSD – 64TB Magnetic)

Large Nodes: (dc1.8xl & ds2.8xl)

24 spindles, 120-244GiB RAM, 2.56TB SSD or 16TB Magnetic, 16 or 32 virtual cores, 10GigE

Cluster 2-100 Nodes (5TB SSD – 1.6PB Magnetic)

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

L

Page 50: Large Scale Data Analytics on AWS

COMPLEX ANALYTICS

Page 51: Large Scale Data Analytics on AWS

Elastic MapReduceManaged, elastic Hadoop (1.x & 2.x) cluster

Integrates with S3, DynamoDB and Redshift

Install Storm, Spark & Shark, Hive, Pig, Impala &

End User Tools Automatically

Support for Spot Instances

Integrated HBase NOSQL Database

Analytics

Elastic MapReduce

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 52: Large Scale Data Analytics on AWS

Analytics

Page 53: Large Scale Data Analytics on AWS

Analytics languages/enginesData management

AmazonRedshift

AmazonKinesis

AmazonS3

AmazonDynamoDB

AmazonRDSEMR

Data Sources

AWSData Pipeline

Ecosystem

Page 54: Large Scale Data Analytics on AWS

S&P Capital IQ Uses AWS for Big Data Processing

Provides data to 4200+ top global investment firms

Launched Hadoop faster, Learned Hadoop faster

S3 Hadoop Cluster

http://aws.amazon.com/solutions/case-studies/sp-capital-iq

Page 55: Large Scale Data Analytics on AWS

Event Processing

AWS LambdaFully Managed Event Processor

Node.js, Integrated AWS SDK & ImageMagick

Natively Compile & Install any Node.js modules

Specify Runtime RAM & Timeout

Automatically Scaled to support Event Volume

Events from S3, Dynamo DB, Kinesis & Lambda

Integrated CloudWatch Logging

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 56: Large Scale Data Analytics on AWS

Analytics of the Internet of Things

Page 57: Large Scale Data Analytics on AWS

Input Datanode: This could be a S3 bucket, RDS table, EMR Hive table, etc.

Activity: This is a data aggregation, manipulation, or copy that runs on a user-configured schedule.

Output Datanode: This supports all the same datasources as the input datanode, but they don’t have to be the same type.

Analytics Orchestration

Data PipelineAutomatically Provision EC2 & EMR Resources

Manage Dependencies & Scheduling

Automatically Retry and Notify of Success &

Failure

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 58: Large Scale Data Analytics on AWS

Output: S3 filePath: s3://trend-data/#{year-month-day}.csv

Activity: EMR TransformHive Query: user-metrics.hqlFrequency: Daily

Input: RDS TableTable: User-DemographicsSQL Precondition: “Select last_update from table“ > #{YY-MM-DD}

Input: DynamoDB TableTable: User-Event-Data-#{year-month}

Success Notification: [email protected] Notification: [email protected] Notification: : [email protected]

Sample Use Case

Page 59: Large Scale Data Analytics on AWS

Train and optimize models on GBs of data

Batch process predictions

Real-time prediction API in one-click

No servers to provision or manage

Amazon Machine Learning

Page 60: Large Scale Data Analytics on AWS

END USER REPORTING

Page 61: Large Scale Data Analytics on AWS

End User Reporting

Redshift

S3

EMR

Dynamo DB

Page 62: Large Scale Data Analytics on AWS

End User Reporting – Customer Issues

Realizing the “Virtual Desktop Dream”BYOD is increasingly popular

Workforces are increasingly diverse

Tablet adoption significant

Keeping all these desktops secure

Page 63: Large Scale Data Analytics on AWS

End User Reporting - Workspaces

WorkSpaces

Fully Managed

Support Multiple Devices

Keep Data Secure and Available

Choose Software & Hardware

Pay as You Go

Corporate Directory Integration

No data stored on end-user device

Only Pixels delivered to users (PCoIP)

User volume backed by Amazon S3

Page 64: Large Scale Data Analytics on AWS

INTEGRATED ANALYTICS

Page 65: Large Scale Data Analytics on AWS

Integrated Analytics

Page 66: Large Scale Data Analytics on AWS

Integrated Analytics

TBs of logs sent daily

Logs stored inAmazon S3

Amazon EMR clusters

Hive Metastoreon Amazon EMR

Interactive query

Page 67: Large Scale Data Analytics on AWS

Integrated Analytics

Batch Processing

GBs of logs pushed to Amazon

S3 hourly

Daily Amazon EMR cluster using Hive to

process data

Input and output stored in Amazon S3

Load subset into Amazon Redshift

Page 68: Large Scale Data Analytics on AWS

Integrated Analytics

Streaming Data Processing

Clickstream logs streamed to Kinesis

Logs stored in Amazon Kinesis

Amazon Kinesis Client Library

AWS Lambda

Amazon EMR

Amazon EC2

Page 69: Large Scale Data Analytics on AWS

Integrated Analytics

Real Time Predictions

Your applicationAmazon

DynamoDB

+

Trigger event with Lambda+

Query for predictions with the Amazon Machine Learning

real-time API

Page 70: Large Scale Data Analytics on AWS

Integrated Analytics

Batch Predictions

Structured datain Amazon Redshift

Load predictions intoAmazon Redshift Predictions

in Amazon S3

Query for predictions with

Amazon ML batch API

Your application -or-

Read prediction resultsdirectly from S3

Page 71: Large Scale Data Analytics on AWS

aws.amazon.com/architecture/

Page 72: Large Scale Data Analytics on AWS

Certification

aws.amazon.com/certification

Self-Paced Labs

aws.amazon.com/training/

self-paced-labs

Try products, gain new skills, and get hands-on practice working

with AWS technologies

aws.amazon.com/training

Training

Validate your proven skills and expertise

with the AWS platform

Build technical expertise to design

and operate scalable, efficient applications

on AWS

AWS Training & Certification

Page 73: Large Scale Data Analytics on AWS

Large Scale Data Analytics with Amazon Web Services

Ian Meyers, Principal Solution Architect

October 28th, 2015

Page 74: Large Scale Data Analytics on AWS

A customer has built a new Oil Pipeline, the North Sea Anglian

System (the Flying Scotsman) which ships Crude Oil from the North Sea to

London.

Built on Next Generation Sensor Technology, this Pipeline emits

operational metrics from every Sensor using Internet of Things technology.

With every measurement, each sensor can track the ambient

temperature, corrosivity, Pressure and Flow Rate, as well as physical

orientation of the segment of Pipeline being monitored.

Provide an Operational Analytics Pipeline which allows for real time

monitoring of the Pipeline, as well as historical analysis of all data.

Page 75: Large Scale Data Analytics on AWS

Getting the Data In

Page 76: Large Scale Data Analytics on AWS

Amazon EC2

Amazon Kinesis

MQTT

HTTPS

Page 77: Large Scale Data Analytics on AWS

Application Services

Amazon Kinesis Managed Service for Real Time Big Data Processing

Create Streams to Produce & Consume Data

Elastically Add and Remove Shards for Performance

Use Kinesis Worker Library to Process Data

Integration with S3, Redshift and Dynamo DB

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 78: Large Scale Data Analytics on AWS

Data

Sources

App.4

[Machine

Learning]

AW

S E

nd

po

int

App.1

[Aggregate &

De-Duplicate]

Data

Sources

Data Sources

Data

Sources

App.2

[Metric

Extraction]

S3

DynamoDB

Redshift

App.3

[Sliding

Window

Analysis]

Data

Sources

Availability Zone

Amazon Kinesis

Availability Zone

Availability

Zone

Shard 1

Shard 2

Shard N

Page 79: Large Scale Data Analytics on AWS

Native Code Module to perform efficient writes to Multiple

Kinesis Streams

C++/Boost

Asynchronous Execution

Configurable Aggregation of Events

Introducing the Kinesis Producer Library

My Application KPL Daemon

PutRecord(s)

Kinesis Stream

Kinesis Stream

Kinesis Stream

Kinesis Stream

Async

Page 80: Large Scale Data Analytics on AWS

KPL Aggregation

My Application KPL Daemon

PutRecord(s)

Kinesis Stream

Kinesis Stream

Kinesis Stream

Kinesis Stream

Async

1MB Max Event Size

Aggregate

100k 20k 500k 200k

40k 20k 40k

500k 100k 200k 20k

40k

40k

20k

Protobuf Header Protobuf Footer

Page 81: Large Scale Data Analytics on AWS

KCL Libraries available for Java, Ruby,

Node, Go, and a Multi-Lang

Implementation with Native Python

support

All State Management in Dynamo DB

Kinesis Client Library

DynamoDB

Page 82: Large Scale Data Analytics on AWS

AWS Analytics Demo

Page 83: Large Scale Data Analytics on AWS

Long Term Durability

Page 84: Large Scale Data Analytics on AWS

Amazon EC2

Amazon Kinesis

MQTT

HTTPS

Page 85: Large Scale Data Analytics on AWS

Amazon EC2

Amazon S3

Amazon Kinesis

Amazon EC2

MQTT

HTTPS

Page 86: Large Scale Data Analytics on AWS

Kinesis Connectors

• S3

Batch Write Files for Archive into S3

Extensible file naming

• Redshift

Once Written to S3, load to Redshift

Manifest support

User defined transformers

• DynamoDB

BatchPut append to table

User defined transformers

• Spark • Spark Streaming RDD’s

• Storm

Use Kinesis as a Spout

• ElasticSearch

Automatically index stream contents

Storm

S3

DynamoDB

Redshift

Kinesis

ElasticSearch

Page 87: Large Scale Data Analytics on AWS

Connectors Architecture

Page 88: Large Scale Data Analytics on AWS

Elastic Block Store High performance block storage

device

1GB to 1TB in size

Mount as drives to instances with

snapshot/cloning functionalities

IMAGE

Availability 99.99%

Durability 99.999999999%

Is a Web Store Not a file system

No Single Points of Failure Eventually consistent

Paradigm Object store

Performance Very Fast

Redundancy Across Availability Zones

Security Public Key / Private Key

Pricing $0.095/GB/month

Typical use case Write once, read many

Limits 100 Buckets, Unlimited Storage, 5TB Objects

Simple Storage Service Highly scalable object storage for the internet

1 byte to 5TB in size

99.999999999% durability

Page 89: Large Scale Data Analytics on AWS

Amazon S3 provides near linear scalability

S3 Streaming Performance 100 VMs; 9.6GB/s; $26/hr

350 VMs; 28.7GB/s; $90/hr

34 secs per terabyte

GB/Second

Rea

de

r C

on

ne

ctions

S3 Performance & Scalability

Page 90: Large Scale Data Analytics on AWS

AWS Analytics Demo

Page 91: Large Scale Data Analytics on AWS

Real Time Analytics

Page 92: Large Scale Data Analytics on AWS

Amazon EC2

Amazon S3

Amazon Kinesis

Amazon EC2

MQTT

HTTPS

Page 93: Large Scale Data Analytics on AWS

Amazon EC2

Elastic Beanstalk

DynamoDB

Amazon S3

Amazon Kinesis CloudWatch

Amazon EC2

MQTT

HTTPS json

Page 94: Large Scale Data Analytics on AWS

Deployment & Admin

Elastic Beanstalk 1 click deployment from Eclipse, Visual Studio and Git

Rapid deployment of applications

All AWS resources automatically created

Feature Details

Platform support Containers for Java, .net , Ruby and PHP

Resource creation Creates load balancer, instances, autoscaling and monitoring

automatically

Monitoring & Logs Integrated with Cloud Watch and consolidates server logs

Versioning Manage versions of applications and easily rollback deployments

Notifications Receive alerts on key events

Full resource access Access all underlying AWS resources as necessary

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 95: Large Scale Data Analytics on AWS

KCL Libraries available for Java, Ruby,

Node, Go, and a Multi-Lang

Implementation with Native Python

support

All State Management in Dynamo DB

Kinesis Client Library

DynamoDB

Page 96: Large Scale Data Analytics on AWS

Kinesis Aggregators

Kinesis Aggregators provide a powerful and simple mechanism for creating Real Time Aggregates of data as it traverses Kinesis Simple Configuration

Create a configuration file defining the Aggregations required Run the application using Elastic Beanstalk

Data is persisted automatically to Dynamo DB, Dynamo Provisioning is fully managed Data can be graphed using CloudWatch Utilities to integrate Real Time Aggregates with Elastic MapReduce Hive or Amazon Redshift

Σ

Page 97: Large Scale Data Analytics on AWS

Database

DynamoDB Provisioned throughput NoSQL database

Fast, predictable, configurable performance

Fully distributed, fault tolerant HA architecture

Integration with EMR & Hive

RDS Dynamo DB

Redshift Elasticache

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 98: Large Scale Data Analytics on AWS

CloudWatch Integration

Σ

Page 99: Large Scale Data Analytics on AWS

AWS Analytics Demo

Page 100: Large Scale Data Analytics on AWS

Massively Parallel Transformations

Page 101: Large Scale Data Analytics on AWS

Amazon EC2

Elastic Beanstalk

DynamoDB

Amazon S3

Amazon Kinesis CloudWatch

Amazon EC2

MQTT

HTTPS json

Page 102: Large Scale Data Analytics on AWS

Amazon EC2

Elastic Beanstalk

DynamoDB

Amazon S3

Amazon Kinesis

Amazon EMR

CloudWatch

Amazon EC2

MQTT

HTTPS json

Page 103: Large Scale Data Analytics on AWS

Elastic MapReduce Managed, elastic Hadoop (1.x & 2.x) cluster

Integrates with S3, DynamoDB and Redshift

Install Storm, Spark & Shark, Hive, Pig, Impala &

End User Tools Automatically

Support for Spot Instances

Integrated HBase NOSQL Database

Analytics

Elastic MapReduce

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 104: Large Scale Data Analytics on AWS

AWS Analytics Demo

Page 105: Large Scale Data Analytics on AWS

Accessible for Analysts & Dashboards

Page 106: Large Scale Data Analytics on AWS

Amazon EC2

Elastic Beanstalk

DynamoDB

Amazon S3

Amazon Kinesis

Amazon EMR

CloudWatch

Amazon EC2

MQTT

HTTPS json

Page 107: Large Scale Data Analytics on AWS

Amazon EC2

Elastic Beanstalk

DynamoDB

Amazon S3

Amazon Kinesis

Amazon Redshift

Amazon EMR

CloudWatch

Amazon EC2

MQTT

HTTPS json

AWS Lambda

Page 108: Large Scale Data Analytics on AWS

S3 Events

AWS Lambda

SQS Queues

SNS Topics

Amazon S3 Bucket

RRS Object Lost

Object Deleted

Object Delete Marker Created

Object Created (Put)

Object Created (Post)

Object Created (Copy)

Object Created (Multi-Part)

Page 109: Large Scale Data Analytics on AWS

Event Processing

AWS Lambda Fully Managed Event Processor

Node.js, Integrated AWS SDK & ImageMagick

Natively Compile & Install any Node.js modules

Specify Runtime RAM & Timeout

Automatically Scaled to support Event Volume

Events from S3, Dynamo DB, Kinesis & Lambda

Integrated CloudWatch Logging

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 110: Large Scale Data Analytics on AWS

Database

Redshift Managed Massively Parallel Petabyte Scale Data

Warehouse

Streaming Backup/Restore to S3

Load data from S3, DynamoDB and EMR

Extensive Security Features

Scale from 160GB -> 2 PB Online

RDS Dynamo DB

Redshift Elasticache

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Page 111: Large Scale Data Analytics on AWS