Amazon Redshift & Amazon...

Post on 17-Feb-2018

222 views 0 download

Transcript of Amazon Redshift & Amazon...

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

2014-05-15

Amazon Redshift & Amazon DynamoDB

Amazon Redshift

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Amazon Redshift

Amazon Redshift

A fully managed data warehouse service •  Massively parallel relational data warehouse •  Takes care of cluster management and

distribution of your data •  Columnar data store with variable compression •  Optimized for complex queries across many

large tables •  Use standard SQL & standard BI tools

Amazon DynamoDB

A fully managed fast key-value store •  Fast, predictable performance •  Simple and fast to deploy •  Easy to scale as you go, up to millions of IOPS •  Pay only for what you use: Read / write IOPS + storage •  Data is automatically replicated across data centers

Amazon DynamoDB

Amazon DynamoDB Amazon Redshift

•  Fast insert & update •  Limited query

capability (single table only)

•  NoSQL database

•  Fast queries •  Flexible queries

(JOINs, aggregation functions, …)

•  SQL

Queries in Amazon DynamoDB

Queries in Amazon DynamoDB •  Query or BatchQuery APIs retrieve items •  Scan & filter to comb through a whole table •  You have to join tables in your own code!

Amazon DynamoDB

Queries in Amazon DynamoDB (2) •  Apache Hive on Amazon EMR can access data

in DynamoDB •  Run HiveQL queries for bulk processing •  Can integrate data in HDFS, Amazon S3, …

Amazon DynamoDB HiveQL queries on Amazon EMR

Queries in Amazon DynamoDB (3) •  Import data into Amazon Redshift •  Use SQL queries, use BI tools etc. •  Powerful analytics and aggregation functions

Amazon Redshift Amazon DynamoDB

Importing Data into Amazon Redshift

TMTOWTDI

Query & Insert

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

#3 INSERT … INTO (…)

Query & Insert The Good •  Full control over queries •  Decide which items you

want to move to Redshift •  Process data on the way

The Bad •  Slow •  Inefficient on the Redshift

side of things •  Does not scale well

The COPY Command

Amazon Redshift

Amazon DynamoDB

#1 COPY FROM …

#2 Politely ask for a table

#3 Return whole table

The COPY Command

Amazon Redshift

Amazon DynamoDB

#1 COPY FROM …

#2 Parallel Scans

The COPY Command

Amazon Redshift

Amazon DynamoDB

#1 COPY FROM …

#3 Return Items

The COPY Command •  COPY a single table at a time •  From one Amazon DynamoDB table into one

Amazon Redshift table •  Fast – executed in parallel on all data nodes in

the Amazon Redshift cluster •  Can be limited to use a certain percentage of

provisioned throughput on the DynamoDB table

The COPY Command COPY <table_name> (col1, col2, …)

FROM 'dynamodb://<table_name2>'

CREDENTIALS 'aws_access_key_id=…;aws_secret_access_key=…'

READRATIO 10 -- use 10% of available read capacity

COMPROWS 0 -- how many rows to read to determine

-- compression

[…other options…]

The COPY Command •  Attributes are mapped to columns by name •  Case of column names is ignored •  Attributes that do not map are ignored •  Missing attributes are stored as NULL or empty

values •  Only works for STRING and NUMBER attributes

The COPY Command The Good •  Easy to use •  Fast •  Efficient use of resources •  Scales linearly with

cluster size •  Only uses certain

percentage of read throughput

The Bad •  Whole tables only •  No processing in between •  Can only copy from

DynamoDB in same region •  Only works with STRING

and NUMBER types

Query & Insert at Scale

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

#3 INSERT … INTO (…) in parallel in parallel

Amazon EMR

Query & Insert at Scale

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

#3 INSERT … INTO (…) in parallel in parallel

Amazon EMR

Query & Insert at Scale

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

#3 INSERT … INTO (…) in parallel in parallel

Amazon EMR

Query & Import using Amazon EMR

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

in parallel

Amazon S3

#3 Export to file(s) on S3

#5 Retrieve files

#4 COPY… FROM s3://

Amazon EMR

Query & Import using Amazon EMR

Amazon Redshift

Amazon DynamoDB

#1 Query / BatchQuery

#2 Retrieve Items

in parallel

#3 COPY … FROM emr://

#4 Retrieve files from HDFS

Query & Import using Amazon EMR The Good •  Decide which items you

want to move to Redshift •  Full control over queries •  Process data on the way •  Scales well •  Integrates with other data

sources easily

The Bad •  Additional complexity •  Additional cost (for EMR) •  Slower than direct COPY

from Amazon DynamoDB

Please welcome Erez Hadas-Sonnenschein, Sr. Product Manager Witali Stohler, Datawarehouse & BI Specialist

clipkit GmbH

Video Syndication – The Possibilities

News Sports Cars/motor Business/finances Music Gaming Cinema Cooking/food Lifestyle/fashion Traveling Computer/mobile Fitness/wellness Knowledge/hobby entertaintment

Content – Partner Overview

clipkit Player – Analytics (Metrics)

Full Screen

Category

Playlist Pos.

Play / Pause

Progress Pos. Mute / Unmute Volume

clipkit Player – Analytics (Metrics) Location (Country, City) Language Browser Operating System Video Id Publisher URL Etc…

First Implementation (Expensive and Slow)

•  designed in starting days •  not calculated to such amount of

data •  slow copy process from S3 to DB

(PHP application old architecture) •  fix EC2 price (expensive to

support peak hours) •  PostgreSQL scalability limitations •  sometimes the copy process

was so slow that the delay was ~3 days.

Analytics / Metrics (Requests Graph)

•  ~ 6,000,000 New Entries per day •  ~ 1,000 Requests per second (Peak Hours) •  ~ 25 Requests per second (Off-peak Hours)

4000% Requests Growth during the day.

Analytics / Metrics (Numbers)

Second Implementation (Expensive and Slow)

•  Inserting only for one (big) Table •  The copy command only works

for whole tables •  The minimum delay was one

day •  Our solution have increase the

provisioned throughput and that was expensive

NO REAL-TIME DATA

Third Implementation (Cheap and Fast)

Third Implementation – Dynamo DB •  Java SDK

AmazonDynamoDBAsyncClient (Fire and Go)

•  Easy to Create and Delete Tables •  Write Latency ~5ms •  Throughput auto scale with Dynamic

DynamoDB

•  One Table per day •  Continuous Iteration and copy to

Redshift •  We just pay for what we use

Third Implementation – Redshift •  Standard PostgreSQL JDBC •  Fully managed by Amazon •  Automated Backups and Fast Restores

•  ~7000 Insert Items per Second •  Less than 2 seconds Queries to > 1 billion

entries •  Real-time available data (maximum 1

minute delay)

Third Implementation – Conclusions •  Java Web Application

–  Auto Scale (Off-Peak - 1 Small Instance)

•  Dynamo DB –  One Table per day (After copied it will be deleted) –  Auto Scale –  ~5 ms Put Item Latency

•  Redshift –  Insert ~7000 Items per second –  Fully managed

Thank You!