How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

28
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. DAT306 - How Amazon.com, with One of the World’s Largest Data Warehouses, is Leveraging Amazon Redshift Erik Selberg ([email protected]) and Abhishek Agrawal ([email protected]) November 14, 2013

description

Learn how Amazon’s enterprise data warehouse, one of the world's largest data warehouses managing petabytes of data, is leveraging Amazon Redshift. Learn about Amazon's enterprise data warehouse best practices and solutions, and how they’re using Amazon Redshift technology to handle design and scale challenges.

Transcript of How Amazon.com is Leveraging Amazon Redshift (DAT306) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

DAT306 - How Amazon.com, with One of the

World’s Largest Data Warehouses, is Leveraging

Amazon Redshift

Erik Selberg ([email protected]) and

Abhishek Agrawal ([email protected])

November 14, 2013

Agenda

• Amazon Data Warehouse Overview

• Amazon Data Warehouse and Amazon Redshift Integration Project

• Amazon Redshift Best Practices

• Conclusion

Amazon Data Warehouse

Overview Erik Selberg <[email protected]>

Amazon Data Warehouse

• Authoritative repository of data for all Amazon

• Petabytes of data

• Existing EDW is Oracle RAC; also using Amazon Elastic MapReduce and now Amazon Redshift

• Owns managing the hardware and software infrastructure – Apart from Oracle DB, just Amazon IP

• Not part of AWS

Introducing the Elephant…

• Mission: Provide customers the

best value – Leverage AWS only if it provides the

best value

– We aren’t moving 100% to Amazon

Redshift

• Publish best practices – If AWS isn’t the best, we’ll say so

• There is a conflict of interest

Amazon Data Warehouse Architecture

Control Plane (ETL

Manager)

Existing

EDW

Amazon

EMR

Amazon

Redshift

Amazon Data Warehouse – Growth Story

• Petabytes of data

• Growth of data volume – YoY storage requirements have grown 67%

• Growth of processing volume – YoY processing demand has grown 47%

Long-Term Sustainable Scale

Demand

SAN-based

Redshift

$$ Wasted

Coping with Change

Demand

SAN

Redshift

Growth

changes

Capacity

Unmet

Amazon Data Warehouse – Cost per Job

• Our main efficiency metric – Cost per Job (CPJ)

rDayPeakJobsPe

port$VendorSupr$DataCente$CapEx

What Drives Cost per Job…

Up? • Number of disks

– Data gets bigger!

• Number of servers

• Short-sighted negotiations

– 4th year support…

• Data Center costs (power, rent)

Down? • Bidding

– 2+ vendors

• Moore’s Law

– Vendors fight this!

• Data design

• Software (e.g. DBM)

Current State and Problems

• Existing EDW – Multiple multi-petabyte clusters (redundancy and jobs)

– Why not <x>? CPJ not lower

• Data stored in SANs (not Exadata)

• Performs poorly on scans of 10T+

• Long procurement cycles (3 month minimum)

Amazon Data Warehouse and Amazon Redshift

Integration Project

• Spent 2013 evaluating Amazon Redshift for Amazon data

warehouse

– Where does Amazon Redshift provide a better CPJ?

– Can Amazon Redshift solve some pain (without introducing new pain)?

• Picked 10K jobs and 275 tables to copy

Current State of Affairs

• Biggest cluster size: 20+1 8XL

• Peak daily jobs: 7211 (using all 4 clusters)

• 4159 extracts

• 3052 loads

Some Results

• Benchmarking for 4159 jobs – Outperforming 2719

– Underperforming 1440

– Avg. runtime • 4:43 mins in Amazon Redshift

• 17:38 mins in existing EDW

• LOADS are slower

• EXTRACTS are faster

Job Type RS Performance

Category

Job Count by

Category

EXTRACT 10X Faster 945

EXTRACT 5X Faster 487

EXTRACT 3X Faster 393

EXTRACT 2X Faster 301

EXTRACT 1X or same 480

EXTRACT 2X Slower 1150

LOAD 10X Faster 7

LOAD 5X Faster 15

LOAD 3X Faster 23

LOAD 2X Faster 23

LOAD 1X or same 45

LOAD 2X Slower 290

Amazon Redshift Best Practices Abhishek Agrawal <[email protected]>

Amazon Redshift Integration Best Practices

• Integrating via Amazon S3 (Manifests)

• Primary key enforcement

• Idempotent loads – MERGE via INSERT/UPDATE

– Mimic Trunc-Load [Backfills]

• Trunc-partition using sort keys

• Administration automation

• Ensuring data correctness

Integrating via Amazon S3

• S3 in the US Standard Region is eventually consistent!

• S3 LIST might not give the entire list of data right after

you save it (this WILL eventually happen to you!)

• Amazon Redshift loads everything it sees in a bucket – You may see all data files, Amazon Redshift may not, which can cause

missing data

Best Practices – Using Amazon S3

• Read/COPY – System table validation – STL_LOAD_ERRORS,

– Verify files loaded are ‘intended’ files

• Write/ UNLOAD – System table validation – STL_UNLOAD_LOG

– Verify all files that has the data are on S3

• Manifests – Metadata to know what to exactly to read from S3

– Provides authoritative reference to data

– Powerful in terms of user metadata format, encryption, etc.

Primary Key Enforcement

• Amazon Redshift does not enforce primary key – You will need to do this to ensure data quality

• Best practice – Introduce temp table to check duplicates in incoming data

– Validate against incoming data to catch offenders

– Put the data in target table and validate target data in the same transaction before commit

• Yes, this IS a lot of overhead

Idempotent Loads

• Idempotent Loads – doing a load 2+ times the same as

doing 1 load – Needed to manage load failures

• MERGE – leverages primary key, row at a time

• TRUNC / INSERT – load a partition at a time

MERGE

• No native Amazon Redshift MERGE support

• Merge is implemented as a multi-step process – Load the data in temp table

– Figure out inserts and load

– Figure out updates and modify target table

– Validation for duplicates

TRUNC - INSERT

• Solution – Distribute randomly

– Use sort keys to align data (mimics partition)

– Selectively delete and insert

• Issues – Inserts are in an “unsorted” bucket – performance degrades without

periodic VACUUM

– Very slow (effectively row at a time)

Other Temp Table Uses

• Partial column data load

• Filtered data load

• Column transformations

Automating Administration

• Stored procs / Oracle workflow used to do

admin task like retention, stats, etc.

• Solution – We introduced a software layer to prepare the administrative

task statements based on defined inputs

– Execute using JDBC connection

– Can schedule work like stats collection, vacuum, etc.

2013 Results

• CPJ is 55% less on Amazon Redshift in general – We can’t share the math, sorry YMMV

– Between Redshift and Amazon data warehouse, known improvements get us to ~66%

– Big wins are in big queries

– Loads are slow and expensive

• Moved ~10K jobs to ~60 8XLs (4 clusters)

• We could move at most 45% of our work to Amazon Redshift with

minimal changes

2014 Plan

• Focus on big tables (100T+) – Need to solve data expiry and backfill challenges

• Solve problems with CPU bound

• Interactive analytics (third-party vendor apps

with Amazon Redshift + Oracle)

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

DAT306