Amazon RedShift - Ianni Vamvadelis

Amazon Redshift Intro, Details

Ianni Vamvadelis Solutions Architect

Amazon DynamoDB Fast, Predictable, Highly-‐Scalable NoSQL Data Store

Amazon RDS Managed Rela=onal Database Service for

MySQL, Oracle and SQL Server

Amazon ElastiCache In-‐Memory Caching Service

Amazon Redshift Fast, Powerful, Fully Managed, Petabyte-‐Scale

Data Warehouse Service

Compute Storage

AWS Global Infrastructure

Database

Application Services

Deployment & Administration

Networking

AWS Database Services

Scalable High Performance Application Storage in the Cloud

Design Objec=ves

A petabyte-‐scale data warehouse service that was…

Amazon RedshiL

A Whole Lot Simpler

A Lot Cheaper

A Lot Faster

RedshiL Drama=cally Reduces I/O

•  Direct-‐aNached storage •  Large data block sizes •  Columnar storage

•  Data compression

•  Zone maps

Id Age State 123 20 CA 345 25 WA 678 40 FL

Row storage Column storage

16GB RAM

2TB disk

2 cores

RedshiL Runs on Op=mized Hardware

•  Op=mized for I/O intensive workloads •  HS1.8XL available on Amazon EC2 •  Runs in HPC -‐ fast network •  High disk density

HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate

HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

Click to grow …to 1.6PB

RedshiL Parallelizes and Distributes Everything

Load Query Resize Backup Restore

10 GigE (HPC)

Inges=on Backup Restore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3

JDBC/ODBC

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk


128GB RAM

16TB disk


Leader Node

Point and Click Resize


128GB RAM

48TB disk


128GB RAM

48TB disk


128GB RAM

48TB disk


128GB RAM

48TB disk

16 cores Leader Node

Resize your cluster while remaining online

128GB RAM

48TB disk


128GB RAM

48TB disk


128GB RAM

48TB disk


128GB RAM

48TB disk


128GB RAM

48TB disk


New target provisioned in the background Only charged for source cluster

Resize your cluster while remaining online

•  Fully automated – Data automa=cally redistributed

•  Read only mode during resize •  Parallel node-‐to-‐node data copy •  Automa=c DNS-‐based endpoint cut-‐over

•  Only charged for one cluster


128GB RAM

48TB disk


128GB RAM

48TB disk


128GB RAM

48TB disk


128GB RAM

48TB disk


128GB RAM

48TB disk


Amazon RedshiL has security built-‐in •  SSL to secure data in transit •  Encryp=on to secure data at rest

– AES-‐256 – All blocks on disks and in Amazon S3 encrypted

•  No direct access to compute nodes

•  Amazon VPC support

10 GigE (HPC)

Inges=on Backup Restore


128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

Amazon S3

Customer VPC

Internal VPC

JDBC/ODBC

Leader Node

Compute Node

Compute Node

Compute Node

Con=nuous Backup, Automated Recovery

•  Replica=on within the cluster and backup to Amazon S3 to maintain mul=ple copies of data at all =mes

•  Backups to Amazon S3 are con=nuous, automa=c, and incremental

•  Con=nuous monitoring and automated recovery from failures of drives and nodes

•  Able to restore snapshots to any Availability Zone within a region

data

vol

ume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

data available for analysis

data generated

Gap cost + effort

RedshiL is Priced to Analyze All Your Data

$0.85 per hour for on-demand (2TB) $999 per TB per year (3-yr reservation)

Integrates With Exis=ng BI Tools

Amazon Redshift

JDBC/ODBC

Scenarios

6

Repor=ng Warehouse

•  Accelerated opera=onal repor=ng •  Support for short-‐=me use cases •  Data compression, index redundancy

RDBMS Redshift

OLTP ERP Reporting

and BI

Data Integration Partners*

On-‐Premises Integra=on

RDBMS Redshift

OLTP ERP Reporting

and BI

Live Archive for (Structured) Big Data

•  Direct integra=on with copy command •  High velocity data •  Data ages into RedshiL •  Low cost, high scale op=on for new apps

DynamoDB Redshift

OLTP Web Apps Reporting

and BI

Cloud ETL for Big Data

•  Maintain online SQL access to historical logs •  Transforma=on and enrichment with EMR •  Longer history ensures beNer insight

Redshift Reporting

and BI Elastic MapReduce S3

Ingestion – Best Practices §  Goal: Leverage all the compute nodes and minimize overhead

§  Best Prac=ces §  Preferred method -‐ COPY from S3 §  Loads data in sorted order through the compute nodes §  Single Copy command, Split data into mul=ple files §  Strongly recommend that you gzip large datasets

§  If you must ingest through SQL §  Mul=-‐row inserts §  Avoid large number of singleton

insert/update/delete opera=ons

§  To copy from another table §  CREATE TABLE AS or INSERT INTO SELECT

insert into category_stage values!(default, default, default, default),!(20, default, 'Country', default),!(21, 'Concerts', 'Rock', default);!

copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter '|’;!

Choose a Sort key

§  Goal §  Skip over data blocks to minimize IO

§  Best Prac=ce §  Sort based on range or equality predicate (WHERE clause) §  If you access recent data frequently, sort based on TIMESTAMP

Choose a Distribution Key §  Goal

§  Distribute data evenly across nodes §  Minimize data movement among nodes : Co-‐located Joins and Co-‐located Aggregates

§  Best Prac=ce §  Consider using Join key as distribu=on key (JOIN clause) §  If mul=ple joins, use the foreign key of the largest dimension as distribu=on key §  Consider using Group By column as distribu=on key (GROUP BY clause)

§  Avoid §  Keys used as equality filter as your distribu=on key

§  If de-‐normalized tables and no aggregates, do not specify a distribu=on key -‐RedshiL will use round robin

Select sum( S.Price * S.Quantity )!

FROM SALES S!

JOIN CATEGORY C ON C.ProductId = S.ProductId!

JOIN FRANCHISE F ON F.FranchiseId = S.FranchiseId!

Where C.CategoryId = ‘Produce’ And F.State = ‘WA’!

AND S.Date Between ‘1/1/2013’ AND ‘1/31/2013’!

Example

Dist key (C) = ProductID

Sort key (S) = Date

-- Total Produce sold in Washington in January 2013

Dist key (F) = FranchiseID Dist key (S) = ProductID

Workload Manager

§  Allows you to manage and adjust query concurrency

§  WLM allows you to §  Increase query concurrency up to 15 §  Define user groups and query groups §  Segregate short and long running queries §  Help improve performance of individual queries

§  Be aware: query workload is distributed to every compute node §  Increasing concurrency may not always help due to resource conten=on

§  CPU, Memory and I/O §  Total throughput may increase by lekng one query complete first and allowing

other queries to wait

Workload Manager §  Default : 1 queue with a concurrency of 5 §  Define up to 8 queues with a total concurrency of 15 §  RedshiL has a super user queue internally

Query Performance – Best Practices

§  Encode date and =me using “TIMESTAMP” data type instead of “CHAR”

§  Specify Constraints §  RedshiL does not enforce constraints (primary key, foreign key, unique values) but

the op=mizer uses it §  Loading and/or applica=ons need to be aware

§  Specify redundant predicate on the sort column

! !SELECT * FROM tab1, tab2 !! !WHERE tab1.key = tab2.key !! !AND tab1.timestamp > '1/1/2013' !! !AND tab2.timestamp > '1/1/2013';!

§  WLM sekngs

Summary

§  Avoid large number of singleton DML statements if possible

§  Use COPY for uploading large datasets

§  Choose Sort and Distribu=on keys with care

§  Encode data and =me with TIMESTAMP data type

§  Experiment with WLM sekngs

More Information

Best Prac=ces for Designing Tables http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html

Best Prac=ces for Data Loading http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html

View the Redshift Developer Guide at: http://aws.amazon.com/documentation/redshift/

Thanks.

aws.amazon.com/big-data

Amazon RedShift - Ianni Vamvadelis

Technology

Transcript of Amazon RedShift - Ianni Vamvadelis