Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

58
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Shree Kenghe, Solution Architect, AWS Data warehousing in the era of Big Data: Intro to Amazon Redshift

Transcript of Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Page 1: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Shree Kenghe, Solution Architect, AWS

Data warehousing in the era of Big Data: Intro to Amazon Redshift

Page 2: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Agenda

• Introduction• Benefits• Use cases• Getting started• Q&A

Page 3: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

AnalyzeStore

Amazon Glacier

Amazon S3

Amazon DynamoDB

Amazon RDS, Amazon Aurora

AWS big data portfolio

AWS Data Pipeline

Amazon CloudSearch

Amazon EMR Amazon EC2

Amazon Redshift

Amazon MachineLearning

Amazon Elasticsearch

Service

AWS Database Migration Service

Amazon QuickSight

Amazon Kinesis

FirehoseAWS Import/Export

AWS Direct Connect

Collect

Amazon Kinesis Streams

New

NewNew

New

Page 4: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Relational data warehouse

Massively parallel; petabyte scale

Fully managed

HDD and SSD platforms

$1,000/TB/year; starts at $0.25/hour

Amazon Redshift

a lot fastera lot simplera lot cheaper

Page 5: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

The Amazon Redshift view of data warehousing

10x cheaper

Easy to provision

Higher DBA productivity

10x faster

No programming

Easily leverage BI tools, Hadoop, machine learning, streaming

Analysis inline with process flows

Pay as you go, grow as you need

Managed availability and disaster recovery

Enterprise Big data SaaS

Page 6: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.

Forrester Wave™ Enterprise Data Warehouse Q4 ’15

Page 7: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Selected Amazon Redshift customers

Page 8: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Amazon Redshift architecture

Leader nodeSimple SQL endpointStores metadataOptimizes query planCoordinates query execution

Compute nodesLocal columnar storageParallel/distributed execution of all queries, loads, backups, restores, resizes

Start at just $0.25/hour, grow to 2 PB (compressed)DC1: SSD; scale from 160 GB to 326 TBDS2: HDD; scale from 2 TB to 2 PB

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Ingestion/BackupBackupRestoreAmazon S3 / EMR / DynamoDB / SSH

JDBC/ODBC

10 GigE(HPC)

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

LeaderNode

Page 9: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #1: Amazon Redshift is fast

Dramatically less I/OColumn storage

Data compression

Zone maps

Direct-attached storage

Large data block sizes

analyze compression listing;

Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

Page 10: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #1: Amazon Redshift is fast

Parallel and distributedQuery

Load

Export

Backup

Restore

Resize

Amazon S3 / EMR / DynamoDB / SSH

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

128GB RAM

16TB disk

16 coresCompute Node

SQL Clients/BI Tools

128GB RAM

48TB disk

16 cores

CN

128GB RAM

48TB disk

16 cores

CN

128GB RAM

48TB disk

16 cores

CN

128GB RAM

48TB disk

16 coresLeaderNode

128GB RAM

48TB disk

16 cores

CN

128GB RAM

48TB disk

16 cores

CN

128GB RAM

48TB disk

16 cores

CN

128GB RAM

48TB disk

16 cores

CN

128GB RAM

48TB disk

16 coresLeaderNode

Page 11: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #1: Amazon Redshift is fast

Hardware optimized for I/O intensive workloads, 4 GB/sec/node

Enhanced networking, over 1 million packets/sec/node

Choice of storage type, instance size

Regular cadence of auto-patched improvements

Page 12: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #1: Amazon Redshift is fast

New Dense Storage (HDD) instance type (Jun 15)Improved memory 2x, compute 2x, disk throughput 1.5xCost: Same as our prior generation!Performance improvement: 50%

Enhanced I/O and commit improvements (Jan 16)Reduce amount of time to commit dataThroughput performance improvement: 35%

Improved memory allocation for query processing (May 16)Increased overall throughput by up to 60%

Page 13: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #2: Amazon Redshift is inexpensive

DS2 (HDD) Price per hour for DS2.XL single node

Effective annual price per TB compressed

On-demand $ 0.850 $ 3,7251 year reservation $ 0.500 $ 2,1903 year reservation $ 0.228 $ 999

DC1 (SSD) Price per hour for DC1.L single node

Effective annual price per TB compressed

On-demand $ 0.250 $ 13,6901 year reservation $ 0.161 $ 8,7953 year reservation $ 0.100 $ 5,500

Pricing is simpleNumber of nodes x price/hourNo charge for leader node No upfront costsPay as you go

Page 14: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #3: Amazon Redshift is fully managed

Continuous/incremental backups

Multiple copies within cluster

Continuous and incremental backups to Amazon S3

Continuous and incremental backups across regions

Streaming restore

Amazon S3

Amazon S3

Region 1

Region 2

Compute Node

Compute Node

Compute Node

Page 15: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #3: Amazon Redshift is fully managed

Amazon S3

Amazon S3

Region 1

Region 2

Compute Node

Compute Node

Compute Node

Fault tolerance

Disk failures

Node failures

Network failures

Availability Zone/region level disasters

Page 16: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #4: Security is built-in• Load encrypted from S3

• SSL to secure data in transit

• ECDHE perfect forward security

• Amazon VPC for network isolation

• Encryption to secure data at rest

• All blocks on disks and in S3 encrypted• Block key, cluster key, master key (AES-256)• On-premises HSM & AWS CloudHSM support

• Audit logging and AWS CloudTrail integration

• SOC 1/2/3, PCI-DSS, FedRAMP, BAA

10 GigE(HPC)

IngestionBackupRestore

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

Amazon S3/EMR/DynamoDB/SSH

Customer VPC

InternalVPC

JDBC/ODBC

LeaderNode

Compute Node

Compute Node

Compute Node

Page 17: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #5: We innovate quicklyWell over 125 new features added since launchRelease every two weeksAutomatic patching

Page 18: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #6: Redshift is powerful• Approximate functions

• User defined functions

• Machine learning

• Data science

Page 19: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #7: Amazon Redshift has a large ecosystem

Data integration Systems integratorsBusiness intelligence

Page 20: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Benefit #8: Service oriented architecture

DynamoDB

EMR

S3

EC2/SSH

RDS/Aurora

Amazon Redshift

Amazon Kinesis

Amazon ML

Data Pipeline

CloudSearch

Amazon Mobile

Analytics

Page 21: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Use cases

Page 22: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

NTT Docomo: Japan’s largest mobile service provider

68 million customersTens of TBs per day of data across a mobile network6 PB of total data (uncompressed)Data science for marketing operations, logistics, and so on

Greenplum on-premises

Scaling challengesPerformance issues

Need same level of securityNeed for a hybrid environment

Page 23: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

125 node DS2.8XL cluster4,500 vCPUs, 30 TB RAM2 PB compressed

10x faster analytic queries50% reduction in time for new BI application deployment Significantly less operations overhead

Data Source

ET

AWS Direct

Connect

Client

ForwarderLoaderState

Management

SandboxAmazon Redshift

S3

NTT Docomo: Japan’s largest mobile service provider

Page 24: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Nasdaq: powering 100 marketplaces in 50 countries

Orders, quotes, trade executions, market “tick” data from 7 exchanges7 billion rows/dayAnalyze market share, client activity, surveillance, billing, and so on

Microsoft SQL Server on-premises

Expensive legacy DW ($1.16 M/yr.)Limited capacity (1 yr. of data online)

Needed lower TCOMust satisfy multiple security and regulatory requirementsSimilar performance

Page 25: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

23 node DS2.8XL cluster828 vCPUs, 5 TB RAM368 TB compressed2.7 T rows, 900 B derived8 tables with 100 B rows7 man-month migration¼ the cost, 2x storage, room to growFaster performance, very secure

Nasdaq: powering 100 marketplaces in 50 countries

Page 26: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Getting started

Page 27: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Provisioning

Page 28: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Enter cluster details

Page 29: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Select node configuration

Page 30: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Select security settings and provision

Page 31: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Point-and-click resize

Page 32: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Resize

• Resize while remaining online

• Provision a new cluster in the background

• Copy data in parallel from node to node

• Only charged for source cluster

SQL Clients/BI Tools

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresLeaderNode

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresCompute Node

128GB RAM

48TB disk

16 coresLeaderNode

Page 33: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Data modeling

Page 34: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’

MIN: 01-JUNE-2013MAX: 20-JUNE-2013

MIN: 08-JUNE-2013MAX: 30-JUNE-2013

MIN: 12-JUNE-2013MAX: 20-JUNE-2013

MIN: 02-JUNE-2013MAX: 25-JUNE-2013

Unsorted tableMIN: 01-JUNE-2013MAX: 06-JUNE-2013

MIN: 07-JUNE-2013MAX: 12-JUNE-2013

MIN: 13-JUNE-2013MAX: 18-JUNE-2013

MIN: 19-JUNE-2013MAX: 24-JUNE-2013

Sorted by date

Sort K

eys

Zone maps

Page 35: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Sort K

eys

• Single column

• Compound

• Interleaved

Page 36: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Single ColumnSo

rt Key

s • Table is sorted by 1 column

Date Region Country

2-JUN-2015 Oceania New Zealand

2-JUN-2015 Asia Singapore

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Hong Kong

3-JUN-2015 Europe Germany

3-JUN-2015 Asia Korea

[ SORTKEY ( date ) ]

• Best for: • Queries that use 1st column (i.e. date) as primary filter• Can speed up joins and group bys• Quickest to VACUUM

Page 37: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

CompoundSo

rt Key

s • Table is sorted by 1st column , then 2nd column etc.

Date Region Country

2-JUN-2015 Africa Zaire

2-JUN-2015 Asia Korea

2-JUN-2015 Asia Singapore

2-JUN-2015 Europe Germany

3-JUN-2015 Asia Hong Kong

3-JUN-2015 Asia Korea

[ SORTKEY COMPOUND ( date, region, country) ]

• Best for: • Queries that use 1st column as primary filter, then other cols• Can speed up joins and group bys• Slower to VACUUM

Page 38: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

InterleavedSo

rt Key

s • Equal weight is given to each column.

Date Region Country

2-JUN-2015 Africa Zaire

3-JUN-2015 Asia Singapore

2-JUN-2015 Asia Korea

2-JUN-2015 Europe Germany

3-JUN-2015 Asia Hong Kong

2-JUN-2015 Asia Korea

[ SORTKEY INTERLEAVED ( date, region, country) ]

• Best for: • Queries that use different columns in filter• Queries get faster the more columns used in the filter• Slowest to VACUUM

Page 39: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

• KEY

• ALL

• EVEN

Distrib

ution

Page 40: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

1

2

3

4

ID Gender Name

101 M John Smith

306 F Lisa Green

ID Gender Name

292 F Jane Jones

209 M James White

ID Gender Name

139 M Peter Black

164 M Brian Snail

ID Gender Name

446 M Pat Partridge

658 F Sarah Cyan

EVEN

Distrib

ution

DISTSTYLE EVEN

Page 41: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

KEY

1

2

3

4

ID Gender Name

101 M John Smith

306 F Lisa Green

ID Gender Name

292 F Jane Jones

209 M James White

ID Gender Name

139 M Peter Black

164 M Brian Snail

ID Gender Name

446 M Pat Partridge

658 F Sarah Cyan

Distrib

ution

DISTSTYLE KEY

Page 42: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

KEY

1

2

3

4

ID Gender Name

101 M John Smith

139 M Peter Black

446 M Pat Partridge

164 M Brian Snail

209 M James White

ID Gender Name

292 F Jane Jones

658 F Sarah Cyan

306 F Lisa Green

Distrib

ution

DISTSTYLE KEY

Page 43: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

ID Gender Name

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M James White

306 F Lisa Green

1

2

3

4

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

101 M John Smith

292 F Jane Jones

139 M Peter Black

446 M Pat Partridge

658 F Sarah Cyan

164 M Brian Snail

209 M Lisa Green

306 F James White

ALL

Distrib

ution

DISTSTYLE ALL

Page 44: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

• KEY• Large Fact tables• Large dimension tables

• ALL • Medium dimension tables (1K – 2M)• Small dimension tables

• EVEN• Tables with no joins or group by

Distrib

ution

Page 45: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Loading data

Page 46: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

AWS CloudCorporate Data center

Amazon S3Amazon Redshift

Flat files

Data loading options

Page 47: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

AWS CloudCorporate Data center

ETL

Source DBs

Amazon Redshift

Amazon Redshift

Data loading options

Page 48: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

AWS Cloud

Amazon Redshift

Amazon Kinesis

Data loading options

Page 49: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Use multiple input files to maximize throughput

Use the COPY command

Each slice can load one file at a time

A single input file means only one slice is ingesting data

Instead of 100MB/s, you’re only getting 6.25MB/s

DS2.8XL Compute Node

Single Input File

Page 50: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Use multiple input files to maximize throughput

Use the COPY command

You need at least as many input files as you have slices

With 16 input files, all slices are working so you maximize throughput

Get 100MB/s per node; scale linearly as you add nodes

16 Input Files

DS2.8XL Compute Node

Page 51: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Querying

Page 52: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

JDBC/ODBC

Amazon Redshift

Amazon Redshift works with your existing BI tools

Page 53: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

ODBC/JDBC

BI ClientsRedshift

Page 54: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

ODBC/JDBC

BI Server Redshift

Clients

Page 55: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

View explain plans

Page 56: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Monitor query performance

Page 58: Data Warehousing in the Era of Big Data: Intro to Amazon Redshift

Thank you!