Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon...

46
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved ©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Amazon Redshift Deep Dive Ran Tessler, AWS Solutions Architect Guest Speaker: Arik Fraimovich, EverythingMe Architect

Transcript of Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon...

Page 1: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Amazon Redshift Deep Dive

Ran Tessler, AWS Solutions Architect

Guest Speaker: Arik Fraimovich, EverythingMe Architect

Page 2: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Amazon Redshift Architecture

• Leader Node

– SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute Nodes

– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via

Amazon S3; load from

Amazon DynamoDB or SSH

• Two hardware platforms

– Optimized for data processing

– Dense Storage: HDD; scale from 2TB to 2PB

– Dense Compute: SSD; scale from 160GB to 326TB

10 GigE

(HPC)

Ingestion

Backup

Restore

JDBC/ODBC

Page 3: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

• Massive Parallel Processing (MPP)

– Nodes are split into independent slices

– Each slice has a single virtual core, dedicated RAM

and storage

Amazon Redshift Architecture

10 GigE

(HPC)

Ingestion

Backup

Restore

JDBC/ODBC

Compute Node

Slice 1 Slice 2

Virtual Core

7.5 GiB RAM

Local Disk

Virtual Core

7.5 GiB RAM

Local Disk

Page 4: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Amazon Redshift dramatically reduces I/O

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Page 5: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Amazon Redshift dramatically reduces I/O

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Page 6: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

Page 7: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Direct-attached storage

• Large data block sizes

• Track of the minimum and

maximum value for each block

• Skip over blocks that don’t

contain the data needed for a

given query

• Minimize unnecessary I/O

Page 8: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Large data block sizes

• Use direct-attached storage

to maximize throughput

• Hardware optimized for high

performance data

processing

• Large block sizes to make the

most of each read

• Amazon Redshift manages

durability for you

Page 9: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Data Modeling

Page 10: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Data Distribution

• Data is allocated to slices based on

distribution style

– DISTSTYLE EVEN – Round Robin

– DISTSTYLE KEY – based on the distribution key

hash value

– DISTSTYLE ALL - Replicated to slice 0 on all

nodes

• Query performance considerations

– Uneven distribution harms query

– Data redistribution is expensive

Compute Node 1

Slice 1 Slice 2

Compute Node 2

Slice 3 Slice 4

5M

records

2M

records 1M

records

4M

records

Page 11: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Compute Node 1

Slice 1 Slice 2

Compute Node 2

Slice 3 Slice 4

Compute Node 3

Slice 5 Slice 6

Suboptimal Distribution

ORDERS ITEMS

Default (No Distribution Key, Round Robin Order)

Order 1 Order 2 Order 3Item 2.1 Item 1.1 Item 1.2

Item 2.2Item 3.1

Order 1: Dave Smith, Total $195

Item 1.1: Order 1, Kindle Fire HD 7”, $159

Item 1.2: Order 1, Kindle Fire Case, $36

Page 12: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Compute Node 1

Slice 1 Slice 2

Compute Node 2

Slice 3 Slice 4

Compute Node 3

Slice 5 Slice 6

Optimal Distribution

ORDERS ITEMSOrder 1: Dave Smith, Total $195

Item 1.1: Order 1, Kindle Fire HD 7”, $159

Item 1.2: Order 1, Kindle Fire Case, $36

Order 1 Order 2 Order 3

Item 2.1Item 1.1

Item 1.2 Item 2.2

Item 3.1

Customised (ORDERS.ORDER_ID DISTKEY, ITEMS.ORDER_ID DISTKEY)

Page 13: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Sorting Table Data

• Sort Keys ≠ Index

– Data is initially written by INSERT/COPY order

– VACUUM sorts the rows and reclaims stale

storage

Page 14: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Compound Sort Keys Illustrated

Records in Redshift are stored in blocks.

For this illustration, let’s assume that four records fill a block

Records with a given cust_id are all in one block

However, records with a given prod_id are spread across four blocks

1

1

1

1

2

3

4

1

4

4

4

2

3

4

4

1

3

3

3

2

3

4

3

1

2

2

2

2

3

4

2

1

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4

prod_id

cust_id

cust_id prod_id other columns blocks

Page 15: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

1 [1,1] [1,2] [1,3] [1,4]

2 [2,1] [2,2] [2,3] [2,4]

3 [3,1] [3,2] [3,3] [3,4]

4 [4,1] [4,2] [4,3] [4,4]

1 2 3 4

prod_id

cust_id

Interleaved Sort Keys Illustrated

Records with a given

cust_id are spread across

two blocks

Records with a given

prod_id are also spread

across two blocks

Data is sorted in equal

measures for both keys

1

1

2

2

2

1

2

3

3

4

4

4

3

4

3

1

3

4

4

2

1

2

3

3

1

2

2

4

3

4

1

1

cust_id prod_id other columns blocks

Page 16: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

How to Use Sort Keys

• New keyword ‘INTERLEAVED’ when defining sort keys

– Existing syntax will still work and behavior is unchanged

– You can choose up to 8 columns to include and can query with any or

all of them

• No change needed to queries

• Benefits are significant

[ SORTKEY [ COMPOUND | INTERLEAVED ] ( column_name [, ...] ) ]

Page 17: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Query Optimization

Page 18: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Query Performance

• Good choice of distribution and sort keys speed query

performance more than any other factor

• Redshift Uses a Cost Based Query Optimizer

– Good statistics are VITAL to ensure good performance

– Table constraints, while not enforced, are used to optimize queries

• Run ANALYZE command to update statistics:ANALYZE lineitem;

Page 19: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Query Analysis

• EXPLAIN command followed by the query:EXPLAIN select avg(datediff(day, listtime, saletime)) as avgwait from sales,

listing where sales.listid = listing.listid;

QUERY PLAN

XN Aggregate (cost=6350.30..6350.31 rows=1 width=16)

-> XN Hash Join DS_DIST_NONE (cost=47.08..6340.89 rows=3766 width=16)

Hash Cond: ("outer".listid = "inner".listid)

-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=12)

-> XN Hash (cost=37.66..37.66 rows=3766 width=12)

-> XN Seq Scan on sales (cost=0.00..37.66 rows=3766 width=12)

• From the EXPLAIN plan you can tell:

– Query execution steps

– Which operation to be performed in each step

– Which table to be used in each step

– How much data needs to be processed in each step

Page 20: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Query Analysis

• Access the STL_EXPLAIN table for executed queries: select query,nodeid,parentid,substring(plannode from 1 for 30),

substring(info from 1 for 20) from stl_explain

where query=10 order by 1,2;

query | nodeid | parentid | substring | substring

------+--------+----------+---------------------+---------------------

10 | 1 | 0 | XN Aggregate (cost=6350.30... |

10 | 2 | 1 | -> XN Merge Join DS_DIST_NO | Merge Cond: ("outer"

10 | 3 | 2 | -> XN Seq Scan on lis |

10 | 4 | 2 | -> XN Seq Scan on sal |

• SVL_QUERY_SUMMARY and SVL_QUERY_REPORT for finer details

Page 21: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Query Analysis

• Explain plans and performance metrics are also available via

the console:

Page 22: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Query Analysis

• Explain Plan Visualization is now also available

Page 23: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Expanding Amazon Redshift’s

Functionality

Page 24: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

New Dense Storage Instance

DS2, based on EC2’s D2, has twice the memory and CPU as DW1

Migrate from DS1 to DS2 by restoring from snapshot. We will help you migrate

your RIs

• Twice the memory and compute power of DW1

• Enhanced networking and 50% gain in disk throughput

• 40% to 60% performance gain over DW1

• Available in the two node types: XL (2TB) and 8XL (16TB)

Page 25: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

User Defined Functions

• We’re enabling User Defined Functions (UDFs) so

you can add your own– Scalar and Aggregate Functions supported

• You’ll be able to write UDFs using Python 2.7– Syntax is largely identical to PostgreSQL UDF Syntax

– System and network calls within UDFs are prohibited

• Comes with Pandas, NumPy, and SciPy pre-

installed– You’ll also be able import your own libraries for even more

flexibility

Page 26: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Scalar UDF example – URL parsing

CREATE FUNCTION f_hostname (VARCHAR url)

RETURNS varchar

IMMUTABLE AS $$

import urlparse

return urlparse.urlparse(url).hostname

$$ LANGUAGE plpythonu;

Page 27: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Amazon Redshift

Spend time with your data, not your database….

Page 28: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Lessons Learned from 2 Years with Redshift

Arik Fraimovich @arikfr

2 Years with Redshift

Arik Fraimovich @arikfr

AWS Summit Tel Aviv 2015

Page 29: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware
Page 30: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware
Page 31: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

One main events table:fact_events

Page 32: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware
Page 33: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Common analytics question:

retention with different dimensions

Page 34: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Original Query Execution time:

~2-3 minutes

Page 35: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

x3

Before

Page 36: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

After

Correct use of distribution keys

Window Functions

Result:

5-7 sec’ execution time

Page 37: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

https:// github.com/EverythingMe/ redshift_console

Page 38: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware
Page 39: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware
Page 40: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware
Page 41: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware
Page 42: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware
Page 43: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware
Page 44: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware
Page 45: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

EverythingMe Hailo SoundCloud GrubHub

Bringg Yallo FundBox MyPermissions

Collabspot Gini Voxel Complete Labs

MyFitnessPal Life360 CrowdTilt Ravello

NextPeer InterludeGeneral

AssemblyExelate

ironSource properati.com PacketZoom FullBottle Group

The Public

Knowledge

Workshop

Page 46: Amazon Redshift Deep Dive - d0. · PDF fileAmazon Redshift Deep Dive Ran Tessler, ... Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata ... • Hardware

Thank you.

@arikfr

[email protected]

http://redash.io/

Thank you.

Arik Fraimovich @arikfr

[email protected]

http:// redash.io/

http:// github.com/EverythingMe/ redshift_console