©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Amazon Redshift Deep Dive
Ran Tessler, AWS Solutions Architect
Guest Speaker: Arik Fraimovich, EverythingMe Architect
Amazon Redshift Architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB or SSH
• Two hardware platforms
– Optimized for data processing
– Dense Storage: HDD; scale from 2TB to 2PB
– Dense Compute: SSD; scale from 160GB to 326TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
• Massive Parallel Processing (MPP)
– Nodes are split into independent slices
– Each slice has a single virtual core, dedicated RAM
and storage
Amazon Redshift Architecture
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
Compute Node
Slice 1 Slice 2
Virtual Core
7.5 GiB RAM
Local Disk
Virtual Core
7.5 GiB RAM
Local Disk
Amazon Redshift dramatically reduces I/O
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Amazon Redshift dramatically reduces I/O
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Direct-attached storage
• Large data block sizes
• Track of the minimum and
maximum value for each block
• Skip over blocks that don’t
contain the data needed for a
given query
• Minimize unnecessary I/O
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Large data block sizes
• Use direct-attached storage
to maximize throughput
• Hardware optimized for high
performance data
processing
• Large block sizes to make the
most of each read
• Amazon Redshift manages
durability for you
Data Modeling
Data Distribution
• Data is allocated to slices based on
distribution style
– DISTSTYLE EVEN – Round Robin
– DISTSTYLE KEY – based on the distribution key
hash value
– DISTSTYLE ALL - Replicated to slice 0 on all
nodes
• Query performance considerations
– Uneven distribution harms query
– Data redistribution is expensive
Compute Node 1
Slice 1 Slice 2
Compute Node 2
Slice 3 Slice 4
5M
records
2M
records 1M
records
4M
records
Compute Node 1
Slice 1 Slice 2
Compute Node 2
Slice 3 Slice 4
Compute Node 3
Slice 5 Slice 6
Suboptimal Distribution
ORDERS ITEMS
Default (No Distribution Key, Round Robin Order)
Order 1 Order 2 Order 3Item 2.1 Item 1.1 Item 1.2
Item 2.2Item 3.1
Order 1: Dave Smith, Total $195
Item 1.1: Order 1, Kindle Fire HD 7”, $159
Item 1.2: Order 1, Kindle Fire Case, $36
Compute Node 1
Slice 1 Slice 2
Compute Node 2
Slice 3 Slice 4
Compute Node 3
Slice 5 Slice 6
Optimal Distribution
ORDERS ITEMSOrder 1: Dave Smith, Total $195
Item 1.1: Order 1, Kindle Fire HD 7”, $159
Item 1.2: Order 1, Kindle Fire Case, $36
Order 1 Order 2 Order 3
Item 2.1Item 1.1
Item 1.2 Item 2.2
Item 3.1
Customised (ORDERS.ORDER_ID DISTKEY, ITEMS.ORDER_ID DISTKEY)
Sorting Table Data
• Sort Keys ≠ Index
– Data is initially written by INSERT/COPY order
– VACUUM sorts the rows and reclaims stale
storage
Compound Sort Keys Illustrated
Records in Redshift are stored in blocks.
For this illustration, let’s assume that four records fill a block
Records with a given cust_id are all in one block
However, records with a given prod_id are spread across four blocks
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved Sort Keys Illustrated
Records with a given
cust_id are spread across
two blocks
Records with a given
prod_id are also spread
across two blocks
Data is sorted in equal
measures for both keys
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks
How to Use Sort Keys
• New keyword ‘INTERLEAVED’ when defining sort keys
– Existing syntax will still work and behavior is unchanged
– You can choose up to 8 columns to include and can query with any or
all of them
• No change needed to queries
• Benefits are significant
[ SORTKEY [ COMPOUND | INTERLEAVED ] ( column_name [, ...] ) ]
Query Optimization
Query Performance
• Good choice of distribution and sort keys speed query
performance more than any other factor
• Redshift Uses a Cost Based Query Optimizer
– Good statistics are VITAL to ensure good performance
– Table constraints, while not enforced, are used to optimize queries
• Run ANALYZE command to update statistics:ANALYZE lineitem;
Query Analysis
• EXPLAIN command followed by the query:EXPLAIN select avg(datediff(day, listtime, saletime)) as avgwait from sales,
listing where sales.listid = listing.listid;
QUERY PLAN
XN Aggregate (cost=6350.30..6350.31 rows=1 width=16)
-> XN Hash Join DS_DIST_NONE (cost=47.08..6340.89 rows=3766 width=16)
Hash Cond: ("outer".listid = "inner".listid)
-> XN Seq Scan on listing (cost=0.00..1924.97 rows=192497 width=12)
-> XN Hash (cost=37.66..37.66 rows=3766 width=12)
-> XN Seq Scan on sales (cost=0.00..37.66 rows=3766 width=12)
• From the EXPLAIN plan you can tell:
– Query execution steps
– Which operation to be performed in each step
– Which table to be used in each step
– How much data needs to be processed in each step
Query Analysis
• Access the STL_EXPLAIN table for executed queries: select query,nodeid,parentid,substring(plannode from 1 for 30),
substring(info from 1 for 20) from stl_explain
where query=10 order by 1,2;
query | nodeid | parentid | substring | substring
------+--------+----------+---------------------+---------------------
10 | 1 | 0 | XN Aggregate (cost=6350.30... |
10 | 2 | 1 | -> XN Merge Join DS_DIST_NO | Merge Cond: ("outer"
10 | 3 | 2 | -> XN Seq Scan on lis |
10 | 4 | 2 | -> XN Seq Scan on sal |
• SVL_QUERY_SUMMARY and SVL_QUERY_REPORT for finer details
Query Analysis
• Explain plans and performance metrics are also available via
the console:
Query Analysis
• Explain Plan Visualization is now also available
Expanding Amazon Redshift’s
Functionality
New Dense Storage Instance
DS2, based on EC2’s D2, has twice the memory and CPU as DW1
Migrate from DS1 to DS2 by restoring from snapshot. We will help you migrate
your RIs
• Twice the memory and compute power of DW1
• Enhanced networking and 50% gain in disk throughput
• 40% to 60% performance gain over DW1
• Available in the two node types: XL (2TB) and 8XL (16TB)
User Defined Functions
• We’re enabling User Defined Functions (UDFs) so
you can add your own– Scalar and Aggregate Functions supported
• You’ll be able to write UDFs using Python 2.7– Syntax is largely identical to PostgreSQL UDF Syntax
– System and network calls within UDFs are prohibited
• Comes with Pandas, NumPy, and SciPy pre-
installed– You’ll also be able import your own libraries for even more
flexibility
Scalar UDF example – URL parsing
CREATE FUNCTION f_hostname (VARCHAR url)
RETURNS varchar
IMMUTABLE AS $$
import urlparse
return urlparse.urlparse(url).hostname
$$ LANGUAGE plpythonu;
Amazon Redshift
Spend time with your data, not your database….
Lessons Learned from 2 Years with Redshift
Arik Fraimovich @arikfr
2 Years with Redshift
Arik Fraimovich @arikfr
AWS Summit Tel Aviv 2015
One main events table:fact_events
Common analytics question:
retention with different dimensions
Original Query Execution time:
~2-3 minutes
x3
Before
After
Correct use of distribution keys
Window Functions
Result:
5-7 sec’ execution time
https:// github.com/EverythingMe/ redshift_console
EverythingMe Hailo SoundCloud GrubHub
Bringg Yallo FundBox MyPermissions
Collabspot Gini Voxel Complete Labs
MyFitnessPal Life360 CrowdTilt Ravello
NextPeer InterludeGeneral
AssemblyExelate
ironSource properati.com PacketZoom FullBottle Group
The Public
Knowledge
Workshop
Thank you.
@arikfr
http://redash.io/
Thank you.
Arik Fraimovich @arikfr
http:// redash.io/
http:// github.com/EverythingMe/ redshift_console
Top Related