SRV405 Deep Dive on Amazon Redshift
-
Upload
amazon-web-services -
Category
Technology
-
view
71 -
download
1
Transcript of SRV405 Deep Dive on Amazon Redshift
![Page 1: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/1.jpg)
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tony Gibbs Data Warehousing Solutions Architect
April 19, 2017
Deep Dive on Amazon Redshift Query Lifecycle and Storage Subsystem
![Page 2: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/2.jpg)
Deep Dive Overview
• Amazon Redshift history and development • Cluster architecture • Concepts and terminology • Storage deep dive • Query lifecycle • New & upcoming features • Open Q&A
![Page 3: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/3.jpg)
Amazon Redshift History & Development
![Page 4: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/4.jpg)
Columnar
MPP
OLAP
AWS Identity and Access
Management (IAM)
Amazon VPC
Amazon SWF
Amazon S3 AWS KMS Amazon Route 53
Amazon CloudWatch
Amazon EC2
PostgreSQL
Amazon Redshift
![Page 5: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/5.jpg)
February 2013
April 2017
> 100 Significant Patches
> 150 Significant Features
![Page 6: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/6.jpg)
Amazon Redshift Cluster Architecture
![Page 7: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/7.jpg)
Amazon Redshift Cluster Architecture
Massively parallel, shared nothing Leader node
• SQL endpoint • Stores metadata • Coordinates parallel SQL processing
Compute nodes • Local, columnar storage • Executes queries in parallel • Load, backup, restore
10 GigE (HPC)
Ingestion Backup Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
Leader Node
![Page 8: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/8.jpg)
Compute & Leader Node Components
![Page 9: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/9.jpg)
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
Leader Node
![Page 10: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/10.jpg)
• Parser & rewriter • Planner & optimizer • Code generator
• Input: optimized plan • Output: >=1 C++
functions • Compiler
• Task scheduler • WLM
• Admission • Scheduling
• PostgreSQL catalog tables
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
Leader Node
![Page 11: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/11.jpg)
• Query execution processes • Backup & restore processes • Replication processes • Local Storage
• Disks • Slices • Tables
• Columns • Blocks
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Leader Node
Compute Node Compute Node Compute Node
![Page 12: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/12.jpg)
Concepts & Terminology
![Page 13: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/13.jpg)
Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt
CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date );
aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14
• Accessing dt with row storage: o Need to read everything o Unnecessary I/O
![Page 14: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/14.jpg)
Designed for I/O Reduction
aid loc dt
Designed for I/O Reduction Columnar storage Data compression Zone maps
CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date );
aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14
• Accessing dt with columnar storage o Only scan blocks for relevant column
![Page 15: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/15.jpg)
Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt
CREATE TABLE deep_dive ( aid INT ENCODE LZO ,loc CHAR(3) ENCODE BYTEDICT ,dt DATE ENCODE RUNLENGTH );
aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14
• Columns grow and shrink independently • Reduces storage requirements • Reduces I/O
![Page 16: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/16.jpg)
Designed for I/O Reduction Columnar storage Data compression Zone maps
aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14
aid loc dt
CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date );
• In-memory block metadata • Contains per-block MIN and MAX value • Effectively prunes blocks which cannot
contain data for a given query • Eliminates unnecessary I/O
![Page 17: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/17.jpg)
SELECT COUNT(*) FROM deep_dive WHERE dt = '09-JUNE-2013'
MIN: 01-JUNE-2013
MAX: 20-JUNE-2013
MIN: 08-JUNE-2013
MAX: 30-JUNE-2013
MIN: 12-JUNE-2013
MAX: 20-JUNE-2013
MIN: 02-JUNE-2013
MAX: 25-JUNE-2013
Unsorted Table MIN: 01-JUNE-2013
MAX: 06-JUNE-2013
MIN: 07-JUNE-2013
MAX: 12-JUNE-2013
MIN: 13-JUNE-2013
MAX: 18-JUNE-2013
MIN: 19-JUNE-2013
MAX: 24-JUNE-2013
Sorted By Date
Zone Maps
![Page 18: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/18.jpg)
Terminology and Concepts: Data Sorting • Goals:
• Physically order rows of table data based on certain column(s) • Optimize effectiveness of zone maps • Enable MERGE JOIN operations
• Impact:
• Enables rrscans to prune blocks by leveraging zone maps • Overall reduction in block I/O
• Achieved with the table property SORTKEY defined over one or more columns
• Optimal SORTKEY is dependent on:
• Query patterns • Data profile • Business requirements
![Page 19: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/19.jpg)
Terminology and Concepts: Slices
A slice can be thought of like a “virtual compute node” • Unit of data partitioning • Parallel query processing
Facts about slices:
• Each compute node has either 2, 16, or 32 slices • Table rows are distributed to slices • A slice processes only its own data
![Page 20: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/20.jpg)
Data Distribution
• Distribution style is a table property which dictates how that table’s data is distributed throughout the cluster: • KEY: Value is hashed, same value goes to same location (slice) • ALL: Full table data goes to first slice of every node • EVEN: Round robin
• Goals: • Distribute data evenly for parallel processing • Minimize data movement during query processing
KEY
ALL Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
Node 1
Slice 1
Slice 2
Node 2
Slice 3
Slice 4
EVEN
![Page 21: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/21.jpg)
Data Distribution: Example CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ) DISTSTYLE (EVEN|KEY|ALL);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
Table: deep_dive User
Columns System
Columns aid loc dt ins del row
![Page 22: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/22.jpg)
Data Distribution: EVEN Example CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ) DISTSTYLE EVEN;
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO deep_dive VALUES (1, 'SFO', '2016-09-01'), (2, 'JFK', '2016-09-14'), (3, 'SFO', '2017-04-01'), (4, 'JFK', '2017-05-14');
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Rows: 0 Rows: 0 Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (4 slices) = 24 Blocks (24 MB)
Rows: 1 Rows: 1 Rows: 1 Rows: 1
![Page 23: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/23.jpg)
Data Distribution: KEY Example #1 CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ) DISTSTYLE KEY DISTKEY (loc);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO deep_dive VALUES (1, 'SFO', '2016-09-01'), (2, 'JFK', '2016-09-14'), (3, 'SFO', '2017-04-01'), (4, 'JFK', '2017-05-14');
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Rows: 2 Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (2 slices) = 12 Blocks (12 MB)
Rows: 0 Rows: 1
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Rows: 2 Rows: 0 Rows: 1
![Page 24: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/24.jpg)
Data Distribution: KEY Example #2 CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ) DISTSTYLE KEY DISTKEY (aid);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO deep_dive VALUES (1, 'SFO', '2016-09-01'), (2, 'JFK', '2016-09-14'), (3, 'SFO', '2017-04-01'), (4, 'JFK', '2017-05-14');
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Rows: 0 Rows: 0 Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (4 slices) = 24 Blocks (24 MB)
Rows: 1 Rows: 1 Rows: 1 Rows: 1
![Page 25: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/25.jpg)
Data Distribution: ALL Example CREATE TABLE loft_deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ) DISTSTYLE ALL;
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO deep_dive VALUES (1, 'SFO', '2016-09-01'), (2, 'JFK', '2016-09-14'), (3, 'SFO', '2017-04-01'), (4, 'JFK', '2017-05-14');
Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (2 slice) = 12 Blocks (12 MB)
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Rows: 0 Rows: 1 Rows: 2 Rows: 4 Rows: 3
Table: loft_deep_dive User Columns System Columns
aid loc dt ins del row
Rows: 0 Rows: 1 Rows: 2 Rows: 4 Rows: 3
![Page 26: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/26.jpg)
Terminology and Concepts: Data Distribution
KEY • The key creates an even distribution of data • Joins are performed between large fact/dimension tables • Optimizing merge joins and group by
ALL • Small and medium size dimension tables (< 2-3M)
EVEN • When key cannot produce an even distribution
![Page 27: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/27.jpg)
Storage Deep Dive
![Page 28: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/28.jpg)
Storage Deep Dive: Disks
• Amazon Redshift utilizes locally attached storage devices • Compute nodes have 2.5-3x the advertised storage capacity
• 1, 3, 8, or 24 disks depending on node type • Each disk is split into two partitions
• Local data storage, accessed by local CN • Mirrored data, accessed by remote CN
• Partitions are raw devices • Local storage devices are ephemeral in nature • Tolerant to multiple disk failures on a single node
![Page 29: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/29.jpg)
Storage Deep Dive: Blocks
Column data is persisted to 1 MB immutable blocks Each block contains in-memory metadata:
• Zone Maps (MIN/MAX value) • Location of previous/next block • Blocks are individually compressed with 1 of 11 encodings
A full block contains between 16 and 8.4 million values
![Page 30: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/30.jpg)
Storage Deep Dive: Columns
• Column: Logical structure accessible via SQL • Physical structure is a doubly linked list of blocks • These blockchains exist on each slice for each column • All sorted & unsorted blockchains compose a column • Column properties include:
• Distribution Key • Sort Key • Compression Encoding
• Columns shrink and grow independently, 1 block at a time • Three system columns per table-per slice for MVCC
![Page 31: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/31.jpg)
Block Properties: Design Considerations
• Small writes: • Batch processing system, optimized for processing massive data • For 1MB size + immutable blocks, we clone blocks on write to avoid
fragmentation • Small write (~1-10 rows) has similar cost to a larger write (~100 K
rows)
• UPDATE and DELETE: • Immutable blocks means that we only logically delete rows on
UPDATE or DELETE • Must VACUUM or DEEP COPY to remove ghost rows from table
![Page 32: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/32.jpg)
Column Properties: Design Considerations
• Compression: • COPY automatically analyzes and compresses data when loading into empty
tables • ANALYZE COMPRESSION checks existing tables and proposes optimal
compression algorithms for each column • Changing column encoding requires a table rebuild
• DISTKEY and SORTKEY influence performance (orders of magnitude) • Distribution keys:
• A poor DISTKEY can introduce data skew and an unbalanced workload • A query completes only as fast as the slowest slice completes
• Sort keys: • A sort key is only effective as the data profile allows it to be • Selectivity needs to be considered
![Page 33: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/33.jpg)
Query Lifecycle
![Page 34: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/34.jpg)
Terminology and Concepts: Slices
A slice can be thought of like a “virtual compute node” • Unit of data partitioning • Parallel query processing
Facts about slices:
• Each compute node has either 2, 16, or 32 slices • Table rows are distributed to slices • A slice processes only its own data
![Page 35: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/35.jpg)
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
Leader Node
• Parser & Rewriter • Planner & Optimizer • Code Generator
• Input: Optimized plan • Output: >=1 C++
functions • Compiler
• Task Scheduler • WLM
• Admission • Scheduling
• PostgreSQL Catalog Tables
• Amazon Redshift System Tables (STV)
![Page 36: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/36.jpg)
Query Execution Terminology
Step: • An individual operation needed during query execution • Steps are combined to allow compute nodes to perform a join • Examples: scan, sort, hash, aggr Segment: • A combination of several steps that can be done by a single process • Smallest compilation unit executable by a slice • Segments within a stream run in parallel Stream: • A collection of combined segments • Output to the next stream or SQL client
![Page 37: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/37.jpg)
Visualizing Streams, Segments, and Steps Stream 0
Segment 0 Step 0 Step 1 Step 2
Segment 1 Step 0 Step 1 Step 2 Step 3 Step 4
Segment 2 Step 0 Step 1 Step 2 Step 3
Segment 3 Step 0 Step 1 Step 2 Step 3 Step 4 Step 5
Stream 1 Segment 4
Step 0 Step 1 Step 2 Step 3 Segment 5
Step 0 Step 1 Step 2 Segment 6
Step 0 Step 1 Step 2 Step 3 Step 4 Stream 2
Segment 7 Step 0 Step 1
Segment 8 Step 0 Step 1
Time
![Page 38: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/38.jpg)
Query Lifecycle client
JDBC ODBC
Leader Node
Parser
Query Planner
Code Generator
Final Computations
Generate code for all segments of one stream
Explain Plans
Compute Node
Receive Compiled Code
Run the Compiled Code
Return results to Leader
Compute Node
Receive Compiled Code
Run the Compiled Code
Return results to Leader
Return results to client
Segments in a stream are executed concurrently. Each step in a segment is executed serially.
![Page 39: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/39.jpg)
Query Execution Deep Dive: Leader Node
• Leader node receives a query and parses the SQL • Parser produces a logical representation of original query • This query tree is input into the query optimizer (volt) • Volt rewrites the query to maximize its efficiency • Sometimes a single query is rewritten as several dependent statements in the
background • Rewritten query is sent to the planner which generates 1+ query plans for the
execution with the best estimated performance • Query plan is sent to execution engine, where it’s translated into steps, segments,
and streams • Translated plan is sent to the code generator, which generates a C++ function for
each segment • Generated C++ is compiled with gcc to a .o file and distributed to the compute nodes.
![Page 40: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/40.jpg)
Query Execution Deep Dive: Compute Nodes
• Slices execute query segments in parallel • Executable segments are created for one stream at a
time in sequence • When the compute nodes are done, they return the
query results to leader node for final processing • Leader node merges data into a single result set and
addresses any needed sorting or aggregation • Leader node then returns results to the client
![Page 41: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/41.jpg)
Visualizing Streams, Segments, and Steps Stream 0
Segment 0 Step 0 Step 1 Step 2
Segment 1 Step 0 Step 1 Step 2 Step 3 Step 4
Segment 2 Step 0 Step 1 Step 2 Step 3
Segment 3 Step 0 Step 1 Step 2 Step 3 Step 4 Step 5
Stream 1 Segment 4
Step 0 Step 1 Step 2 Step 3 Segment 5
Step 0 Step 1 Step 2 Segment 6
Step 0 Step 1 Step 2 Step 3 Step 4 Stream 2
Segment 7 Step 0 Step 1
Segment 8 Step 0 Step 1
Time
![Page 42: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/42.jpg)
Query Execution Stream 0
Segment 0 Step 0 Step 1 Step 2
Segment 1 Step 0 Step 1 Step 2 Step 3 Step 4
Segment 2 Step 0 Step 1 Step 2 Step 3
Segment 3 Step 0 Step 1 Step 2 Step 3 Step 4 Step 5
Stream 1 Segment 4
Step 0 Step 1 Step 2 Step 3 Segment 5
Step 0 Step 1 Step 2 Segment 6
Step 0 Step 1 Step 2 Step 3 Step 4 Stream 2
Segment 7 Step 0 Step 1
Segment 8 Step 0 Step 1
Time
Stream 0 Segment 0
Step 0 Step 1 Step 2 Segment 1
Step 0 Step 1 Step 2 Step 3 Step 4 Segment 2
Step 0 Step 1 Step 2 Step 3 Segment 3
Step 0 Step 1 Step 2 Step 3 Step 4 Step 5 Stream 1
Segment 4 Step 0 Step 1 Step 2 Step 3
Segment 5 Step 0 Step 1 Step 2
Segment 6 Step 0 Step 1 Step 2 Step 3 Step 4
Stream 2 Segment 7
Step 0 Step 1 Segment 8
Step 0 Step 1
Stream 0 Segment 0
Step 0 Step 1 Step 2 Segment 1
Step 0 Step 1 Step 2 Step 3 Step 4 Segment 2
Step 0 Step 1 Step 2 Step 3 Segment 3
Step 0 Step 1 Step 2 Step 3 Step 4 Step 5 Stream 1
Segment 4 Step 0 Step 1 Step 2 Step 3
Segment 5 Step 0 Step 1 Step 2
Segment 6 Step 0 Step 1 Step 2 Step 3 Step 4
Stream 2 Segment 7
Step 0 Step 1 Segment 8
Step 0 Step 1
Stream 0 Segment 0
Step 0 Step 1 Step 2 Segment 1
Step 0 Step 1 Step 2 Step 3 Step 4 Segment 2
Step 0 Step 1 Step 2 Step 3 Segment 3
Step 0 Step 1 Step 2 Step 3 Step 4 Step 5 Stream 1
Segment 4 Step 0 Step 1 Step 2 Step 3
Segment 5 Step 0 Step 1 Step 2
Segment 6 Step 0 Step 1 Step 2 Step 3 Step 4
Stream 2 Segment 7
Step 0 Step 1 Segment 8
Step 0 Step 1
Slic
es
0
1
2
3
![Page 43: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/43.jpg)
Design considerations for Redshift slices
DS2.8XL Compute Node
Ingestion Throughput: • Each slice’s query processors can load one file at a time:
• Streaming decompression • Parse • Distribute • Write
Realizing only partial node usage as 6.25% of slices are active
0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
![Page 44: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/44.jpg)
Design considerations for Amazon Redshift slices
Use at least as many input files as there are slices in the cluster With 16 input files, all slices are working so you maximize throughput COPY continues to scale linearly as you add nodes
16 Input Files
DS2.8XL Compute Node
0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15
![Page 45: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/45.jpg)
Data Preparation for Redshift COPY
Export Data from Source System • CSV Recommend (Simple Delimiter ',' or '|')
• Be aware of UTF-8 varchar columns (UTF-8 take 4 bytes per char) • Be aware of your NULL character (\N)
• GZIP Compress Files • Split Files (1MB – 1GB after gzip compression)
Useful COPY Options for PoC Data • MAXERRORS • ACCEPTINVCHARS • NULL AS
![Page 46: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/46.jpg)
Open source tools
https://github.com/awslabs/amazon-redshift-utils https://github.com/awslabs/amazon-redshift-monitoring https://github.com/awslabs/amazon-redshift-udfs Admin scripts
Collection of utilities for running diagnostics on your cluster Admin views
Collection of utilities for managing your cluster, generating schema DDL, etc. ColumnEncodingUtility
Gives you the ability to apply optimal column encoding to an established schema with data already loaded
Amazon Redshift Engineering’s Advanced Table Design Playbook https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-preamble-prerequisites-and-prioritization/
![Page 47: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/47.jpg)
New & Upcoming Features
![Page 48: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/48.jpg)
Recently Released Features
New Data Type – TIMESTAMPTZ Support for Timestamp with Time zone Multi-byte Object Names Support for Multi-byte (UTF-8) characters for tables, columns, and other database object names
User Connection Limits You can now set a limit on the number of database connections a user is permitted to have open concurrently
Automatic Data Compression for CTAS All newly created tables will leverage default encoding
New Column Encoding - ZSTD Approximate Percentile Functions
![Page 49: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/49.jpg)
Recently Released Features
Performance Enhancements • Vacuum (10x faster for deletes) • Snapshot Restore (2x faster) • Queries (Up to 5x faster)
Copy Can Extend Sorted Region on Single Sort Key • No need to vacuum when loading in sorted order
Enhanced VPC Routing • Restrict S3 Bucket Access
Schema Conversion Tool - One-Time Data Exports • Oracle • Teradata
Schema Conversion Tool • Vertica • SQL Server
![Page 50: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/50.jpg)
Monitor and control cluster resources consumed by a query
Get notified, abort and reprioritize long-running / bad queries
Pre-defined templates for common use cases
Coming Soon: Query Monitoring Rules
![Page 51: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/51.jpg)
BI tools SQL clients Analytics tools
Client AWS
Redshift
ADFS
Corporate Active Directory IAM
Amazon Redshift ODBC/JDBC
User groups Individual user
Single Sign-On
Identity providers
New Redshift ODBC/JDBC drivers. Grab the ticket (userid) and get a SAML assertion.
Coming Soon: IAM Authentication
![Page 52: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/52.jpg)
Automatic and Incremental Background VACUUM
• Reclaims space and sorts when Redshift clusters are idle • Vacuum is initiated when performance can be enhanced • Improves ETL and query performance
Coming Soon: Lots More …
![Page 53: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/53.jpg)
Amazon Redshift Spectrum Run SQL queries directly against data in S3
High Concurrency: Multiple clusters access same data
No ETL: Query data in-place using open file formats
Full Redshift SQL Support
S3
SQL
Check out the coming session on Amazon Redshift Spectrum
![Page 54: SRV405 Deep Dive on Amazon Redshift](https://reader031.fdocuments.us/reader031/viewer/2022021421/58f9b3e91a28ab46608b458f/html5/thumbnails/54.jpg)
Thank you! Tony Gibbs