Amazon RedShift - Ianni Vamvadelis
-
Upload
huguk -
Category
Technology
-
view
1.817 -
download
2
description
Transcript of Amazon RedShift - Ianni Vamvadelis
Amazon Redshift Intro, Details
Ianni Vamvadelis Solutions Architect
Amazon DynamoDB Fast, Predictable, Highly-‐Scalable NoSQL Data Store
Amazon RDS Managed Rela=onal Database Service for
MySQL, Oracle and SQL Server
Amazon ElastiCache In-‐Memory Caching Service
Amazon Redshift Fast, Powerful, Fully Managed, Petabyte-‐Scale
Data Warehouse Service
Compute Storage
AWS Global Infrastructure
Database
Application Services
Deployment & Administration
Networking
AWS Database Services
Scalable High Performance Application Storage in the Cloud
Amazon DynamoDB Fast, Predictable, Highly-‐Scalable NoSQL Data Store
Amazon RDS Managed Rela=onal Database Service for
MySQL, Oracle and SQL Server
Amazon ElastiCache In-‐Memory Caching Service
Amazon Redshift Fast, Powerful, Fully Managed, Petabyte-‐Scale
Data Warehouse Service
Compute Storage
AWS Global Infrastructure
Database
Application Services
Deployment & Administration
Networking
AWS Database Services
Scalable High Performance Application Storage in the Cloud
Design Objec=ves
A petabyte-‐scale data warehouse service that was…
Amazon RedshiL
A Whole Lot Simpler
A Lot Cheaper
A Lot Faster
RedshiL Drama=cally Reduces I/O
• Direct-‐aNached storage • Large data block sizes • Columnar storage
• Data compression
• Zone maps
Id Age State 123 20 CA 345 25 WA 678 40 FL
Row storage Column storage
16GB RAM
2TB disk
2 cores
RedshiL Runs on Op=mized Hardware
• Op=mized for I/O intensive workloads • HS1.8XL available on Amazon EC2 • Runs in HPC -‐ fast network • High disk density
HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate
HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
Click to grow …to 1.6PB
RedshiL Parallelizes and Distributes Everything
Load Query Resize Backup Restore
10 GigE (HPC)
Inges=on Backup Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3
JDBC/ODBC
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
128GB RAM
16TB disk
16 cores Compute Node
Leader Node
Point and Click Resize
SQL Clients/BI Tools
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Leader Node
Resize your cluster while remaining online
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Leader Node
New target provisioned in the background Only charged for source cluster
Resize your cluster while remaining online
• Fully automated – Data automa=cally redistributed
• Read only mode during resize • Parallel node-‐to-‐node data copy • Automa=c DNS-‐based endpoint cut-‐over
• Only charged for one cluster
SQL Clients/BI Tools
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Compute Node
128GB RAM
48TB disk
16 cores Leader Node
Amazon RedshiL has security built-‐in • SSL to secure data in transit • Encryp=on to secure data at rest
– AES-‐256 – All blocks on disks and in Amazon S3 encrypted
• No direct access to compute nodes
• Amazon VPC support
10 GigE (HPC)
Inges=on Backup Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3
Customer VPC
Internal VPC
JDBC/ODBC
Leader Node
Compute Node
Compute Node
Compute Node
Con=nuous Backup, Automated Recovery
• Replica=on within the cluster and backup to Amazon S3 to maintain mul=ple copies of data at all =mes
• Backups to Amazon S3 are con=nuous, automa=c, and incremental
• Con=nuous monitoring and automated recovery from failures of drives and nodes
• Able to restore snapshots to any Availability Zone within a region
data
vol
ume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
data available for analysis
data generated
Gap cost + effort
RedshiL is Priced to Analyze All Your Data
$0.85 per hour for on-demand (2TB) $999 per TB per year (3-yr reservation)
Integrates With Exis=ng BI Tools
Amazon Redshift
JDBC/ODBC
Scenarios
6
Repor=ng Warehouse
• Accelerated opera=onal repor=ng • Support for short-‐=me use cases • Data compression, index redundancy
RDBMS Redshift
OLTP ERP Reporting
and BI
Data Integration Partners*
On-‐Premises Integra=on
RDBMS Redshift
OLTP ERP Reporting
and BI
Live Archive for (Structured) Big Data
• Direct integra=on with copy command • High velocity data • Data ages into RedshiL • Low cost, high scale op=on for new apps
DynamoDB Redshift
OLTP Web Apps Reporting
and BI
Cloud ETL for Big Data
• Maintain online SQL access to historical logs • Transforma=on and enrichment with EMR • Longer history ensures beNer insight
Redshift Reporting
and BI Elastic MapReduce S3
Ingestion – Best Practices § Goal: Leverage all the compute nodes and minimize overhead
§ Best Prac=ces § Preferred method -‐ COPY from S3 § Loads data in sorted order through the compute nodes § Single Copy command, Split data into mul=ple files § Strongly recommend that you gzip large datasets
§ If you must ingest through SQL § Mul=-‐row inserts § Avoid large number of singleton
insert/update/delete opera=ons
§ To copy from another table § CREATE TABLE AS or INSERT INTO SELECT
insert into category_stage values!(default, default, default, default),!(20, default, 'Country', default),!(21, 'Concerts', 'Rock', default);!
copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter '|’;!
Choose a Sort key
§ Goal § Skip over data blocks to minimize IO
§ Best Prac=ce § Sort based on range or equality predicate (WHERE clause) § If you access recent data frequently, sort based on TIMESTAMP
Choose a Distribution Key § Goal
§ Distribute data evenly across nodes § Minimize data movement among nodes : Co-‐located Joins and Co-‐located Aggregates
§ Best Prac=ce § Consider using Join key as distribu=on key (JOIN clause) § If mul=ple joins, use the foreign key of the largest dimension as distribu=on key § Consider using Group By column as distribu=on key (GROUP BY clause)
§ Avoid § Keys used as equality filter as your distribu=on key
§ If de-‐normalized tables and no aggregates, do not specify a distribu=on key -‐RedshiL will use round robin
Select sum( S.Price * S.Quantity )!
FROM SALES S!
JOIN CATEGORY C ON C.ProductId = S.ProductId!
JOIN FRANCHISE F ON F.FranchiseId = S.FranchiseId!
Where C.CategoryId = ‘Produce’ And F.State = ‘WA’!
AND S.Date Between ‘1/1/2013’ AND ‘1/31/2013’!
Example
Dist key (C) = ProductID
Sort key (S) = Date
-- Total Produce sold in Washington in January 2013
Dist key (F) = FranchiseID Dist key (S) = ProductID
Workload Manager
§ Allows you to manage and adjust query concurrency
§ WLM allows you to § Increase query concurrency up to 15 § Define user groups and query groups § Segregate short and long running queries § Help improve performance of individual queries
§ Be aware: query workload is distributed to every compute node § Increasing concurrency may not always help due to resource conten=on
§ CPU, Memory and I/O § Total throughput may increase by lekng one query complete first and allowing
other queries to wait
Workload Manager § Default : 1 queue with a concurrency of 5 § Define up to 8 queues with a total concurrency of 15 § RedshiL has a super user queue internally
Query Performance – Best Practices
§ Encode date and =me using “TIMESTAMP” data type instead of “CHAR”
§ Specify Constraints § RedshiL does not enforce constraints (primary key, foreign key, unique values) but
the op=mizer uses it § Loading and/or applica=ons need to be aware
§ Specify redundant predicate on the sort column
! !SELECT * FROM tab1, tab2 !! !WHERE tab1.key = tab2.key !! !AND tab1.timestamp > '1/1/2013' !! !AND tab2.timestamp > '1/1/2013';!
§ WLM sekngs
Summary
§ Avoid large number of singleton DML statements if possible
§ Use COPY for uploading large datasets
§ Choose Sort and Distribu=on keys with care
§ Encode data and =me with TIMESTAMP data type
§ Experiment with WLM sekngs
More Information
Best Prac=ces for Designing Tables http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html
Best Prac=ces for Data Loading http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
View the Redshift Developer Guide at: http://aws.amazon.com/documentation/redshift/
Thanks.
aws.amazon.com/big-data