Amazon RedShift - Ianni Vamvadelis

30
Amazon Redshift Intro, Details Ianni Vamvadelis Solutions Architect

description

In this talk, Ian will table about Amazon Redshift, a managed petabyte scale data warehouse, give an overview of integration with Amazon Elastic MapReduce, a managed Hadoop environment, and cover some exciting new developments in the analytics space.

Transcript of Amazon RedShift - Ianni Vamvadelis

Page 1: Amazon RedShift - Ianni Vamvadelis

Amazon Redshift Intro, Details

Ianni Vamvadelis Solutions Architect

Page 2: Amazon RedShift - Ianni Vamvadelis

Amazon DynamoDB Fast,  Predictable,  Highly-­‐Scalable  NoSQL  Data  Store  

Amazon RDS Managed  Rela=onal  Database  Service  for  

MySQL,  Oracle  and  SQL  Server  

Amazon ElastiCache In-­‐Memory  Caching  Service  

Amazon Redshift Fast,  Powerful,  Fully  Managed,  Petabyte-­‐Scale  

Data  Warehouse  Service  

Compute Storage

AWS Global Infrastructure

Database

Application Services

Deployment & Administration

Networking

AWS Database Services

Scalable High Performance Application Storage in the Cloud

Page 3: Amazon RedShift - Ianni Vamvadelis

Amazon DynamoDB Fast,  Predictable,  Highly-­‐Scalable  NoSQL  Data  Store  

Amazon RDS Managed  Rela=onal  Database  Service  for  

MySQL,  Oracle  and  SQL  Server  

Amazon ElastiCache In-­‐Memory  Caching  Service  

Amazon Redshift Fast,  Powerful,  Fully  Managed,  Petabyte-­‐Scale  

Data  Warehouse  Service  

Compute Storage

AWS Global Infrastructure

Database

Application Services

Deployment & Administration

Networking

AWS  Database  Services  

Scalable High Performance Application Storage in the Cloud

Page 4: Amazon RedShift - Ianni Vamvadelis

Design  Objec=ves  

A  petabyte-­‐scale  data  warehouse  service  that  was…  

Amazon  RedshiL  

A Whole Lot Simpler

A Lot Cheaper

A Lot Faster

Page 5: Amazon RedShift - Ianni Vamvadelis

RedshiL  Drama=cally  Reduces  I/O  

•  Direct-­‐aNached  storage  •  Large  data  block  sizes  •  Columnar  storage  

•  Data  compression  

•  Zone  maps  

Id Age State 123 20 CA 345 25 WA 678 40 FL

Row storage Column storage

Page 6: Amazon RedShift - Ianni Vamvadelis

16GB RAM

2TB disk

2 cores

RedshiL  Runs  on  Op=mized  Hardware  

•  Op=mized  for  I/O  intensive  workloads  •  HS1.8XL  available  on  Amazon  EC2  •  Runs  in  HPC  -­‐  fast  network  •  High  disk  density  

HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate

HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

16GB RAM

2TB disk

2 cores

Click to grow …to 1.6PB

Page 7: Amazon RedShift - Ianni Vamvadelis

RedshiL  Parallelizes  and  Distributes  Everything  

Load  Query  Resize  Backup  Restore  

10  GigE  (HPC)  

Inges=on  Backup  Restore  

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

Amazon S3

JDBC/ODBC  

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

128GB RAM

16TB disk

16 cores Compute Node

Leader Node

Page 8: Amazon RedShift - Ianni Vamvadelis

Point  and  Click  Resize  

Page 9: Amazon RedShift - Ianni Vamvadelis

SQL Clients/BI Tools

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Leader Node

Resize  your  cluster  while  remaining  online  

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Leader Node

New  target  provisioned  in  the  background  Only  charged  for  source  cluster  

Page 10: Amazon RedShift - Ianni Vamvadelis

Resize  your  cluster  while  remaining  online  

•  Fully  automated  – Data  automa=cally  redistributed  

•  Read  only  mode  during  resize  •  Parallel  node-­‐to-­‐node  data  copy  •  Automa=c  DNS-­‐based  endpoint  cut-­‐over  

•  Only  charged  for  one  cluster  

SQL Clients/BI Tools

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Compute Node

128GB RAM

48TB disk

16 cores Leader Node

Page 11: Amazon RedShift - Ianni Vamvadelis

Amazon  RedshiL  has  security  built-­‐in  •  SSL  to  secure  data  in  transit  •  Encryp=on  to  secure  data  at  rest  

– AES-­‐256  – All  blocks  on  disks  and  in  Amazon  S3  encrypted  

•  No  direct  access  to  compute  nodes  

•  Amazon  VPC  support  

10  GigE  (HPC)  

Inges=on  Backup  Restore  

SQL Clients/BI Tools

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

128GB RAM

16TB disk

16 cores

Amazon S3

Customer  VPC  

Internal  VPC  

JDBC/ODBC  

Leader Node

Compute Node

Compute Node

Compute Node

Page 12: Amazon RedShift - Ianni Vamvadelis

Con=nuous  Backup,  Automated  Recovery  

•  Replica=on  within  the  cluster  and  backup  to  Amazon  S3  to  maintain  mul=ple  copies  of  data  at  all  =mes  

•  Backups  to  Amazon  S3  are  con=nuous,  automa=c,  and  incremental  

•  Con=nuous  monitoring  and  automated  recovery  from  failures  of  drives  and  nodes  

•  Able  to  restore  snapshots  to  any  Availability  Zone  within  a  region  

Page 13: Amazon RedShift - Ianni Vamvadelis

data

vol

ume

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

data available for analysis

data generated

Gap cost  +  effort  

Page 14: Amazon RedShift - Ianni Vamvadelis

RedshiL  is  Priced  to  Analyze  All  Your  Data  

$0.85 per hour for on-demand (2TB) $999 per TB per year (3-yr reservation)

Page 15: Amazon RedShift - Ianni Vamvadelis

Integrates  With  Exis=ng  BI  Tools  

Amazon Redshift

JDBC/ODBC    

   

Page 16: Amazon RedShift - Ianni Vamvadelis

Scenarios

6

Page 17: Amazon RedShift - Ianni Vamvadelis

Repor=ng  Warehouse  

•  Accelerated  opera=onal  repor=ng  •  Support  for  short-­‐=me  use  cases  •  Data  compression,  index  redundancy  

RDBMS Redshift

OLTP ERP Reporting

and BI  

Page 18: Amazon RedShift - Ianni Vamvadelis

Data Integration Partners*

On-­‐Premises  Integra=on  

RDBMS Redshift

OLTP ERP Reporting

and BI  

Page 19: Amazon RedShift - Ianni Vamvadelis

Live  Archive  for  (Structured)  Big  Data  

•  Direct  integra=on  with  copy  command  •  High  velocity  data    •  Data  ages  into  RedshiL  •  Low  cost,  high  scale  op=on  for  new  apps  

DynamoDB Redshift

OLTP Web Apps Reporting

and BI  

Page 20: Amazon RedShift - Ianni Vamvadelis

Cloud  ETL  for  Big  Data  

•  Maintain  online  SQL  access  to  historical  logs  •  Transforma=on  and  enrichment  with  EMR  •  Longer  history  ensures  beNer  insight  

Redshift Reporting

and BI  Elastic MapReduce S3

Page 21: Amazon RedShift - Ianni Vamvadelis

Ingestion – Best Practices §  Goal:  Leverage  all  the  compute  nodes  and  minimize  overhead  

§  Best  Prac=ces  §  Preferred  method  -­‐  COPY  from  S3  §  Loads  data  in  sorted  order  through  the  compute  nodes  §  Single  Copy  command,  Split  data  into  mul=ple  files  §  Strongly  recommend  that  you  gzip  large  datasets  

§  If  you  must  ingest  through  SQL  §  Mul=-­‐row  inserts  §  Avoid  large  number  of  singleton  

 insert/update/delete  opera=ons    

§  To  copy  from  another  table  §  CREATE  TABLE  AS  or  INSERT  INTO  SELECT  

insert into category_stage values!(default, default, default, default),!(20, default, 'Country', default),!(21, 'Concerts', 'Rock', default);!

copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter '|’;!

Page 22: Amazon RedShift - Ianni Vamvadelis

Choose a Sort key

§  Goal  §  Skip  over  data  blocks  to  minimize  IO  

§  Best  Prac=ce  §  Sort  based  on  range  or  equality  predicate  (WHERE  clause)  §  If  you  access  recent  data  frequently,  sort  based  on  TIMESTAMP  

Page 23: Amazon RedShift - Ianni Vamvadelis

Choose a Distribution Key §  Goal  

§  Distribute  data  evenly  across  nodes    §  Minimize  data  movement  among  nodes  :  Co-­‐located  Joins  and  Co-­‐located  Aggregates  

§  Best  Prac=ce  §  Consider  using  Join  key  as  distribu=on  key  (JOIN  clause)  §  If  mul=ple  joins,  use  the  foreign  key  of  the  largest  dimension  as  distribu=on  key  §  Consider  using  Group  By  column  as  distribu=on  key  (GROUP  BY  clause)  

§  Avoid  §  Keys  used  as  equality  filter  as  your  distribu=on  key  

§  If  de-­‐normalized  tables  and  no  aggregates,  do  not  specify  a  distribu=on  key  -­‐RedshiL  will  use  round  robin  

Page 24: Amazon RedShift - Ianni Vamvadelis

Select  sum( S.Price * S.Quantity )!

FROM SALES S!

JOIN CATEGORY C   ON C.ProductId = S.ProductId!

JOIN  FRANCHISE  F ON F.FranchiseId = S.FranchiseId!

Where C.CategoryId = ‘Produce’  And  F.State = ‘WA’!

AND S.Date Between ‘1/1/2013’  AND ‘1/31/2013’!

Example

Dist key (C) = ProductID

Sort key (S) = Date

-- Total Produce sold in Washington in January 2013

Dist key (F) = FranchiseID Dist key (S) = ProductID

Page 25: Amazon RedShift - Ianni Vamvadelis

Workload Manager

§  Allows  you  to  manage  and  adjust  query  concurrency  

§  WLM    allows  you  to  §  Increase  query  concurrency  up  to  15  §  Define  user  groups  and  query  groups  §  Segregate  short  and  long  running  queries  §  Help  improve  performance  of  individual  queries  

§  Be  aware:  query  workload  is  distributed  to  every  compute  node  §  Increasing  concurrency  may  not  always  help  due  to  resource  conten=on  

§  CPU,  Memory  and  I/O  §  Total  throughput  may  increase  by  lekng  one  query  complete  first  and  allowing  

other  queries  to  wait  

Page 26: Amazon RedShift - Ianni Vamvadelis

Workload Manager §  Default  :  1  queue  with  a  concurrency  of  5  §  Define  up  to  8  queues  with  a  total  concurrency  of  15  §  RedshiL  has  a  super  user  queue  internally  

Page 27: Amazon RedShift - Ianni Vamvadelis

Query Performance – Best Practices

§  Encode  date  and  =me  using  “TIMESTAMP”  data  type  instead  of  “CHAR”  

§  Specify  Constraints  §  RedshiL  does  not  enforce  constraints  (primary  key,  foreign  key,  unique  values)  but  

the  op=mizer  uses  it  §  Loading  and/or  applica=ons  need  to  be  aware  

§  Specify  redundant  predicate  on  the  sort  column  

! !SELECT * FROM tab1, tab2 !! !WHERE tab1.key = tab2.key !! !AND tab1.timestamp > '1/1/2013' !! !AND tab2.timestamp > '1/1/2013';!

§  WLM  sekngs  

Page 28: Amazon RedShift - Ianni Vamvadelis

Summary

§  Avoid  large  number  of  singleton  DML  statements  if  possible  

§  Use  COPY  for  uploading  large  datasets  

§  Choose  Sort  and  Distribu=on  keys  with  care  

§  Encode  data  and  =me  with  TIMESTAMP  data  type  

§  Experiment  with  WLM  sekngs  

Page 29: Amazon RedShift - Ianni Vamvadelis

More Information

Best  Prac=ces  for  Designing  Tables  http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html

 

Best  Prac=ces  for  Data  Loading  http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html

View the Redshift Developer Guide at: http://aws.amazon.com/documentation/redshift/

Page 30: Amazon RedShift - Ianni Vamvadelis

Thanks.

aws.amazon.com/big-data