AWS Redshift Introduction - Big Data Analytics

24
Redshift Introduction Keeyong Han [email protected]

description

Introduction of Redshift based on my experience. It is really scalable, easy to use and relatively inexpensive solution

Transcript of AWS Redshift Introduction - Big Data Analytics

Page 1: AWS Redshift Introduction - Big Data Analytics

Redshift Introduction

Keeyong [email protected]

Page 2: AWS Redshift Introduction - Big Data Analytics

Table of Contents

1. What is Redshift?2. Redshift In Action

1. How to Upload?2. How to Query?

3. Recommendation4. Q&A

Page 3: AWS Redshift Introduction - Big Data Analytics

WHAT IS REDSHIFT?

Page 4: AWS Redshift Introduction - Big Data Analytics

Brief Introduction (1)

• A scalable SQL engine in AWS– Available except N. California and San Paulo regions as

of Sep 2014– Up to 1.6PB of data in a cluster of servers– Fast but still in minutes for big joins– Columnar storage

• Adding or Deleting a column is very fast!!• Supports Per column compression

– Supports bulk update• Upload gzipped tsv/csv file to S3 and then run bulk update

command (called “copy”)

Page 5: AWS Redshift Introduction - Big Data Analytics

Brief Introduction (2)

• Supports Postgresql 8.x– But not all of the features of Postgresql– Accessible through ODBC/JDBC interface• You can use any tools/library supporting ODBC/JDBC

– Still table schema matters!• It is still SQL

Page 6: AWS Redshift Introduction - Big Data Analytics

Brief Introduction (3)

• Dense Compute vs. Dense StoragevCPU ECU Memory Storage Price

DW1 – Dense Storage

dw1.xlarge 2 4.4 15 2TB HDD $0.85/hour

dw1.8xlarge 16 35 120 16TB HDD $6.80/hour

DW2 – Dense Compute

dw2.xlarge 2 7 15 0.16TB SSD $0.25/hour

dw2.8xlarge 32 104 244 2.56TB SSD $4.80/hour

Page 7: AWS Redshift Introduction - Big Data Analytics

Brief Introduction (4)

• Cost Analysis– If you need 8TB RedShift cluster, you will need 4

dw1.xlarge instances• That will be $2448 per 30 days and about $30K per year

– You will need to store input records to RedShift in S3 at the minimum. So there will be S3 cost as well. • 1TB with “reduced redundancy” would cost $24.5 per

month

Page 8: AWS Redshift Introduction - Big Data Analytics

Brief Introduction (5)

• Tightly coupled with other AWS services– S3, EMR (ElasticMapReduce), Kinesis, DynamoDB, RDS and

so on– Backup and Snapshot to S3

• No Automatic Resizing– You have to manually resize and it takes a while

• Doubling from 2 nodes to 4 took 8 hours. The other way around took 18 hours or so (done in summer of 2013 though)

– But during resizing, read operation still works• 30 minutes Maintenance every week

– You have to avoid this window

Page 9: AWS Redshift Introduction - Big Data Analytics

Brief Summary

• RedShift is a large scale SQL engine which can be used as Data Warehouse/Analytics solution– You don’t stall your production database!– Smoother migration for anyone who knows SQL– It supports SQL interface but behind the scene it is a

NoSQL engine

• RedShift isn’t for Realtime query engine– Semi-realtime data consumption might be doable but

querying can take a while

Page 10: AWS Redshift Introduction - Big Data Analytics

Difference from MySQL (1)

• No guarantee of primary key uniqueness– There can be many duplicates if you are not careful

• You better delete before inserting (based on date/time range)– Primary key is just a hint for query optimizer

• Need to define distkey and sortkey per table– distkey is to determine which node to store a record– sortkey is to determine in what order records need to be stored in a machine

create table session_attribute ( browser_id decimal(20,0) not null distkey sortkey, session_id int, name varchar(48), value varchar(48), primary key(vid, sid, name));

Page 11: AWS Redshift Introduction - Big Data Analytics

Difference from MySQL (2)

• char/varchar type is in bytes not in characters• "\r\n” is counted as two characters• No text field. The max number of bytes in

char/varchar is 65535• Addition/deletion of a column is very fast• Some keywords are reserved (user, tag and so

on)• LIKE is case-sensitive (ILIKE is case-insensitive)

Page 12: AWS Redshift Introduction - Big Data Analytics

Supported Data Type in RedShift

• SMALLINT (INT2)• INTEGER (INT, INT4)• BIGINT (INT8)• DECIMAL (NUMERIC)• REAL (FLOAT4)• DOUBLE PRECISION (FLOAT8)• BOOLEAN (BOOL)• CHAR (CHARACTER)• VARCHAR (CHARACTER VARYING)• DATE• TIMESTAMP

Page 13: AWS Redshift Introduction - Big Data Analytics

REDSHIFT IN ACTION

Page 14: AWS Redshift Introduction - Big Data Analytics

What can be stored?

• Log Files – Web access logs– But needs to define schema. Better to add session

level tables• Relational Database Tables– MySQL tables– Almost one to one mapping

• Any structured data– Any data you can represent as CSV

Page 15: AWS Redshift Introduction - Big Data Analytics

A bit more about Session Table

• Hadoop can be used to aggregate pageviews into session (on top of pageviews):– Group by session key– Order pageviews in the same session by

timestamp• This aggregated info -> session table• Example of session table– Session ID, Browser ID, user ID, IP, UserAgent,

Referrer info, Start time, Duration, …

Page 16: AWS Redshift Introduction - Big Data Analytics

How to Upload?

• Need to define schema of your data• Create a table (again it is a SQL engine)• Generate a tsv or csv file(s) from your source data• Compress the file(s)• Upload the file to S3

– This S3 bucket better be in the same region as the RedShift cluster (but it is no longer a must)

• Run a bulk insert (called “copy”)– copy session_attribute [fields] from ‘s3://your_bucket/…’ options– Options include AWS keys, whether gzipped or not, delimiter used,

max errors to tolerate and so on• Regular insert/update SQL statement can be used

Page 17: AWS Redshift Introduction - Big Data Analytics

Update Workflow

A cronjob

Data Source Server

Periodically upload input files

S3 RedShift

Bulk Insert

You can introduce a queue where S3 location of all incoming input files are pushed. A consumer of this queue read from the queue and bulk insert to RedShift

You might have to do ETL on your source data using Hadoop and so on

Page 18: AWS Redshift Introduction - Big Data Analytics

Incremental Update from MySQL

• Change your table schema if possible– Need to have updatedon field in your table– Never delete a record but mark it as inactive

• Monitor your table changes and propagate it to Redshift– Use DataBus from LinkedIn

Page 19: AWS Redshift Introduction - Big Data Analytics

HOW TO ACCESS REDSHIFT

Page 20: AWS Redshift Introduction - Big Data Analytics

Different Ways to Access (1)

1. JDBC/ODBC desktop tools such as– SQLWorkBench, Navicat and so on– Requires IP registration for outside access

2. JDBC/ODBC Library– Any PostgreSQL 8.0.x compatible should work

In both cases, you use SQL statements

Page 21: AWS Redshift Introduction - Big Data Analytics

Different Ways to Access (2)

3. Use Analytics Tool such as Tableau or Birst– But these have too many features– Will likely need a dedicated analyst

Page 22: AWS Redshift Introduction - Big Data Analytics

RECOMMENDATION

Page 23: AWS Redshift Introduction - Big Data Analytics

Things to Consider

• How big are your tables?• Dumping your tables would cause issues?– Site’s stability and so on– Or do you have backup instance?

• Are your tables friendly for incremental update?– “updatedon” field– no deletion of records

Page 24: AWS Redshift Introduction - Big Data Analytics

Steps

• Start from Daily Update– Daily full refresh is fine to begin with to set up end-

to-end cycle– If the tables are big, then dumping them can take a

while • Implement Incremental Update Mechanism– This will require either table schema change or the

use of some database change tracking mechanism• Go for Shorter update interval