AWS Redshift Introduction - Big Data Analytics

Redshift Introduction

Keeyong Hankeeyonghan@hotmail.com

Table of Contents

1. What is Redshift?2. Redshift In Action

1. How to Upload?2. How to Query?

3. Recommendation4. Q&A

WHAT IS REDSHIFT?

Brief Introduction (1)

• A scalable SQL engine in AWS– Available except N. California and San Paulo regions as

of Sep 2014– Up to 1.6PB of data in a cluster of servers– Fast but still in minutes for big joins– Columnar storage

• Adding or Deleting a column is very fast!!• Supports Per column compression

– Supports bulk update• Upload gzipped tsv/csv file to S3 and then run bulk update

command (called “copy”)

• Supports Postgresql 8.x– But not all of the features of Postgresql– Accessible through ODBC/JDBC interface• You can use any tools/library supporting ODBC/JDBC

– Still table schema matters!• It is still SQL

• Dense Compute vs. Dense StoragevCPU ECU Memory Storage Price

DW1 – Dense Storage

dw1.xlarge 2 4.4 15 2TB HDD $0.85/hour

dw1.8xlarge 16 35 120 16TB HDD $6.80/hour

DW2 – Dense Compute

dw2.xlarge 2 7 15 0.16TB SSD $0.25/hour

dw2.8xlarge 32 104 244 2.56TB SSD $4.80/hour

• Cost Analysis– If you need 8TB RedShift cluster, you will need 4

dw1.xlarge instances• That will be $2448 per 30 days and about $30K per year

– You will need to store input records to RedShift in S3 at the minimum. So there will be S3 cost as well. • 1TB with “reduced redundancy” would cost $24.5 per

• Tightly coupled with other AWS services– S3, EMR (ElasticMapReduce), Kinesis, DynamoDB, RDS and

so on– Backup and Snapshot to S3

• No Automatic Resizing– You have to manually resize and it takes a while

• Doubling from 2 nodes to 4 took 8 hours. The other way around took 18 hours or so (done in summer of 2013 though)

– But during resizing, read operation still works• 30 minutes Maintenance every week

– You have to avoid this window

Brief Summary

• RedShift is a large scale SQL engine which can be used as Data Warehouse/Analytics solution– You don’t stall your production database!– Smoother migration for anyone who knows SQL– It supports SQL interface but behind the scene it is a

NoSQL engine

• RedShift isn’t for Realtime query engine– Semi-realtime data consumption might be doable but

querying can take a while

Difference from MySQL (1)

• No guarantee of primary key uniqueness– There can be many duplicates if you are not careful

• You better delete before inserting (based on date/time range)– Primary key is just a hint for query optimizer

• Need to define distkey and sortkey per table– distkey is to determine which node to store a record– sortkey is to determine in what order records need to be stored in a machine

create table session_attribute ( browser_id decimal(20,0) not null distkey sortkey, session_id int, name varchar(48), value varchar(48), primary key(vid, sid, name));

Difference from MySQL (2)

• char/varchar type is in bytes not in characters• "\r\n” is counted as two characters• No text field. The max number of bytes in

char/varchar is 65535• Addition/deletion of a column is very fast• Some keywords are reserved (user, tag and so

on)• LIKE is case-sensitive (ILIKE is case-insensitive)

Supported Data Type in RedShift

• SMALLINT (INT2)• INTEGER (INT, INT4)• BIGINT (INT8)• DECIMAL (NUMERIC)• REAL (FLOAT4)• DOUBLE PRECISION (FLOAT8)• BOOLEAN (BOOL)• CHAR (CHARACTER)• VARCHAR (CHARACTER VARYING)• DATE• TIMESTAMP

REDSHIFT IN ACTION

What can be stored?

• Log Files – Web access logs– But needs to define schema. Better to add session

level tables• Relational Database Tables– MySQL tables– Almost one to one mapping

• Any structured data– Any data you can represent as CSV

A bit more about Session Table

• Hadoop can be used to aggregate pageviews into session (on top of pageviews):– Group by session key– Order pageviews in the same session by

timestamp• This aggregated info -> session table• Example of session table– Session ID, Browser ID, user ID, IP, UserAgent,

Referrer info, Start time, Duration, …

How to Upload?

• Need to define schema of your data• Create a table (again it is a SQL engine)• Generate a tsv or csv file(s) from your source data• Compress the file(s)• Upload the file to S3

– This S3 bucket better be in the same region as the RedShift cluster (but it is no longer a must)

• Run a bulk insert (called “copy”)– copy session_attribute [fields] from ‘s3://your_bucket/…’ options– Options include AWS keys, whether gzipped or not, delimiter used,

max errors to tolerate and so on• Regular insert/update SQL statement can be used

Update Workflow

A cronjob

Data Source Server

Periodically upload input files

S3 RedShift

Bulk Insert

You can introduce a queue where S3 location of all incoming input files are pushed. A consumer of this queue read from the queue and bulk insert to RedShift

You might have to do ETL on your source data using Hadoop and so on

Incremental Update from MySQL

• Change your table schema if possible– Need to have updatedon field in your table– Never delete a record but mark it as inactive

• Monitor your table changes and propagate it to Redshift– Use DataBus from LinkedIn

HOW TO ACCESS REDSHIFT

Different Ways to Access (1)

1. JDBC/ODBC desktop tools such as– SQLWorkBench, Navicat and so on– Requires IP registration for outside access

2. JDBC/ODBC Library– Any PostgreSQL 8.0.x compatible should work

In both cases, you use SQL statements

Different Ways to Access (2)

3. Use Analytics Tool such as Tableau or Birst– But these have too many features– Will likely need a dedicated analyst

RECOMMENDATION

Things to Consider

• How big are your tables?• Dumping your tables would cause issues?– Site’s stability and so on– Or do you have backup instance?

• Are your tables friendly for incremental update?– “updatedon” field– no deletion of records

• Start from Daily Update– Daily full refresh is fine to begin with to set up end-

to-end cycle– If the tables are big, then dumping them can take a

while • Implement Incremental Update Mechanism– This will require either table schema change or the

use of some database change tracking mechanism• Go for Shorter update interval

AWS Redshift Introduction - Big Data Analytics

Data & Analytics

Transcript of AWS Redshift Introduction - Big Data Analytics

AWS Webcast - Sales Productivity Solutions with MicroStrategy and Redshift

AWS Helper Tools… · Redshift Spectrum • Query S3 data • Must have Redshift Cluster • Made for existing Redshift customers Athena • Query S3 data • No need for Redshift

AWS IoT Analytics - AWS IoT Analytics User Guide

AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT

Building AWS Redshift Data Warehouse with Matillion and Tableau

(FIN401) Seismic Shift: Nasdaq's Migration to Amazon Redshift | AWS re:Invent 2014

AWS Analytics Modernization

DAT103 Introducing Amazon RedShift - AWS re: Invent 2012

AWS IoT Analytics - AWS IoT Analytics Documentation · AWS IoT Analytics automates the steps required to analyze data from IoT devices. AWS IoT Analytics ﬁlters, transforms, and

AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

AWS Webcast - Introducing Amazon Redshift

AWS Webcast - Data Integration into Amazon Redshift

GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK

AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics

Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Machine Learning | AWS Public Sector Summit 2016

Getting Enterprises on Track with Advanced Analytics · AWS Dynamo AWS Batch AWS CloudTrail Amazon Redshift AWS Quicksight Reporting & dashboards BUSINESS INTELLIGENCE DATA ACCESS

AWS Summit Berlin 2013 - Amazon Redshift

AWS (Amazon Redshift) presentation

AWS June Webinar Series - Getting Started: Amazon Redshift