AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

97
AWS Summit 2013 Tel Aviv Oct 16 Tel Aviv, Israel Guy Ernest Solutions Architecture, Amazon Web Services Data Warehouse on AWS

description

 

Transcript of AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Page 1: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel

Guy Ernest

Solutions Architecture, Amazon Web Services

Data Warehouse on AWS

Page 2: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

DATAWAREHOUSE

ERP

ANALYST CRM

DB

Page 3: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

DATAWAREHOUSE

ERP

ANALYST CRM

DB

OLTP

OLTP

OLTP

OLAP

Page 4: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Transactional Processing Analytical Processing

Transactional context Global context

Latency Throughput

Indexed access Full table scans

Random IO Sequential IO

Disk seek times Disk transfer rate

Page 5: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

OLTP

OLAP

Page 6: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

DATAWAREHOUSE ANALYST

BUSINESS INTELLIGENCE REPORTS, DASHBOARD, …

PRODUCTION OFFLOAD DIFFERENT DATA STRUCTURE, USING ETLs, …

Page 7: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 8: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 9: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

BIG ENTREPRISES

VERY EXPENSIVE (ROI)

DIFFICULT TO MAINTAIN

NOT SCALABLE

Page 10: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

BIG ENTREPRISES SME

WAY TOO EXPENSIVE !

VERY EXPENSIVE (ROI)

DIFFICULT TO MAINTAIN

NOT SCALABLE

Page 11: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Jeff Bezos

Page 12: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Data Sources

Queries

Value

Page 13: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

+ ELASTIC CAPACITY + NO CAPEX + PAY FOR WHAT YOU USE + DISPOSE ON DEMAND

= NO CONTRAINTS

Page 14: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

COLLECT STORE ANALYZE SHARE

ACCELERATION

AMAZON REDSHIFT

Page 15: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

AMAZON REDSHIFT

Page 16: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

DWH that scales to petabyte and…

AMAZON REDSHIFT

… WAY LESS EXPENSIVE

… WAY FASTER

…WAY SIMPLER

Page 17: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

AMAZON REDSHIFT RUNNING ON OPTIMIZED HARDWARE

HS1.8XL: 128 GB RAM, 16 Cores, 16 TB Compressed Data, 2 GB/sec Disk Scan

HS1.XL: 16 GB RAM, 2 Cores, 2 TB Compressed Data

Page 18: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Extra Large Node

(HS1.XL)

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)

Page 19: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

10 GigE (HPC)

Ingestion Backup

Restoration

JDBC/ODBC

Page 20: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 21: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

…WAY SIMPLER

Page 22: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

LOADING DATA

Parallel Loading Data sorted and distributed automatically Linear Growth

Page 23: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

DATA SNAPSHOTS

Automatic and Incremental snapshots in Amazon S3 Configurable Retention Period Manual Snapshots “Streaming” Restore

Page 24: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

REPLICATION IN CLUSTER +

AUTOMATIC SNAPSHOT IN AMAZON S3 +

MONITORING OF CLUSTER NODES

Page 25: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

AUTOMATIC RESIZING

Page 26: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Read-only mode while resizing

New cluster is created in the

background

Parallel node-to-node data copy

Only charged for a single cluster

Page 27: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Automatic DNS based endpoint cut-over

Deletion of source cluster

Page 28: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 29: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

CREATE A DATAWAREHOUSE IN MINUTES

Page 30: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 31: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 32: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 33: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 34: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 35: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 36: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 37: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

…WAY FASTER

Page 38: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

MEMORY CAPACITY AND CPU ERFORMANCE DOUBLE EVERY 2 YEARS

DISK PERFORMANCE

DOUBLE EVERY 10 YEARS

Page 39: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Progress is not evenly distributed

1980 Today

14,000,000$/TB 100MB 4MB/s

30$/TB 3TB

200MB/s 30,000 X

50 X

450,000 ÷

Page 40: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

I/O IS THE MAIN FACTOR FOR PERFORMANCE

Page 41: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

• COLUMNAR STORAGE

• COMPRESSION PER COLUMN

• ZONE MAPS

• HARDWARE OPTIMIZE

• LARGE DATA BLOCK SIZE

Id Age State 123 20 CA 345 25 WA 678 40 FL

Page 42: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 43: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 44: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 45: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 46: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

TEST:

2 BILLION RECORDS

6 REPRESENTATIVE REQUETS

Page 47: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

AMAZON REDSHIFT 2xHS1.8XL

Vs.

32 NODES, 4.2TB RAM, 1.6PB

Page 48: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

12x - 150x FASTER

Page 49: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 50: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

30 MINUTES

12 SECONDES

Page 51: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

…WAY LESS EXPENSIVE

Page 52: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

2x HS1.8XL 3.65$ / HOUR

32 000$ / YEAR

Page 53: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Instance HS1.XL per hour

Hourly Price per TB Yearly Price per TB

On-Demand 0.850 $ 0.425 $ 3 723 $

1 Year Reservation

0.500 $ 0.250 $ 2 190 $

3 Years Reservation

0.228 $ 0.114 $ 999 $

Page 54: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 55: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

Intel Analytics on AWS

Assaf Araki

October, 2013

Page 56: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

Agenda

• Advanced Analytics @ Intel

• Enterprise on the Cloud

• Use Case

Page 57: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

Advanced Analytics

• Vision: Make analytics a competitive advantage for Intel

• Mission:

• Solve strategic high value business line problems

• Leverage analytics to grow Intel revenue

• About the team:

• ~100 employees - corporate ownership of advanced analytics

• Big data and Machine Learning are key focus areas

• Skills: Software Engineering / Decision Science / Business Acumen

• Value driven – ROI>$10M and/or key corporate problem as defined by VPs

• Part of the Israel Academy Computational research center

Intel AA Team

Page 58: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

Big Data Analytics Platform

• Highly scalable, hybrid platform to support a range of business use cases

MPP High Speed Data Loader

Rich advanced analytics and real-

time, in-database data mining

capabilities

Heterogeneous data, batch oriented

on advanced analytics

Prediction Module

AA Overview

Page 59: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

Why Cloud ?

• Known reasons

– Reduce cost

– Universal access

– Scale fast

• Additional reasons

– Flexible & Agile platform – no need to certify each tool by

engineering team

– Development accelerator – R&D team can start develop while

engineering teams implement the platform on premise

Enterprise On the Cloud

Page 60: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

Use Case

• Characteristics:

– CPU behavior data

– Size: 30TB of data per month

– Type: Structured data

– Processing:

• Create aggregation facts and grant ad hoc analysis

• Create ML solutions

• Current Status:

– Data is sampled and processed on SMP RDBMS

– Takes almost 24 hours to process the entire data

• Problem Statement

– Limited ability analyze all data

Use Case

Page 61: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

Platforms

• On premise

– Hbase – Hadoop platform exists

• No Hbase

– MPP DB – Exists with Machine Learning capabilities

• Lower cost platform evaluate and purchase

• Cloud

– HBase - EMR

– MPP DB - AWS Redshift

Enterprise On the Cloud

Go for POC on the Cloud

Page 62: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

Evaluation Criteria

• Capabilities

– Create statistics calculations

• Cost of HW per TB

– Replication

– Compression

• Performance

– Load, transformation, querying

• Scalability

• Ability to execute

Enterprise On the Cloud

Page 63: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

Preliminary Results • Dataset example

– 34GB compressed data divided to files

– ~1,500,000,000 records

– 24B compressed, 240B per record ( ~15 columns )

• Performance & Scalability - 8 x 1XL nodes

– Load time – for 32 files – 2 hours ( 4 files – 5 hours )

– Table size – 202GB (compression rate ~1.5:1)

– SQL aggregation statements

• 38K records – 6 minutes

• 14M records – 7 minutes

• 66M records – 11 minutes ( on 4 x 1XL – 22 minutes )

• 939M records – 34 minutes ( on 4 x 1XL – 77 minutes )

Use Case

Page 64: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

Capabilities and Cost

• No current ability to write code (Java/C++/Python/R)

– Implement statistics and algorithm in SQL

• Compression is not strait forward

• Cost sensitive for actual compression

– 2.6 : 1 is break even

• 8XL vs. High Storage instance (16 cores 48TB)

• 3 years with 100% utilization

Use Case

Page 65: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

[email protected]

Page 66: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Intel Confidential

Thank You!

Page 67: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

USE CASE

Page 68: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

AMAZON ELASTIC

MAPREDUCE

AMAZON

DYNAMODB

AMAZON EC2

AWS STORAGE GATEWAY

AMAZON S3

DATA CENTER

AMAZON RDS

AMAZON REDSHIFT

Page 69: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

UPLOAD TO AMAZON S3

AWS IMPORT/EXPORT

AWS DIRECT CONNECT

DATA

INTEGRATION

INTEGRATION

SYSTEMS

Page 70: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 71: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 72: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

2 million

15 million

MEMBRES REGISTRATION

2011 2012 2013

Page 73: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

1,500,000+ NEW MEMBRES EACH MONTH

Page 74: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

1,200,000,000+ SOCIAL CONNECTIONS IMPORTED

Page 75: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Data Analyst

Raw Data

Get Data

Join via Facebook

Add a Skill Page

Invite Friends

Web Servers Amazon S3 User Action Trace Events

EMR Hive Scripts Process Content

• Process log files with regular expressions to parse out the info we need.

• Processes cookies into useful searchable data such as Session, UserId, API Security token.

• Filters surplus info like internal varnish logging.

Amazon S3

Aggregated Data

Raw Events

Internal Web

Excel Tableau

Amazon Redshift

Page 76: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

ELASTIC DATA WAREHOUSE

Page 77: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 78: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 79: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 80: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 81: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 82: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 83: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Monthly Reports on a new cluster

Page 84: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Redshift Reporting

and BI EMR

S3

Page 85: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

DynamoDB Redshift

OLTP Web Apps

Reporting and BI

Page 86: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

RDBMS Redshift

OLTP ERP

Reporting & BI

Page 87: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

+

RDBMS Redshift

OLTP ERP

Reporting & BI

Page 88: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

JDBC/ODBC

Amazon Redshift

Page 89: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 90: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 91: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 92: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 93: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 94: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Page 95: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

DATAWAREHOUSE BY AWS

Pay per use, no CAPEX

Low cost for high performances

Open and integrate with existing BI tools

Simple to use and scalable

Page 96: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

Speed and Agility

Frequent Experiments

Low Cost of Failure

More Innovation

Fewer Experiments

High Cost of Failures

Less Innovation

“On Premise”

Page 97: AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

תודה רבה