AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Post on 15-Jan-2015

473 views 1 download

Tags:

description

 

Transcript of AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel

Jan Borch | AWS Solutions Architect

Data Analytics on BigData

GENERATE STORE ANALYZE SHARE

THE COST OF DATA

GENERATION IS FALLING

Progress is not evenly distributed

1980 Today

14,000,000$/TB

100MB

4MB/s

30$/TB

3TB

200MB/s

30,000 X

50 X

450,000 ÷

THE MORE DATA YOU COLLECT

THE MORE VALUE YOU CAN

DERIVE FROM IT

GENERATE STORE ANALYZE SHARE

Lower cost,

higher throughput

GENERATE STORE ANALYZE SHARE

Lower cost,

higher throughput

Highly

constrained

Generated data

Available for analysis

DATA VOLUME

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

GENERATE STORE ANALYZE SHARE

GENERATE STORE ANALYZE SHARE

ACCELERATE

+ ELASTIC AND HIGHLY SCALABLE

+ NO UPFRONT CAPITAL EXPENSE

+ ONLY PAY FOR WHAT YOU USE

+ AVAILABLE ON-DEMAND

= REMOVE CONSTRAINTS

GENERATE STORE ANALYZE SHARE

AWS EC2

AWS CloudFront

• Fluentd

• Flume

• Scribe

• Chukwa

• LogStash

{output{ s3 {

bucket => myBucket,

aws_credential_file => ~/cred.json

size_file=> 120MB

}}

“Poor man’s Analytics”

Embed poor-man pixel

http://www.poor-man-analytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban.com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr=-&utmp=%2F&utmac=UA-7019765-1&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B%2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(referral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analytics-architecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~

GENERATE STORE ANALYZE SHARE

GENERATE STORE ANALYZE SHARE

AWS Import / Export

AWS Direct Connect

AWS Elastic Map Reduce

Generated and stored in AWS

Inbound data transfer is free

Multipart upload to S3

Physical media

AWS Direct Connect

Regional replication of AMIs and snapshots

Aggregation with S3Distcp

S3distcp on EMR job sample

./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \

/home/hadoop/lib/emr-s3distcp-1.0.jar \

--args \

'--src,s3://myawsbucket/cf,\

--dest,s3://myoutputbucket/aggregate ,\

--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,\

--targetSize,128,\

--outputCodec,lzo,\

--deleteOnSuccess'

GENERATE STORE ANALYZE SHARE

Amazon S3,

Amazon Glacier,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

AWS Storage Gateway,

Data on Amazon EC2

AMAZON S3 SIMPLE STORAGE SERVICE

AMAZON

DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED

NoSQL DATABASE SERVICE

DURABLE &

AVAILABLE CONSISTENT, DISK-ONLY

WRITES (SSD)

LOW LATENCY AVERAGE READS < 5MS,

WRITES < 10MS

NO ADMINISTRATION

ad-id advertiser max-price imps to

deliver

imps

delivered

1 AAA 100 50000 1200

2 BBB 150 30000 2500

user-id attribute1 attribute2 attribute3 attribute4

A XXX XXX XXX XXX

B YYY YYY YYY YYY

not many

rows

so many

rows

frequent

update

(near realtime)

batch manner update

Ads

Profiles

Very general table structure

500,000 WRITES PER SECOND

DURING SUPER BOWL

AMAZON

GLACIER reliable long term archiving

AMAZON S3 Archive to

Amazon Glacier

S3 Lifecycle policies

If object older than 5 month

AMAZON S3

Delete object from S3

S3 Lifecycle policies

/dev/null

If object older than 5 month

If object older than 1 year

AMAZON

REDSHIFT FULLY MANAGED, PETA-BYTE SCALE

DATAWAREHOUSE ON AWS

DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was…

AMAZON REDSHIFT

A Whole Lot Simpler

A Lot Cheaper

A Lot Faster

AMAZON REDSHIFT

RUNS ON OPTIMIZED HARDWARE

HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate

HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage

30 MINUTES

DOWN TO

12 SECONDS

Extra Large Node

(HS1.XL)

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

AMAZON REDSHIFT LETS YOU

START SMALL AND GROW BIG

Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)

JDBC/ODBC

Price Per Hour for

HS1.XL Single

Node

Effective Hourly

Price Per TB

Effective Annual

Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year

Reservation $ 0.500 $ 0.250 $ 2,190

3 Year

Reservation $ 0.228 $ 0.114 $ 999

DATA WAREHOUSING DONE THE AWS WAY

No upfront costs, pay as you go

Really fast performance at a really low price

Open and flexible with support for popular tools

Easy to provision and scale up massively

USAGE SCENARIOS

Reporting Warehouse

Accelerated operational reporting

Support for short-time use cases

Data compression, index redundancy

RDBMS Redshift

OLTP ERP Reporting

and BI

Data Integration Partners*

On-Premises Integration

RDBMS Redshift

OLTP ERP Reporting

and BI

Live Archive for (Structured) Big Data

Direct integration with copy command

High velocity data

Data ages into Redshift

Low cost, high scale option for new apps

DynamoDB Redshift

OLTP Web Apps Reporting

and BI

Cloud ETL for Big Data

Maintain online SQL access to historical logs

Transformation and enrichment with EMR

Longer history ensures better insight

Redshift Reporting and BI Elastic MapReduce

S3

create table cf_logs

( d date,

t char(8),

edge char(4),

bytes int,

cip varchar(15),

verb char(3), distro varchar(MAX), object varchar(MAX), status int,

Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )

COPY into Amazon Redshift

copy cf_logs from 's3://cfri/cflogs-sm/E123ABCDEF/'

credentials

'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'

IGNOREHEADER 2

GZIP

DELIMITER '\t'

DATEFORMAT 'YYYY-MM-DD'

COPY into Amazon Redshift

GENERATE STORE ANALYZE SHARE

Amazon EC2

Amazon Elastic

MapReduce

AMAZON EC2 ELASTIC COMPUTE CLOUD

Virtual core: 1

Memory: 1.7 GiB

I/O performance: Moderate

m1.small

EC2 instance families – General purpose

cc2.8xlarge

Virtual core: 32 - 2 x Intel Xeon

Memory: 60,5 GiB

I/O performance: 10 Gbit

m1.small

EC2 instance families – Compute optimized

cc2.8xlarge m1.small cr1.8xlarge

Virtual core: 32 - 2 x Intel Xeon

Memory: 240 GiB

I/O performance: 10 Gbit

SSD Instance store: 240 GB

EC2 instance families – Memory optimized

cc2.8xlarge m1.small cr1.8xlarge hi.4xlarge

Virtual core: 16

Memory: 60.5 GiB

I/O performance: 10 Gbit

SSD Instance store: 2 x 1TB

hs1.8xlarge

Virtual core: 16

Memory: 117 GiB

I/O performance: 10 Gbit

Instance store: 24 x 2TB

EC2 instance families – Storage optimized

ON A SINGLE INSTANCE

COMPUTE TIME: 4h

COST: 4h x $2.1 = $8.4

ON MULTIPLE INSTANCES

COMPUTE TIME: 1h

COST: 1h x 4 x $2.1 = $8.4

3 HOURS FOR $4828.85/hr

Instead of

$20+ MILLIONS

in infrastructure

• A FRAMEWORK

• SPLITS DATA INTO PIECES

• LETS PROCESSING OCCUR

• GATHERS THE RESULTS

AMAZON ELASTIC

MAPREDUCE HADOOP AS A SERVICE

Corporate Data

Center

Elastic Data

Center

Corporate Data

Center

Elastic Data

Center

Application data

and logs for

analysis pushed

to S3

Corporate Data

Center

Elastic Data

Center

Amazon Elastic

Map Reduce

master node to

control analysis

M

Corporate Data

Center

Elastic Data

Center

Hadoop cluster

started by Elastic

Map Reduce

M

Corporate Data

Center

Elastic Data

Center

M

Adding many

hundreds or

thousands of

nodes

Corporate Data

Center

Elastic Data

Center

M

Disposed of when

job completes

Corporate Data

Center

Elastic Data

Center

Results of

analysis pulled

back into your

systems

Your Spreadsheet does not

scale …

PIG

A real Pig script

(used at Twitter)

Run on

a sample

dataset on

your Laptop

$ pig –f myPigFile.q

Elastic Data

Center

M

Run the same

script on a

50 node

Hadoop cluster

$ ./elastic-mapreduce --create

--name "$USER's Pig JobFlow"

--pig-script

--args s3://myawsbucket/mypigquery.q

--instance-type m1.xlarge --instance-count 50

$ elastic-mapreduce -j j-21IMWIA28LRK1

--add-instance-group task

--instance-count 10

--instance-type m1.xlarge

GENERATE STORE ANALYZE SHARE

Amazon S3,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

Data on Amazon EC2

PUBLIC DATA SETS http://aws.amazon.com/publicdatasets

GENERATE STORE ANALYZE SHARE

AWS Data Pipeline

AWS Data Pipeline

Data-intensive orchestration and automation

Reliable and scheduled

Easy to use, drag and drop

Execution and retry logic

Map data dependencies

Create and manage compute resources

GENERATE STORE ANALYZE SHARE

Amazon S3,

Amazon Glacier,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

AWS Storage Gateway,

Data on Amazon EC2

AWS Import / Export

AWS Direct Connect

Amazon S3,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

Data on Amazon EC2

Amazon EC2

Amazon Elastic

MapReduce

AWS Data Pipeline

FROM DATA TO

ACTIONABLE

INFORMATION

Shlomi Vaknin

Amazon AWS generates big data core component for Ginger Software

Shlomi Vaknin

Oct 16, 2013

118

English writing assistant

An open platform for personal assistants

119

• Users talk naturally with any mobile application, Ginger understands and executes their command

• An end-to-end Speech-to-Action solution

• First open platform for creating personal assistants

120

Natural language speech interface for mobile apps

Proofreader

Speech Engine

Rephrase

PA Platform DB

Semantic Model

Writing Assistant Personal Coach

Query Understanding

NLP/NLU Algorithms

Web Corpus Language model

Domain Corpus

User Corpus

122

• A collection of all the language we found on the internet, accessible and pre-processed

• Has to contain lots and lots of sentences

• Needs to represent “common written language”

• Accessible both for offline (research) and online (service) uses

Our platform depends on scanning and indexing all the language we can find on the internet

123

1. Crawling [own cluster, EMR+S3] • Generated about 50 TB of raw data • Reduced to about 5 TB of text data

2. Post processing [EMR+S3]

3. Indexing/Serving [EMR+S3] • Key/Value – has to be super fast • Full-text-search

4. Archiving (Glacier) [S3+Glacier] • Keeping data available for later research while minimizing cost

• Tokenize • Normalize • Split to n-grams

• Generalize • Count • Filter

124

• Mainly an NLP task

• So we picked up • It’s a Lisp! • Integrates very well with EMR, S3, etc..

• n-Gram Counting • How are you, How are, are you, How, are, you • Lots of grams are repeated • Generalize contextually similar tokens

• Fits map-reduce paradigm very well • Most parts can be trivially parallelized • One part is sequential by grams

125

• EMR cluster node types • Master, Task, Core

• Ratio between Core and Task nodes • We expected a very large output (100TB)

• m2.4xlarge core output 1690GB

• core nodes

• Estimate number of total map tasks

• Final specs: Node Type Instance Count

MASTER cc2.8xlarge 1

CORE m2.4xlarge 200

TASK m2.2xlarge 500

126

• Job took about 30 hours to complete

• We generated nearly 100TB of output data

• During map phase, the cluster achieved nearly 100% utilization

• After initial filtration, 20TB remained

127

• Stay up to date with AMI releases • Don't stick to an old AMI just because it previously worked

• Use the Job-Tracker • Use custom progress notification • Increase mapred.task.timeout

• Limit number of concurrent map tasks • Use the minimum number that gets you close to 100% CPU

• Beware of spot nodes • If you ask for too many you might compete against your own price

128

• Stash the data for later use, to reduce cost

• Glacier offers very cheap storage

• Important things to know about Glacier: • Restoring the data could be VERY expensive • The key to reduce restore costs - restore SLOWLY • There is no built-in mechanism to restore slowly

• 3rd party application • do it manually

• Glacier is very useful if your use case matches its design

129

• EMR/S3 provides great power and elasticity, to grow and shrink as required

• Do your homework before running large jobs!

130

• Our platforms depends on scanning and indexing all the language we can find on the internet

• To achieve this Ginger Software makes heavy use of Amazon EMR

• With Amazon EMR, Ginger Software can scale up vast amounts of computing power and scale back down when it is not needed

• This gives Ginger Software the ability to create the world’s most accurate language enhancement technology without the need to have expensive hardware lying idle during quiet periods

We are hiring! shlomiv@gingersoftware.com

Thank You!