AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

131
AWS Summit 2013 Tel Aviv Oct 16 Tel Aviv, Israel Jan Borch | AWS Solutions Architect Data Analytics on BigData

description

 

Transcript of AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Page 1: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel

Jan Borch | AWS Solutions Architect

Data Analytics on BigData

Page 2: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

Page 3: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

THE COST OF DATA

GENERATION IS FALLING

Page 4: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 5: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 6: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Progress is not evenly distributed

1980 Today

14,000,000$/TB

100MB

4MB/s

30$/TB

3TB

200MB/s

30,000 X

50 X

450,000 ÷

Page 7: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

THE MORE DATA YOU COLLECT

THE MORE VALUE YOU CAN

DERIVE FROM IT

Page 8: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 9: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 10: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

Lower cost,

higher throughput

Page 11: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

Lower cost,

higher throughput

Highly

constrained

Page 12: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Generated data

Available for analysis

DATA VOLUME

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011

IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares

Page 13: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

Page 14: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

ACCELERATE

Page 15: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

+ ELASTIC AND HIGHLY SCALABLE

+ NO UPFRONT CAPITAL EXPENSE

+ ONLY PAY FOR WHAT YOU USE

+ AVAILABLE ON-DEMAND

= REMOVE CONSTRAINTS

Page 16: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

AWS EC2

AWS CloudFront

Page 17: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 18: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

• Fluentd

• Flume

• Scribe

• Chukwa

• LogStash

{output{ s3 {

bucket => myBucket,

aws_credential_file => ~/cred.json

size_file=> 120MB

}}

Page 19: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 20: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 21: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 22: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

“Poor man’s Analytics”

Page 23: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 24: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Embed poor-man pixel

http://www.poor-man-analytics.com/__track.gif?idt=5.1.5&idc=5&utmn=1532897343&utmhn=www.douban.com&utmcs=UTF-8&utmsr=1440x900&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.3%20r181&utmdt=%E8%B1%86%E7%93%A3&utmhid=571356425&utmr=-&utmp=%2F&utmac=UA-7019765-1&utmcc=__utma%3D30149280.1785629903.1314674330.1315290610.1315452707.10%3B%2B__utmz%3D30149280.1315452707.10.7.utmcsr%3Dbiaodianfu.com%7Cutmccn%3D(referral)%7Cutmcmd%3Dreferral%7Cutmcct%3D%2Fpoor-man-analytics-architecture.html%3B%2B__utmv%3D30149280.162%3B&utmu=qBM~

Page 25: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

Page 26: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

AWS Import / Export

AWS Direct Connect

AWS Elastic Map Reduce

Page 27: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Generated and stored in AWS

Inbound data transfer is free

Multipart upload to S3

Physical media

AWS Direct Connect

Regional replication of AMIs and snapshots

Page 28: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Aggregation with S3Distcp

Page 29: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

S3distcp on EMR job sample

./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \

/home/hadoop/lib/emr-s3distcp-1.0.jar \

--args \

'--src,s3://myawsbucket/cf,\

--dest,s3://myoutputbucket/aggregate ,\

--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-[0-9]+-[0-9]+).*,\

--targetSize,128,\

--outputCodec,lzo,\

--deleteOnSuccess'

Page 30: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

Amazon S3,

Amazon Glacier,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

AWS Storage Gateway,

Data on Amazon EC2

Page 31: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 32: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 33: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AMAZON S3 SIMPLE STORAGE SERVICE

Page 34: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 35: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 36: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AMAZON

DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED

NoSQL DATABASE SERVICE

Page 37: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

DURABLE &

AVAILABLE CONSISTENT, DISK-ONLY

WRITES (SSD)

Page 38: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

LOW LATENCY AVERAGE READS < 5MS,

WRITES < 10MS

Page 39: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

NO ADMINISTRATION

Page 40: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

ad-id advertiser max-price imps to

deliver

imps

delivered

1 AAA 100 50000 1200

2 BBB 150 30000 2500

user-id attribute1 attribute2 attribute3 attribute4

A XXX XXX XXX XXX

B YYY YYY YYY YYY

not many

rows

so many

rows

frequent

update

(near realtime)

batch manner update

Ads

Profiles

Very general table structure

Page 41: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

500,000 WRITES PER SECOND

DURING SUPER BOWL

Page 42: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 43: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 44: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AMAZON

GLACIER reliable long term archiving

Page 45: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AMAZON S3 Archive to

Amazon Glacier

S3 Lifecycle policies

If object older than 5 month

Page 46: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AMAZON S3

Delete object from S3

S3 Lifecycle policies

/dev/null

If object older than 5 month

If object older than 1 year

Page 47: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 48: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 49: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AMAZON

REDSHIFT FULLY MANAGED, PETA-BYTE SCALE

DATAWAREHOUSE ON AWS

Page 50: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 51: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was…

AMAZON REDSHIFT

A Whole Lot Simpler

A Lot Cheaper

A Lot Faster

Page 52: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AMAZON REDSHIFT

RUNS ON OPTIMIZED HARDWARE

HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate

HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage

Page 53: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 54: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 55: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

30 MINUTES

DOWN TO

12 SECONDS

Page 56: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 57: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Extra Large Node

(HS1.XL)

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

AMAZON REDSHIFT LETS YOU

START SMALL AND GROW BIG

Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)

Page 58: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

JDBC/ODBC

Page 59: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 60: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Price Per Hour for

HS1.XL Single

Node

Effective Hourly

Price Per TB

Effective Annual

Price per TB

On-Demand $ 0.850 $ 0.425 $ 3,723

1 Year

Reservation $ 0.500 $ 0.250 $ 2,190

3 Year

Reservation $ 0.228 $ 0.114 $ 999

Page 61: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

DATA WAREHOUSING DONE THE AWS WAY

No upfront costs, pay as you go

Really fast performance at a really low price

Open and flexible with support for popular tools

Easy to provision and scale up massively

Page 62: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

USAGE SCENARIOS

Page 63: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Reporting Warehouse

Accelerated operational reporting

Support for short-time use cases

Data compression, index redundancy

RDBMS Redshift

OLTP ERP Reporting

and BI

Page 64: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Data Integration Partners*

On-Premises Integration

RDBMS Redshift

OLTP ERP Reporting

and BI

Page 65: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Live Archive for (Structured) Big Data

Direct integration with copy command

High velocity data

Data ages into Redshift

Low cost, high scale option for new apps

DynamoDB Redshift

OLTP Web Apps Reporting

and BI

Page 66: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Cloud ETL for Big Data

Maintain online SQL access to historical logs

Transformation and enrichment with EMR

Longer history ensures better insight

Redshift Reporting and BI Elastic MapReduce

S3

Page 67: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

create table cf_logs

( d date,

t char(8),

edge char(4),

bytes int,

cip varchar(15),

verb char(3), distro varchar(MAX), object varchar(MAX), status int,

Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )

COPY into Amazon Redshift

Page 68: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

copy cf_logs from 's3://cfri/cflogs-sm/E123ABCDEF/'

credentials

'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'

IGNOREHEADER 2

GZIP

DELIMITER '\t'

DATEFORMAT 'YYYY-MM-DD'

COPY into Amazon Redshift

Page 69: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

Amazon EC2

Amazon Elastic

MapReduce

Page 70: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 71: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 72: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AMAZON EC2 ELASTIC COMPUTE CLOUD

Page 73: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Virtual core: 1

Memory: 1.7 GiB

I/O performance: Moderate

m1.small

EC2 instance families – General purpose

Page 74: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

cc2.8xlarge

Virtual core: 32 - 2 x Intel Xeon

Memory: 60,5 GiB

I/O performance: 10 Gbit

m1.small

EC2 instance families – Compute optimized

Page 75: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

cc2.8xlarge m1.small cr1.8xlarge

Virtual core: 32 - 2 x Intel Xeon

Memory: 240 GiB

I/O performance: 10 Gbit

SSD Instance store: 240 GB

EC2 instance families – Memory optimized

Page 76: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

cc2.8xlarge m1.small cr1.8xlarge hi.4xlarge

Virtual core: 16

Memory: 60.5 GiB

I/O performance: 10 Gbit

SSD Instance store: 2 x 1TB

hs1.8xlarge

Virtual core: 16

Memory: 117 GiB

I/O performance: 10 Gbit

Instance store: 24 x 2TB

EC2 instance families – Storage optimized

Page 77: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

ON A SINGLE INSTANCE

COMPUTE TIME: 4h

COST: 4h x $2.1 = $8.4

Page 78: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

ON MULTIPLE INSTANCES

COMPUTE TIME: 1h

COST: 1h x 4 x $2.1 = $8.4

Page 79: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 80: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

3 HOURS FOR $4828.85/hr

Page 81: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Instead of

$20+ MILLIONS

in infrastructure

Page 82: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 83: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

• A FRAMEWORK

• SPLITS DATA INTO PIECES

• LETS PROCESSING OCCUR

• GATHERS THE RESULTS

Page 84: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AMAZON ELASTIC

MAPREDUCE HADOOP AS A SERVICE

Page 85: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 86: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 87: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Corporate Data

Center

Elastic Data

Center

Page 88: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Corporate Data

Center

Elastic Data

Center

Application data

and logs for

analysis pushed

to S3

Page 89: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Corporate Data

Center

Elastic Data

Center

Amazon Elastic

Map Reduce

master node to

control analysis

M

Page 90: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Corporate Data

Center

Elastic Data

Center

Hadoop cluster

started by Elastic

Map Reduce

M

Page 91: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Corporate Data

Center

Elastic Data

Center

M

Adding many

hundreds or

thousands of

nodes

Page 92: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Corporate Data

Center

Elastic Data

Center

M

Disposed of when

job completes

Page 93: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Corporate Data

Center

Elastic Data

Center

Results of

analysis pulled

back into your

systems

Page 94: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 95: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Your Spreadsheet does not

scale …

Page 96: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 97: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

PIG

Page 98: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

A real Pig script

(used at Twitter)

Page 99: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Run on

a sample

dataset on

your Laptop

Page 100: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

$ pig –f myPigFile.q

Page 101: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Elastic Data

Center

M

Run the same

script on a

50 node

Hadoop cluster

Page 102: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

$ ./elastic-mapreduce --create

--name "$USER's Pig JobFlow"

--pig-script

--args s3://myawsbucket/mypigquery.q

--instance-type m1.xlarge --instance-count 50

Page 103: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

$ elastic-mapreduce -j j-21IMWIA28LRK1

--add-instance-group task

--instance-count 10

--instance-type m1.xlarge

Page 104: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

Amazon S3,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

Data on Amazon EC2

Page 105: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

PUBLIC DATA SETS http://aws.amazon.com/publicdatasets

Page 106: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 107: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 108: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 109: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

AWS Data Pipeline

Page 110: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

AWS Data Pipeline

Data-intensive orchestration and automation

Reliable and scheduled

Easy to use, drag and drop

Execution and retry logic

Map data dependencies

Create and manage compute resources

Page 111: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 112: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 113: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data
Page 114: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

GENERATE STORE ANALYZE SHARE

Amazon S3,

Amazon Glacier,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

AWS Storage Gateway,

Data on Amazon EC2

AWS Import / Export

AWS Direct Connect

Amazon S3,

Amazon DynamoDB,

Amazon RDS,

Amazon Redshift,

Data on Amazon EC2

Amazon EC2

Amazon Elastic

MapReduce

AWS Data Pipeline

Page 115: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

FROM DATA TO

ACTIONABLE

INFORMATION

Page 116: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Shlomi Vaknin

Page 117: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Amazon AWS generates big data core component for Ginger Software

Shlomi Vaknin

Oct 16, 2013

Page 118: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

118

English writing assistant

An open platform for personal assistants

Page 119: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

119

Page 120: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

• Users talk naturally with any mobile application, Ginger understands and executes their command

• An end-to-end Speech-to-Action solution

• First open platform for creating personal assistants

120

Natural language speech interface for mobile apps

Page 121: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Proofreader

Speech Engine

Rephrase

PA Platform DB

Semantic Model

Writing Assistant Personal Coach

Query Understanding

NLP/NLU Algorithms

Web Corpus Language model

Domain Corpus

User Corpus

Page 122: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

122

• A collection of all the language we found on the internet, accessible and pre-processed

• Has to contain lots and lots of sentences

• Needs to represent “common written language”

• Accessible both for offline (research) and online (service) uses

Our platform depends on scanning and indexing all the language we can find on the internet

Page 123: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

123

1. Crawling [own cluster, EMR+S3] • Generated about 50 TB of raw data • Reduced to about 5 TB of text data

2. Post processing [EMR+S3]

3. Indexing/Serving [EMR+S3] • Key/Value – has to be super fast • Full-text-search

4. Archiving (Glacier) [S3+Glacier] • Keeping data available for later research while minimizing cost

• Tokenize • Normalize • Split to n-grams

• Generalize • Count • Filter

Page 124: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

124

• Mainly an NLP task

• So we picked up • It’s a Lisp! • Integrates very well with EMR, S3, etc..

• n-Gram Counting • How are you, How are, are you, How, are, you • Lots of grams are repeated • Generalize contextually similar tokens

• Fits map-reduce paradigm very well • Most parts can be trivially parallelized • One part is sequential by grams

Page 125: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

125

• EMR cluster node types • Master, Task, Core

• Ratio between Core and Task nodes • We expected a very large output (100TB)

• m2.4xlarge core output 1690GB

• core nodes

• Estimate number of total map tasks

• Final specs: Node Type Instance Count

MASTER cc2.8xlarge 1

CORE m2.4xlarge 200

TASK m2.2xlarge 500

Page 126: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

126

• Job took about 30 hours to complete

• We generated nearly 100TB of output data

• During map phase, the cluster achieved nearly 100% utilization

• After initial filtration, 20TB remained

Page 127: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

127

• Stay up to date with AMI releases • Don't stick to an old AMI just because it previously worked

• Use the Job-Tracker • Use custom progress notification • Increase mapred.task.timeout

• Limit number of concurrent map tasks • Use the minimum number that gets you close to 100% CPU

• Beware of spot nodes • If you ask for too many you might compete against your own price

Page 128: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

128

• Stash the data for later use, to reduce cost

• Glacier offers very cheap storage

• Important things to know about Glacier: • Restoring the data could be VERY expensive • The key to reduce restore costs - restore SLOWLY • There is no built-in mechanism to restore slowly

• 3rd party application • do it manually

• Glacier is very useful if your use case matches its design

Page 129: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

129

• EMR/S3 provides great power and elasticity, to grow and shrink as required

• Do your homework before running large jobs!

Page 130: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

130

• Our platforms depends on scanning and indexing all the language we can find on the internet

• To achieve this Ginger Software makes heavy use of Amazon EMR

• With Amazon EMR, Ginger Software can scale up vast amounts of computing power and scale back down when it is not needed

• This gives Ginger Software the ability to create the world’s most accurate language enhancement technology without the need to have expensive hardware lying idle during quiet periods

Page 131: AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

We are hiring! [email protected]

Thank You!