AWS re:Invent 2016: Best Practices for Apache Spark on Amazon EMR (BDM301)

Jonathan Fritz, Sr. Product Manager, Amazon EMR

Yekesa Kosuru, VP of Engineering, DataXu

Dong Jiang, Sr. Principal Software Engineer, DataXu

Saket Mengle, Sr. Principal Data Scientist, DataXu

AWS re:Invent 2016

BDM301

Best Practices for Apache

Spark on Amazon EMR

What to Expect from the Session

• Spark on EMR architecture

• Spark performance and using S3

• Spark security

• Spark for ETL and data science at DataXu

• Demo of DataXu’s Spark workflow

• Run Spark Driver in

Client or Cluster mode

• Spark application runs

as a YARN application

• SparkContext runs as a

library in your program,

one instance per Spark

application.

• Spark Executors run in

YARN Containers on

NodeManagers in your

cluster

• Access Spark UI through

the Resource Manager

or Spark History Server

Spark on YARN

Spark UI

Configuring Executors – Dynamic Allocation

• Optimal resource utilization

• YARN dynamically creates and shuts down executors

based on the resource needs of the Spark application

• Spark uses the executor memory and executor cores

settings in the configuration for each executor

• Amazon EMR uses dynamic allocation by default, and

calculates the default executor size to use based on the

instance family of your Core Group

• Or, use maximizeResourceAllocation for single-tenancy

Options to submit jobs – on cluster

Notebooks and IDEs: Apache Zeppelin,

Jupyter, RStudio, and more!

Connect with ODBC / JDBC to

Spark Thriftserver

Use Spark Actions in your Apache Oozie

workflow to create DAGs of jobs.

(start using

start-thriftserver.sh)

Or, use Spark Submit

or the Spark Shell

Develop fast using notebooks and IDEs

Install using Bootstrap ActionsIncluded in EMR releases Connect with

ODBC / JDBC

(start using

start-thriftserver.sh)

Thrift Server

Options to submit jobs – off cluster

Amazon EMR

Step API

Submit a Spark

application

Amazon EMR

AWS Data Pipeline

Airflow, Luigi, or other

schedulers on EC2

Create a pipeline

to schedule job

submission or create

complex workflows

AWS Lambda

Use AWS Lambda to

submit applications to

EMR Step API or directly

to Spark on your cluster

Many storage layers to choose from

Amazon DynamoDB

Amazon RDS Amazon Kinesis

Amazon Redshift

Amazon S3

Amazon EMR

Decouple compute and storage by using S3

as your data layer

S3 is designed for 11

9’s of durability and is

massively scalable

EC2 Instance

Memory

Amazon S3

Amazon EMR

Intermediates

stored on local

disk or HDFSLocal

Amazon Aurora

Metadata for

external tables

S3 tips: Partitions, compression, and file formats

• Avoid key names in lexicographical order

• Improve throughput and S3 list performance

• Use hashing/random prefixes or reverse the date-time

• Compress data set to minimize bandwidth from S3 to

• Make sure you use splittable compression or have each file

be the optimal size for parallelization on your cluster

• Columnar file formats like Parquet can give increased

performance on reads

Encryption at-rest and in-flight

Auto Scaling for data science on-demand

YARN metrics

Coming Soon: Advanced Spot provisioning

Master Node Core Instance Fleet Task Instance Fleet

• Provision from a list of instance types with Spot and On-Demand

• Launch in the most optimal AZ based on capacity/price

• Spot Block support

Using Spark at DataXu - Overview

• Who is DataXu

• Why Spark

• DataXu Processing Flows

• Benchmarks

• Tips & Lessons Learned

• Demo

• DataXu Science

• Benchmarks

• Tips and Lessons Learned

• Demo

DataXu

• Who

• Spun out of MIT Labs

• A petabyte scale digital

marketing platform

• One of the fastest growing

companies in Inc. 5000

• What

• Help world’s most

valuable brands

understand and engage

with their consumer

• Maximize ROI

Quick Statistics

• 2M+ bid requests per second

• Billions of impressions,

Petabytes of data

• ~10ms round trip response time

• 180+TB logs per day

• 2PB data analyzed

• 3000+ servers powering the

platform

• 13 regions, 24x7

Real Time Bidding

Why Spark?

• Performance

• Speed of execution: Spark sorted 100 TB of data (1 trillion records) 3X faster

using 10X fewer machines than traditional MapReduce

• Massively parallel

• In-memory vs disk IO

• Partition aware shuffles

• Speaks your language – Java, Scala, Python, R

• Streaming Use Cases – Streaming API

• Spark on EMR

• EMR as Common Foundation – mature platform

• Elasticity, Security, Spot Instances, Decoupled Storage and Compute

• Low TCO: Spot Instances, Low OPEX

DataXu Flows

Real Time

Bidding

Retargeting

Platform

Amazon

Kinesis

Streams

Advanced Analytics

(Third Party)

Reporting Tools

(Third Party)Amazon

Machine

Learning

S3All Data

Attribution

Ecosystem of tools and services

DataXu Spark Pipeline

ETL AttributionMachine

Learning

S3Amazon

Kinesis

Parquet, DataSets, DataFrames

Common Backend

Amazon EMR

SQLStreaming

DataXu Flows – New Generation

Real Time

Bidding

Retargeting

Platform

Amazon

KinesisAttribution & ML

Reporting

Data Visualization

Pipeline

ETL(Spark SQL)

Ecosystem of tools and services

ETL Pipeline

Event Data

• Impressions

• Activities

• Attributions

• (Facts)

Reference Data

(Dimensions)

Application Logs

Exceptions Data

Reporting DataSpark on EMR

Cost/Performance Benchmarks

ETL Workload: Spark on EMR vs MPP Database

Cluster Instance Type,

Execution

Time (mins)

Monthly Costs

Shared MPP*

(COLO)48 x nodes

(24 cores, 64GB)20, 30, 54

$20,000

(Excluding SW license)

Single-node MPP (AWS) 1 i2.8xlarge ~40$4,992

(Excluding SW license)

Spark on EMR

7 m3.xlarge

7 m4.xlarge

20, 20, 20

18, 18, 18

Spark on EMR

3 m3.xlarge

3 m4.xlarge

50, 51, 51

44, 45, 46

Scalability

1x 2x 3x 4x 5x 6x 7x 8x 9x 10x

Scalability w/ Data Volume

Minutes (Actual) Minutes (Ideal)

Data Volume

2 3 4 5 6

Execution Time and Cost vs Number of Core Nodes

m4.xlarge Minutes m4.xlarge Monthly Cost

Number of Core Nodes

Spark SQL

BroadcastHashJoin, Easy in 2.0

UDF for Readability, Modularity and Testability, no performance penalty

Complex SQL queries may need to be materialized

Watch out for bottlenecks, like Spark driver operations

Lessons Learned

Big Picture

Re-think infrastructure (Cloud vs Host Solution)

• Think cloud native instead of fork lifting

Re-think ETL process (Spark vs MPP database vs Hadoop MR)

• Spark Core and SQL are rock solid, MR as fallback

• Streamline processing

Re-think in-memory processing

• Minimize incoming data footprints

• Control unnecessary joins

EMR is an excellent common backend

• Default setting just works

• Consistent performance

DataXu Machine Learning

Models

ModelsImpressions

Clicks

Activities

Calibrate

Evaluate

Bidding

ETL AttributionMachine

Learning

S3Amazon

Kinesis

Why is this hard?

Huge Scale • 2 Petabytes Processed Daily

• 2 Million Bid Decisions Per Second

• Runs 24 X 7 on 5 Continents

• Thousands of ML Models Trained per Day

Unattended Operation • Model training and deployment runs

automatically every day

Changing Industry • Need ability to adapt quickly to new customer

requirements

Benchmark – Training Time

• Training times are linear using

Spark AWS

• Three different algorithms

tested at scale

• Decision Tree

• Random Forest

• Logistic Regression

• Linear Scalability is important

For training/evaluating we used a 6 node r3.2xlarge

cluster - 150 GB Memory, 20 Cores.

100150200250300350400450500

0 5 10 15

Number of Training RecordsMillions

Training Time Comparison

Logistic Regression Decision Tree Random Forest

Benchmarks – Bidding time

CURRENT DATAXU MODEL

SPARK RANDOM FOREST

Avg. Latency (milliseconds)

536 37

RandomForests

LogisticRegression

NaiveBayes

DecisionTrees

DataXuModel

Model Size in Memory (KB)

• Out-of-the-box Spark performance is impressive

• Latency

• Model Size

• DataXu models are faster and smaller but they were optimized for years

Takeaways

Big Picture

One common big data platform

Spark on Amazon EMR works well out of the box

Very little code to train and evaluate models

Automated & unattended ML at scale

Thank you!

ykosuru@dataxu.com

djiang@dataxu.com

smengle@dataxu.com

We are hiring!

www.dataxu.com/careers

jonfritz@amazon.com

aws.amazon.com/emr/

aws.amazon.com/blogs/big-data/

Remember to complete

your evaluations!

AWS re:Invent 2016: Best Practices for Apache Spark on Amazon EMR (BDM301)

Technology

Transcript of AWS re:Invent 2016: Best Practices for Apache Spark on Amazon EMR (BDM301)

Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013

(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift | AWS re:Invent 2014

16069 Romance Strs - alle-noten.deHejre Kati (Hubay) N° EMR Clarinet & Orchestra EMR 16044 EMR 16058 EMR 16060 EMR 16062 EMR 16064 EMR 16066 EMR 16068 EMR 16069 EMR 16071 EMR 16073

EMR 12215 The Gremlins Suite - alle-noten.deThe Gremlins Suite (Goldsmith) N EMR Blasorchester Concert Band EMR 12210 EMR 12217 EMR 12194 EMR 12205 EMR 12206 EMR 12207 EMR 12208 EMR

DISCOGRAPHY · Funiculi Funicula (Arr.: Naulais) N° EMR Blasorchester Concert Band EMR 10666 EMR 10425 EMR 10700 EMR 11001 EMR 10791 EMR 10396 EMR 10784 EMR 10432 ...

(SEC403) Diving into AWS CloudTrail Events w/ Apache Spark on EMR

(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift | AWS re:Invent 2014

DISCOGRAPHY - Amazon S3 · Take Five N° EMR Brass Band EMR 3619 EMR 3620 EMR 3621-EMR 3622 EMR 3623 EMR 3624 EMR 3625 EMR 3626 EMR 3627 EMR 3628 EMR 3629 EMR 3630 EMR 3631 EMR 3632

(ADV303) MediaMath’s Data Revolution with Amazon Kinesis and Amazon EMR | AWS re:Invent 2014

(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR

DISCOGRAPHY - Amazon S3 · 2020. 6. 9. · Brass Band EMR 1433 EMR 1241 EMR 2507 EMR 2760 EMR 2753 EMR 2574 EMR 1424 EMR 2622 EMR 1240 EMR 1886 EMR 2634 EMR 2551 EMR 1693 EMR 2761

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303) | AWS re:Invent 2013

DISCOGRAPHY - Amazon S3...Sing, Sing, Sing (Prima) N EMR Brass Band EMR 1433 EMR 1241 EMR 2507 EMR 2760 EMR 2753 EMR 2574 EMR 1424 EMR 2622 EMR 1240 EMR 1886 EMR 2634 …

EMR 11556 Tango Forever - edrmartin.com · Tango Forever (Montana) N° EMR Blasorchester Concert Band EMR 11493 EMR 11478 EMR 11421 EMR 11696 EMR 11423 EMR 11420 EMR 11716 EMR 11518

DISCOGRAPHY · 2016-07-08 · Big Band EMR 13833 EMR 13086 EMR 13829 EMR 13830 EMR 14183 EMR 13835 EMR 14169 EMR 13092 EMR 13839 EMR 14223 ... EMR 1280 Hello, Dolly ! HERMAN (Mortimer)

EMR 9107 Zodiac - BB Parts on Landscape · 2015. 5. 28. · Blasorchester Concert Band EMR 11532 EMR 11582 EMR 11584 EMR 11589 EMR 11599 EMR 11603 EMR 11612 EMR 11602 EMR 11637 EMR

AWS re:Invent 2016: How Telltale Games migrated its story analytics from Apache CouchDB to Amazon DynamoDB (DAT316)

DISCOGRAPHY - edrmartin.com · Venus N° EMR Brass Band EMR 3526 EMR 3527 EMR 3528 EMR 3529 EMR 3530 EMR 3531 EMR 3532 EMR 3533 EMR 3534 EMR 3535 ... Medium Rock q = 124 As sung by

DISCOGRAPHY · (Vangelis) N° EMR Blasorchester Concert Band EMR 1619 EMR 1663 EMR 1661 EMR 1638 EMR 1660 EMR 1653 EMR 1458 EMR 1178 EMR 1334 EMR 1546 EMR 1166 Time 4’24 4’32

DISCOGRAPHY - alle-noten.de · Collection Timofei Dokshitser Trumpet Solo EMR 6001 EMR 639 EMR 677 Trumpet, Piano EMR 625 EMR 624 EMR 626 EMR 693 EMR 640 EMR 615 EMR 617 EMR 618 EMR