Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Scaling Your Analytics with Amazon Elastic MapReduce

Peter Sirota, General Manager - Amazon Elastic MapReduce

November 14, 2013

Agenda • Amazon EMR: Hadoop in the cloud

• Hadoop Ecosystem on Amazon EMR

• Customer Use Cases

Hadoop is the right system for Big Data

• Scalable and fault tolerant • Flexibility for multiple languages

and data formats • Open source • Ecosystem of tools • Batch and real-time analytics

Challenges with Hadoop

On Premise

• Manage HDFS, upgrades, and system administration

• Pay for expensive support contracts

• Select hardware in advance and stick with predictions

On Amazon EC2

• Difficult to integrate with AWS storage services

• Independently manage and monitor clusters

Amazon EMR is the easiest way to run Hadoop in the cloud

• Managed services • Easy to tune clusters and trim costs • Support for multiple data stores • Unique features and ecosystem support

Why Amazon EMR?

Input data S3, DynamoDB, Redshift

Code


Elastic MapReduce

Elastic MapReduce

Code Name node


Elastic MapReduce

Code Name node

Input data

Elastic cluster

S3, DynamoDB, Redshift

S3/HDFS

Elastic MapReduce

Code Name node

Input data

S3/HDFS Queries + BI

Via JDBC, Pig, Hive


Elastic cluster

Elastic MapReduce

Code Name node

Output

Input data

Queries + BI

Via JDBC, Pig, Hive


Elastic cluster

S3/HDFS

Output


Elastic clusters Customize size and type to reduce costs

Choose your instance types Try out different configurations to find your optimal architecture

CPU c1.xlarge cc1.4xlarge cc2.8xlarge

Memory m1.large m2.2xlarge m2.4xlarge

Disk hs1.8xlarge

Long running or transient clusters Easy to run Hadoop clusters short-term or 24/7, and only pay for what you need

=

10 hours

Resizable clusters Easy to add and remove compute capacity on your cluster

6 hours


Peak capacity


Matched compute demands with cluster sizing


10 hours

Use Spot and Reserved Instances Minimize costs by supplementing on-demand pricing

Easy to use Spot Instances Name-your-price supercomputing to minimize costs

Spot for task nodes

Up to 90% off Amazon

EC2 on-demand

pricing

On-demand for core nodes

Standard Amazon EC2

pricing for on-demand

capacity

24/7 clusters on Reserved Instances Minimize cost for consistent capacity

Reserved Instances for long running

clusters

Up to 65% off on-demand

pricing

Your data, your choice Easy to integrate Amazon EMR with your data stores

Using Amazon S3 and HDFS

Data Sources Transient EMR cluster

for batch map/reduce jobs for daily reports

Long running EMR cluster holding data in HDFS for Hive interactive queries

Weekly Report

Ad-hoc Query

Data aggregated and stored in Amazon S3

Use Amazon EMR with Amazon Redshift and Amazon S3

Data Sources

Daily data aggregated in Amazon S3

Amazon EMR cluster used to process data

Processed data loaded into

Amazon Redshift data warehouse

Use the Hadoop Ecosystem on Amazon EMR Leverage a diverse set of tools to get the most out of your data

• Databases • Machine learning • Metadata stores • Exchange formats • Diverse query languages

Hadoop 2.x

and much more...

Use Hive on Amazon EMR to interact with your data in HDFS and Amazon S3

• Data warehouse for Hadoop • Integration with Amazon S3 for

better performance reading and writing to Amazon S3

• SQL-like query language to make iterative queries easier

• Easy to scale in HDFS on a persistent Amazon EMR cluster

Use HBase on a persistent Amazon EMR cluster as a column-oriented scalable data store

• Billions of rows and millions of columns

• Backup to and restore from Amazon S3

• Flexible datatypes • Modulate your HBase tables

when adding new data to your system

Use ad-hoc queries on your cluster to drive insights in real-time

• In-memory MapReduce for faster queries

• Use HiveQL to interact with your data

Spark / Shark

Use ad-hoc queries on your cluster to drive insights in real-time

• In-memory MapReduce for faster queries

• Use HiveQL to interact with your data

Spark / Shark

• Parallel database engine for Hadoop

• Use SQL to query data in HDFS on your cluster in real-time

Impala (coming soon!)

“Hadoop-as-a-Service [Amazon EMR] offers a better price-performance ratio [than bare-metal Hadoop].”

1. Elastic clusters and cost optimization

2. Rapid, tuned provisioning

3. Agility for experimentation

4. Easy integration with diverse datastores

Diverse set of partners to build on Amazon EMR

BI / Visualization Business Intelligence BI / Visualization BI / Visualization

Hadoop Distribution Data Transfer Encryption Data Transformation

Monitoring Performance Tuning Graphical IDE Graphical IDE

Available on AWS Marketplace Available as a distribution in Amazon Elastic MapReduce

ETL Tool

Thousands of customers


How Netflix scales Big Data Platform on Amazon EMR

Eva Tse, Director of Big Data Platform, Netflix

November 14, 2013

Hadoop ecosystem as our Data Analytics platform

in the cloud

How we got here?

How do we scale?

Separate compute and storage layers

Amazon S3 as our DW

S3

Source of

truth

S3 S3mper-enabled

Source of

truth

Multiple clusters

S3

Source of

truth

zone x zone y

Ad hoc SLA

S3

Source of

truth

zone x zone y zone z

SLA Ad hoc

Bonus Bonus Bonus

Unified and global big data collection pipeline

Ursula

cloud apps

Suro

SLA

Source of

truth

S3

Events Pipeline

Aegisthus

Dimension Pipeline

Bonus

Adhoc

Innovate – services and tools

CLIs Gateways

Sting

Putting into perspective … • Billions of viewing hours of data • ~3000 nodes clusters • Hundred billion events / day • Few petabytes DW on Amazon S3 • Thousands of jobs / day

Adhoc querying

Simple Reporting

E

T L E

T

T

L

Analytics and statistical modeling

Open Connect

What works for us? Scalability

What works for us? Hadoop integration on Amazon EC2 / AWS

What works for us? Let us focus on innovation and build a solution

What works for us?

Tight engagement with Amazon EMR & Amazon EC2 teams for tactical issues and strategic roadmap

Next Steps …

• Heterogeneous node cluster • Auto expand shrink

• Richer monitoring infrastructure

We strive to build the best of class big data platform in the cloud


Big Data at Channel 4 Amazon Elastic MapReduce for Competitive Advantage

Bob Harris – Channel 4 Television

14th November 2013

Channel 4 – Background • Channel 4 is a public service, commercially funded, not-for-profit, broadcaster.

• We have a remit to deliver innovative, experimental, distinctive, and diverse

content across television, film, and digital media.

• We are funded predominantly by television advertising, competing with the other established UK commercial broadcasters, and increasingly with emerging, Internet based, providers.

• Our content, is available across our portfolio of around 10 core and time-shift channels, and our on demand service 4oD is accessible across multiple devices and platforms.

Why Big Data at C4

Business Intelligence at C4 • Well established Business Intelligence capability

• Based on industry standard proprietary products

• Real-time data warehousing

• Comprehensive business reporting

• Excellent internal skills

• Good external skills availability

Big Data Technology at C4 • 2011 - Embarked on Big Data initiative

– Ran in-house and cloud-based PoCs – Selected Amazon EMR

• 2012 - Ran Amazon EMR in parallel with conventional BI

– Hive deployed to Data Analysts – Amazon EMR workflows deployed to production

• 2013 – Amazon EMR confirmed as primary Big Data platform

– Amazon EMR usage growing, focus on automation – Experimenting with Mahout for Machine Learning

What problems are we solving?

Single view of the viewer recognising them across

devices and serving relevant content

Personalising the viewer experience

How are we doing this? • Principal tasks…

– Audience segmentation – Personalisation – Recommendations

• What data do we process…

– Website clickstream logs – 4oD activity and viewing history – Over 9m registered users – Majority of activity now from “logged-in” users

High-Level Architecture

High-Level Architecture • Amazon EMR and existing BI technology are

complementary

• Process billions of data rows in Amazon EMR, store millions of result rows in RDBMS

• No need to “rip and replace”, existing technology investment is protected

• Amazon EMR will continue to underpin major growth in data volumes and processing complexity

Where Next? • Continued growth in usage of Amazon EMR

• Migrate to Hadoop 2.x

• Adopt Amazon Redshift

• Improved integration between C4 and AWS

• Shift toward “near real-time” processing

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

BDT301

Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Technology

Transcript of Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013