Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013
-
Upload
amazon-web-services -
Category
Technology
-
view
1.153 -
download
0
description
Transcript of Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Scaling Your Analytics with Amazon Elastic MapReduce
Peter Sirota, General Manager - Amazon Elastic MapReduce
November 14, 2013
Agenda • Amazon EMR: Hadoop in the cloud
• Hadoop Ecosystem on Amazon EMR
• Customer Use Cases
Hadoop is the right system for Big Data
• Scalable and fault tolerant • Flexibility for multiple languages
and data formats • Open source • Ecosystem of tools • Batch and real-time analytics
Challenges with Hadoop
On Premise
• Manage HDFS, upgrades, and system administration
• Pay for expensive support contracts
• Select hardware in advance and stick with predictions
On Amazon EC2
• Difficult to integrate with AWS storage services
• Independently manage and monitor clusters
Amazon EMR is the easiest way to run Hadoop in the cloud
• Managed services • Easy to tune clusters and trim costs • Support for multiple data stores • Unique features and ecosystem support
Why Amazon EMR?
Input data S3, DynamoDB, Redshift
Code
Input data S3, DynamoDB, Redshift
Elastic MapReduce
Elastic MapReduce
Code Name node
Input data S3, DynamoDB, Redshift
Elastic MapReduce
Code Name node
Input data
Elastic cluster
S3, DynamoDB, Redshift
S3/HDFS
Elastic MapReduce
Code Name node
Input data
S3/HDFS Queries + BI
Via JDBC, Pig, Hive
S3, DynamoDB, Redshift
Elastic cluster
Elastic MapReduce
Code Name node
Output
Input data
Queries + BI
Via JDBC, Pig, Hive
S3, DynamoDB, Redshift
Elastic cluster
S3/HDFS
Output
Input data S3, DynamoDB, Redshift
Elastic clusters Customize size and type to reduce costs
Choose your instance types Try out different configurations to find your optimal architecture
CPU c1.xlarge cc1.4xlarge cc2.8xlarge
Memory m1.large m2.2xlarge m2.4xlarge
Disk hs1.8xlarge
Long running or transient clusters Easy to run Hadoop clusters short-term or 24/7, and only pay for what you need
=
10 hours
Resizable clusters Easy to add and remove compute capacity on your cluster
6 hours
Resizable clusters Easy to add and remove compute capacity on your cluster
Peak capacity
Resizable clusters Easy to add and remove compute capacity on your cluster
Matched compute demands with cluster sizing
Resizable clusters Easy to add and remove compute capacity on your cluster
10 hours
Use Spot and Reserved Instances Minimize costs by supplementing on-demand pricing
Easy to use Spot Instances Name-your-price supercomputing to minimize costs
Spot for task nodes
Up to 90% off Amazon
EC2 on-demand
pricing
On-demand for core nodes
Standard Amazon EC2
pricing for on-demand
capacity
24/7 clusters on Reserved Instances Minimize cost for consistent capacity
Reserved Instances for long running
clusters
Up to 65% off on-demand
pricing
Your data, your choice Easy to integrate Amazon EMR with your data stores
Using Amazon S3 and HDFS
Data Sources Transient EMR cluster
for batch map/reduce jobs for daily reports
Long running EMR cluster holding data in HDFS for Hive interactive queries
Weekly Report
Ad-hoc Query
Data aggregated and stored in Amazon S3
Use Amazon EMR with Amazon Redshift and Amazon S3
Data Sources
Daily data aggregated in Amazon S3
Amazon EMR cluster used to process data
Processed data loaded into
Amazon Redshift data warehouse
Use the Hadoop Ecosystem on Amazon EMR Leverage a diverse set of tools to get the most out of your data
• Databases • Machine learning • Metadata stores • Exchange formats • Diverse query languages
Hadoop 2.x
and much more...
Use Hive on Amazon EMR to interact with your data in HDFS and Amazon S3
• Data warehouse for Hadoop • Integration with Amazon S3 for
better performance reading and writing to Amazon S3
• SQL-like query language to make iterative queries easier
• Easy to scale in HDFS on a persistent Amazon EMR cluster
Use HBase on a persistent Amazon EMR cluster as a column-oriented scalable data store
• Billions of rows and millions of columns
• Backup to and restore from Amazon S3
• Flexible datatypes • Modulate your HBase tables
when adding new data to your system
Use ad-hoc queries on your cluster to drive insights in real-time
• In-memory MapReduce for faster queries
• Use HiveQL to interact with your data
Spark / Shark
Use ad-hoc queries on your cluster to drive insights in real-time
• In-memory MapReduce for faster queries
• Use HiveQL to interact with your data
Spark / Shark
• Parallel database engine for Hadoop
• Use SQL to query data in HDFS on your cluster in real-time
Impala (coming soon!)
“Hadoop-as-a-Service [Amazon EMR] offers a better price-performance ratio [than bare-metal Hadoop].”
1. Elastic clusters and cost optimization
2. Rapid, tuned provisioning
3. Agility for experimentation
4. Easy integration with diverse datastores
Diverse set of partners to build on Amazon EMR
BI / Visualization Business Intelligence BI / Visualization BI / Visualization
Hadoop Distribution Data Transfer Encryption Data Transformation
Monitoring Performance Tuning Graphical IDE Graphical IDE
Available on AWS Marketplace Available as a distribution in Amazon Elastic MapReduce
ETL Tool
Thousands of customers
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
How Netflix scales Big Data Platform on Amazon EMR
Eva Tse, Director of Big Data Platform, Netflix
November 14, 2013
Hadoop ecosystem as our Data Analytics platform
in the cloud
How we got here?
How do we scale?
Separate compute and storage layers
Amazon S3 as our DW
S3
Source of
truth
S3 S3mper-enabled
Source of
truth
Multiple clusters
S3
Source of
truth
zone x zone y
Ad hoc SLA
S3
Source of
truth
zone x zone y zone z
SLA Ad hoc
Bonus Bonus Bonus
Unified and global big data collection pipeline
Ursula
cloud apps
Suro
SLA
Source of
truth
S3
Events Pipeline
Aegisthus
Dimension Pipeline
Bonus
Adhoc
Innovate – services and tools
CLIs Gateways
Sting
Putting into perspective … • Billions of viewing hours of data • ~3000 nodes clusters • Hundred billion events / day • Few petabytes DW on Amazon S3 • Thousands of jobs / day
Adhoc querying
Simple Reporting
E
T L E
T
T
L
Analytics and statistical modeling
Open Connect
What works for us? Scalability
What works for us? Hadoop integration on Amazon EC2 / AWS
What works for us? Let us focus on innovation and build a solution
What works for us?
Tight engagement with Amazon EMR & Amazon EC2 teams for tactical issues and strategic roadmap
Next Steps …
• Heterogeneous node cluster • Auto expand shrink
• Richer monitoring infrastructure
We strive to build the best of class big data platform in the cloud
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Big Data at Channel 4 Amazon Elastic MapReduce for Competitive Advantage
Bob Harris – Channel 4 Television
14th November 2013
Channel 4 – Background • Channel 4 is a public service, commercially funded, not-for-profit, broadcaster.
• We have a remit to deliver innovative, experimental, distinctive, and diverse
content across television, film, and digital media.
• We are funded predominantly by television advertising, competing with the other established UK commercial broadcasters, and increasingly with emerging, Internet based, providers.
• Our content, is available across our portfolio of around 10 core and time-shift channels, and our on demand service 4oD is accessible across multiple devices and platforms.
Why Big Data at C4
Business Intelligence at C4 • Well established Business Intelligence capability
• Based on industry standard proprietary products
• Real-time data warehousing
• Comprehensive business reporting
• Excellent internal skills
• Good external skills availability
Big Data Technology at C4 • 2011 - Embarked on Big Data initiative
– Ran in-house and cloud-based PoCs – Selected Amazon EMR
• 2012 - Ran Amazon EMR in parallel with conventional BI
– Hive deployed to Data Analysts – Amazon EMR workflows deployed to production
• 2013 – Amazon EMR confirmed as primary Big Data platform
– Amazon EMR usage growing, focus on automation – Experimenting with Mahout for Machine Learning
What problems are we solving?
Single view of the viewer recognising them across
devices and serving relevant content
Personalising the viewer experience
How are we doing this? • Principal tasks…
– Audience segmentation – Personalisation – Recommendations
• What data do we process…
– Website clickstream logs – 4oD activity and viewing history – Over 9m registered users – Majority of activity now from “logged-in” users
High-Level Architecture
High-Level Architecture • Amazon EMR and existing BI technology are
complementary
• Process billions of data rows in Amazon EMR, store millions of result rows in RDBMS
• No need to “rip and replace”, existing technology investment is protected
• Amazon EMR will continue to underpin major growth in data volumes and processing complexity
Where Next? • Continued growth in usage of Amazon EMR
• Migrate to Hadoop 2.x
• Adopt Amazon Redshift
• Improved integration between C4 and AWS
• Shift toward “near real-time” processing
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
BDT301