Amazon Elastic MapReduce (EMR): Hadoop as a Service

Amazon ElasticMapReduce

Ville Seppänen | Jari Voutilainen | @Vilsepi @Zharktas @GoforeOy

Agenda1. Introduction to Hadoop Streaming and Elastic

MapReduce

2. Simple EMR web interface demo

3. Introduction to our dataset

4. Using EMR from command line with botoAll presentation material is available at https://github.com/gofore/aws-

Hadoop StreamingUtility that allows you to create and runMap/Reduce jobs with any executable or script asthe mapper and/or the reducer.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input my/Input/Directories \ -output my/Output/Directory \ -mapper myMapperProgram.py \ -reducer myReducerProgram.py

cat input_data.txt | mapper.py | reducer.py > output_data.txt

Elastic MapReduce(EMR)

Amazon Elastic MapReduceHadoop-based MapReduce cluster as a service

Can run either Amazon-optimized Hadoop orMapR

Managed from a web UI or through API

Hadoop streaming in EMR

Quick EMR demo

The endlessly fascinating example of counting words in Hadoop

Cluster creation stepsCluster: name, logging

Tags: keywords for the cluster

Software: Hadoop distribution and version, pre-installed applications (Hive, Pig,...)

File System: encryption, consistency

Hardware: number and type of instances

Security and Access: ssh keys, node access roles

Bootstrap Actions: scripts to customize the cluster

Steps: a queue of mapreduce jobs for the cluster

(mapper)WordSplitter.py#!/usr/bin/pythonimport sysimport re

pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")for line in sys.stdin: for word in pattern.findall(line): print "LongValueSum:" + word.lower() + "\t" + "1"

LongValueSum:i 1LongValueSum:count 1LongValueSum:words 1LongValueSum:with 1LongValueSum:hadoop 1

FilesystemsEMRFS is an implementation of HDFS, with readingand writing of files directly to S3.

HDFS should be used to cache results ofintermediate steps.

S3 is block-based just like HDFS. S3n is file based,which can be accessed with other tools, but filesizeis limited to 5GB

S3 is not a file system, it is a RESTish objectstorage.

S3 has eventual consistency: files written to S3might not be immediately available for reading.

EMR FS can be configured to encrypt files in S3and monitor consistancy of files, which can detectevent that try to use inconsistant files.

http://wiki.apache.org/hadoop/AmazonS3

Our dataset

is a service offering real timeinformation and data about the traffic, weatherand condition information on the Finnish mainroads.

The service is provided by the (Liikennevirasto), and produced by

Digitraffic

Finnish TransportAgency Gofore

Infotripla

Traffic fluencyOur data consists of traffic fluency information, i.e.how quickly vehicles have been identified to passthrough a road segment (a link).

Data is gathered with camera-based , and more

recently with mobile-device-based .

AutomaticLicense Plate Recognition (ALPR)

Floating CarData (FCD)

Travel time link network

Static link information (271kb xml)642 one-way links, 243 sites

Each file contains finished passthroughs for each road segment duringone minute.

Some numbers6.5 years worth of data from January 2008 to June2014

3.9 million XML files (525600 files per year)

6.3 GB of compressed archives (with 7.5GB ofadditional median data as CSV)

42 GB of data as XML (and 13 GB as CSV)

Potential research questions1. Do people drive faster during the night?

2. Does winter time have less recognitions (eitherdue to less cars or snowy plates)?

3. How well number of recognitions correlate withspeed (rush hour probably slows travel, but arespeeds higher during days with less traffic)?

4. Is it possible to identify speed limits from thetravel times? How much dispersion in speeds?

5. When do speed limits change (winter and summerlimits)?

Munging

The small files problemUnpacked the tar.gz archives and uploaded theXML files as such to S3 (using AWS ).

Turns out (4 million 11kB) small files with Hadoopis not fun. Hadoop does not handle well with filessignificantly smaller than the HDFS block size(default 64MB)

And well, XML is not fun either, so...

CLI tools

[1] [2] [3]

JSONify all the things!Wrote scripts to parse, munge and upload data

Concatenated data into bigger files, calculatedsome extra data, and converted it into JSON. Sizereduced to 60% of the original XML.

First munged 1-day files (10-20MB each) and later1-month files (180-540MB each)

Munging XML worth of 6.5 years takes 8.5 hourson a single t2.medium instance

Static link information (120kb json)

{ "sites": [ { "id": "1205", "name": "Viinikka", "lat": 61.488282, "lon": 23.779057, "rno": "3495", "tro": "3495/1-2930" } ], "links": [ { "id": "99001041", "name": "Hallila -> Viinikka", "dist": 5003.0, "site_start": "1108", "site_end": "1205" }]}

{ "date": "2014-06-01T02:52:00Z", "recognitions": [ { "id": "4510201", "tt": 117, "cars": 4, "itts": [ 100, 139, 121, 110 ] } ]}

Programming EMR

Alternatives for the web interfaceAWS

SDKs like for Python

Command line tools

Connect to EMR#!/usr/bin/env python

import boto.emrfrom boto.emr.instance_group import InstanceGroup

# Requires that AWS API credentials have been exported as env variablesconnection = boto.emr.connect_to_region('eu-west-1')

Specify EC2 instancesinstance_groups = []instance_groups.append(InstanceGroup( role="MASTER", name="Main node", type="m1.medium", num_instances=1, market="ON_DEMAND"))instance_groups.append(InstanceGroup( role="CORE", name="Worker nodes", type="m1.medium", num_instances=3, market="ON_DEMAND"))instance_groups.append(InstanceGroup( role="TASK", name="Optional spot-price nodes", type="m1.medium", num_instances=20, market="SPOT", bidprice=0.012))

Start EMR clustercluster_id = connection.run_jobflow( "Our awesome cluster", instance_groups=instance_groups, action_on_failure='CANCEL_AND_WAIT', keep_alive=True, enable_debugging=True, log_uri="s3://our-s3-bucket/logs/", ami_version="3.3.1", bootstrap_actions=[], ec2_keyname="name-of-our-ssh-key", visible_to_all_users=True, job_flow_role="EMR_EC2_DefaultRole", service_role="EMR_DefaultRole")

Add job step to clustersteps = []steps.append(boto.emr.step.StreamingStep( "Our awesome streaming app", input="s3://our-s3-bucket/our-input-data", output="s3://our-s3-bucket/our-output-path/", mapper="our-mapper.py", reducer="aggregate", cache_files=[ "s3://our-s3-bucket/programs/our-mapper.py#our-mapper.py", "s3://our-s3-bucket/data/our-dictionary.json#our-dictionary.json",) ], action_on_failure='CANCEL_AND_WAIT', jar='/home/hadoop/contrib/streaming/hadoop-streaming.jar'))connection.add_jobflow_steps(cluster_id, steps)

Recap#!/usr/bin/env python

import boto.emrfrom boto.emr.instance_group import InstanceGroup

connection = boto.emr.connect_to_region('eu-west-1')cluster_id = connection.run_jobflow(**cluster_parameters)connection.add_jobflow_steps(cluster_id, **steps_parameters)

Step 1 of 2: Run mapreduce# Create new clusteraws-tools/run-jobs.py create-cluster "Car speed counting cluster"

Starting cluster j-F0K0A4Q9F5O0 Car speed counting cluster

# Add job step to the clusteraws-tools/run-jobs.py run-step j-F0K0A4Q9F5O0 05-car-speed-for-time-of-day_map.py digitraffic/munged/links-by-month/2014

Step will output data to s3://hadoop-seminar-emr/digitraffic/outputs/ 2015-02-18_11-08-24_05-car-speed-for-time-of-day_map.py/

Step 2 of 2: Analyze results# Download and concatenate outputaws s3 cp s3://hadoop-seminar-emr/digitraffic/outputs/2015-02-18_11-08-24_05-car-speed-for-time-of-day_map.py/ /tmp/emr --recursive --profile hadoop-seminar-emr

cat /tmp/emr/part-* > /tmp/emr/output

# Analyze resultsresult-analysis/05_speed_during_day/05-car-speed-for-time-of-day_output.py /tmp/emr/output example-data/locationdata.json

Some statisticsWe experimented with different input files ancluster sizes

Execution time was about half hour with smallcluster and 30 small 15-20 MB files

Same input parsed with simple python script tookabout 5 minutes

Larger cluster and 6 larger 500 MB files took 17minutes.

"Too small problem for EMR/Hadoop"

Summary

TakeawaysMake sure your problem is big enough for Hadoop

Munging wisely makes streaming programs easierand faster

Always use Spot instances with EMR

Amazon Elastic MapReduce (EMR): Hadoop as a Service

Software

Transcript of Amazon Elastic MapReduce (EMR): Hadoop as a Service

MapReduce en Hadoop

Hadoop Mapreduce

Hadoop: Beyond MapReduce

Deep Dive - Amazon Elastic MapReduce (EMR)

Überblick Hadoop Einführung HDFS und MapReduce - doag.org · Inhalt Seite 3 1 Apache Hadoop 2 Hadoop Distributed File System (HDFS) 3 MapReduce Überblick Hadoop 4 MapReduce im

Beyond Hadoop and MapReduce

Big data- hadoop -MapReduce

MapReduce Programming With Apache Hadoop

Hadoop Training #5: MapReduce Algorithm

Hadoop MapReduce - OSDC FR 2009

Hadoop MapReduce - 123seminarsonly.com · Hadoop MapReduce Felipe Meneses Besson IME-USP, Brazil July 7, 2010

Data Management in Large-Scale Distributed Systems - MapReduce … · Introduction to MapReduce The Hadoop Eco-System HDFS Hadoop MapReduce 4. MapReduce at Google Publication The

Hadoop MapReduce Fundamentals

Tutorial Hadoop HDFS MapReduce

ApproxHadoop: Bringing Approximations to MapReduce Frameworkssantosh.nagarakatte/... · Hadoop. Hadoop is the best-known, publicly available im-plementation of MapReduce [1]. Hadoop

A Micro-Benchmark Suite for Evaluating Hadoop MapReduce …...Hadoop MapReduce 5 Performance of Hadoop MapReduce is influenced by many factors • Network configuration of cluster

Real time hadoop + mapreduce intro

MapReduce with Hadoop

MapReduce Improvements in MapR Hadoop

Masterclass Webinar: Amazon Elastic MapReduce (EMR)