Amazon Elastic MapReduce (EMR): Hadoop as a Service

40
Amazon Elastic MapReduce Ville Seppänen | Jari Voutilainen | @Vilsepi @Zharktas @GoforeOy

Transcript of Amazon Elastic MapReduce (EMR): Hadoop as a Service

Page 1: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Amazon ElasticMapReduce

Ville Seppänen | Jari Voutilainen | @Vilsepi @Zharktas @GoforeOy

Page 2: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Agenda1. Introduction to Hadoop Streaming and Elastic

MapReduce

2. Simple EMR web interface demo

3. Introduction to our dataset

4. Using EMR from command line with botoAll presentation material is available at https://github.com/gofore/aws-

emr

Page 3: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Hadoop StreamingUtility that allows you to create and runMap/Reduce jobs with any executable or script asthe mapper and/or the reducer.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input my/Input/Directories \ -output my/Output/Directory \ -mapper myMapperProgram.py \ -reducer myReducerProgram.py

cat input_data.txt | mapper.py | reducer.py > output_data.txt

Page 4: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Elastic MapReduce(EMR)

Page 5: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Amazon Elastic MapReduceHadoop-based MapReduce cluster as a service

Can run either Amazon-optimized Hadoop orMapR

Managed from a web UI or through API

Page 6: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Hadoop streaming in EMR

Page 7: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Quick EMR demo

Page 8: Amazon Elastic MapReduce (EMR): Hadoop as a Service

The endlessly fascinating example of counting words in Hadoop

Page 9: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Cluster creation stepsCluster: name, logging

Tags: keywords for the cluster

Software: Hadoop distribution and version, pre-installed applications (Hive, Pig,...)

File System: encryption, consistency

Hardware: number and type of instances

Security and Access: ssh keys, node access roles

Bootstrap Actions: scripts to customize the cluster

Steps: a queue of mapreduce jobs for the cluster

Page 10: Amazon Elastic MapReduce (EMR): Hadoop as a Service

(mapper)WordSplitter.py#!/usr/bin/pythonimport sysimport re

pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")for line in sys.stdin: for word in pattern.findall(line): print "LongValueSum:" + word.lower() + "\t" + "1"

LongValueSum:i 1LongValueSum:count 1LongValueSum:words 1LongValueSum:with 1LongValueSum:hadoop 1

Page 11: Amazon Elastic MapReduce (EMR): Hadoop as a Service

FilesystemsEMRFS is an implementation of HDFS, with readingand writing of files directly to S3.

HDFS should be used to cache results ofintermediate steps.

S3 is block-based just like HDFS. S3n is file based,which can be accessed with other tools, but filesizeis limited to 5GB

Page 12: Amazon Elastic MapReduce (EMR): Hadoop as a Service

S3 is not a file system, it is a RESTish objectstorage.

S3 has eventual consistency: files written to S3might not be immediately available for reading.

EMR FS can be configured to encrypt files in S3and monitor consistancy of files, which can detectevent that try to use inconsistant files.

http://wiki.apache.org/hadoop/AmazonS3

Page 13: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Our dataset

Page 14: Amazon Elastic MapReduce (EMR): Hadoop as a Service

is a service offering real timeinformation and data about the traffic, weatherand condition information on the Finnish mainroads.

The service is provided by the (Liikennevirasto), and produced by

and .

Digitraffic

Finnish TransportAgency Gofore

Infotripla

Page 15: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Traffic fluencyOur data consists of traffic fluency information, i.e.how quickly vehicles have been identified to passthrough a road segment (a link).

Data is gathered with camera-based , and more

recently with mobile-device-based .

AutomaticLicense Plate Recognition (ALPR)

Floating CarData (FCD)

Page 17: Amazon Elastic MapReduce (EMR): Hadoop as a Service

<link> <linkno>310102</linkno> <startsite>1108</startsite> <endsite>1107</endsite> <name language="en">Hallila -> Kaukajärvi</name> <name language="fi">Hallila -> Kaukajärvi</name> <name language="sv">Hallila -> Kaukajärvi</name> <distance> <value>3875.000</value> <unit>m</unit> </distance></link>

Static link information (271kb xml)642 one-way links, 243 sites

Page 18: Amazon Elastic MapReduce (EMR): Hadoop as a Service

<ivjtdata duration="60" periodstart="2014-06-24T02:55:00Z"> <recognitions> <link id="110302" data_source="1"> <recognition offset_seconds="8" travel_time="152"></recognition> <recognition offset_seconds="36" travel_time="155"></recognition> </link> <link id="410102" data_source="1"> <recognition offset_seconds="6" travel_time="126"></recognition> <recognition offset_seconds="45" travel_time="152"></recognition> </link> <link id="810502" data_source="1"> <recognition offset_seconds="25" travel_time="66"></recognition> <recognition offset_seconds="34" travel_time="79"></recognition> <recognition offset_seconds="35" travel_time="67"></recognition> <recognition offset_seconds="53" travel_time="58"></recognition> </link> </recognitions></ivjtdata>

Each file contains finished passthroughs for each road segment duringone minute.

Page 19: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Some numbers6.5 years worth of data from January 2008 to June2014

3.9 million XML files (525600 files per year)

6.3 GB of compressed archives (with 7.5GB ofadditional median data as CSV)

42 GB of data as XML (and 13 GB as CSV)

Page 20: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Potential research questions1. Do people drive faster during the night?

2. Does winter time have less recognitions (eitherdue to less cars or snowy plates)?

3. How well number of recognitions correlate withspeed (rush hour probably slows travel, but arespeeds higher during days with less traffic)?

4. Is it possible to identify speed limits from thetravel times? How much dispersion in speeds?

5. When do speed limits change (winter and summerlimits)?

Page 21: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Munging

Page 22: Amazon Elastic MapReduce (EMR): Hadoop as a Service

The small files problemUnpacked the tar.gz archives and uploaded theXML files as such to S3 (using AWS ).

Turns out (4 million 11kB) small files with Hadoopis not fun. Hadoop does not handle well with filessignificantly smaller than the HDFS block size(default 64MB)

And well, XML is not fun either, so...

CLI tools

[1] [2] [3]

Page 23: Amazon Elastic MapReduce (EMR): Hadoop as a Service

JSONify all the things!Wrote scripts to parse, munge and upload data

Concatenated data into bigger files, calculatedsome extra data, and converted it into JSON. Sizereduced to 60% of the original XML.

First munged 1-day files (10-20MB each) and later1-month files (180-540MB each)

Munging XML worth of 6.5 years takes 8.5 hourson a single t2.medium instance

Page 24: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Static link information (120kb json)

{ "sites": [ { "id": "1205", "name": "Viinikka", "lat": 61.488282, "lon": 23.779057, "rno": "3495", "tro": "3495/1-2930" } ], "links": [ { "id": "99001041", "name": "Hallila -> Viinikka", "dist": 5003.0, "site_start": "1108", "site_end": "1205" }]}

Page 25: Amazon Elastic MapReduce (EMR): Hadoop as a Service

{ "date": "2014-06-01T02:52:00Z", "recognitions": [ { "id": "4510201", "tt": 117, "cars": 4, "itts": [ 100, 139, 121, 110 ] } ]}

Page 26: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Programming EMR

Page 27: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Alternatives for the web interfaceAWS

SDKs like for Python

Command line tools

boto

Page 28: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Connect to EMR#!/usr/bin/env python

import boto.emrfrom boto.emr.instance_group import InstanceGroup

# Requires that AWS API credentials have been exported as env variablesconnection = boto.emr.connect_to_region('eu-west-1')

Page 29: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Specify EC2 instancesinstance_groups = []instance_groups.append(InstanceGroup( role="MASTER", name="Main node", type="m1.medium", num_instances=1, market="ON_DEMAND"))instance_groups.append(InstanceGroup( role="CORE", name="Worker nodes", type="m1.medium", num_instances=3, market="ON_DEMAND"))instance_groups.append(InstanceGroup( role="TASK", name="Optional spot-price nodes", type="m1.medium", num_instances=20, market="SPOT", bidprice=0.012))

Page 30: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Start EMR clustercluster_id = connection.run_jobflow( "Our awesome cluster", instance_groups=instance_groups, action_on_failure='CANCEL_AND_WAIT', keep_alive=True, enable_debugging=True, log_uri="s3://our-s3-bucket/logs/", ami_version="3.3.1", bootstrap_actions=[], ec2_keyname="name-of-our-ssh-key", visible_to_all_users=True, job_flow_role="EMR_EC2_DefaultRole", service_role="EMR_DefaultRole")

Page 31: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Add job step to clustersteps = []steps.append(boto.emr.step.StreamingStep( "Our awesome streaming app", input="s3://our-s3-bucket/our-input-data", output="s3://our-s3-bucket/our-output-path/", mapper="our-mapper.py", reducer="aggregate", cache_files=[ "s3://our-s3-bucket/programs/our-mapper.py#our-mapper.py", "s3://our-s3-bucket/data/our-dictionary.json#our-dictionary.json",) ], action_on_failure='CANCEL_AND_WAIT', jar='/home/hadoop/contrib/streaming/hadoop-streaming.jar'))connection.add_jobflow_steps(cluster_id, steps)

Page 32: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Recap#!/usr/bin/env python

import boto.emrfrom boto.emr.instance_group import InstanceGroup

connection = boto.emr.connect_to_region('eu-west-1')cluster_id = connection.run_jobflow(**cluster_parameters)connection.add_jobflow_steps(cluster_id, **steps_parameters)

Page 33: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Step 1 of 2: Run mapreduce# Create new clusteraws-tools/run-jobs.py create-cluster "Car speed counting cluster"

Starting cluster j-F0K0A4Q9F5O0 Car speed counting cluster

# Add job step to the clusteraws-tools/run-jobs.py run-step j-F0K0A4Q9F5O0 05-car-speed-for-time-of-day_map.py digitraffic/munged/links-by-month/2014

Step will output data to s3://hadoop-seminar-emr/digitraffic/outputs/ 2015-02-18_11-08-24_05-car-speed-for-time-of-day_map.py/

Page 34: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Step 2 of 2: Analyze results# Download and concatenate outputaws s3 cp s3://hadoop-seminar-emr/digitraffic/outputs/2015-02-18_11-08-24_05-car-speed-for-time-of-day_map.py/ /tmp/emr --recursive --profile hadoop-seminar-emr

cat /tmp/emr/part-* > /tmp/emr/output

# Analyze resultsresult-analysis/05_speed_during_day/05-car-speed-for-time-of-day_output.py /tmp/emr/output example-data/locationdata.json

Page 35: Amazon Elastic MapReduce (EMR): Hadoop as a Service
Page 36: Amazon Elastic MapReduce (EMR): Hadoop as a Service
Page 37: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Some statisticsWe experimented with different input files ancluster sizes

Execution time was about half hour with smallcluster and 30 small 15-20 MB files

Same input parsed with simple python script tookabout 5 minutes

Larger cluster and 6 larger 500 MB files took 17minutes.

"Too small problem for EMR/Hadoop"

Page 38: Amazon Elastic MapReduce (EMR): Hadoop as a Service

Summary

Page 39: Amazon Elastic MapReduce (EMR): Hadoop as a Service

TakeawaysMake sure your problem is big enough for Hadoop

Munging wisely makes streaming programs easierand faster

Always use Spot instances with EMR