Download - Practice and Applications of Data Managementavid.cs.umass.edu/courses/345/f2015/lectures/19-AWS.pdf · } Amazon Web Services } You need to sign up! } Prac*ce large-scale unstructured

Practice and Applications of Data Management

CMPSCI 345

Lecture 19-20: Amazon Web Services

Extra credit: project part 3

2

} Open-endedaddi*onalfeatures.

}  Presenta*onsonDec7

} NeedtosignupbyNov30!

This week

3

} NoclassonWednesday(enjoyThanksgiving!)

} OfficehoursonTuesday2-3pm.

Map-Reduce Summary

4

}  Hidesschedulingandparalleliza*ondetails}  However,verylimitedqueries

}  Difficulttowritemorecomplextasks}  Needmul*plemap-reduceopera*ons

}  Solu*on:}  UseMapReduceasarun*meforhigherlevellanguages}  Pig(Yahoo!,nowapacheproject):SQL-likeoperators}  Hive(apacheproject):SQL}  Scope(MS):SQL!Butproprietary…}  DryadLINQ(MS):LINQ!Butalsoproprietary…

Homework assignment

5

} AmazonWebServices}  Youneedtosignup!}  Prac*celarge-scaleunstructureddataprocessingonHadoop

}  This(andnext)week:}  OverviewofAWSinclass}  Guidingthroughthefirststepsoftheassignment.

Amazon Web Services (AWS)

6

} Acloudcompu*ngpla]orm

Why cloud computing?

7

vs

vs

What will we learn?

8

BED75271605EBD0C !970916201045 !yahoo chat!824F413FA37520BF !970916184818 !garter belts!824F413FA37520BF !970916184939 !lingerie!824F413FA37520BF !970916185051 !spiderman!824F413FA37520BF !970916185155 !tommy hilfiger!824F413FA37520BF !970916185257 !calgary!824F413FA37520BF !970916185513 !calgary!824F413FA37520BF !970916185605 !exhibitionists!…… ! !…… !……!

analyzesearchlogs

What is Pig?

} Anengineforexecu*ngprogramsontopofHadoop

}  Itprovidesalanguage,PigLa*n,tospecifytheseprograms

} AnApacheopensourceproject}  h^p://hadoop.apache.org/pig/

9

Why use Pig?

Supposeyouhaveuserdatainonefile,websitedatainanother,andyouneedtofindthetop5mostvisitedsitesbyusersaged18-25.

LoadUsers LoadPages

Filterbyage

Joinonname

Grouponurl

Countclicks

Orderbyclicks

Taketop5

10

In MapReduce import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.KeyValueTextInputFormat; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.jobcontrol.Job; import org.apache.hadoop.mapred.jobcontrol.JobControl; import org.apache.hadoop.mapred.lib.IdentityMapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String key = line.substring(0, firstComma); String value = line.substring(firstComma + 1); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("1" + value); oc.collect(outKey, outVal); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String value = line.substring(firstComma + 1); int age = Integer.parseInt(value); if (age < 18 || age > 25) return; String key = line.substring(0, firstComma); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("2" + value); oc.collect(outKey, outVal); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasNext()) { Text t = iter.next(); String value = t.toString(); if (value.charAt(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1));

reporter.setStatus("OK"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setStatus("OK"); } } } } public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.toString(); int firstComma = line.indexOf(','); int secondComma = line.indexOf(',', firstComma); String key = line.substring(firstComma, secondComma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outKey = new Text(key); oc.collect(outKey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasNext()) { sum += iter.next().get(); reporter.setStatus("OK"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((LongWritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { // Only output the first 100 records while (count < 100 && iter.hasNext()) { oc.collect(key, iter.next()); count++; } } } public static void main(String[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setJobName("Load Pages"); lp.setInputFormat(TextInputFormat.class);

lp.setOutputKeyClass(Text.class); lp.setOutputValueClass(Text.class); lp.setMapperClass(LoadPages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setNumReduceTasks(0); Job loadPages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setJobName("Load and Filter Users"); lfu.setInputFormat(TextInputFormat.class); lfu.setOutputKeyClass(Text.class); lfu.setOutputValueClass(Text.class); lfu.setMapperClass(LoadAndFilterUsers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setNumReduceTasks(0); Job loadUsers = new Job(lfu); JobConf join = new JobConf(MRExample.class); join.setJobName("Join Users and Pages"); join.setInputFormat(KeyValueTextInputFormat.class); join.setOutputKeyClass(Text.class); join.setOutputValueClass(Text.class); join.setMapperClass(IdentityMapper.class); join.setReducerClass(Join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setNumReduceTasks(50); Job joinJob = new Job(join); joinJob.addDependingJob(loadPages); joinJob.addDependingJob(loadUsers); JobConf group = new JobConf(MRExample.class); group.setJobName("Group URLs"); group.setInputFormat(KeyValueTextInputFormat.class); group.setOutputKeyClass(Text.class); group.setOutputValueClass(LongWritable.class); group.setOutputFormat(SequenceFileOutputFormat.class); group.setMapperClass(LoadJoined.class); group.setCombinerClass(ReduceUrls.class); group.setReducerClass(ReduceUrls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setNumReduceTasks(50); Job groupJob = new Job(group); groupJob.addDependingJob(joinJob); JobConf top100 = new JobConf(MRExample.class); top100.setJobName("Top 100 sites"); top100.setInputFormat(SequenceFileInputFormat.class); top100.setOutputKeyClass(LongWritable.class); top100.setOutputValueClass(Text.class); top100.setOutputFormat(SequenceFileOutputFormat.class); top100.setMapperClass(LoadClicks.class); top100.setCombinerClass(LimitClicks.class); top100.setReducerClass(LimitClicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setNumReduceTasks(1); Job limit = new Job(top100); limit.addDependingJob(groupJob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addJob(loadPages); jc.addJob(loadUsers); jc.addJob(joinJob); jc.addJob(groupJob); jc.addJob(limit); jc.run(); } }

170linesofcode,4hourstowrite11

In Pig Latin Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ‘top5sites’;

9linesofcode,15minutestowrite

12

But how good is it?

13

Essence of Pig } Map-Reduceistoolowaleveltoprogram,SQLtoohigh

}  PigLa*n,alanguageintendedtositbetweenthetwo:}  Impera*ve}  Providesstandardrela*onaltransforms(join,sort,etc.)}  Schemasareop*onal,usedwhenavailable,canbedefinedatrun*me

}  UserDefinedFunc*onsarefirstclassci*zens}  Opportuni*esforadvancedop*mizerbutop*miza*onsbyprogrammeralsopossible

14

Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’;

loadusers filternullsgroupbystate

groupbyage,gender

applyUDFs

applyUDFs

storeinto‘bystate’

storeinto‘bydemo’

15

What are people doing with Pig }  AtYahoo~70%ofHadoopjobsarePigjobs}  BeingusedatTwi^er,LinkedIn,andothercompanies}  AvailableaspartofAmazonEMRwebserviceandClouderaHadoopdistribu*on

}  WhatusersusePigfor:}  Searchinfrastructure}  Adrelevance}  Modeltraining}  Userintentanalysis}  Weblogprocessing}  Imageprocessing}  Incrementalprocessingoflargedatasets

16

What will we learn?

17


analyzesearchlogs


analyzesmallsearchlogs

AWS assignment

18

Informa*ononPig,Hadoop,andAWS

Helpwithgemngsetup

Actualassignment

Running Hadoop on your machines

19

Semngup–PartA1.  Extracthw3.zip

2.  Extractpigtmp.zip

3.  Extracthadoop-0.18.3.zip

Setting up

20

Make sure hadoop is executable:!!$ chmod u+x ~/hw3/hadoop-0.18.3/bin/hadoop!

Setting up

21

Set environment variables:!!$ export PIGDIR=~/hw3/pigtmp!$ export HADOOP=~/hw3/hadoop-0.18.3!$ export HADOOPSITEPATH=~/hw3/hadoop-0.18.3/conf/!$ export PATH=$HADOOP/bin/:$PATH!!!In Windows:!$ set PIGDIR=~/hw3/pigtmp!…etc…!

Setting up

22

The variable JAVA_HOME should be set to point to your system's Java directory. !System dependent!!!In OS X:!$ export JAVA_HOME=$(/usr/libexec/java_home)!!!In Windows, it should point to your JDK folder.!(You should have that from project part 2.!!

The data: search query logs

23

Excite:oldsearchengine(somethinglikegoogle)

The data

24

}  Takeapeakinsideexcite-small.log

BED75271605EBD0C !970916201045 !yahoo chat!824F413FA37520BF !970916184818 !garter belts!824F413FA37520BF !970916184939 !lingerie!824F413FA37520BF !970916185051 !spiderman!824F413FA37520BF !970916185155 !tommy hilfiger!824F413FA37520BF !970916185257 !calgary!824F413FA37520BF !970916185513 !calgary!824F413FA37520BF !970916185605 !exhibitionists!

user *me:YYMMDDHHMMSS

query

script1-local.pig

25

} Objec*ve:}  Findqueryphrasesthatoccurwithhighfrequencyduringcertain*mesofday

} Openscript1-local.pig

script1-local.pig

26

REGISTER ./tutorial.jar;!!raw = LOAD 'excite-small.log' USING PigStorage('\t') AS (user, time, query);!!!clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);!!!clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;!!!!...!

RegisterthejartoaccessUDFs

Loadtherawdata

RemoverecordswherethequeryisemptyoraURL

Changethequerytolowercase

script1-local.pig

27

...!houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query;!!ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;!!!ngramed2 = DISTINCT ngramed1;!!!hour_frequency1 = GROUP ngramed2 BY (ngram, hour);!!...!

Extractthehour

Generaten-gramsfromthequerystring

Getuniquen-grams

Groupbyn-gramandhour

script1-local.pig

28

...!hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;!!uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;!!!uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1));!!!uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean;!...!

Counttheoccurrencesofeachn-gram

Generaten-gramsfromthequerystring

UseaUDFtocomputeapopularityscoreforthen-gram

Assignsnamestothefields

script1-local.pig

29

...!filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0;!!!ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score;!!!!STORE ordered_uniq_frequency INTO 'script1-local-results.txt' USING PigStorage();!!

Keepfrequencyscoreshigherthan2

Sorttherecordsbyhourandscore

Storetheresults

Execute your Pig script

30

$ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local script1-local.pig!!$ ls -l script1-local-results.txt!!$ cat script1-local-results.txt!

Explore what happens

31

Start grunt:!!$ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local!grunt> !!!Copy and paste commands from the script!Explore the created tables with the commands describe and dump! !

Sign in the AWS management console

32

}  h^ps://console.aws.amazon.com

Check your S3 storage

33


Go to Elastic MapReduce

34


Starting the job

40

}  Thejobmaytakeafewminutestostart

Cluster list

41Monitorselapsed*me

42

Terminatesthejob

DNSnameofMasternode

SSHinstruc*ons

Connecting to the Master

43

Find your Master’s DNS from the console!!$ ssh -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>!

Usethenameofthemaster,andthepathtoyourEC2keypair

On the Master

44

Create a directory on the HDFS system:!!% hadoop dfs -mkdir /user/hadoop!

Edit script1-hadoop.pig

45

...!!!raw = LOAD 'excite.log.bz2' USING PigStorage('\t') AS (user, time, query);!!!...!!!!STORE ordered_uniq_frequency INTO !‘script1-hadoop-results' USING PigStorage();!...!

Changetheloca*onofthedatatotheoneonyourS3bucket

Changetheloca*onoftheoutput

s3n://<name_of_your_bucket>/excite.log.bz2

/user/hadoop/script1-hadoop-results

Upload files to the Master

46

$ scp –i </path/to/saved/keypair/file.pem> !script1-hadoop.pig !hadoop@<master.public-dns-name.amazonaws.com>:~/.!!$ scp –i </path/to/saved/keypair/file.pem>!tutorial.jar !hadoop@<master.public-dns-name.amazonaws.com>:~/.!

Again,usethenameofthemaster,andthepathtoyourEC2keypair

On the Master

47

Execute the script:!!% pig -l . script1-hadoop.pig!

48

Instruc*onstoenablemonitoringconnec*ons

Monitoring job flows

49

In a new terminal window:!!$ ssh -i </path/to/saved/keypair/file.pem> -ND 8157 hadoop@<master.public-dns-name.amazonaws.com>!

Usethenameofthemaster,andthepathtoyourEC2keypair

Startsaproxylisteningonport8157

Enable FoxyProxy on the browser

50

Monitoring jobs

51

AccessmonitoringURLs

Load the jobtracker

52

}  http://<master.public-dns-name.amazonaws.com>:9100/!

Retrieving results

53

On the Master:!!% hadoop dfs –copyToLocal /user/hadoop/script1-hadoop-results script1-hadoop-results!!On your machine:!!$ scp –i </path/to/saved/keypair/file.pem>!-r hadoop@<master.public-dns-name.amazonaws.com>:~/script1-hadoop-results/ .!

54

Terminatealljobswhenyouaredone!

Ifyouforgetjobsrunning,costswillrackup.

Youareresponsibleforyourusage.

Relational DB on AWS

55


57

Pickanameadescrip*on

Connect to the cloud database

66

psql !--host=<your_RDS_instance> !--port=5432 !--username=<username>!--password !--dbname=cloud_db!

UsetheDBinstanceaddressfromyourconsole

Typethecommandinasingleline

Typetheusernameyouchose

Import data to RDS

67

psql !-f initialize.sql!--host=<your_RDS_instance>!--port=5432 !--username=<username>!--password !--dbname=cloud_db!

InyourphpExamplecode

Update your configuration file

68

Enterthepropervaluesinconfig.php

69

Startalocalh^pserver.E.g.,withphp5.4:php -S localhost:8000!

70

Remembertodeleteyourinstancewhenyounolongerneedit.