Practice and Applications of Data Management
CMPSCI 345
Lecture 19-20: Amazon Web Services
Extra credit: project part 3
2
} Open-endedaddi*onalfeatures.
} Presenta*onsonDec7
} NeedtosignupbyNov30!
This week
3
} NoclassonWednesday(enjoyThanksgiving!)
} OfficehoursonTuesday2-3pm.
Map-Reduce Summary
4
} Hidesschedulingandparalleliza*ondetails} However,verylimitedqueries
} Difficulttowritemorecomplextasks} Needmul*plemap-reduceopera*ons
} Solu*on:} UseMapReduceasarun*meforhigherlevellanguages} Pig(Yahoo!,nowapacheproject):SQL-likeoperators} Hive(apacheproject):SQL} Scope(MS):SQL!Butproprietary…} DryadLINQ(MS):LINQ!Butalsoproprietary…
Homework assignment
5
} AmazonWebServices} Youneedtosignup!} Prac*celarge-scaleunstructureddataprocessingonHadoop
} This(andnext)week:} OverviewofAWSinclass} Guidingthroughthefirststepsoftheassignment.
Amazon Web Services (AWS)
6
} Acloudcompu*ngpla]orm
Why cloud computing?
7
vs
vs
What will we learn?
8
BED75271605EBD0C !970916201045 !yahoo chat!824F413FA37520BF !970916184818 !garter belts!824F413FA37520BF !970916184939 !lingerie!824F413FA37520BF !970916185051 !spiderman!824F413FA37520BF !970916185155 !tommy hilfiger!824F413FA37520BF !970916185257 !calgary!824F413FA37520BF !970916185513 !calgary!824F413FA37520BF !970916185605 !exhibitionists!…… ! !…… !……!
analyzesearchlogs
What is Pig?
} Anengineforexecu*ngprogramsontopofHadoop
} Itprovidesalanguage,PigLa*n,tospecifytheseprograms
} AnApacheopensourceproject} h^p://hadoop.apache.org/pig/
9
Why use Pig?
Supposeyouhaveuserdatainonefile,websitedatainanother,andyouneedtofindthetop5mostvisitedsitesbyusersaged18-25.
LoadUsers LoadPages
Filterbyage
Joinonname
Grouponurl
Countclicks
Orderbyclicks
Taketop5
10
In MapReduce import java.io.IOException; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.KeyValueTextInputFormat; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.jobcontrol.Job; import org.apache.hadoop.mapred.jobcontrol.JobControl; import org.apache.hadoop.mapred.lib.IdentityMapper; public class MRExample { public static class LoadPages extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String key = line.substring(0, firstComma); String value = line.substring(firstComma + 1); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("1" + value); oc.collect(outKey, outVal); } } public static class LoadAndFilterUsers extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { public void map(LongWritable k, Text val, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // Pull the key out String line = val.toString(); int firstComma = line.indexOf(','); String value = line.substring(firstComma + 1); int age = Integer.parseInt(value); if (age < 18 || age > 25) return; String key = line.substring(0, firstComma); Text outKey = new Text(key); // Prepend an index to the value so we know which file // it came from. Text outVal = new Text("2" + value); oc.collect(outKey, outVal); } } public static class Join extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> iter, OutputCollector<Text, Text> oc, Reporter reporter) throws IOException { // For each value, figure out which file it's from and store it // accordingly. List<String> first = new ArrayList<String>(); List<String> second = new ArrayList<String>(); while (iter.hasNext()) { Text t = iter.next(); String value = t.toString(); if (value.charAt(0) == '1') first.add(value.substring(1)); else second.add(value.substring(1));
reporter.setStatus("OK"); } // Do the cross product and collect the values for (String s1 : first) { for (String s2 : second) { String outval = key + "," + s1 + "," + s2; oc.collect(null, new Text(outval)); reporter.setStatus("OK"); } } } } public static class LoadJoined extends MapReduceBase implements Mapper<Text, Text, Text, LongWritable> { public void map( Text k, Text val, OutputCollector<Text, LongWritable> oc, Reporter reporter) throws IOException { // Find the url String line = val.toString(); int firstComma = line.indexOf(','); int secondComma = line.indexOf(',', firstComma); String key = line.substring(firstComma, secondComma); // drop the rest of the record, I don't need it anymore, // just pass a 1 for the combiner/reducer to sum instead. Text outKey = new Text(key); oc.collect(outKey, new LongWritable(1L)); } } public static class ReduceUrls extends MapReduceBase implements Reducer<Text, LongWritable, WritableComparable, Writable> { public void reduce( Text key, Iterator<LongWritable> iter, OutputCollector<WritableComparable, Writable> oc, Reporter reporter) throws IOException { // Add up all the values we see long sum = 0; while (iter.hasNext()) { sum += iter.next().get(); reporter.setStatus("OK"); } oc.collect(key, new LongWritable(sum)); } } public static class LoadClicks extends MapReduceBase implements Mapper<WritableComparable, Writable, LongWritable, Text> { public void map( WritableComparable key, Writable val, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { oc.collect((LongWritable)val, (Text)key); } } public static class LimitClicks extends MapReduceBase implements Reducer<LongWritable, Text, LongWritable, Text> { int count = 0; public void reduce( LongWritable key, Iterator<Text> iter, OutputCollector<LongWritable, Text> oc, Reporter reporter) throws IOException { // Only output the first 100 records while (count < 100 && iter.hasNext()) { oc.collect(key, iter.next()); count++; } } } public static void main(String[] args) throws IOException { JobConf lp = new JobConf(MRExample.class); lp.setJobName("Load Pages"); lp.setInputFormat(TextInputFormat.class);
lp.setOutputKeyClass(Text.class); lp.setOutputValueClass(Text.class); lp.setMapperClass(LoadPages.class); FileInputFormat.addInputPath(lp, new Path("/user/gates/pages")); FileOutputFormat.setOutputPath(lp, new Path("/user/gates/tmp/indexed_pages")); lp.setNumReduceTasks(0); Job loadPages = new Job(lp); JobConf lfu = new JobConf(MRExample.class); lfu.setJobName("Load and Filter Users"); lfu.setInputFormat(TextInputFormat.class); lfu.setOutputKeyClass(Text.class); lfu.setOutputValueClass(Text.class); lfu.setMapperClass(LoadAndFilterUsers.class); FileInputFormat.addInputPath(lfu, new Path("/user/gates/users")); FileOutputFormat.setOutputPath(lfu, new Path("/user/gates/tmp/filtered_users")); lfu.setNumReduceTasks(0); Job loadUsers = new Job(lfu); JobConf join = new JobConf(MRExample.class); join.setJobName("Join Users and Pages"); join.setInputFormat(KeyValueTextInputFormat.class); join.setOutputKeyClass(Text.class); join.setOutputValueClass(Text.class); join.setMapperClass(IdentityMapper.class); join.setReducerClass(Join.class); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/indexed_pages")); FileInputFormat.addInputPath(join, new Path("/user/gates/tmp/filtered_users")); FileOutputFormat.setOutputPath(join, new Path("/user/gates/tmp/joined")); join.setNumReduceTasks(50); Job joinJob = new Job(join); joinJob.addDependingJob(loadPages); joinJob.addDependingJob(loadUsers); JobConf group = new JobConf(MRExample.class); group.setJobName("Group URLs"); group.setInputFormat(KeyValueTextInputFormat.class); group.setOutputKeyClass(Text.class); group.setOutputValueClass(LongWritable.class); group.setOutputFormat(SequenceFileOutputFormat.class); group.setMapperClass(LoadJoined.class); group.setCombinerClass(ReduceUrls.class); group.setReducerClass(ReduceUrls.class); FileInputFormat.addInputPath(group, new Path("/user/gates/tmp/joined")); FileOutputFormat.setOutputPath(group, new Path("/user/gates/tmp/grouped")); group.setNumReduceTasks(50); Job groupJob = new Job(group); groupJob.addDependingJob(joinJob); JobConf top100 = new JobConf(MRExample.class); top100.setJobName("Top 100 sites"); top100.setInputFormat(SequenceFileInputFormat.class); top100.setOutputKeyClass(LongWritable.class); top100.setOutputValueClass(Text.class); top100.setOutputFormat(SequenceFileOutputFormat.class); top100.setMapperClass(LoadClicks.class); top100.setCombinerClass(LimitClicks.class); top100.setReducerClass(LimitClicks.class); FileInputFormat.addInputPath(top100, new Path("/user/gates/tmp/grouped")); FileOutputFormat.setOutputPath(top100, new Path("/user/gates/top100sitesforusers18to25")); top100.setNumReduceTasks(1); Job limit = new Job(top100); limit.addDependingJob(groupJob); JobControl jc = new JobControl("Find top 100 sites for users 18 to 25"); jc.addJob(loadPages); jc.addJob(loadUsers); jc.addJob(joinJob); jc.addJob(groupJob); jc.addJob(limit); jc.run(); } }
170linesofcode,4hourstowrite11
In Pig Latin Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ‘top5sites’;
9linesofcode,15minutestowrite
12
But how good is it?
13
Essence of Pig } Map-Reduceistoolowaleveltoprogram,SQLtoohigh
} PigLa*n,alanguageintendedtositbetweenthetwo:} Impera*ve} Providesstandardrela*onaltransforms(join,sort,etc.)} Schemasareop*onal,usedwhenavailable,canbedefinedatrun*me
} UserDefinedFunc*onsarefirstclassci*zens} Opportuni*esforadvancedop*mizerbutop*miza*onsbyprogrammeralsopossible
14
Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’;
loadusers filternullsgroupbystate
groupbyage,gender
applyUDFs
applyUDFs
storeinto‘bystate’
storeinto‘bydemo’
15
What are people doing with Pig } AtYahoo~70%ofHadoopjobsarePigjobs} BeingusedatTwi^er,LinkedIn,andothercompanies} AvailableaspartofAmazonEMRwebserviceandClouderaHadoopdistribu*on
} WhatusersusePigfor:} Searchinfrastructure} Adrelevance} Modeltraining} Userintentanalysis} Weblogprocessing} Imageprocessing} Incrementalprocessingoflargedatasets
16
What will we learn?
17
BED75271605EBD0C !970916201045 !yahoo chat!824F413FA37520BF !970916184818 !garter belts!824F413FA37520BF !970916184939 !lingerie!824F413FA37520BF !970916185051 !spiderman!824F413FA37520BF !970916185155 !tommy hilfiger!824F413FA37520BF !970916185257 !calgary!824F413FA37520BF !970916185513 !calgary!824F413FA37520BF !970916185605 !exhibitionists!…… ! !…… !……!
analyzesearchlogs
BED75271605EBD0C !970916201045 !yahoo chat!824F413FA37520BF !970916184818 !garter belts!824F413FA37520BF !970916184939 !lingerie!824F413FA37520BF !970916185051 !spiderman!824F413FA37520BF !970916185155 !tommy hilfiger!824F413FA37520BF !970916185257 !calgary!824F413FA37520BF !970916185513 !calgary!824F413FA37520BF !970916185605 !exhibitionists!…… ! !…… !……!
analyzesmallsearchlogs
AWS assignment
18
Informa*ononPig,Hadoop,andAWS
Helpwithgemngsetup
Actualassignment
Running Hadoop on your machines
19
Semngup–PartA1. Extracthw3.zip
2. Extractpigtmp.zip
3. Extracthadoop-0.18.3.zip
Setting up
20
Make sure hadoop is executable:!!$ chmod u+x ~/hw3/hadoop-0.18.3/bin/hadoop!
Setting up
21
Set environment variables:!!$ export PIGDIR=~/hw3/pigtmp!$ export HADOOP=~/hw3/hadoop-0.18.3!$ export HADOOPSITEPATH=~/hw3/hadoop-0.18.3/conf/!$ export PATH=$HADOOP/bin/:$PATH!!!In Windows:!$ set PIGDIR=~/hw3/pigtmp!…etc…!
Setting up
22
The variable JAVA_HOME should be set to point to your system's Java directory. !System dependent!!!In OS X:!$ export JAVA_HOME=$(/usr/libexec/java_home)!!!In Windows, it should point to your JDK folder.!(You should have that from project part 2.!!
The data: search query logs
23
Excite:oldsearchengine(somethinglikegoogle)
The data
24
} Takeapeakinsideexcite-small.log
BED75271605EBD0C !970916201045 !yahoo chat!824F413FA37520BF !970916184818 !garter belts!824F413FA37520BF !970916184939 !lingerie!824F413FA37520BF !970916185051 !spiderman!824F413FA37520BF !970916185155 !tommy hilfiger!824F413FA37520BF !970916185257 !calgary!824F413FA37520BF !970916185513 !calgary!824F413FA37520BF !970916185605 !exhibitionists!
user *me:YYMMDDHHMMSS
query
script1-local.pig
25
} Objec*ve:} Findqueryphrasesthatoccurwithhighfrequencyduringcertain*mesofday
} Openscript1-local.pig
script1-local.pig
26
REGISTER ./tutorial.jar;!!raw = LOAD 'excite-small.log' USING PigStorage('\t') AS (user, time, query);!!!clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query);!!!clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;!!!!...!
RegisterthejartoaccessUDFs
Loadtherawdata
RemoverecordswherethequeryisemptyoraURL
Changethequerytolowercase
script1-local.pig
27
...!houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query;!!ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;!!!ngramed2 = DISTINCT ngramed1;!!!hour_frequency1 = GROUP ngramed2 BY (ngram, hour);!!...!
Extractthehour
Generaten-gramsfromthequerystring
Getuniquen-grams
Groupbyn-gramandhour
script1-local.pig
28
...!hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;!!uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;!!!uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1));!!!uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean;!...!
Counttheoccurrencesofeachn-gram
Generaten-gramsfromthequerystring
UseaUDFtocomputeapopularityscoreforthen-gram
Assignsnamestothefields
script1-local.pig
29
...!filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0;!!!ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score;!!!!STORE ordered_uniq_frequency INTO 'script1-local-results.txt' USING PigStorage();!!
Keepfrequencyscoreshigherthan2
Sorttherecordsbyhourandscore
Storetheresults
Execute your Pig script
30
$ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local script1-local.pig!!$ ls -l script1-local-results.txt!!$ cat script1-local-results.txt!
Explore what happens
31
Start grunt:!!$ java -cp $PIGDIR/pig.jar org.apache.pig.Main -x local!grunt> !!!Copy and paste commands from the script!Explore the created tables with the commands describe and dump! !
Sign in the AWS management console
32
} h^ps://console.aws.amazon.com
Check your S3 storage
33
} h^ps://console.aws.amazon.com
Go to Elastic MapReduce
34
} h^ps://console.aws.amazon.com
35
36
37
38
39
Starting the job
40
} Thejobmaytakeafewminutestostart
Cluster list
41Monitorselapsed*me
42
Terminatesthejob
DNSnameofMasternode
SSHinstruc*ons
Connecting to the Master
43
Find your Master’s DNS from the console!!$ ssh -i </path/to/saved/keypair/file.pem> hadoop@<master.public-dns-name.amazonaws.com>!
Usethenameofthemaster,andthepathtoyourEC2keypair
On the Master
44
Create a directory on the HDFS system:!!% hadoop dfs -mkdir /user/hadoop!
Edit script1-hadoop.pig
45
...!!!raw = LOAD 'excite.log.bz2' USING PigStorage('\t') AS (user, time, query);!!!...!!!!STORE ordered_uniq_frequency INTO !‘script1-hadoop-results' USING PigStorage();!...!
Changetheloca*onofthedatatotheoneonyourS3bucket
Changetheloca*onoftheoutput
s3n://<name_of_your_bucket>/excite.log.bz2
/user/hadoop/script1-hadoop-results
Upload files to the Master
46
$ scp –i </path/to/saved/keypair/file.pem> !script1-hadoop.pig !hadoop@<master.public-dns-name.amazonaws.com>:~/.!!$ scp –i </path/to/saved/keypair/file.pem>!tutorial.jar !hadoop@<master.public-dns-name.amazonaws.com>:~/.!
Again,usethenameofthemaster,andthepathtoyourEC2keypair
On the Master
47
Execute the script:!!% pig -l . script1-hadoop.pig!
48
Instruc*onstoenablemonitoringconnec*ons
Monitoring job flows
49
In a new terminal window:!!$ ssh -i </path/to/saved/keypair/file.pem> -ND 8157 hadoop@<master.public-dns-name.amazonaws.com>!
Usethenameofthemaster,andthepathtoyourEC2keypair
Startsaproxylisteningonport8157
Enable FoxyProxy on the browser
50
Monitoring jobs
51
AccessmonitoringURLs
Load the jobtracker
52
} http://<master.public-dns-name.amazonaws.com>:9100/!
Retrieving results
53
On the Master:!!% hadoop dfs –copyToLocal /user/hadoop/script1-hadoop-results script1-hadoop-results!!On your machine:!!$ scp –i </path/to/saved/keypair/file.pem>!-r hadoop@<master.public-dns-name.amazonaws.com>:~/script1-hadoop-results/ .!
54
Terminatealljobswhenyouaredone!
Ifyouforgetjobsrunning,costswillrackup.
Youareresponsibleforyourusage.
Relational DB on AWS
55
} h^ps://console.aws.amazon.com
56
57
Pickanameadescrip*on
58
59
60
61
62
63
64
65
Connect to the cloud database
66
psql !--host=<your_RDS_instance> !--port=5432 !--username=<username>!--password !--dbname=cloud_db!
UsetheDBinstanceaddressfromyourconsole
Typethecommandinasingleline
Typetheusernameyouchose
Import data to RDS
67
psql !-f initialize.sql!--host=<your_RDS_instance>!--port=5432 !--username=<username>!--password !--dbname=cloud_db!
InyourphpExamplecode
Update your configuration file
68
Enterthepropervaluesinconfig.php
69
Startalocalh^pserver.E.g.,withphp5.4:php -S localhost:8000!
70
Remembertodeleteyourinstancewhenyounolongerneedit.
Top Related