Pig Installation Guide and Practical Example Presented by Priagung Khusumanegara Prof. Kyungbaek...
-
Upload
kevin-turner -
Category
Documents
-
view
212 -
download
0
Transcript of Pig Installation Guide and Practical Example Presented by Priagung Khusumanegara Prof. Kyungbaek...
Pig Installation Guide and Practical Example
Presented by Priagung KhusumanegaraProf. Kyungbaek Kim
Installation Guide
•Requirements Java 1.6 (this example using java-7-openjdk) Hadoop 0.23.x, 1.2.x, or 2.5.x (example using Hadoop 1.2.1)
Configuration
• Make sure you have installed Hadoop and can run Hadoop correctly• Download Pig Stable Version (0.13) $ wget http://apache.tt.co.kr/pig/pig-0.13.0/pig-0.13.0.tar.gz • Unpack the downloaded Pig distribution and move it to preferred directory (example using
/usr/local/pig/)$ tar -xvzf pig-0.13.0.tar.gz$ mv pig-0.13.0 /usr/local/pig
• Edit ~/.bashrc and add the following statement in the last lineexport PIG_HOME=/usr/local/pigexport PATH=$PATH:$PIG_HOME/bin
• Test the Pig installation with simple command $pig -help
Practical ExampleObjective : Counting packet length between ip source and ip destination in the network traffic• Running Hadoop
• Download Input files and copy them to HDFS- $ wget https://www.dropbox.com/s/k6li67bha12geet/input.txt?dl=1 -O input.txt- $ hadoop dfs –copyFromLocal input.txt /input/input.txtNote: get input file using tcpdump : tcpdump -n -i wlan0 >> input.txt
Screenshot Input File (input.txt)
• Enter grunt $ pig –x mapreduce
• Load text file into a bag, stick entire line into element ‘line’ of type ’chararray’
RAW_LOGS = LOAD ‘/input/input.txt ' AS (line:chararray);
• Apply a schema to raw data LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( (tuple(CHARARRAY,CHARARRAY,LONG))REGEX_EXTRACT_ALL(line,'.+\\s(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).+\\s(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).+length\\s+(\\d+)')) AS (IPS:chararray, IPD:chararray, S:long);
• Group traffic information by source IP addresses and destination IP addresses
FLOW = GROUP LOGS_BASE BY (IPS, IPD);
• Count the number of packet length by each IP addressTRAFFIC = FOREACH FLOW {sorted = ORDER LOG_BASE by S DESC; GENERATE group, SUM(LOGS_BASE.S);}
• Store output data in HDFS (/output)STORE TRAFFIC INTO '/output';
SCREENSHOT EACH PROCESS
• Screenshot Output File