APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.
-
Upload
james-stevens -
Category
Documents
-
view
229 -
download
0
Transcript of APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.
APACHE PIGPresented by Priagung Khusumanegara
Prof. Kyungbaek Kim
Agenda
• Introducing Pig Pig CharacteristicsPig Element
• Pig Latin Foundation Data FlowPig FeatureData Types
• Pig Operator and Function
Pig Characteristics
• A platform for analyzing large data sets that runs on top Hadoop• Provides a high-level language for
expressing data analysis• Uses both HDFS (read and write files)
and MapReduce (execute jobs)
Pig Elements
Pig Latin- High-level scripting language- Designed specifically for data transformation and flow expression
Grunt- The environment in which Pig Latin commands are executed- Currently there is support for Local and Hadoop modes.
Pig Interpreter- Pig interpreter converts Pig Latin to MapReduce
Pig Latin Data Flow
• A LOAD statement to read data from the file system.• A series of "transformation" statements to process the data.• A DUMP statement to view results or a STORE statement to save the
results.
LOAD TRANSFORM DUMP OR STORE
Running Pig
• Script - Execute commands in a file
- $ pig scriptFile.pig
• Grunt- Interactive shell for executing Pig Commands- Started when script file is NOT provided
Running Modes
• Local Executes in a single JVM Works exclusively with local file system Great for development, experimentation and prototyping
• Hadoop ModeAlso known as MapReduce modePig renders Pig Latin into MapReduce jobs and executes them on the
clusterCan execute against pseudo-distributed or fully distributed
Running Modes- $pig -x local
- $pig -x mapreduce
Hadoop Mode
Pig Relation
Pig Latin statements work with relation• A field is a piece of data 19
• A tuple is an ordered set of fields (19,2)
• A bag is a collection of unordered tuples {(19,2), (18,1)}
• A relation is a bag
Field
Tuple
FieldField
Bag
Data Type
Data Typeint
DescriptionSigned 32-bit integer
Example10
long Signed 64-bit integer Data: 10L or 10lDisplay: 10L
float 32-bit floating point Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F
Display: 10.5F or 1050.0F
double 64-bit floating point Data: 10.5 or 10.5e2 or 10.5E2Display: 10.5 or 1050.0
chararray Character array (string) in Unicode UTF-8 format
hello world
boolean boolean true/false (case insensitive)
datetime datetime 1970-01-01T00:00:00.000+00:00
LOAD operator
Load contents of text files into a bag names data
schema
DUMP and STORE operator
• No action is taken until DUMP or STORE commands are encountered- Pig will parse, validate and analyzed statements but not
execute them• DUMP – display the results to screen • STORE – save results to a file
DUMP and STORE operatorDUMP Example
STORE Example
FILTER and GROUP operatorFilter the data bag
Group bag filtered by score
ORDER operator
Note:For descending orderSorted = ORDER data BY score DESC;
FOREACH operator
For each row emit score, status fields
DISTINCT operator
Remove duplicate tuples in bag
UNION operator
Merge the contents of two or more bags
JOIN operator
Bag data1 and data2 are joined by their first fields.
SUM, MIN, AVG Function
Note:find min value : MINfind sum value : SUMfind average value : AVG