APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

21
APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Transcript of APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Page 1: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

APACHE PIGPresented by Priagung Khusumanegara

Prof. Kyungbaek Kim

Page 2: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Agenda

• Introducing Pig Pig CharacteristicsPig Element

• Pig Latin Foundation Data FlowPig FeatureData Types

• Pig Operator and Function

Page 3: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Pig Characteristics

• A platform for analyzing large data sets that runs on top Hadoop• Provides a high-level language for

expressing data analysis• Uses both HDFS (read and write files)

and MapReduce (execute jobs)

Page 4: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Pig Elements

Pig Latin- High-level scripting language- Designed specifically for data transformation and flow expression

Grunt- The environment in which Pig Latin commands are executed- Currently there is support for Local and Hadoop modes.

Pig Interpreter- Pig interpreter converts Pig Latin to MapReduce

Page 5: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Pig Latin Data Flow

• A LOAD statement to read data from the file system.• A series of "transformation" statements to process the data.• A DUMP statement to view results or a STORE statement to save the

results.

LOAD TRANSFORM DUMP OR STORE

Page 6: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Running Pig

• Script - Execute commands in a file

- $ pig scriptFile.pig

• Grunt- Interactive shell for executing Pig Commands- Started when script file is NOT provided

Page 7: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Running Modes

• Local Executes in a single JVM Works exclusively with local file system Great for development, experimentation and prototyping

• Hadoop ModeAlso known as MapReduce modePig renders Pig Latin into MapReduce jobs and executes them on the

clusterCan execute against pseudo-distributed or fully distributed

Page 8: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Running Modes- $pig -x local

- $pig -x mapreduce

Page 9: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Hadoop Mode

Page 10: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Pig Relation

Pig Latin statements work with relation• A field is a piece of data 19

• A tuple is an ordered set of fields (19,2)

• A bag is a collection of unordered tuples {(19,2), (18,1)}

• A relation is a bag

Field

Tuple

FieldField

Bag

Page 11: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

Data Type

Data Typeint

DescriptionSigned 32-bit integer

Example10

long Signed 64-bit integer Data: 10L or 10lDisplay: 10L

float 32-bit floating point Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F

Display: 10.5F or 1050.0F

double 64-bit floating point Data: 10.5 or 10.5e2 or 10.5E2Display: 10.5 or 1050.0

chararray Character array (string) in Unicode UTF-8 format

hello world

boolean boolean true/false (case insensitive)

datetime datetime 1970-01-01T00:00:00.000+00:00

Page 12: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

LOAD operator

Load contents of text files into a bag names data

schema

Page 13: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

DUMP and STORE operator

• No action is taken until DUMP or STORE commands are encountered- Pig will parse, validate and analyzed statements but not

execute them• DUMP – display the results to screen • STORE – save results to a file

Page 14: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

DUMP and STORE operatorDUMP Example

STORE Example

Page 15: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

FILTER and GROUP operatorFilter the data bag

Group bag filtered by score

Page 16: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

ORDER operator

Note:For descending orderSorted = ORDER data BY score DESC;

Page 17: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

FOREACH operator

For each row emit score, status fields

Page 18: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

DISTINCT operator

Remove duplicate tuples in bag

Page 19: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

UNION operator

Merge the contents of two or more bags

Page 20: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

JOIN operator

Bag data1 and data2 are joined by their first fields.

Page 21: APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.

SUM, MIN, AVG Function

Note:find min value : MINfind sum value : SUMfind average value : AVG