MapReduce vs Pig | MapReduce Pig Integration

32
Slide 1 © 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com MapReduce and Pig Comparison

Transcript of MapReduce vs Pig | MapReduce Pig Integration

Page 1: MapReduce vs Pig | MapReduce Pig Integration

Slide 1© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

MapReduce and Pig Comparison

Page 2: MapReduce vs Pig | MapReduce Pig Integration

Slide 2© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Session Objectives

ᗍ Introduction to Big Data and Hadoop

ᗍ Understanding MapReduce and Pig Latin

ᗍ Comparative Analysis of MapReduce & Pig

ᗍ BIG Data & Hadoop Course Syllabus

ᗍ Webinar by Skillspeed

This session will help you with the following:

Get Started with BIG Data & Hadoop

Page 3: MapReduce vs Pig | MapReduce Pig Integration

Slide 3© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Big Data and its Challenges

Get Started with BIG Data & Hadoop

Page 4: MapReduce vs Pig | MapReduce Pig Integration

Slide 4© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Big Data and its Challenges

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information

It’s very difficult to manage such huge data……

Get Started with BIG Data & Hadoop

Page 5: MapReduce vs Pig | MapReduce Pig Integration

Slide 5© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Who Generates Big Data?

Have you ever wondered how Google, Facebook or LinkedIn manages to store and utilize the huge data?

Today, it is becoming a problem for all of us to manage such BIG DATA…. Get Started with BIG Data & Hadoop

Page 6: MapReduce vs Pig | MapReduce Pig Integration

Slide 6© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Hadoop can be used for easy processing of such huge Data…..

We will answer how?

Before that let’s understand what is Hadoop?Get Started with BIG Data & Hadoop

Page 7: MapReduce vs Pig | MapReduce Pig Integration

Slide 7© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Hadoop and its Characteristics

Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of commodity computers using a simple programming model

It is an Open-source Data Management technology with scale-out storage and distributed processing

Hadoop Characteristics

Flexible

Reliable

Economical

Scalable Get Started with BIG Data & Hadoop

Page 8: MapReduce vs Pig | MapReduce Pig Integration

Slide 8© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Hadoop Ecosystem

Flume Sqoop

Import Or Export

Unstructured or Semi-Structured data Structured Data

Apache Oozie (Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework HBase

OtherYARN

Frameworks (MPI,GIRAPH)

YARNCluster Resource Management

Get Started with BIG Data & Hadoop

Page 9: MapReduce vs Pig | MapReduce Pig Integration

Slide 9© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Map Reduce

Get Started with BIG Data & Hadoop

Page 10: MapReduce vs Pig | MapReduce Pig Integration

© 2015 BlueCamphor Technologies (P) Ltd. Slide 10© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

MapReduce Use Cases

Problem Statement:

Find maximum stock market levels recorded in a span of 5 years

Problem Statement:

De-identify personal identifier information

Get Started with BIG Data & Hadoop

Page 11: MapReduce vs Pig | MapReduce Pig Integration

© 2015 BlueCamphor Technologies (P) Ltd. Slide 11© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Traditional Solution

matchesSplit Data

Allmatches

grep

grep

grep

cat

grep

:

matches

matches

matches

Split Data

Split Data

Split Data

VeryBig

Data

Get Started with BIG Data & Hadoop

Page 12: MapReduce vs Pig | MapReduce Pig Integration

© 2015 BlueCamphor Technologies (P) Ltd. Slide 12© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

MapReduce Solution

Split Data

Allmatches

:

Split Data

Split Data

Split Data

MAP

REDUCE

MapReduce Framework

VeryBig

Input

Get Started with BIG Data & Hadoop

Page 13: MapReduce vs Pig | MapReduce Pig Integration

© 2015 BlueCamphor Technologies (P) Ltd. Slide 13© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Understanding MapReduce Paradigm

Input Splitting Mapping Shuffling Reducing Final Result

List(K3,V3)Jack Bill Joe

Bill, 2Don, 3Jack, 2Joe, 2

K2,List(V2)List(K2,V2)K1,V1

Don Don Joe

Jack Car Bill

Bill, (1,1)

Don, (1,1,1)

Jack, (1,1)

Joe, (1,1)

MapReduce Word Count Process Flow

Jack Bill JoeDon Don JoeJack Don Bill

Jack, 1Bill, 1Joe, 1

Don, 1Don, 1Joe, 1

Jack, 1Don, 1Bill, 1

Bill, 2

Don, 3

Jack, 2

Joe, 2Get Started with BIG Data & Hadoop

Page 14: MapReduce vs Pig | MapReduce Pig Integration

© 2015 BlueCamphor Technologies (P) Ltd. Slide 14© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

MapReduce Anatomy

Key Value

Map:

Reduce:

(K1, V1) List (K2, V2)

(K2, list (V2)) List (K3, V3)

MapReduce

Get Started with BIG Data & Hadoop

Page 15: MapReduce vs Pig | MapReduce Pig Integration

© 2015 BlueCamphor Technologies (P) Ltd. Slide 15© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

MapReduce Advantages

a b

c

Map Task

HDFS Block

Data Center

Rack

Node

The two biggest advantages of Map Reduce are:

ᗍ It takes processing to the data

ᗍ It allows processing of data in parallel

Get Started with BIG Data & Hadoop

Page 16: MapReduce vs Pig | MapReduce Pig Integration

© 2015 BlueCamphor Technologies (P) Ltd. Slide 16© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Input Splits in MapReduce

Input Data

HDFS Block

Input Splits

Physical Division

LogicalDivision

Get Started with BIG Data & Hadoop

Page 17: MapReduce vs Pig | MapReduce Pig Integration

© 2015 Blue Camphor Technologies (P) Ltd. Slide 17© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Demonstration

Sequence Files Processing

Get Started with BIG Data & Hadoop

Page 18: MapReduce vs Pig | MapReduce Pig Integration

© 2015 Blue Camphor Technologies (P) Ltd. Slide 18© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Need for Pig

Java is not a preferred language for many data analysts

200 Java LOC ~ 10 Pig LOC Many built-in operations are available for common data

operations like join, grouping, filtering etc.

Get Started with BIG Data & Hadoop

Page 19: MapReduce vs Pig | MapReduce Pig Integration

© 2015 Blue Camphor Technologies (P) Ltd. Slide 19© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Need for Pig

ᗍ Useful for creating ad-hoc Map Reduce jobs on very large data sets

ᗍ Java knowledge is optional

ᗍ Very less development time

ᗍ Fewer LOC = Easier Maintenance

ᗍ Easily extensible whenever required

ᗍ Easy to Learn and user friendly

Get Started with BIG Data & Hadoop

Page 20: MapReduce vs Pig | MapReduce Pig Integration

© 2015 Blue Camphor Technologies (P) Ltd. Slide 20© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Pig Vs M/R

0

20

40

60

80

100

120

140

160

180

Hadoop Pig

1/20 the lines of Code

0

50

100

150

200

250

300

Hadoop Pig

Min

ute

s

1/16 the development time

Min

ute

s

Get Started with BIG Data & Hadoop

Page 21: MapReduce vs Pig | MapReduce Pig Integration

© 2015 Blue Camphor Technologies (P) Ltd. Slide 21© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Map Reduce

ᗍ Provides powerful mechanism for parallel computation

ᗍ Gives more control on algorithm execution

ᗍ Very rigid in structure

Pig

ᗍ Acts as higher level DSL over Map Reduce

ᗍ Insulates programmers from underlying Hadoop concepts

ᗍ Provides seamless integration with a range of underlying Hadoop versions

Pig Vs M/R

Get Started with BIG Data & Hadoop

Page 22: MapReduce vs Pig | MapReduce Pig Integration

© 2015 Blue Camphor Technologies (P) Ltd. Slide 22© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Where to use Pig?

Pig is a Data Flow language, thus it is most suitable for:

ᗍ Quickly changing data processing requirements

ᗍ Processing data from multiple channels

ᗍ Quick hypothesis testing

ᗍ Time sensitive data refreshes

ᗍ Data profiling using sampling

Get Started with BIG Data & Hadoop

Page 23: MapReduce vs Pig | MapReduce Pig Integration

© 2015 Blue Camphor Technologies (P) Ltd. Slide 23© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Pig might NOT be a preferred choice when:

ᗍ Input data format is really nasty (video, audio, free formatted text etc)

ᗍ We need more fine grained control on processing

ᗍ Pig lacks control structures, so more looping and complex logic might need to extend Pig quite often

ᗍ There is always a baggage of extra processing in Pig on the top of Map Reduce logic, so Pig jobs are going to be a tad slower as compared to equivalent Map Reduce jobs

Where NOT to use Pig?

Get Started with BIG Data & Hadoop

Page 24: MapReduce vs Pig | MapReduce Pig Integration

Slide 24© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

What is Expected?

In this section, we will discuss the questions on HDFS and MapReduce that is asked during the interview

This will help you analyze the importance of the topics under study!

Get Started with BIG Data & Hadoop

Page 25: MapReduce vs Pig | MapReduce Pig Integration

Slide 25© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

What is the use of Namenode in HDFS?

What is DataNode in HDFS?

What is Job Tracker in HDFS?

What is MapReduce?

How does an Hadoop application look like on their basic components?

And many more…………….

The Top 5 Interview Questions

Get Started with BIG Data & Hadoop

Page 26: MapReduce vs Pig | MapReduce Pig Integration

Slide 26© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Job Trends – Hadoop

Get Started with BIG Data & Hadoop

Page 27: MapReduce vs Pig | MapReduce Pig Integration

Slide 27© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Why SkillSpeed?

Course Curriculum

from Industry Experts

Instructor Led Live Virtual

Sessions

Lifetime access to Course

Content via LMS

100% Placement Assistance

24x7 Support

Get Started with BIG Data & Hadoop

Page 28: MapReduce vs Pig | MapReduce Pig Integration

Slide 28© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Course Topics

Module 1

Introduction to Big Data and Hadoop

Module 2

HDFS Internals, Hadoop Configurations and

Data Loading

Module 3

Introduction to Map Reduce

Module 4

Advanced Map Reduce Concepts

Module 5

Introduction to Pig

Module 6

Advanced Pig and Introduction to Hive

Module 7

Advanced Hive Concepts

Module 8

Extending Hive and HBase Introduction

Module 9

Advanced HBase and Oozie Introduction

Module 10

Project Set-up Discussion

Get Started with BIG Data & Hadoop

Page 29: MapReduce vs Pig | MapReduce Pig Integration

Slide 29© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Corporate Partners

Get Started with BIG Data & Hadoop

Page 30: MapReduce vs Pig | MapReduce Pig Integration

Slide 30© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Lines open 24/7

To know more about the course, Please contact:

IND +91-90660-20904 USA 1866-607-6547 (Toll Free)

Or reach us at

[email protected]

Contact us..

Get Started with BIG Data & Hadoop

Page 31: MapReduce vs Pig | MapReduce Pig Integration

Slide 31© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com

Image References

Google images – credit for google, Facebook and LinkedIn LOGO and Snapshots

http://iconizer.net/en/search/1/collection:Practika

http://findicons.com/icon/66444/user_group

http://www.virtualizor.com/tour

https://accounts.it.et.byu.edu/

http://www.clipartsfree.net/tag/server.html

http://www.gopixpic.com/16/time-clock-icon-png-download

http://blog.smartbear.com/requirements/how-to-interview-users-to-find-out-what-they-really-want/

http://www.lincs.fr/research/areas/big-data/

http://www.counsellingpages.co.uk/

http://langfordsconsultancy.com/langfords-training-support-package/

http://cbsepathshala.blogspot.in/2012/05/physics-class-x-chapter-electricity.html

http://mmatycoon.com/tycoontimes/tycoontimesstory.php?SID=1010

Page 32: MapReduce vs Pig | MapReduce Pig Integration