Map Reduce, Hadoop & Pig

MAP/REDUCE, HADOOP & PIGData mining applied on the enterprise

DEFINITIONS

Data mining is the process of extracting patterns from data. Commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.

A Framework is a re-usable design for a software system (or subsystem). A software framework may include support programs, code libraries, a scripting language, or other software to help develop and glue together the different components of a software project. Various parts of the framework may be exposed through an API.

MAP/REDUCE Framework for processing huge datasets on certain

kinds of distributable problems using a large number of computers.

MapReduce provides Automatic parallelization & distribution Fault tolerance I/O scheduling Status monitoring

Use cases Document clustering Machine learning Inverted index construction

Was used to completely regenerate Google's index of the World Wide Web

MAP/REDUCE

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure.

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.

Defined with respect to data structured in (key, value) pairs

MAP/REDUCE

MAP/REDUCE – DATA FLOW SECTIONS

Input reader

Map function

Partition function

Compare function

Reduce function

Output writer

MAP/REDUCE -> HADOOP

Google calls it… Hadoop equivalent…

MapReduce Hadoop MapReduce

GFS HDFS

Sawzall Hive, Pig

BigTable Hbase

Chubby ZooKeeper

HADOOP

Java Map/Reduce implementation

Framework that schedules tasks, provides monitoring, and re-executing the failed ones.

Single master JobTracker

Several slave TaskTracker, one per node

Hadoop DFS (not explicitly required)

Add-ons: Hive (Facebook dev), Pig (Yahoo! dev)

HADOOP EXAMPLE

A program that takes web server access log files and counts the number of hits in each minute slot over a week

Differentiate input & output phases: Map & Reduce Map phase: Access log files Reduce phase: Key set + iterator over each key

subset

HADOOP EXAMPLE

Reduce

Output

HADOOP EXAMPLE – MAP PHASE

HADOOP EXAMPLE – REDUCE PHASE

HADOOP EXAMPLE – MAIN CODE

PROBLEMS

Hadoop Map/Reduce is very powerful, but…

Requires a Java Programmer

User has to reinvent the wheel everytime a functionality is needed (join, filter, etc)

Harder to write, harder to maintain

User optimized

Platform for analyzing large data sets

High-level language + infrastructure (compiler)

Pig Latin Data flow language rather than procedural or

declarative

Ease of programming

Optimization opportunities

Extensibility

PIG - ADVANTAGES

Increases productivity. In one test 10 lines of Pig Latin ≈ 200 lines of Java. What took 4 hours to write in Java took 15

minutes in Pig Latin.

Opens the system to non-Java programmers.

Provides common operations like join, group, filter, sort.

PIG – HOW IT WORKS

PIG EXAMPLE

Start a terminal and run $ cd /usr/share/cloudera/pig/ $ bin/pig –x local

Should see a prompt like: grunt>

PIG – EXAMPLE - AGGREGATION

Let’s count the number of times each user appears in a given data set. log = LOAD ‘excite-small.log’ AS (user,

timestamp, query); grpd = GROUP log BY user; cntd = FOREACH grpd GENERATE group,

COUNT(log); STORE cntd INTO ‘output’;

Results: 002BB5A52580A8ED 18 005BD9CD3AC6BB38 18

Supports several functions Aggregation Grouping Filtering Ordering Joins & Anti-Joins Cogrouping (grouping generalization) Several data types:

Scalar: int, long, double, chararray, bytearray Complex: Maps, Tuples, Bags

PIG - COMMANDS

Pig Command What it does

load Read data from file system.

store Write data to file system.

foreach Apply expression to each record and output one or morerecords.

filter Apply predicate and remove records that do not return true.

group/cogroup Collect records with the same key from one or more inputs.

join Join two or more inputs based on a key.

order Sort records based on a key.

distinct Remove duplicate records.

union Merge two data sets.

POSSIBLE APPLICATIONS AT VLEX

Faster and improved, parallelized document indexing

Targeted advertisement

Recommendation system

Map Reduce, Hadoop & Pig

Documents

Transcript of Map Reduce, Hadoop & Pig

Hadoop M/R Pig Hive

Pig, a high level data processing system on Hadoop · Pig Used to Process web log ... Introduction to data processing using Hadoop and Pig, by Ricardo Varela Pig, Making Hadoop Easy,

Hadoop, Pig, and Python (PyData NYC 2012)

Oracle NoSQL Database Overview & Use Cases › OSWOUG › Presentation_Slides_files... · 2013-11-04 · HDFS, Hadoop, CDH Map Reduce ORCH - Hive Stats - Activities Map Reduce Pig

Big Data Basics Hadoop, MapReduce, Hive, Pig, & Spark

Hadoop map reduce

Hadoop, HDFS, MapReduce and Pig

Map Reduce & Hadoop › _media › teaching › wintersemester...Hadoop Map Reduce Hadoop 2 TEZ Execution Engine DevelopmentSummary Outline 1 Hadoop 2 Map Reduce 3 Hadoop 2 4 TEZ Execution

New Hadoop Map Reduce

Hadoop map reduce concepts

What's hot: Big Data Analytics met Hadoop · Pig Hive Map Reduce HDFS Base SAS & SAS/ACCESS® to Hadoop™ In-Memory Data Access Pig Hive SAS® Data Management SAS® Visual Analytics

Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop with Map Reduce, Pig, and Hive

Introduction to Apache Hadoop & Pig - Indiana University

Hadoop and pig at twitter (oscon 2010)

Introduction to Apache Hadoop & Pig - SALSAHPCsalsahpc.indiana.edu/CloudCom2010/slides/PDF/tutorials/Yahoo... · Hadoop & Pig Milind Bhandarkar ... (hadoop, pig) (apache, pig) (hadoop,

Making Pig Fly Optimizing Data Processing on Hadoop

Apache hadoop pig overview and introduction

Practical Hadoop using Pig

Programming in Hadoop with Pig and Hive · Programming in Hadoop with Pig and Hive • Hadoop is a open-source reimplementation of –A distributed file system –A map-reduce processing

CS-495/595 Big DataCS-495/595 Big Data:::: Exam #1Exam ...ccartled/Teaching/2015-Spring/Exams/001.pdf– Hadoop, Pig– Hadoop, Pig– Hadoop, Pig – MapReduce– MapReduce– MapReduce