Big Data Fundamentals

DECODING THE

BIG DATA HYPE From A Layman’s Perspective

ABSTRACT In 2001, Meta Group (Now Gartner) published a

report by Doug Laney, wherein the first analysis of

Big Data challenges was documented. This

document is an attempt to understand those

challenges and the solutions offered. This document

has been prepared using multiple sources and has

been quoted likewise.

By: Smarak Das [EMP Id: 391485] Teradata Certified Database Administrator & Technical Specialist

1 | P a g e

DECODING THE BIG DATA HYPE

INDEX (a) Fundamentals Of Big Data

- Characteristics Of Big Data Page 3

- Warehouse Vs Hadoop Page 7

- Use Cases Of Big Data Page 11

- Big Data Demographics Page 15

(b) All About Hadoop Page 21

- HDFS Page 23

- Assumptions & Goals Page 25

- DataNodes & NameNodes Page 26

- Data Replication & Integrity Page 28

- Data Blocks Organization & Pipelining Page 30

- File Permission Guide Page 34

(c) Basics Of MapReduce Page 35

(d) Hadoop Common Components Page 38

- Pig & PigLatin Page 39

- HIVE Page 40

- JAQL Page 41

- FLUME Page 42

- ZooKeeper Page 43

- Oozie Page 44

- Lucene Page 44

- Avro Page 44

2 | P a g e


Fundamentals Of BIG DATA

You Are Part Of It Every Day

Wikipedia quotes “Big Data is the collection of data sets so large and complex that

it becomes difficult to process using on-hand database management tools or

traditional data processing applications.”

The term “Big Data” is a misnomer since it implies that pre-existing data is somehow

small (it isn’t) or the only challenge is its sheer size (size is one of them, but there

are often more). In short, Big Data applies to information that can’t be processed or

analyzed using traditional tools. The effect of e-commerce, social media, rise in

merger/acquisition, increasing collaboration/partnership etc. is driving enterprises to

higher levels of consciousness about how the data is being managed at its basic level.

Today, every organization has access to a wealth of information, yet they don’t know

how to get value out of it because it is sitting in its most raw format or in semi-

structured, unstructured format, and as a result, they don’t even know whether it’s

worth keeping. Also, organization are getting overwhelmed by the volume of the

data generated, variety of data being available, and velocity of data availability.

Companies have the ability to store anything and they are generating data like never

before, yet as this potential gold mine of data piles up, the percentage of data the

business can process is going down.

Today’s world is changing. Through instrumentation and sensors, we are able to

track and sense more things, and if we can track and sense it, we tend to store it. Big

Data is the game changer for the overall effectiveness of your data centers, because

of its potential as a powerful tool in your information management repertoire.

The practice and tools of Big Data and Data Science doesn’t stand alone in the data

ecosystem. They rely on the usability of data, and a platform for future discovery

and innovation. As Big Data grows over 2014, we will see more of Big Data

acceptability, maturity as an industry, and adoption across industries.

3 | P a g e


Characteristics Of Big Data

Three characteristics define Big Data: VOLUME, VARIETY, and VELOCITY.

Source: Google Images

Fig: 3V Of Big Data

Other “V”s have been included in the Big Data characteristics, with “Variety” and

“Variability” being two other characteristics. However, we shall concentrate on

“Volume”, “Variety” and “Velocity”. Each of these characteristics introduces its

own set of complexities concerning data processing. All these 03 features have

created the need for a new class of capabilities to augment the way things are done

today to provide a better line of sight and controls over our existing knowledge

domains and the ability to act on them. Together, handling these 03 challenges

effectively will decide the efficiency and successfulness of any Big Data initiative.

4 | P a g e


VOLUME: The Volume of data is exploding. Year 2000 had 800,000 Petabytes of

data and by 2020, we are expecting to reach 35 Zettabytes. Twitter generates 7 TB

of data, while Facebook generates 10 TB every day. If the figures of Twitter and

Facebook didn’t amaze you, below lists numbers much beyond the realm of any

volume accumulated:

(a) Large Hadron Collider: Generated 15PB of data.

(b) YouTube: 72 hours of Video per hour.

(c) Human Genomics: 7000 PB

(d) Large Synoptic Survey Telescope: 30 TB of Images per day.

(e) Annual Email Traffic (No SPAM): 300 PB +

These numbers will be out of date by the time this document is prepared, and further

outdated by the time you have read it.

Today, we are storing everything: environmental data, financial data, medical data,

surveillance data, click stream data, and the list goes on and on. The term “Big Data”

means organizations are facing massive volumes of data, and this volume is

increasing with each day. As the amount of data available to business is on the rise,

the percent of data it can process, understand and analyze is on the decline, thus

creating a Blind Zone. This Blind Zone creates an uncertainty concerning the value

of all the captured and yet-unexplored data.


Fig: Discrepancy between Data Storage & Data Analysis

5 | P a g e


VARIETY: Of all “Vs”, Variety holds the potential for most exploitation. While

everybody doesn’t have the huge volume of data like Twitter, Facebook, eBay etc.

even the medium to small scale industries have multiple data sources which can be

integrated for organizational benefit. With the explosion of sensors, smart phones,

social collaboration technologies, data in an organization is becoming complex,

because it includes not only relational-database-suitable structured data, but also raw,

unstructured, semi-structured data. Traditional systems are struggling to store and

perform the required analytics to gain understanding from these varieties of data.

Only 20 percent of today’s data is traditional. Rest 80 percent of world’s data is

moving towards unstructured or semi-structured at most. Videos & Pictures aren’t

easily stored in relational databases. Variety is harder to grasp and analyze than

“bigger” (Volume) or “faster” (Variety). To capitalize on Big Data opportunity,

enterprises must be able to analyze all types of data, both relational & non-relational:

text, sensor data, audio, video, transactional, and more.

Figure: Share Of Structured & Un-Structured Data Source: Google Images

Figure: Difference Between Variety of Data Source: Relational Source

6 | P a g e


VELOCITY: Velocity refers to the speed with which data is stored or retrieved.

Earlier, the typical process was to fire a batch job against data and wait for results to

arrive. It used to work because the incoming data rate is slower than the batch

processing rate. Today, data in streaming into the server in real-time and have a very

short shelf-life. As such, the need to analyze the data-in-motion, rather than the data-

in-rest is critical. Sometimes, the competitive advantage for an organization is

decided by identifying a trend, problem, opportunity few minutes or few seconds

before someone else. There is a difference between “How many people live in

London” and “How many people are currently in London”. Dealing effectively with

Big Data requires performing analytics against the volume and variety of data while

it is still in motion, not just after it is at rest.

Source: Scale DB

Fig: Velocity & Big Data

7 | P a g e


Warehouse Vs. Hadoop (The Versus Thing)

Gartner in its 2014 Data Warehouse Database Management Systems Magic

Quadrant said “Entering 2014, the hype around replacing the data warehouse

gives way to the more sensible strategy of augmenting it.” Data in a warehouse

goes through multiple quality rigors: Cleaning, enrichment, matching, glossary,

metadata, master data management, modeling, and other services before it’s ready

for analysis. This is a very expensive and time consuming process. Having said that,

Business realizes that the data in the Data Warehouse is required for reporting and

BI purposes, which is essential for its functioning. Hence, the “high compute per

byte” (high computation cost) is associated with the “high value per byte” warehouse’

data.

The difference between traditional BI analytics and Big Data analytics is shown in

the below figure:

Source: Storage Networking Industry Association (SNIA)

Fig: Big Data Is Different From Business Intelligence

8 | P a g e


In contrast, Big Data repositories rarely undergoes the full quality rigors similar to

Data Warehouse’ data. Hadoop data might seems to be of “low value per byte”, it

also have “low compute by byte” factor. With the volume and velocity of today’s

data, we cannot afford to cleanse and document every piece of data properly, because

it’s not going to be economical. The data in Hadoop might sit for a while for analysis,

and when its value is discovered, it might migrate its way to way to the data

warehouse post the quality rigors associated with Data Warehouse’ data.

03 major considerations for Big Data technologies:

(a) Big Data solutions are ideal for analyzing not only raw data, but structured,

semi-structured, unstructured data as well.

(b) Big Data solutions are ideal when all the data needs to be analyzed in

comparison to a sample of data.

(c) Big Data solutions are ideal for iterative and exploratory analysis when

business measures of data is not predetermined.

Hadoop is not meant for high performance interactive use, not it supports database

features like schemas, indexes, optimizer, data structure, data models etc.


9 | P a g e



Fig: Business Requirement Case Study

Big Data solutions aren’t a replacement of traditional and existing warehouse

solutions. Data bound for the analytic warehouse has to be cleaned, documented, and

trusted before it’s neatly placed into the warehouse. A Big Data solution is going to

give up some of the formalities and strictness of data.

The next figure, as provided by Data Warehouse and Big Data Market Leader

Teradata (Gartner 2014) explains the best approach by workload and data type

concerning when to use what:

Legend:

(a) STABLE SCHEMA: Financial Analysis, OLAP, Enterprise-wide BI,

Reporting, Active Intelligence etc.

(b) EVOLVING SCHEMA: Interactive data discovery, web clickstream, social

feeds, set-top box analysis, sensor logs, JSON etc.

(c) FORMAT, NO SCHEMA: Image processing, audio/video storage and

refining, storage and batch transformation.

10 | P a g e


Source: Teradata

Fig: Teradata Aster [When to Use What] Mapping As Per Requirement]

Your information platform shouldn’t go into the future without these two important

entities working together, because the outcomes of a cohesive analytic solutions

deliver premium results.

Source: SAS Best Practices 2013

Fig: Big Data & Data Warehouse Together = Premium Results

11 | P a g e


Use Cases Of Big Data

The early companies to embrace Big Data were Google, LinkedIn, Facebook, eBay

etc. These companies didn’t have to reconcile or integrate big data with the

traditional sources of data and perform analytics on them as they were built around

Big Data from the beginning. For these companies, Big Data could stand alone, Big

Data analytics could be the only focus of analytics and Big Data technology

architecture could be the only architecture.

However, Large and well-established business should integrate their Big Data

technologies with everything else going on with their company i.e. analytics on Big

Data should co-exist with analytics of other types of data.

Below, we list 05 instances of Big Data use by popular companies:

Source: International Institute for Analytics Study Sponsored by SAS May 2013

12 | P a g e




13 | P a g e



14 | P a g e



15 | P a g e


Big Data Demographics

NVP [NewVantage Partners] conducted a survey in 2013 for Big Data statistics with

the following participating companies {Below Figure}, with the survey participants

including Chief Information Officers, Chief Analytics and Risk Officers, Chief

Technology Officers, Chief Marketing Officers, Senior Line-of-Business Executives

(EVP/SVP), Chief Architects, and Heads of Big Data and Analytics.

Source: NewVantage Partners (NVP) Big Data Executive Survey 2013

16 | P a g e


(a) BIG Data Acceptance In Terms Of Industry

[Financial Services usage of Big Data stems from development of highly sophisticated

customer analytics and predictive behavior models, fraud detection and risk analytics etc.

Health Care & Life Sciences firms are in nascent stage of Big Data adoption, with only

17% LS & HC executives reporting Big Data systems operational in production as

compared with 33% for financial services]


(b) BIG Data Initiative Status

[In 2012, 85% of executives indicated embarking on initial forays of Big Data initiatives.

In 2013, 91% executives were planning or have embarked on a Big Data initiatives. Also,

68% executives reported investment of more than $1MM in Big Data initiatives]


17 | P a g e


(c) BIG Data Initiative Being Planned By Organization

[The most significant factor for Big Data initiatives were to enhance the analytics power

and capabilities as a mean to compete more successfully and operate more efficiently, with

70% of executives’ demand. Effective integration of existing data, be it structured, un-

structured, semi-structured represented 69% of executives’ driving factor for Big Data

initiative]


18 | P a g e


(d) Primary Focus Of Big Data Analysis Initiative


Most Big Data initiatives do not currently require any ROI playback analysis for

justifying their investment in Big Data, with 50% indicating long term strategic

investment. The below figure shows whether an ROI has been conducted before

approving the Big Data investment:


19 | P a g e


(e) Factors Critical To Business Adoption Of Big Data Initiatives

[Executive Sponsorship is the most critical factor driving Big Data initiative]


(f) Primary Business Benefit Expected By Big Data Analysis


20 | P a g e


(g) Big Data Solutions Used


(h) Analytics & Visualization Solutions Used


21 | P a g e


All About Hadoop

The problem is obvious: There is a staggering amount of data lying around in every

enterprise of various formats and types, with more and more data being added to the

repository each and every moment. But, the enterprise aren’t sure whether to

continue storing the data, or analyze it, or whether there is any value in it.

It will not be wrong to say the Big Data is the culmination of technological

advancement which can spiraled the volume, variety and velocity aspect of data

beyond the organizational grasp. People and organization have attempted to tackle

this problem from many different angles. The angle which is currently leading the

pack in terms of popularity is an open source project called Hadoop.


Fig: Hadoop Adoption

Hadoop is a top-level Apache project in the Apache Software Foundation written in

Java. For the definition, we can define Hadoop as a computing environment built on

top of a distributed clustered file system that was designed specifically for very

large-scale operations.

22 | P a g e


Hadoop is based on Google’s work on MapReduce programming paradigm. Unlike

traditional systems, Hadoop is designed to scan through large data sets to produce

its results through a highly scalable, distributed batch processing system. Hadoop is

not about speed-of-thought response times, real-time warehousing, or blazing

transactional speeds as mentioned in the Big Data vs. Data Warehouse section. It is

about discovery and making the once near-impossible possible from a scalable and

analytic perspective.

The Hadoop project has 03 major components:

(a) Hadoop Distributed File System (HDFS)

(b) Hadoop MapReduce.

(c) Hadoop Common.


Fig: Hadoop Base Components

One of the key component of Hadoop is the redundancy built into Hadoop. Hadoop

recognizes that failure is a norm rather than an exception. As mentioned earlier,

Hadoop achieves the performance and scalability via commodity based hardware

(ready-made and easily available inexpensive hardware). It is well known that

commodity based hardware will fail, especially when you have large numbers of

them. But the redundancy built into Hadoop provides the fault tolerance and the

capability of Hadoop to heal itself. This allows Hadoop to scale out workloads across

large clusters of inexpensive machines to work on Big Data problems.

23 | P a g e


Hadoop Distributed File System [HDFS]

The Hadoop Distributed File System [HDFS] is a distributed file system designed to

run on commodity hardware. HDFS is highly fault tolerant and is designed to be

deployed on low cost hardware.

Data in Hadoop cluster is broken down into smaller pieces called Blocks and

distributed across the cluster. In this way, the map & reduce functions can be

executed on smaller subsets of your larger data sets and this provides the scalability

needed by Big Data processing.

Source: Apache Hadoop Org

Fig: Hadoop HDFS Architecture

The goal of Hadoop is to use commonly available servers in very large cluster, where

each server has a set of inexpensive internal disk drives. As commodity hardware

failure’s probability is high, Hadoop built-in tolerance and fault compensation

facilities. Any data is divided into blocks and copies of these blocks are stored on

other servers in the Hadoop Cluster. That is, an individual file is actually stored as

smaller blocks on several servers in the entire cluster.

24 | P a g e


In Hadoop, a file is broken down into multiple “n” blocks. Each of these “n” block

is replicated across 03 Servers by default [The number of Blocks per file and the

replication factor can be customized on a per-file basis e.g. the Development Hadoop

needn’t have any replication]. Coordination amongst all the servers has a significant

overhead, so the ability to process large chunks of data locally helps improve both

performance and communication overhead.

For example: Imagine a file having all the Employee Ids of an organization. This file

is divided into say, 03 parts (Block 1, Block 2, and Block 3) and is stored across

multiple servers in a Hadoop Cluster.

Source: VMware’s Networking and Security Business Unit [Brad Hedlund]

Fig: Block Replication across HDFS

Here, Block 1 is replicated by Block 1` and Block 1``. Same applies for Block 2 and

Block 3. With the default replication factor of 3, each block is replicated thrice.

This redundancy has 02 major advantages:

(a) High availability

(b) It allows Hadoop to break a work into smaller chunks and run those jobs on

all servers in the cluster for better scalability.

25 | P a g e


Assumptions & Goals

(a) Hardware Failure: Hardware Failure is a norm rather than an exception. As

HDFS uses thousands of commodity hardware component, and each

component has a non-trivial probability of failure, some component of HDFS

is always at fault. However, this failure is expected and part of the design

perspective of Hadoop.

(b) Streaming Data Access: HDFS is designed for batch processing, rather than

interactive uses by users. It is not a general purpose distributed file systems

dealing with stand-alone data. It requires continuous streaming access to the

data sets.

(c) Large Data Sets: A typical file in Hadoop is terabytes in size. HDFS is built

to support huge data sets.

(d) Simple Coherency Model: HDFS applications have a write-once-read-many

access model for files. A file once created, written, and saved need not be

changed. This assumption greatly simplifies the data coherency issues. There

is a plan to introduce file’s data-append in the future release of Hadoop.

(e) “Moving Computation To Data, Rather Than Data To Computation”:

Hadoop believes moving computation to data is much faster than moving data

to computation. This is very true for large data sets. Moving huge data sets to

the application greatly increases the network bandwidth usages.

(f) Portability: HDFS is designed to work on any platform supporting Java. With

highly portable Java at helm, Hadoop advocates widespread adoption as a

platform of choice for large data sets applications.

26 | P a g e


DataNodes & NameNodes

Hadoop has a Master/Slave architecture. All of Hadoop data placement is managed

by special server called NAMENODE. Each cluster of Hadoop has 1 NameNode

Server assigned to it. This server keeps track of all the data in the HDFS. All the

NameNode’s information is stored in memory, which allows quick response time to

storage manipulation and read requests. In other words, NameNode deals with

CLUSTER METADATA. The obvious scenario coming to our mind is the Single

Point of Failure (SPOF) with all these details stored in a server. Hence, it is advisable

to choose a robust server component for NameNode as compared to other servers.

Initial version of Hadoop had only 1 NameNode Server. Hadoop version 0.21

included the capability of a Backup Node, which acts as a clod standby for

NameNode.


Fig: Hadoop Master Slave Architecture

27 | P a g e


DataNodes manages the storage attached to each node in the cluster. Usually, 1

DataNode is assigned to 1 Node. Internally, when a file is divided into multiple

blocks, each block is assigned to multiple DataNode [03 By Default]. Overall, the

NameNode handles the namespaces operation like opening, closing, renaming files

in addition to determining the mapping of blocks to DataNodes. The DataNodes is

responsible for servicing read and write request.


Fig: NameNode & DataNode Functionality

When you fire a job for inserting data into HDFS or for retrieving data from HDFS,

Hadoop has the responsibility of communicating with NameNode for necessary

information, be it storage location and replication details for INSERT operation or

the server from which the data for SELECT operation is to be fetched. In other words,

any Hadoop operation needn’t reference NameNode directly or indirectly.

Hadoop isn’t UNIX (POSIX) portable. It means that all the familiar commands for

copying, deleting, inserting, opening, moving etc. are slightly available in different

forms with HDFS. To work around this, either we can develop our own Java

applications to perform some of the functions, or we can use Hadoop components

available readily in the Apache Software Foundation.

28 | P a g e


Data Replication & Integrity

HDFS is designed to store huge files, and each file is divided into many blocks, with

all the blocks occupying same size except the last block. All these blocks are

replicated for fault tolerance. The block size and replication factor is configurable

for each file. The replication factor for a file can be changed later also.

The NameNode makes all the decisions regarding the block’s replication. It also

receives a HeartBeat and BlockReport from each DataNodes. HeartBeat report

signifies all the DataNodes are functioning properly. The BlockReport contains the

list of all blocks on a DataNode.


Fig: Heartbeat & Block Report.

The placement of the first replica is crucial. Optimizing the block’s replica

placement requires lots of tuning and optimization. Hadoop uses a Rack Awareness

policy for replication. A group of nodes forms a Rack. By default, the Replication

Factor is 03. A simple policy will be to place one replica per rack. This policy will

ensure even distribution of blocks, but increases the cost of writes as a write needs

to transfer blocks to multiple racks.

29 | P a g e


Hadoop Rack Awareness policy is to put first replica on one node in the local rack,

another on a different node in the same local rack, and the last on a different node in

a different rack. One third of replicas is on one node, two third of replicas is in one

rack, and the other third are evenly distributed across the remaining racks.

The necessity of block re-replication may arise due to many reasons: DataNode may

become unavailable, a replica may become corrupted, a hard disk on a DataNode

may fail, or the replication factor of a file may be increased. The Blocks’ information

across DataNode is sent to NameNode via HeartBeat message periodically.

It is very possible for a block of data to be fetched arrives corrupted. Whenever a

block of a file is stored in a DataNode, checksum is implemented for data integrity.

This checksum is used for validation of data blocks whenever data is received from

the DataNode by the NameNode.

30 | P a g e


Data Blocks Organization & Pipelining

When a client request a file write, it doesn’t reach the NameNode immediately. In

fact, the HDFS Client caches the data into a temporary file until the accumulated

data is worth over one HDFS block size. At this point of time, the Client contacts

the NameNode. The NameNode inserts the file name into the file system hierarchy

and allocates a data block for it. The NameNode responds to the client request with

the identity of the DataNode and the destination data block. Then the client flushes

the block of data from the local temporary file to the specified DataNode. When a

file is closed, the remaining un-flushed data in the temporary local file is transferred

to the DataNode. The client then tells the NameNode that the file is closed. At this

point, the NameNode commits the file creation operation into a persistent store. If

the NameNode dies before the file is closed, the file is lost.


Figure: Pipelined Flow of HDFS Read Operation

31 | P a g e


Explanation of HDFS File Read Operation:

(a) The HDFS Client requests a file to be read for its operation.

(b) The NameNode is contacted for the file information. The information sought

is the block’s address across DataNodes for the file.

(c) The NameNode fetch the block information based on the latest Block Report

sent by each DataNodes.

(d) The NameNode send these information ordering the Blocks by their ascending

order based on the distance from the first block containing the data.

Example: Assume a file has 2 blocks (B1 and B2) with a replication factor of

02. Node 1 contains B1, B2`; Node 2 contains B2, B1`. When the NameNode

delivers the block ordering, it will place the node containing the first replica,

followed by other nodes based on the distance of other same-block-other-

replica carrying node. For the above example:

B1 (Node1, Node2)

B2 (Node2, Node1)

(e) The InputStream will fetch the data from the ordering specified, choosing the

first Node containing the block, and then checking the checksum to verify data

correctness. If correct, then the block from the first node is used, else the

second node’ block will be used.

32 | P a g e



Figure: Pipelined Flow of HDFS Write Operation

The Normal Line represent

communication between

Client &HDFS. Solid Line

represent data transfer and

dotted line represent

acknowledgement.

Time t0: Client request

WRITE Operation.

Time t1-t2: Packets of data

sent from Client to HDFS in

BLOCK Size.

Time t2: CLOSE Signal

Time t3: Hadoop Saves the

file.

FIG: WRITE PIPELINE

33 | P a g e


The Pipelined approach explained above for WRITE operation has been explained

using the below figure:


Fig: HDFS Write Operation

The policy of putting the first 2 blocks in the same rack, and then the 3rd block in

another rack is based on Hadoop Rack Awareness Policy, explained in the “Data

Replication & Integrity” section.

34 | P a g e


HDFS File Permission Guide

The Hadoop Distributed File System (HDFS) implements a permissions model for

files and directories that shares much of the POSIX model. Each file and directory

is associated with an owner and a group. The file or directory has separate

permissions for the user that is the owner, for other users that are members of the

group, and for all other users. For files, the r permission is required to read the file,

and the w permission is required to write or append to the file. For directories, the r

permission is required to list the contents of the directory, the w permission is

required to create or delete files or directories, and the x permission is required to

access a child of the directory.

Each client process that accesses HDFS has a two-part identity composed of the user

name, and groups list. Whenever HDFS must do a permissions check for a file or

directory foo accessed by a client process,

(a) If the user name matches the owner of foo, then the owner permissions are

tested;

(b) Else if the group of foo matches any of member of the groups list, then the

group permissions are tested;

(c) Otherwise the other permissions of foo are tested.

If a permissions check fails, the client operation fails.

35 | P a g e


Basics Of MapReduce

MapReduce is the heart of Big Data. It is the programming paradigm that allows for

massive scalability across hundreds or thousands of servers in a Hadoop cluster.

The term MapReduce refers to two separate and distinct tasks: Map & Reduce. The

“Map” takes data as input and converts it into another set of data, where every

individual elements is broken down into tuples (key/value pairs). The “Reduce”

takes the output of “Map” as input and combines those tuples into a smaller set of

tuples. As the sequence of name MapReduce suggests, the “Reduce” job is always

performed after “Map” job.

For Example: Consider 3 Files containing Animal’s name and we wish to count the

number of occurrence of each animal. First, the Input File is split into blocks (3 by

Default). For each block, a Map task calculate the number of occurrence of each

animal. The Shuffle unit of MapReduce takes the output of Map and directs them to

appropriate Reducer tasks for consolidation and delivery of the final result.


Fig: MapReduce Operation

36 | P a g e


In Hadoop, every MapReduce Program is called “JOB”. A job is executed by subsequently

breaking it down into smaller pieces called “TASKS”. Every Hadoop cluster has a program

running called “JOBTRACKER”.

The JobTracker communicates with the NameNode to find out where all the data required

by the submitted job exists across the cluster and also breaks the job into map and reduce

tasks. These tasks are then scheduled on all the servers where the data exists. It is very

possible for the tasks to be scheduled on a server where the required data isn’t available

[As every data is replicated across 3 servers by default]. In this case, the server will ask for

the necessary data to be transferred across the network interconnect to perform its task. As

this is not very efficient, the JobTracker tries to avoid this and attempts to schedule the job

on the servers where the data actually resides.

Another set of program called “TASKTRACKER” is responsible for monitoring the status

of every tasks running. If any tasks fails, the status of the failure is reported back to

JobTracker, which will then reschedule the job. We can decide how many times a failed

task will be attempted before the entire job is cancelled.


Fig: MapReduce Basic Concepts

37 | P a g e


SHUFFLE & COMBINER are 02 features of Hadoop used in MapReduce. Shuffle takes

the input from the map tasks and directs the output to reduce tasks. If we wish to perform

some aggregation or other transformation on the output of map tasks before sending to

reduce tasks, then we can use Combiner. The greater the number of reducer tasks, more

will be the overhead but the overall performance is improved.


Fig: MapReduce & HDFS Together

38 | P a g e


Hadoop Common Components

Hadoop Common Components are a set of libraries that support the various Hadoop

subprojects. As mentioned before, Hadoop isn’t UNIX (POSIX) complaint. To

interact with Hadoop, we need to use the /bin/hdfs dfs <args> file system shell

command interface, where args represents the command argument.


Fig: HDFS Shell Commands

39 | P a g e


Application Development in Hadoop

For using Hadoop, we need to use Java for developing the MapReduce programs for

interacting with HDFS. Also, the programmers need to develop and maintain

MapReduce applications for business applications that require long and pipelined

processing.

To abstract some of the complexity of Hadoop programming model, several

application development languages have emerged that run on top of Hadoop. The

popular ones are Pig, Zookeeper, Hive, JAQL etc.

Pig & PigLatin

Pig was developed by Yahoo! To allow people using Hadoop to focus more on

analyzing data rather than writing the mapper and reducer programs. As the animal

Pig eats anything, the Pig programming language is designed to handle any kind of

data. Pig is made up of two components: PigLatin & Pig Runtime Environment

where the PigLatin programs are executed. The relationship between PigLatin & Pig

Runtime Environment is similar to Java Application and JVM.


Fig: Pig & PigLatin

The commonly used commands in Pig are:

(a) LOAD: Before writing programs via PigLatin to access data in HDFS, we

need to specify the data in HDFS which will be used. For this purpose, we use

LOAD ‘File’ Command (Where ‘File’ refers to either a HDFS file or

directory). If a directory is specified, then all the files are loaded into the

PigLatin program.

40 | P a g e


(b) TRANSFORM: The transformation logic is where all the data manipulation

occurs. You can use FILTER to remove rows as required, JOIN to join two

set of data files, GROUP to aggregate data, ORDER to order results, etc.

For Example: To calculate the count of employees per manager belonging to

the HealthCare ISU with the input file being located in HDFS @ Employee

directory:

L = LOAD ‘hdfs//node/Employee’;

FL = FILTER L BY Vertical EQ ‘HC’;

G = GROUP FL BY ManagerID;

RT = FOREACH G GENERATE group, COUNT (EmployeeID);

(c) DUMP & STORE: If DUMP & STORE aren’t specified, then the output of

Pig program isn’t displayed. To display the output to the screen, use DUMP

command. For redirecting the output to a file, use STORE command.

Post writing the PigLatin program, the Pig Runtime Environment translate the

program into a set of map & reduce tasks and runs them under the cover on your

behalf.

HIVE

Although Pig was a very powerful and simple language to understand and use, the

downside was that it was something new to learn and master. HIVE was developed

by Facebook with the intention of developing a runtime Hadoop support structure

which allows anyone fluent in SQL to leverage the power of Hadoop. Their creation

was called HQL (HIVE Structured Language). The HQL statements were broken

down into MapReduce jobs by the HIVE service to be executed across the Hadoop

cluster.

For Example: Using the same scenario as above of calculating the count of

employees per manager id, the following HIVE Code performs the task of creating

a table, populating it, and then querying that table via HIVE:

41 | P a g e


CREATE TABLE Employee (EMPID BIGINT, MGRID BIGINT, VERTICAL

STRING)

COMMENT ‘The Above Is The Employee Table’

STORED AS SEQUENCEFILE;

LOAD DATA INPATH ‘//hdfs://node/Employee’ INTO TABLE Employee;

SELECT MGRID, COUNT (EMPID) FROM EMPLOYEE WHERE VERTICAL

EQUAL ‘HC’ GROUP BY MGRID;

JAQL

JAQL was developed by IBM and allows processing both structured and non-

traditional data. JAQL was inspired by many programming language like LISP, SQL,

XQuery, Pig etc.


Fig: Comparison of Pig, Hive & JAQL

42 | P a g e


GETTING YOUR DATA INTO HADOOP

One of the biggest challenges for Hadoop is that it is not UNIX (POSIX) complaint.

To get your data into Hadoop (HDFS), we need to use the traditional Hadoop

commands:

(a) CopyFromLocal: Move file from local file system into HDFS.

(b) CopyToLocal: Move file from HDFS into local file system.

hdfs dfs –copyFromLocal /user/dir/file hdfs://s1.n1.com/dir/hdfsfile

hdfs dfs –copyToLocal hdfs://s1.n1.com/dir/hdfsfile /user/dir/file

These commands are executed through the HDFS Shell Program, which is similar

to a Java Application. The shell uses the Java APIs for getting the date into and out

of HDFS.

FLUME

Flume is an Apache project for flowing data from source into your Hadoop

environment. In Flume, there are 03 main entities:

(a) SOURCE: A Source can be any data source. Flume has many predefined

source adapters. Some adapters allows the flow of anything coming off a TCP

port to enter the flow. A number of text source file adapters allows you the

granular control to grab a specific file and feed it into whatever new data is

written into the file. Look to Flume when you want data to flow from many

sources.

(b) SINK: A Sink is the target of a specific operation. There are 03 type of Sinks

in Flume. One Sink is basically the final flow destination known as

COLLECTOR TIER EVENT SINK. This is where you land a flow into the

HDFS File System. Another Sink type is AGENT TIER EVENT SINK, which

is used when you want the sink to be the input source of another operation.

When using such Sinks, Flume will ensure communication or

acknowledgement is sent for arriving data. The final sink type is BASIC SINK,

which can be text file, a console display, simple HDFS path etc.

43 | P a g e


(c) DECORATOR: A Decorator is an operation on the stream that can transform

the stream in some manner, be it compressing or adding or removing

information. Very complex transformation or enterprise class transformation

like IBM Information Server isn’t achieved by Flume’s decorator task.

Hadoop is more than just a single project, but rather an ecosystem of projects

at simplifying, managing, coordinating, and analyzing large sets of data. Such

projects are listed in the following sections.

ZOOKEEPER

ZooKeeper is an open source Apache project that provides a centralized

infrastructure and services enabling synchronization across a cluster.

Imagine a Hadoop cluster spanning 500 or more servers. There is a need for

centralized management of the entire cluster in terms of name services, group

services, synchronization services, configuration management, and more. Having

ZooKeeper allows the cross-node synchronization and ensures the tasks across

cluster are serialized or synchronized. A very Hadoop cluster can be supported by

multiple ZooKeeper servers.

Fig: Apache Zookeeper

44 | P a g e


OOZIE

Sometimes, many jobs needs to be chained together to create a complex application.

Oozie is an open source project that simplifies the workflow and coordination

between jobs. It provides users with the ability to define actions and dependencies

between actions. Oozie will then schedule the actions to execute when the required

dependencies have been met.

A workflow in Oozie is defined via DAG (Directed Acyclic Graph), where all the

tasks and dependencies point are specified without any loop. The below figure shows

an example of Oozie workflow, where the node represents the actions and control

flow operations.

LUCENE

Lucene is an extremely popular open source Apache project for text search. Lucene

predates Hadoop and has been a top level Apache project since 2005. If you have

searched on the Internet, it’s very likely that you have used Lucene and yet not

known it.

In a nutshell, if you wish to search a text in a large file, or a set of documents, then

Lucene breaks the documents into text fields and builds an index on these fields. The

index is the key component of Lucene, as it is the basis of its rapid search capabilities.

AVRO

AVRO is an Apache project offering data serialization.

Big Data Fundamentals

Data & Analytics

Transcript of Big Data Fundamentals