1. Shubhendu Tripathi PSE Red Hat GlusterFS and Hadoop
2. 06/22/15 2 Agenda What is BigData Hadoop and its Evolution
Hadoop Acrchitecture and Components Hadoop and GlusterFS
(glusterfs-hadoop plugin) Advantages of using GlusterFS with Hadoop
References
3. 06/22/15 3 What is BigData Software solutions mostly
capture, maintain and manage data Storing data Processing data
Growing data size in current world big data generators Sensors CC
Cam Social networks Online shopping portals Airlines
Hospitality
4. 06/22/15 4 Agenda What is BigData Hadoop and its Evolution
Hadoop Acrchitecture and Components Hadoop and GlusterFS
(glusterfs-hadoop plugin) Advantages of using GlusterFS with Hadoop
References
5. 06/22/15 5 What is BigData 90% of total data today we have,
got generated in last 2 years 1990 HDD: 1-20 GB, RAM: 14-128 MB,
Speed: 10kbps 2014 HDD: 0.5-1 TB, RAM: 1-16 GB, Speed: 100 mbps 3
Factors which define BigData Volume Velocity Variety (unstructured
and semi structured data)
6. 06/22/15 6 What is BigData SAN Storage Area Network One
option Store the data on data centers and get them on need basis
and computation performed on them to process Computation is
processor bound and a limit on the same As the size of the data
increases we need more and more computation as well and its not
possible to perform the same on local machine Solution - sending
computation to the storage node and get the processed data is
better option (size of computation would be small)
7. 06/22/15 7 Hadoop Evolution Started with Google white papers
GFS (Google File System) 2003 - Storage MapReduce 2004 Computation
Yahoo HDFS (Hadoop Distributed File System) - 2006,7 MapReduce
(Computation mechanism) 2007,8 Doug Cutting and Michael Cafarrela
from Yahoo Logo Elephant Apache foundation (2005 Yahoo
donated)
8. 06/22/15 8 Hadoop Architecture / Components Framework of
tools not an application in entirety Used for supporting running of
applications on BigData Opensource'd set of tools distributed under
Apache license Traditional Approach for handling huge data Powerful
computer with big storage and computation capacity Limited by
processing power of the computer with growing data Hadoop approach
Break up data into smaller pieces and distribute to multiple
computers Breaks the computation as well into smaller pieces and
distributes them Combined results returned back
9. 06/22/15 9 Hadoop Architecture / Components Map Reduce Job
Tracker Task Tracker HDFS Name Node Data Node Applications contact
the master node, a task is formed and submitted to the Task Tracker
Task Tracker maintains a queue of the tasks and gets them processed
using the Task Tracker and Data Nodes Consolidates the result and
sends back to the application
10. 06/22/15 10 Hadoop Architecture / Components Hadoop works
on a distributed model Numerous low cost computers commodity
hardware Hadoop components Slaves Task Tracker process smaller
piece of task assigned Data Node manage the piece of data
distributed to this node Master Job Tracker tracks the overall task
Name Node maintains the index of the data blocks stored on
different nodes Task Tracker Data Node
11. 06/22/15 11 Hadoop Architecture / Components Task Tracker
Data Node Job Tracker Name Node Task Tracker Data Node Task Tracker
Data Node Task Tracker Data Node Task Tracker Data Node
Applications Master Slaves Queue
12. 06/22/15 12 Hadoop Architecture / Components Task Tracker
Data Node Job Tracker Name Node Task Tracker Data Node Task Tracker
Data Node Task Tracker Data Node Task Tracker Data Node
Applications Master Slaves
13. 06/22/15 13 Hadoop Architecture / Components Task Tracker
Data Node Job Tracker Name Node Task Tracker Data Node Task Tracker
Data Node Task Tracker Data Node Task Tracker Data Node
Applications Master Slaves
14. 06/22/15 14 Hadoop and GlusterFS GlusterFS is a general
purpose scale-out distributed file- system supporting thousands of
clients Aggregates storage exports over network interconnect to
provide a single unified namespace File-system completely in
userspace, runs on commodity hardware Layered on disk file systems
that support extended attributes
15. 06/22/15 15 Hadoop contains set of daemons running in the
system Name Node centralized metadata node Job Tracker overall task
distribution across data nodes Task Tracker on data nodes to
maintain task Data Node to store data Hadoop = Map Reduce framework
+ HDFS GlusterFS can be a replacement for HDFS
glusterfs-hadoop-plugin Java module which implements Hadoop file
system interface Simple a JAR file which could be kept in Hadoop
libraries Replaces HDFS for glusterfs Hadoop and GlusterFS
16. 06/22/15 16 Hadoop and GlusterFS Data locality is ensured
by Job Tracker Using glusterfs-hadoop-plugin ensures data locality
by getting the gluster volumes mounted as fuse mount Effectively no
name node involved Only clients where map-reduce job runs And data
nodes to store data Glusterfs-hadoop-plugin talks to glusterfs
using fuse mounts In absence of name node, plugin uses xfattrs
mechanism to get the details from volume and consolidates the data
using the same Reads the data directly from the bricks and bypasses
the volume as such for improved performance
17. 06/22/15 17 Hadoop and GlusterFS As simple as to execute
map reduce daemon and then submit the hadoop task to use glusterfs
as storage Analytics uses using HDFS makes files moving around the
nodes whereas glusterfs just need to fuse mount the volume and no
moving around the files
18. 06/22/15 18 Advantages Elimination of centralized metadata
server (name node) Compatibility with MapReduce and Hadoop based
applications Elimination of code rewrites for Hadoop enablement of
glusterfs Fault tolerant file system Allows co-location of compute
and data nodes and ability to run Hadoop jobs across multiple
namespaces using multiple glusterfs volumes Data access through
serveral different mechanisms / protocols (Fuse, NFS, SMB and SWIFT
. and of course Hadoop)