Post on 19-Dec-2015
Big Data & HadoopBy Mr.Nataraj
smallest unit is bit1 byte=8 bits1 KB (Kilo Byte) = 1024 bytes =1024*8 bits1MB (Mega Byte) =1024 KB =(1024)^2 * 8 bits1 GB (Giga Byte) =1024 MB =(1024)^3 * 8 bits1 TB (Tera Byte) =1024GB =(1024)^4 * 8 bits1 PB (Peta Byte) =1024 TB =(1024)^5 * 8 bits1 EB (Exa Byte) =1024 PB =(1024)^6 * 8 bits1 ZB (Zetta Byte) =1024 EB =(1024)^7 * 8 bits1 YB (Yotta Byte) =1024 ZB =(1024)^8 * 8 bits1 XB (Xenotta Byte) =1024 YB =(1024)^9 * 8 bits
UNITS OF DATA
1 byte =A single character1 KB = A very short story1 MB=A small novel (6 seconds of TV-quality video)1 Gigabyte: A pickup truck filled with paper1 Terabyte : 50000 trees made into paper2 PB: All US academic research libraries5 EB: All words ever spoken by human beings
HOW BIG ARE THOSE NUMBERS
WHAT IS BIG DATA
SOME INTERESTING FACTS• Google: 20,00,000 query per second• Facebook 34000 likes per minute• Online Shopping of USD 300,000 per minute• 1,00,000 tweets in twitter per minute• 600 new videos are uploaded per minute in yT• Barack Obama used Big Data to win election• Driver-less cars uses Big Data Processing for
driving vehicles
• AT&T transfers about 30 petabytes of data through its networks each day
• Google processed about 24 petabytes of data per day in 2009
• The 2009 movie Avatar is reported to have taken over 1 petabyte of local storage at Weta Digital for the rendering of the 3D CGI effects
As of January 2013, Facebook users had uploaded over 240 billion photos, with 350 million new photos every day. For each uploaded photo, Facebook generates and stores four images of different sizes, which translated to a total of 960 billion images and an estimated 357 petabytes of storage
Processing capabiltiyGoogle process 20 PB a dayFacebook 2.5 PB of User data + 15 TB/dayebay 6.5 PB of data +50TB/day
Evolution of Hadoop• Doug Cutting working on Lucene
Project(A Search engine to search document)got problem of Storage and computation, was looking for distributed Processing.
• Google publish a Paper GFS(Google File System)
• Doug cutting & Michael Cafarella implemented GFS to come out with Hadoop
WHAT IS HADOOP• A framework written in Java for running
applications on large clusters of commodity hardware.
• Mainly contains 2 parts– HDFS for Storing data– Map-Reduce for processing data
• Maintains fault-tolerant using replication factor.
• employee.txt(eno,ename,empAge,empSal,empDes)• 101,prasad,t20,1000,lead
Assume you have around 100,00,000,00000,0000000000 records and you would like to find out all the employees above 60 years of age.How do you program them traditionally. 10 GB= 10 min 1 TB= 1000 minutes =16 hoursGoogle process 20 PB of data per dayTo process 20 PB it will take 3200 hours = 133 days
INSPIRATION FOR HADOOPTo store huge data(unlimited)To process huge data
• Node -A single computer with its own processor and memory.
• Cluster-combination of nodes as a single unit• Commodity Hardware-cheap non-reliable
hardware• Replication Factor-data getting duplicated &
saved in more than one place• Data Local Optimization-data will be processed
locally
Basic Terminology
• Block:- A part of data• node1 node2 node3• data1 data2 data3• 1 file 200 MB(50MB 50MB 50MB 50MB)
• Block size:- The size of data that can stored as a single unit
• apache hadoop:- 64 MB(configurable)• 1GB in apache hadoop=16 blocks• 65MB(apache)=64MB+ 1MB• Replication:- duplicate the data replication factor is: 3
NODE CLUSTER
SCALING• Vertical Scaling
Adding more powerful hardware to an existing system.Will Scale only up to certain limit.
• Horizontal Scaling Adding a completely new node to an existing
cluster. will scale up to many nodes
3 V's of Hadoop• Volume: The amount of data generated• Variety: structured data,unstructed
data.Database table data• Velocity: The frequency at which data is
generated
1.hadoop believes on scale out instead of scale up when needed buy more oxes dont grow your oxe more powerful
2.hadoop on structured as well unstructured RDBMS only works with structured data.(However now a days many no-sql database has comeout in the market like mongo db,couch base.)
3.hadoop believes on key-value pair rather than data in the column
HADOOP VS RDMS
No doubt Hadoop is a framework for processing big data. But it is not the only framework to do so. Below are few more alternative.
Apache SparkGraphLabHPCC Systems- (High Performance Computing Cluster)
DryadStratosphere
HADOOP ALTERNATIVES
Storm R3 Disco Phoenix Plasma
You can download hadoop from link http://hadoop.apache.org/releases.htmlhttp://apache.bytenet.in/hadoop/common/ · 18 November, 2014: Release 2.6.0 available · 27 June, 2014: Release 0.23.11 available · 1 Aug, 2013: Release 1.2.1 (stable) available
DOWLOADING HADOOP
1. Name Node2. Secondary Name Node3. Job Tracker4. Task Tracker5. Data Node
HADOOP1. Storing Huge Data 2. Processing Huge Data
Hadoop Daemons
HADOOP CORE COMPONENTS
Modes in HadoopStandalone ModePseudo Distributed ModeFully Distributed Mode
Standalone modeIt is the default mode1 nodeNo separate process will be running(daemons)Everything runs in a single JVMSmall development,Test,Debugging
Pseudo Distributed Mode1. A single node, but cluster will be simulated2. Daemons will run on separate process separate JVMs3. Development and Debugging
1. Multiple nodes2. Hadoop will run in a cluster of machines/nodes used in Production Environment
Fully Distributed Mode
Hadoop Architecture
HivePigScoopAvro
ECOsystem Components
FlumeOozieHBaseCassandra
Job Tracker
Job Tracker contd..
Job Tracker contd..
Job Tracker Contd..
HDFS write
HDFS write