TheEdge10 : Big Data is Here - Hadoop to the Rescue
-
Upload
shay-sofer -
Category
Documents
-
view
4.078 -
download
0
description
Transcript of TheEdge10 : Big Data is Here - Hadoop to the Rescue
![Page 1: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/1.jpg)
Big Data is Here – Hadoop to the Rescue!
Shay Sofer,AlphaCSP
![Page 2: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/2.jpg)
2
Today we will:
» Understand what is BigData» Get to know Hadoop» Experience some MapReduce magic» Persist very large files» Learn some nifty tricks
On Today's Menu...
![Page 3: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/3.jpg)
3
Data is Everywhere
![Page 4: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/4.jpg)
4
» IDC : “Total data in the universe : 1.2 Zettabytes” (May, 2010)
» 1ZB = 1 Trillion Gigabytes (or: 1,000,000,000,000,000,000,000 bytes = 1021 )
» 60% Growth from 2009» By 2020 – we will reach 35 ZB
Facts and Numbers
Data is Everywhere
![Page 5: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/5.jpg)
5
Facts and Numbers
Data is Everywhere
Source: www.idc.com
![Page 6: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/6.jpg)
6
» 234M Web sites» 7M New sites in 2009» New York Stock Exchange – 1 TB of data per day» Web 2.0
147M Blogs (and counting…) Twitter – ~12 TB of data per day
Facts and Numbers
Data is Everywhere
![Page 7: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/7.jpg)
7
» 500M users» 40M photos per day » More than 30 billion pieces of content (web links,
news stories, blog posts, notes, photo albums etc.) shared each month
Facts and Numbers - Facebook
Data is Everywhere
![Page 8: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/8.jpg)
8
» Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools
» Where and how do we store this information?» How do we perform analyses on such large datasets?
Why are you here?
Data is Everywhere
![Page 9: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/9.jpg)
9
Scale-up Vs. Scale-out
Data is Everywhere
![Page 10: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/10.jpg)
10
» Scale-up : Adding resources to a single node in a system, typically involving the addition of CPUs or memory to a single computer
» Scale-out : Adding more nodes to a system. E.g. Adding a new computer with commodity hardware to a distributed software application
Scale-up Vs. Scale-out
Data is Everywhere
![Page 11: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/11.jpg)
11
Introducing…Hadoop!
![Page 12: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/12.jpg)
12
» A framework for writing and running distributed applications that process large amount of data.
» Runs on large clusters of commodity hardware» A cluster with hundreds of machine is standard» Inspired by Google’s architecture : MapReduce and
GFS
What is Hadoop?
Hadoop
![Page 13: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/13.jpg)
13
» Robust - Handles failures of individual nodes» Scales linearly» Open source » A top-level Apache project
Why Hadoop?
Hadoop
![Page 14: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/14.jpg)
14
Hadoop
![Page 15: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/15.jpg)
15
» Facebook holds the largest known Hadoop storage cluster in the world 2000 machines 12 TB per machine (some has 24 TB) 32 GB of RAM per machine
» Total of more than 21 Petabytes » (1 Petabyte = 1024 Terabytes)
Facebook (Again…)
Hadoop
![Page 16: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/16.jpg)
16
History
Hadoop
2004 2006 2008 20082002 2010
Apache Nutch – Open Source web search engine founded by Doug Cutting
Cutting joins Yahoo!, forms Hadoop
Sorting 1 TB in 62 seconds
Google’s GFS & MapReduce papers published
Hadoop hits web scale, being used by Yahoo! for web indexing
Creating the longest Pi yet
![Page 17: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/17.jpg)
17
Hadoop
Common
MapReduce
Pig Chukwa
HDFS
Hive
Zoo Keeper
HBase
![Page 18: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/18.jpg)
18
IDE Plugin
Hadoop
![Page 19: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/19.jpg)
19
Hadoop and MapReduce
![Page 20: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/20.jpg)
20
» A programming model for processing and generating large data sets
» Introduced by Google » Parallel processing of the map/reduce operations
Definition
MapReduce
![Page 21: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/21.jpg)
21
Sam believed “An apple a day keeps a doctor away”
MapReduce – The Story of Sam
Mother
Sam
An Apple
Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
![Page 22: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/22.jpg)
22
Sam thought of “drinking” the apple
MapReduce – The Story of Sam
He used a to cut
the and a to
make juice.
Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
![Page 23: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/23.jpg)
23
(map ‘( ))
( )
Sam applied his invention to all the fruits he could find in the fruit basket
MapReduce – The Story of Sam
(reduce ‘( ))
A list of values mapped into another list of values, which gets reduced into a
single value
Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
![Page 24: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/24.jpg)
24
Sam got his first job for his talent in making juice
MapReduce – The Story of Sam
Now, it’s not just one basket
but a whole container of
fruits
Also, they produce a list of
juice types separately
Fruits
But, Sam had just ONE
and ONE
Large data and list of values for output
Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
![Page 25: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/25.jpg)
25
Sam Implemented a parallel version of his innovation Fruits
(<a, > , <o, > , <p ,> , …)
Each map input: list of <key, value> pairs
Each map output: list of <key, value> pairs
(<a’ , > , <o’, v > , <p’ , > , …)Grouped by key (shuffle)
Each reduce input: <key, value-list>
e.g. <a’, ( …)>
Reduced into a list of values
Source: Saliya Ekanayake, SALSA HPC Group at Community Grids Labs
Map
Reduce
MapReduce – The Story of Sam
![Page 26: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/26.jpg)
26
» Mapper - Takes a series of key/value pairs, processes each and generates output key/value pairs
(k1, v1) list(k2, v2)» Reducer - Iterates through the values that are
associated with a specific key and generate output (k2, list (v2)) list(k3, v3)» The Mapper takes the input data, filters and
transforms into something The Reducer can aggregate over
First Map, Then Reduce
MapReduce
![Page 27: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/27.jpg)
27
MapReduce
Map
Map
Map
Map
Map
Input
Shuffle
Reduce
Reduce
Output
Output
![Page 28: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/28.jpg)
28
» Hadoop comes with a number of predefined classes BooleanWritable ByteWritable LongWritable Text, etc…
» Supports pluggable serialization frameworks» Apache Avro
Hadoop Data Types
MapReduce
![Page 29: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/29.jpg)
29
» TextInputFormat / TextOutputFormat» KeyValueTextInputFormat
» SequenceFile - A Hadoop specific compressed binary file format. Optimized for passing data between 2 MapReduce jobs
Input / Output Formats
MapReduce
![Page 30: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/30.jpg)
30
public static class MapClass extends MapReduceBase
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,IntWritable> output, …)
{String line = value.toString();StringTokenizer itr = new StringTokenizer(line);while(itr.hasMoreTokens()){
word.set(itr.nextToken());output.collect(word, new IntWritable(1));}
} }
Word Count – The Mapper
implements Mapper<LongWritable,Text,Text,IntWritable<
<K1,Hello World Bye World>
< Hello, 1> < World, 1>
< Bye, 1> < World, 1>
![Page 31: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/31.jpg)
31
public static class ReduceClass extends MapReduceBase
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable> output,
…){int sum = 0;while(values.hasNext()){
sum += values.next().get(); }output.collect(key, new IntWritable(sum));
{{
Word Count– The Reducer
implements Reducer<Text,IntWritable,Text,IntWritable{<
< Hello, 1> < World, 2> < Bye, 1>
< Hello, 1> < World, 1> < Bye, 1> < World, 1>
![Page 32: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/32.jpg)
32
public static void main(String[] args){JobConf job = new JobConf(WordCount.class);
job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(MapClass.class);job.setReducerClass(ReduceClass.class);
FileInputFormat.addInputFormat(job ,new Path(args[0]));FileOutputFormat.addOutputFormat(job ,new Path(args[1]));
//job.setInputFormat(KeyValueTextInputFormat.class);
JobClient.runJob(job);{
Word Count – The Driver
![Page 33: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/33.jpg)
33
Music discovery website» Scrobbling / Streaming VIA radio» 40M unique visitors per month» Over 40M scrobbles per day» Each scrobble creates a log line
Hadoop @ Last.FM
MapReduce
![Page 34: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/34.jpg)
34
![Page 35: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/35.jpg)
35
» Goal : Create a “Unique listeners per track” chart
Sample listening data
MapReduce
Skip Radio Scrobbles TrackId UserId
0 10 5 100 55551
3 3 0 900 55551
0 5 0 101 55552
0 0 5 102 55553
![Page 36: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/36.jpg)
36
public void map(LongWritable position, Text rawLine, OutputCollector<IntWritable,IntWritable>
output, Reporter reporter) throws IOException {
int scrobbles, radioListens; // assume they are initialized - IntWritable trackId,userId; // for verbosity // if track somehow is marked with zero plays - ignore if (scrobbles <= 0 && radioListens <= 0) { return;
} // output user id against track id output.collect(trackId, userId); }
Unique Listens - Mapper
![Page 37: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/37.jpg)
37
public void reduce(IntWritable trackId, Iterator<IntWritable> values,
OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throws IOException {
Set<Integer> usersSet = new HashSet<Integer>(); // add all userIds to the set, duplicates removed while (values.hasNext()) { IntWritable userId = values.next(); usersSet.add(userId.get()); }
// output: trackId -> number of unique listeners per track output.collect(trackId, new IntWritable(usersSet.size()));}
Unique Listens - Reducer
![Page 38: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/38.jpg)
38
» Complex tasks will sometimes be needed to be broken down to subtasks
» Output of the previous job goes as input to the next job
» job-a | job-b | job-c» Simply launch the driver of the 2nd job after the 1st
Chaining
MapReduce
![Page 39: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/39.jpg)
39
» Hadoop supports other languages via API called Streaming
» Use UNIX commands as mappers and reducers» Or use any script that processes line-oriented data
stream from STDIN and outputs to STDOUT» Python, Perl etc.
Hadoop Streaming
MapReduce
![Page 40: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/40.jpg)
40
$ hadoop jar hadoop-streaming.jar -input input/myFile.txt -output output.txt -mapper myMapper.py -reducer myReducer.py
Hadoop Streaming
MapReduce
![Page 41: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/41.jpg)
41
HDFSHadoop Distributed File System
![Page 42: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/42.jpg)
42
» A large dataset can and will outgrow the storage capacity of a single physical machine
» Partition it across separate machines – Distributed FileSystems
» Network based - complex» What happens when a node fails?
Distributed FileSystem
HDFS
![Page 43: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/43.jpg)
43
» Designed for storing very large files running on clusters on commodity hardware
» Highly fault-tolerant (via replication)» A typical file is gigabytes to terabytes in size» High throughput
HDFS - Hadoop Distributed FileSystem
HDFS
![Page 44: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/44.jpg)
44
Running Hadoop = Running a set of daemons ondifferent servers in your network» NameNode» DataNode» Secondary NameNode» JobTracker» TaskTracker
Hadoop’s Building Blocks
HDFS
![Page 45: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/45.jpg)
45
Topology of a Hadoop Cluster
Secondary NameNode
NameNode
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
![Page 46: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/46.jpg)
46
» HDFS has a master/slave architecture ; The NameNode acts as the master
» Single NameNode per HDFS» Keeps track of :
How the files are broken into blocks Which nodes store those blocks The overall health of the filesystem
» Memory and I/O intensive
The NameNode
HDFS
![Page 47: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/47.jpg)
47
» Each slave machine will host a DataNode daemon» Serves read/write/delete requests from the
NameNode» Manages the storage attached to the nodes » Sends a periodic Heartbeat to the NameNode
The DataNode
HDFS
![Page 48: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/48.jpg)
48
» Failure is the norm rather than exception» Detection of faults and quick, automatic recovery» Each file is stored as a sequence of blocks (default:
64MB each)» The blocks of a file are replicated for fault tolerance» Block size and replicas are configurable per file
Fault Tolerance - Replication
HDFS
![Page 49: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/49.jpg)
49
HDFS
![Page 50: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/50.jpg)
50
Topology of a Hadoop Cluster
Secondary NameNode
NameNode
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
![Page 51: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/51.jpg)
51
» Assistant daemon that should be on a dedicated node» Takes snapshots of the HDFS metadata» Doesn’t receive real time changes» Helps minimizing downtime incase the NameNode
crashes
Secondary NameNode
HDFS
![Page 52: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/52.jpg)
52
Topology of a Hadoop Cluster
Secondary NameNode
NameNode
JobTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
![Page 53: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/53.jpg)
53
» One per cluster - on the master node» Receives job request submitted by the client» Schedules and monitors MapReduce jobs on
TaskTrackers
JobTracker
HDFS
![Page 54: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/54.jpg)
54
» Run map and reduce tasks» Send progress reports to the JobTracker
TaskTracker
HDFS
![Page 55: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/55.jpg)
55
» VIA file commands$ hadoop fs -mkdir /user/chuck$ hadoop fs -put hugeFile.txt$ hadoop fs -get anotherHugeFile.txt
» Programmatically (HDFS API)FileSystem hdfs = FileSystem.get(new Configuration());FSDataOutStream out = hdfs.create(filePath);while(...){ out.write(buffer,0,bytesRead);}
Working with HDFS
HDFS
![Page 56: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/56.jpg)
56
Tips & Tricks
![Page 57: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/57.jpg)
57
Tip #1: Hadoop Configuration Types
Tips & Tricks
HDFS # of Machines Type
No daemons, 1 JVM Local Machine Local Mode
Daemons running on separate JVMs (“cluster of one”)
Local Machine Pseudo-distributed mode
Daemons running on separate JVMs
Cluster with several nodes
Fully-distributed mode
![Page 58: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/58.jpg)
58
» Monitoring events in the cluster can prove to be a bit more difficult
» Web interface for our cluster» Shows a summary of the cluster» Details about list of jobs there are currently running,
completed and failed
Tip #2: JobTracker UI
Tips & Tricks
![Page 59: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/59.jpg)
59
WebTracker UI SS
Tips & Tricks
![Page 60: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/60.jpg)
60
» Digging through logs or….Running again the exact same scenario with the same input on the same node?
» IsolationRunner can rerun the failed task to reproduce the problem
» Attach a debugger » Keep.failed.tasks.file = true
Tip #3: IsolationRunner – Hadoop’s Time Machine
Tips & Tricks
![Page 61: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/61.jpg)
61
» Output of the map phase (which will be shuffled across the network) can be quite large
» Built in support for compression» Different codecs : gzip, bzip2 etc» Transparent to the developer
conf.setCompressMapOutput(true);conf.setMapOutputCompressorClass(GzipCodec.class);
Tip #4: Compression
Tips & Tricks
![Page 62: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/62.jpg)
62
» A node can experience a slowdown, thus slowing down the entire job
» If a task is identified as “slow”, it will be scheduled to run in another node in parallel
» As soon as one finishes successfully, the others will be killed
» An optimization – not a feature
Tip #5: Speculative Execution
Tips & Tricks
![Page 63: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/63.jpg)
63
» Input can come from 2 (or more) different sources» Hadoop has a contrib package called datajoin » Generic framework for performing reduce-side join
Tip #6: DataJoin Package
MapReduce
![Page 64: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/64.jpg)
64
Hadoop in the CloudAmazon Web Services
![Page 65: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/65.jpg)
65
» Cloud computing - Shared resources and information are provided on demand
» Rent a cluster rather than buy it» The best known infrastructure for cloud computing is
Amazon Web Services (AWS)» Launched at July 2002
Cloud Computing and AWS
Hadoop in the Cloud
![Page 66: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/66.jpg)
66
» Elastic Compute Cloud (EC2) A large farm of VMs where a user can rent and use them to
run a computer application Wide range on instance types to choose from (price varies)
» Simple Storage Service (S3) – Online storage for persisting MapReduce data for future use
» Hadoop comes with built in support for EC2 and S3$ hadoop-ec2 launch-cluster <cluster-name>
<num-of-slaves>
Hadoop in the Cloud – Core Services
![Page 67: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/67.jpg)
67
EC2 Data Flow
OurData
HDFS
MapReduce Tasks
EC2
![Page 68: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/68.jpg)
68
EC2 & S3 Data Flow
S3
OurData
HDFS
MapReduce Tasks
EC2
![Page 69: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/69.jpg)
69
Hadoop-Related Projects
![Page 70: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/70.jpg)
70
» Thinking in the level of Map, Reduce and job chaining instead of simple data flow operations is non-trivial
» Pig simplifies Hadoop programming» Provides high-level data processing language : Pig Latin» Being used by Yahoo! (70% of production jobs),
Twitter, LinkedIn, EBay etc..
» Problem: Users file & Pages file. Find top 5 most visited pages by users aged 18-25
Pig
Hadoop-Related Projects
![Page 71: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/71.jpg)
71
Users = LOAD ‘users.csv’ AS (name, age);Fltrd = FILTER Users BY age >= 18 AND age <= 25;
Pages = LOAD ‘pages.csv’ AS (user, url);
Jnd = JOIN Fltrd BY name, Pages BY user;Grpd = GROUP Jnd BY url;Smmd = FOREACH Grpd GENERATE group, COUNT(Jnd) AS clicks;Srtd = ORDER Smmd BY clicks DESC;Top5 = LIMIT Srtd 5;
STORE Top5 INTO ‘top5sites.csv’;
Pig Latin – Data Flow Language
![Page 72: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/72.jpg)
72
» A data warehousing package built on top of Hadoop» SQL-like queries on large datasets
Hive
Hadoop-Related Projects
![Page 73: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/73.jpg)
73
» Hadoop database for random read/write access» Uses HDFS as the underlying file system» Supports billions of rows and millions of columns» Facebook chose HBase as a framework for their new
version of “Messages”
HBase
Hadoop-Related Projects
![Page 74: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/74.jpg)
74
» A distribution of Hadoop that simplifies deployment by providing the most recent stable version of Apache Hadoop with and backports
Cloudera
Hadoop-Related Projects
![Page 75: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/75.jpg)
75
» Machine learning algorithms for Hadoop» Coming up next.. (-:
Mahout
Hadoop-Related Projects
![Page 76: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/76.jpg)
76
» Big Data can and will cause serious scalability problems to your application
» MapReduce for analysis, Distributed filesystem for storage
» Hadoop = MapReduce + HDFS and much more» AWS integration is easy» Lots of documentation
Last words
Summary
![Page 77: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/77.jpg)
77
Hadoop in Action / Chuck LamHadoop
: The Definitive Guide, 2nd Edition / Tom White (O’reilly)
Apache Hadoop DocumentationHadoop @ Last.FM Presentation MapReduce in Simple Terms / Saliya EkanayakeAmazon Web Services
References
![Page 78: TheEdge10 : Big Data is Here - Hadoop to the Rescue](https://reader038.fdocuments.us/reader038/viewer/2022103014/54b41cfb4a7959620e8b45f3/html5/thumbnails/78.jpg)
78
Thank you!