Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf ·...
Transcript of Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf ·...
![Page 1: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/1.jpg)
CASE STUDY: HADOOP
![Page 2: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/2.jpg)
OUTLINE Hadoop - Basics HDFS
GoalsArchitectureOther functions
MapReduceBasicsWord Count ExampleHandy toolsFinding shortest path example
Related Apache sub-projects (Pig, HBase,Hive)
![Page 3: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/3.jpg)
HBASE: PART OF HADOOP’S ECOSYSTEM
3
HBase is built on top of HDFS
HBase files are
internally stored in
HDFS
![Page 4: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/4.jpg)
HADOOP - WHY ? Need to process huge datasets on large
clusters of computers Very expensive to build reliability into each
application Nodes fail every day
Failure is expected, rather than exceptionalThe number of nodes in a cluster is not
constant Need a common infrastructure
Efficient, reliable, easy to useOpen Source, Apache Licence
![Page 5: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/5.jpg)
WHO USES HADOOP? Amazon/A9 Facebook Google New York Times Veoh Yahoo! …. many more
![Page 6: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/6.jpg)
COMMODITY HARDWARE
Typically in 2 level architectureNodes are commodity PCs30-40 nodes/rackUplink from rack is 3-4 gigabitRack-internal is 1 gigabit
Aggregation switch
Rack switch
![Page 7: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/7.jpg)
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
![Page 8: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/8.jpg)
GOALS OF HDFS Very Large Distributed File System
10K nodes, 100 million files, 10PB Assumes Commodity Hardware
Files are replicated to handle hardware failureDetect failures and recover from them
Optimized for Batch ProcessingData locations exposed so that computations
can move to where data residesProvides very high aggregate bandwidth
![Page 9: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/9.jpg)
DISTRIBUTED FILE SYSTEM
Single Namespace for entire cluster Data Coherency
Write-once-read-many access modelClient can only append to existing files
Files are broken up into blocksTypically 64MB block sizeEach block replicated on multiple DataNodes
Intelligent ClientClient can find location of blocksClient accesses data directly from DataNode
![Page 10: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/10.jpg)
HDFS ARCHITECTURE
![Page 11: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/11.jpg)
FUNCTIONS OF A NAMENODE
Manages File System NamespaceMaps a file name to a set of blocksMaps a block to the DataNodes where it
resides Cluster Configuration Management Replication Engine for Blocks
![Page 12: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/12.jpg)
NAMENODE METADATA Metadata in Memory
The entire metadata is in main memoryNo demand paging of metadata
Types of metadataList of filesList of Blocks for each fileList of DataNodes for each blockFile attributes, e.g. creation time, replication
factor A Transaction Log
Records file creations, file deletions etc
![Page 13: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/13.jpg)
DATANODE A Block Server
Stores data in the local file system (e.g. ext3)Stores metadata of a block (e.g. CRC)Serves data and metadata to Clients
Block ReportPeriodically sends a report of all existing
blocks to the NameNode Facilitates Pipelining of Data
Forwards data to other specified DataNodes
![Page 14: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/14.jpg)
BLOCK PLACEMENT Current Strategy
One replica on local nodeSecond replica on a remote rackThird replica on same remote rackAdditional replicas are randomly placed
Clients read from nearest replicas Would like to make this policy pluggable
![Page 15: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/15.jpg)
HEARTBEATS DataNodes send hearbeat to the
NameNodeOnce every 3 seconds
NameNode uses heartbeats to detect DataNode failure
![Page 16: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/16.jpg)
REPLICATION ENGINE NameNode detects DataNode failures
Chooses new DataNodes for new replicasBalances disk usageBalances communication traffic to DataNodes
![Page 17: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/17.jpg)
DATA CORRECTNESS Use Checksums to validate data
Use CRC32 File Creation
Client computes checksum per 512 bytesDataNode stores the checksum
File accessClient retrieves the data and checksum from
DataNode If Validation fails, Client tries other replicas
![Page 18: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/18.jpg)
NAMENODE FAILURE A single point of failure Transaction Log stored in multiple
directoriesA directory on the local file systemA directory on a remote file system (NFS/CIFS)
Need to develop a real HA solution
![Page 19: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/19.jpg)
SECONDARY NAMENODE Copies FsImage and Transaction Log from
Namenode to a temporary directory Merges FSImage and Transaction Log into
a new FSImage in temporary directory Uploads new FSImage to the NameNode
Transaction Log on NameNode is purged
![Page 20: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/20.jpg)
USER INTERFACE Commads for HDFS User:
hadoop dfs -mkdir /foodirhadoop dfs -cat /foodir/myfile.txthadoop dfs -rm /foodir/myfile.txt
Commands for HDFS Administratorhadoop dfsadmin -reporthadoop dfsadmin -decommision datanodename
Web Interfacehttp://host:port/dfshealth.jsp
![Page 21: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/21.jpg)
PIG
![Page 22: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/22.jpg)
PIG Started at Yahoo! Research Now runs about 30% of Yahoo!’s jobs Features
Expresses sequences of MapReduce jobsData model: nested “bags” of itemsProvides relational (SQL) operators (JOIN, GROUP BY, etc.)Easy to plug in Java functions
![Page 23: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/23.jpg)
AN EXAMPLE PROBLEM Suppose you have
user data in a file, website data in another, and you need to find the top 5 most visited pages by users aged 18-25
Load Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
![Page 24: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/24.jpg)
IN PIG LATINUsers = load ‘users’ as (name, age);Filtered = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;Summed = foreach Grouped generate group, count(Joined) as clicks;
Sorted = order Summed by clicks desc;Top5 = limit Sorted 5;store Top5 into ‘top5sites’;
![Page 25: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/25.jpg)
EASE OF TRANSLATIONLoad Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …Fltrd = filter … Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit …
![Page 26: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/26.jpg)
EASE OF TRANSLATIONLoad Users
Load Pages
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Users = load …Fltrd = filter … Pages = load …Joined = join …Grouped = group …Summed = … count()…Sorted = order …Top5 = limit …
Job 1
Job 2
Job 3
![Page 27: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/27.jpg)
HBASE
![Page 28: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/28.jpg)
HBASE - WHAT? Modeled on Google’s Bigtable Row/column store Billions of rows/millions on columns Column-oriented - nulls are free Untyped - stores byte[]
![Page 29: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/29.jpg)
HBASE - DATA MODEL
Row TimestampColumn family:
animal:
Column family
repairs:
animal:type
animal:sizerepairs:cos
t
enclosure1
t2 zebra 1000 EUR
t1 lion big
enclosure2
… … … …
![Page 30: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/30.jpg)
HBASE - DATA STORAGEColumn family animal:
(enclosure1, t2, animal:type)
zebra
(enclosure1, t1, animal:size)
big
(enclosure1, t1, animal:type)
lionColumn family repairs:
(enclosure1, t1, repairs:cost)
1000 EUR
![Page 31: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/31.jpg)
HBASE - CODEHTable table = …Text row = new Text(“enclosure1”);Text col1 = new Text(“animal:type”);Text col2 = new Text(“animal:size”);BatchUpdate update = new BatchUpdate(row);update.put(col1, “lion”.getBytes(“UTF-8”));update.put(col2, “big”.getBytes(“UTF-8));table.commit(update);
update = new BatchUpdate(row);update.put(col1, “zebra”.getBytes(“UTF-8”));table.commit(update);
![Page 32: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/32.jpg)
HBASE - QUERYING Retrieve a cell
Cell = table.getRow(“enclosure1”).getColumn(“animal:type”).getValue();
Retrieve a rowRowResult = table.getRow( “enclosure1” );
Scan through a range of rowsScanner s = table.getScanner( new String[] { “animal:type” } );
![Page 33: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/33.jpg)
HIVE
![Page 34: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/34.jpg)
HIVE Developed at Facebook Used for majority of Facebook jobs “Relational database” built on Hadoop
Maintains list of table schemasSQL-like query language (HiveQL)Can call Hadoop Streaming scripts from
HiveQLSupports table partitioning, clustering,
complex data types, some optimizations
![Page 35: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/35.jpg)
CREATING A HIVE TABLE
Partitioning breaks table into separate files for each (dt, country) pairEx: /hive/page_view/dt=2008-06-08,country=USA
/hive/page_view/dt=2008-06-08,country=CA
CREATE TABLE page_views(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'User IP address') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING)STORED AS SEQUENCEFILE;
![Page 36: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/36.jpg)
A SIMPLE QUERY
SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01'AND page_views.date <= '2008-03-31'AND page_views.referrer_url like '%xyz.com';
• Hive only reads partition 2008-03-01,* instead of scanning entire table
• Find all page views coming from xyz.com on March 31st:
![Page 37: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/37.jpg)
AGGREGATION AND JOINS• Count users who visited each page by
gender:
• Sample output:
SELECT pv.page_url, u.gender, COUNT(DISTINCT u.id)FROM page_views pv JOIN user u ON (pv.userid = u.id)GROUP BY pv.page_url, u.genderWHERE pv.date = '2008-03-03';
![Page 38: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/38.jpg)
USING A HADOOP STREAMING MAPPER SCRIPT
SELECT TRANSFORM(page_views.userid, page_views.date)USING 'map_script.py'AS dt, uid CLUSTER BY dtFROM page_views;
![Page 39: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/39.jpg)
STORM
![Page 40: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/40.jpg)
STORM Developed by BackType which was acquired
by Twitter Lots of tools for data (i.e. batch) processing
Hadoop, Pig, HBase, Hive, … None of them are realtime systems which is
becoming a real requirement for businesses Storm provides realtime computation
ScalableGuarantees no data lossExtremely robust and fault-tolerantProgramming language agnostic
![Page 41: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/41.jpg)
BEFORE STORM
![Page 42: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/42.jpg)
BEFORE STORM – ADDING A WORKER Deploy
Reconfigure/Redeploy
![Page 43: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/43.jpg)
PROBLEMS Scaling is painful Poor fault-tolerance Coding is tedious
![Page 44: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/44.jpg)
WHAT WE WANT Guaranteed data processing Horizontal scalability Fault-tolerance No intermediate message brokers! Higher level abstraction than message
passing “Just works” !!
![Page 45: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/45.jpg)
STORM CLUSTERMaster node (similar to Hadoop JobTracker)
Used for cluster coordination
Run worker processes
![Page 46: Case study: Hadoop - IOE Notesioenotes.edu.np/media/notes/big-data/pulchowk-notes/Hadoop.pdf · OUTLINE Hadoop - Basics HDFS Goals Architecture Other functions MapReduce Basics Word](https://reader034.fdocuments.us/reader034/viewer/2022042219/5ec5c2d94b59e275ef4fa8f4/html5/thumbnails/46.jpg)
STREAMS
Tuple
Tuple
Tuple
Tuple
Tuple
Tuple
Tuple
Unbounded sequence of tuples