Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system...
Transcript of Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system...
![Page 1: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/1.jpg)
Distributed File Systems &Hadoop
Kevin Queenan
![Page 2: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/2.jpg)
What is a Distributed File System (DFS)?
![Page 3: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/3.jpg)
Simply...
A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine.
![Page 4: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/4.jpg)
What is Hadoop?
![Page 5: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/5.jpg)
Apache Hadoop is...
A framework, ecosystem, or set of open-source software tools that allows for the distributed housing and processing of extremely large data sets contained across numerous clusters of commodity grade hardware.
![Page 6: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/6.jpg)
Why does Hadoop exist?
![Page 7: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/7.jpg)
Consider current industry trends...
Data at a massive scale -> TB and PB
Facebook ingested 20 TB of data per day in 2011
NYSE generated 1TB of data per day in 2010
This data is also heterogeneous:
Images, social network activity, log files, IOT sensors, etc
![Page 8: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/8.jpg)
TB and PB
80% unstructured20% structuredHeterogeneous data consisting of log files, audio, video, images, etc
Good, bad, undefined, incomplete?
Time sensitive, real-time, etc
![Page 9: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/9.jpg)
Challenge: Read 1TB of data
1 machine
4 I/O channels
Each channel operates @ 100 MB/s
Time taken?
45 minutes
10 machines
4 I/O channels
Each channel operates @ 100 MB/s
Time taken?
4.5 minutes
![Page 10: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/10.jpg)
Where was Hadoop developed?
![Page 11: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/11.jpg)
Hadoop Origins
Three Google white papers:1. GFS2. MapReduce3. BigTable
HDFS
MapReduce
HBase
![Page 12: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/12.jpg)
Hadoop is the faithful, open-source implementation of Google’s MapReduce, GFS, and BigTable
Hadoop’s primary architect is Doug Cutting who is also credited with creating Apache Lucene
The project began while Doug Cutting was working for Yahoo! on a project named Nutch
Cutting’s son named a yellow stuffed elephant Hadoop which Doug adopted for the project
![Page 13: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/13.jpg)
Hadoop’s Design Axioms
1. Store and process massive amounts of data (order of PB)2. Performance must scale linearly3. Failure is expected4. Easily manageable 5. Self-healing file system6. Run on commodity, off-the-shelf hardware
![Page 14: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/14.jpg)
Fundamental tenet of relational databases involves a db schema -> inherently structured
What about the massive amount of unstructured data we need to house and process?
Scaling commercial relational databases is incredibly expensive and limited
Hadoop cost per user is approx $250/TB
RDBMS cost per user is approx $100,000 - $200,000/TB
Hadoop vs RDBMS
![Page 15: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/15.jpg)
Hadoop Architecture
![Page 16: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/16.jpg)
Master/Slave Model
Master
NameNode (HDFS)
JobTracker (MapReduce)
Slave
DataNode (HDFS)
TaskTracker (MapReduce)
![Page 17: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/17.jpg)
NameNodeFile metadata:/user/kevin/data1.txt -> 1,2,3
r = 3
hdfs-site.xml
DataNode
2, 3
DataNode
1, 3
DataNode
1, 2, 3
DataNode
1, 2
![Page 18: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/18.jpg)
Underlying Filesystem
Each physical drive in each slave DataNode machine is formatted either ext3 or ext4
HDFS can be considered to be an abstract filesystem in the sense that fixed blocks of data are sent to slave DataNodes from the master NameNode
![Page 19: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/19.jpg)
MapReduce
![Page 20: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/20.jpg)
Data Processing Paradigm
MapReduce is a framework for performing high performance distributed data processing using the divide and aggregate programming paradigm
![Page 21: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/21.jpg)
![Page 22: Hadoop Distributed File Systemsdevans/7343... · A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9d24ac5c536c27d32c34c4/html5/thumbnails/22.jpg)
Thanks for your time!