O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop
description
Transcript of O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop
![Page 1: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/1.jpg)
O’Reilly – Hadoop: The Definitive GuideCh.1 Meet Hadoop
May 28th, 2010Taewhi Lee
![Page 2: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/2.jpg)
2
Outline Data! Data Storage and Analysis Comparison with Other Systems
– RDBMS– Grid Computing– Volunteer Computing
The Apache Hadoop Project
![Page 3: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/3.jpg)
3
‘Digital Universe’ Nears a Zettabyte
Digital Universe: the total amount of data stored in the world’s computers Zettabyte: 1021 bytes >> Exabyte >> Petabyte >> Terabyte
![Page 4: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/4.jpg)
4
Flood of Data
NYSE generates 1TB new trade data / day
![Page 5: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/5.jpg)
5
Flood of Data
Facebook hosts 10 billion photos (1 petabyte)
![Page 6: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/6.jpg)
6
Flood of Data
Internet Archive stores 2 petabytes of data
![Page 7: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/7.jpg)
7
Individuals’ Data are Growing Apace
It becomes easier to take more and more photos
![Page 8: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/8.jpg)
8
Individuals’ Data are Growing Apace
LifeLog, my life in a terabyte
SQL
Capture and encoding
Microsoft Research’s MyLifeBits Project
![Page 9: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/9.jpg)
9
Amount of Public Data Increases
Available Public Data Sets on AWS– Annotated Human Genome– Public database of chemical structures– Various census data and labor statistics
![Page 10: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/10.jpg)
10
Large Data!
How to store & analyze large data?
“More data usually beats better algorithms”
![Page 11: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/11.jpg)
11
Outline Data! Data Storage and Analysis Comparison with Other Systems
– RDBMS– Grid Computing– Volunteer Computing
The Apache Hadoop Project
![Page 12: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/12.jpg)
12
Current HDD
How long it takes to read all the data off the disk?
capacity 1TBtransfer
rate 100MB/s
How about using multiple disks?
![Page 13: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/13.jpg)
13
Problems with Multiple Disks Hardware Failure
Doing tasks need to combine the dis-tributed data
What Hadoop Provides– Reliable shared storage (HDFS)– Reliable analysis system (MapReduce)
![Page 14: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/14.jpg)
14
Outline Data! Data Storage and Analysis Comparison with Other Systems
– RDBMS– Grid Computing– Volunteer Computing
The Apache Hadoop Project
![Page 15: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/15.jpg)
15
RDBMS
* Low latency for point queries or updates** Update times of a relatively small amount
of data
***
![Page 16: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/16.jpg)
16
Grid Computing
Shared storage (SAN) Works well for predominantly CPU-intensive jobs Becomes a problem when nodes need to access
large data
![Page 17: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/17.jpg)
17
Volunteer Computing Volunteers donate CPU time from their idle
computers Work units are sent to computers around the
world
Suitable for very CPU-intensive work with small data sets
Risky due to running work on untrusted ma-chines
![Page 18: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/18.jpg)
18
Outline Data! Data Storage and Analysis Comparison with Other Systems
– RDBMS– Grid Computing– Volunteer Computing
The Apache Hadoop Project
![Page 19: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/19.jpg)
19
Brief History of Hadoop Created by Doug Cutting Originated in Apache Nutch (2002)
– Open source web search engine, a part of the Lucene project
NDFS (Nutch Distributed File System, 2004) MapReduce (2005)
Doug Cutting joins Yahoo! (Jan 2006) Official start of Apache Hadoop project (Feb 2006) Adoption of Hadoop on Yahoo! Grid team (Feb
2006)
![Page 20: O’Reilly – Hadoop : The Definitive Guide Ch.1 Meet Hadoop](https://reader035.fdocuments.us/reader035/viewer/2022081416/56816376550346895dd45353/html5/thumbnails/20.jpg)
20
The Apache Hadoop Project
Pig Chukwa Hive HBase
MapReduce HDFSZoo
Keeper
Core Avro