An introduction to Big-Data processing applying hadoop
-
Upload
amir-sedighi -
Category
Data & Analytics
-
view
168 -
download
3
Transcript of An introduction to Big-Data processing applying hadoop
![Page 1: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/1.jpg)
An introduction to
Big Data processing
using Hadoop
A.Sedighi
hexican.com
![Page 2: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/2.jpg)
No single standard definiHon…
“Big Data” is data whose scale, diversity,
and complexity require new architecture,
techniques, algorithms, and analyHcs to
manage it and extract value and hidden
knowledge from it…
Big Data, Definition
![Page 3: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/3.jpg)
Information is powerful…
but it is how we use it that will
define us
![Page 4: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/4.jpg)
Data Explosion
relational
textaudiovideo
images
![Page 5: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/5.jpg)
Big Data Era
-creates over 30 billion pieces of content per day
-stores 30 petabytes of data
-produces over 90 million tweets per day
![Page 6: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/6.jpg)
Log Files
-Log files contains data.
-Each banking transaction should be logged in
different levels.
How much a Banking solution generates log
files per a day?
![Page 7: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/7.jpg)
Big Data: 3 V's
![Page 8: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/8.jpg)
Big Data: 3 V's
volumevelocityvariety
![Page 9: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/9.jpg)
Some Makes it 3 V's
![Page 10: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/10.jpg)
What is driving Big Data Industry?
- Optimizations and predictive analytics- Complex statistical analysis- All types of data, and many sources- Very large datasets- More of a real-time
- Ad-hoc querying and reporting- Data mining techniques- Structured data, typical sources- Small to mid-size datasets
![Page 11: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/11.jpg)
Big Data Challenges
![Page 12: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/12.jpg)
Big Data Challenges
Sorting of 10TB on:1 node takes 2.5 Days O(N log N)100 nodes takes 35 Mins O(log N)
![Page 13: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/13.jpg)
Big Data Challenges
Problem: “Fat” servers implies high cost.
Solution: Using cheap commodity nodes instead.
Problem: Large number of cheap nodes implies often
failures.
Solution: leverage automatic fault-tolerance
![Page 14: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/14.jpg)
Big Data Challenges
We need new data-parallel programming
model for clusters of commodity machines.
![Page 15: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/15.jpg)
What Technology Do We Have
For Big Data ?
![Page 16: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/16.jpg)
![Page 17: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/17.jpg)
Map Reduce
![Page 18: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/18.jpg)
MapReduce
Published in 2004 by GooglePopularized by Apache Hadoop project.
Using by Yahoo!, Facebook, Twitter, Amazon, LinkedIn and many other enterprises.
![Page 19: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/19.jpg)
Word Count Example
![Page 20: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/20.jpg)
MapReduce philosophy
-hide complexity
-make it scalable
-make it cheap
![Page 21: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/21.jpg)
MapReduce popularized by
Apache Hadoop project
![Page 22: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/22.jpg)
Hadoop Overview
Open source implementation of Google
MapReduce
Google File System (GFS)
First release in 2008 by Yahoo!
Wide adoption by Facebook, Twitter, Amazon, etc.
![Page 23: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/23.jpg)
![Page 24: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/24.jpg)
Everything Started By Searching
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project.
![Page 25: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/25.jpg)
Hadoop Sub Projects -‐ 1
![Page 26: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/26.jpg)
Hadoop Sub Projects -‐ 2
![Page 27: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/27.jpg)
Hadoop Distributed File System (HDFS) -‐ 1
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.
-“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.
![Page 28: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/28.jpg)
Hadoop Distributed File System (HDFS) -‐ 2
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.
-HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. The time to read the whole dataset is more important than the latency in reading the first record.
![Page 29: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/29.jpg)
Hadoop Distributed File System (HDFS) -‐ 3
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters on commodity hardware.
-HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
![Page 30: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/30.jpg)
Were HDFS doesn't work well?
● Low-‐latency data access
● Lots of small files
● MulHple writers, arbitrary file modificaHons.
![Page 31: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/31.jpg)
MapReduce and HDFS
![Page 32: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/32.jpg)
HDFS Concepts - Blocks65MB 128MB or 256MB Block size.If the seek time is around 10ms, and the transfer rate is 100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size around 100 MB.
![Page 33: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/33.jpg)
Anatomy of a File Read
![Page 34: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/34.jpg)
Anatomy of a File Write
![Page 35: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/35.jpg)
![Page 36: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/36.jpg)
Replica Replacement
![Page 37: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/37.jpg)
Machine Learning - 1
Mahout's goal is to build scalable machine learning libraries providing core algorithms for clustering, classificaHon and batch based collaboraHve filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.
![Page 38: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/38.jpg)
Machine Learning - 2
Mahout can be used as a recommender engine on the top of hadoop clusters.
![Page 39: An introduction to Big-Data processing applying hadoop](https://reader030.fdocuments.us/reader030/viewer/2022032419/55a289101a28abea748b4686/html5/thumbnails/39.jpg)
Using hadoop for
● ads and recomendations● online travel● processing mobile data● energy savings and discovery● infrastructure management● image processing● fraud detection● IT security● health care