An Intro to Hadoop
-
Upload
matthew-mccullough -
Category
Technology
-
view
18 -
download
3
description
Transcript of An Intro to Hadoop
![Page 1: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/1.jpg)
Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2
L�$2EE96H�$4�F==@F89���>3:6?E�!562D��##�
![Page 2: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/2.jpg)
Talk Metadata
+H:EE6C�>2EE96H>44F==� 25@@A!?EC@
$2EE96H�$4�F==@F89���>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89
![Page 3: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/3.jpg)
![Page 4: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/4.jpg)
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis
To appear in OSDI 20041
![Page 5: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/5.jpg)
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004
1
![Page 6: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/6.jpg)
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004
1
![Page 7: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/7.jpg)
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004
1
![Page 8: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/8.jpg)
MapReduce history
�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED
“
”
![Page 9: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/9.jpg)
![Page 10: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/10.jpg)
$2A)65F46�:>A=6>6?E2E:@?
�@F?565�3J�
&A6?*@FC46�2E�
Origins
![Page 11: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/11.jpg)
Today
-6CD:@?�����$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8
![Page 12: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/12.jpg)
Today
![Page 13: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/13.jpg)
Today
![Page 14: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/14.jpg)
Why Hadoop?
![Page 15: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/15.jpg)
$74.85
![Page 16: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/16.jpg)
$74.85
4gb
![Page 17: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/17.jpg)
$74.85
1tb
4gb
![Page 18: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/18.jpg)
vs
![Page 19: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/19.jpg)
vs$10,000
$1,000
![Page 20: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/20.jpg)
vs
![Page 21: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/21.jpg)
vs
Buy ou
r
way
out
of Failure
Failu
re is
inevitab
le
Go Cheap
![Page 22: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/22.jpg)
Bzzzt!
Crrrkt!
Sproinnnng!
![Page 23: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/23.jpg)
)��$*
Structured
&��
![Page 24: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/24.jpg)
#@8�7:=6D
)��$*
Structured
Unstructured
&��
!>286D
)6=2E:@?D�DA2??:?8���D
�@4F>6?ED
![Page 25: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/25.jpg)
NOSQL
![Page 26: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/26.jpg)
NOSQL
�62E9�@7�E96�)��$*�:D�2�=:6
![Page 27: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/27.jpg)
NOSQL
�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D
![Page 28: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/28.jpg)
Applications
![Page 29: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/29.jpg)
Applications
'C@E6:?�7@=5:?8�A92C>246FE:42=D�
![Page 30: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/30.jpg)
Applications
'C@E6:?�7@=5:?8�A92C>246FE:42=D�
*62C49�6?8:?6D
![Page 31: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/31.jpg)
Applications
'C@E6:?�7@=5:?8�A92C>246FE:42=D�
*62C49�6?8:?6D*@CE:?8
![Page 32: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/32.jpg)
Applications
'C@E6:?�7@=5:?8�A92C>246FE:42=D�
*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�
![Page 33: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/33.jpg)
Applications
![Page 34: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/34.jpg)
Applications
'C:46�D62C49
![Page 35: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/35.jpg)
Applications
'C:46�D62C49*E682?@8C2A9J
![Page 36: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/36.jpg)
Applications
'C:46�D62C49*E682?@8C2A9J�?2=JE:4D
![Page 37: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/37.jpg)
Applications
'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�
![Page 38: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/38.jpg)
Particle Physics
![Page 39: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/39.jpg)
Particle Physics
#2C86� 25C@?��@==:56C
![Page 40: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/40.jpg)
Particle Physics
#2C86� 25C@?��@==:56C���A6E23JE6D�@7�52E2�A6C�J62C
![Page 41: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/41.jpg)
Financial Trends
![Page 42: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/42.jpg)
Financial Trends
�2:=J�EC256�A6C7@C>2?46�2?2=JD:D
![Page 43: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/43.jpg)
Financial Trends
�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8
![Page 44: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/44.jpg)
Financial Trends
�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8
,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD
![Page 45: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/45.jpg)
Financial Trends
�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8
,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=
![Page 46: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/46.jpg)
Contextual Ads
![Page 47: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/47.jpg)
Contextual Ads
*9@AA6C�� �FDE@>6C�
�
'C@5F4E�.
=@@<:?8�2E
'C@5F4E�/�FDE@>6C�
�
�FDE@>6C��
2=D@�3@F89E�/
2=D@�3@F89E�/
5:5�?@E�3FJ
@7�@E96C�4
FDE@>6CD
:D�E@=5�E92E����
H9@�3@F89E�.
![Page 48: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/48.jpg)
Hadoop Family
![Page 49: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/49.jpg)
':8 :G6�@C6��@>>@?
�9F<H2 �2D6 ��*0@@"66A6C
Hadoop Components
![Page 50: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/50.jpg)
the Players
![Page 51: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/51.jpg)
the PlayAs
![Page 52: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/52.jpg)
0@@"66A6C :G6
�9F<H2 �@>>@? �2D6
��*
the PlayAs
![Page 53: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/53.jpg)
MapReduce
![Page 54: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/54.jpg)
MapReduce>2A�E96?����F>����C65F46�
![Page 55: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/55.jpg)
The process
![Page 56: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/56.jpg)
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A
![Page 57: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/57.jpg)
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G��
![Page 58: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/58.jpg)
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J
![Page 59: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/59.jpg)
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA
![Page 60: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/60.jpg)
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<���=:DE��G�������=:DE�G �
![Page 61: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/61.jpg)
24E65!@=5J62CD7@FC2E
@=5J62CD7@FC?@E2>!
8=25
7@FCD@>6@?6=:<624EDE:==!J6E
�=25�!�2>�?@E�7@FC�J62CD�@=5
/6E�!�DE:==�24E�=:<6�
D@>6@?6�7@FC
Start
@FE
�E�7@FC�J62CD�@=5�!�24E65�@FE
![Page 62: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/62.jpg)
Map
24E65!@=5J62CD7@FC2E
@=5J62CD7@FC?@E2>!
8=25
7@FCD@>6@?6=:<624EDE:==!J6E
@FE
![Page 63: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/63.jpg)
Grouping
@FE24E65
!
@=5 J62CD
7@FC2E
@=5
J62CD
7@FC?@E
2>
!8=25
7@FC
D@>6@?6=:<624EDE:==
!
J6E
![Page 64: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/64.jpg)
Reduce
@FE����24E65���� J62CD����
7@FC���
2E����
@=5����
?@E����
2>����
!���
8=25����
D@>6@?6����=:<6����24E����DE:==����
J6E���
![Page 65: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/65.jpg)
MapReduce Demo
![Page 66: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/66.jpg)
HDFS
![Page 67: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/67.jpg)
![Page 68: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/68.jpg)
HDFS Basics
![Page 69: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/69.jpg)
HDFS Basics
�2D65�@?��@@8=6��:8+23=6
![Page 70: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/70.jpg)
HDFS Basics
�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6
![Page 71: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/71.jpg)
HDFS Basics
�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?���$��3=@4<D
![Page 72: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/72.jpg)
Data Overload
![Page 73: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/73.jpg)
Data Overload
56DE:?2E:@?�
@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�
7:=6D
![Page 74: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/74.jpg)
Data Overload
56DE:?2E:@?�
@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�
7:=6D
![Page 75: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/75.jpg)
Data Overload
56DE:?2E:@?�
@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�
7:=6D
![Page 76: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/76.jpg)
Data Overload
56DE:?2E:@?�
@G6C7=@H�7C@>EC25:E:@?2=�)��$*�@C�=@8�
7:=6D
![Page 77: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/77.jpg)
![Page 78: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/78.jpg)
HDFS Demo
![Page 79: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/79.jpg)
Pig
![Page 80: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/80.jpg)
Pig Basics
/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?
![Page 81: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/81.jpg)
Pig Sample
A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';
![Page 82: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/82.jpg)
Pig demo
![Page 83: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/83.jpg)
Hive
![Page 84: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/84.jpg)
Hive Basics
�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2
![Page 85: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/85.jpg)
Sqoop
*B@@A�3J�����������������������:D�9:896C�=6G6=DB@@A���4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>
![Page 86: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/86.jpg)
HBase
![Page 87: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/87.jpg)
HBase Basics
*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*
![Page 88: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/88.jpg)
HBase Demo
![Page 89: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/89.jpg)
on
![Page 90: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/90.jpg)
![Page 91: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/91.jpg)
Amazon Elastic MapReduce
@DE65� 25@@A�4=FDE6CD
![Page 92: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/92.jpg)
Amazon Elastic MapReduce
@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8
![Page 93: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/93.jpg)
Amazon Elastic MapReduce
@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA
![Page 94: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/94.jpg)
Amazon Elastic MapReduce
@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6
![Page 95: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/95.jpg)
EMR Languages
���������������������������������������������!�������������������!!���������
![Page 96: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/96.jpg)
EMR Languages
���������������������������������������������!�������������������!!���������
![Page 97: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/97.jpg)
EMR Pricing
![Page 98: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/98.jpg)
EMR Pricing
![Page 99: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/99.jpg)
EMR Functions
�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�
![Page 100: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/100.jpg)
EMR Functions
�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�
![Page 101: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/101.jpg)
![Page 102: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/102.jpg)
Final Thoughts
![Page 103: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/103.jpg)
Ha! Your Hadoop is
slower than my Hadoop!
Shut up!I’m reducing.
![Page 104: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/104.jpg)
+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2
![Page 105: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/105.jpg)
Does this Hadoop make my
list look ugly?
![Page 106: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/106.jpg)
�2:=FC6�!*�2446AE23=6
![Page 107: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/107.jpg)
�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6
![Page 108: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/108.jpg)
�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM
![Page 109: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/109.jpg)
�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M
![Page 110: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/110.jpg)
Use Hadoop!
![Page 111: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/111.jpg)
Thanks
![Page 112: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/112.jpg)
Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2
![Page 113: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/113.jpg)
Contact
�>2EE96H>44F==� 25@@A!?EC@
>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89
![Page 114: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/114.jpg)
Credits
![Page 115: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/115.jpg)
9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>����8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>���8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� �� ����9EEA�HHH�7=:4<C�4@>A9@E@D��� ����%� ������ 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED���������9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� ������ �9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��������9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>
![Page 116: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/116.jpg)
Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2
L�$2EE96H�$4�F==@F89���>3:6?E�!562D��##�
1Friday, January 15, 2010
![Page 117: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/117.jpg)
Talk Metadata
+H:EE6C�>2EE96H>44F==� 25@@A!?EC@
$2EE96H�$4�F==@F89���>3:6?E�!562D��##�>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89
2Friday, January 15, 2010
![Page 118: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/118.jpg)
3Friday, January 15, 2010
![Page 119: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/119.jpg)
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basis
To appear in OSDI 20041
Seminal paper on MapReducehttp://labs.google.com/papers/mapreduce.html
4Friday, January 15, 2010
![Page 120: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/120.jpg)
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean and Sanjay [email protected], [email protected]
Google, Inc.
AbstractMapReduce is a programming model and an associ-ated implementation for processing and generating largedata sets. Users specify a map function that processes akey/value pair to generate a set of intermediate key/valuepairs, and a reduce function that merges all intermediatevalues associated with the same intermediate key. Manyreal world tasks are expressible in this model, as shownin the paper.Programs written in this functional style are automati-cally parallelized and executed on a large cluster of com-modity machines. The run-time system takes care of thedetails of partitioning the input data, scheduling the pro-gram’s execution across a set of machines, handling ma-chine failures, and managing the required inter-machinecommunication. This allows programmers without anyexperience with parallel and distributed systems to eas-ily utilize the resources of a large distributed system.Our implementation of MapReduce runs on a largecluster of commodity machines and is highly scalable:a typical MapReduce computation processes many ter-abytes of data on thousands of machines. Programmersfind the system easy to use: hundreds ofMapReduce pro-grams have been implemented and upwards of one thou-sand MapReduce jobs are executed on Google’s clustersevery day.
1 Introduction
Over the past five years, the authors and many others atGoogle have implemented hundreds of special-purposecomputations that process large amounts of raw data,such as crawled documents, web request logs, etc., tocompute various kinds of derived data, such as invertedindices, various representations of the graph structureof web documents, summaries of the number of pagescrawled per host, the set of most frequent queries in a
given day, etc. Most such computations are conceptu-ally straightforward. However, the input data is usuallylarge and the computations have to be distributed acrosshundreds or thousands of machines in order to finish ina reasonable amount of time. The issues of how to par-allelize the computation, distribute the data, and handlefailures conspire to obscure the original simple compu-tation with large amounts of complex code to deal withthese issues.As a reaction to this complexity, we designed a newabstraction that allows us to express the simple computa-tions we were trying to perform but hides the messy de-tails of parallelization, fault-tolerance, data distributionand load balancing in a library. Our abstraction is in-spired by the map and reduce primitives present in Lispand many other functional languages. We realized thatmost of our computations involved applying a map op-eration to each logical “record” in our input in order tocompute a set of intermediate key/value pairs, and thenapplying a reduce operation to all the values that sharedthe same key, in order to combine the derived data ap-propriately. Our use of a functional model with user-specified map and reduce operations allows us to paral-lelize large computations easily and to use re-executionas the primary mechanism for fault tolerance.The major contributions of this work are a simple andpowerful interface that enables automatic parallelizationand distribution of large-scale computations, combinedwith an implementation of this interface that achieveshigh performance on large clusters of commodity PCs.Section 2 describes the basic programming model andgives several examples. Section 3 describes an imple-mentation of the MapReduce interface tailored towardsour cluster-based computing environment. Section 4 de-scribes several refinements of the programming modelthat we have found useful. Section 5 has performancemeasurements of our implementation for a variety oftasks. Section 6 explores the use of MapReduce withinGoogle including our experiences in using it as the basisTo appear in OSDI 2004
1
5Friday, January 15, 2010
![Page 121: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/121.jpg)
MapReduce history
�� AC@8C2>>:?8� >@56=� 2?5�:>A=6>6?E2E:@?�7@C�AC@46DD:?8�2?5� 86?6C2E:?8� =2C86� 52E2�D6ED
“
”6Friday, January 15, 2010
![Page 122: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/122.jpg)
http://www.cern.ch/7Friday, January 15, 2010
![Page 123: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/123.jpg)
$2A)65F46�:>A=6>6?E2E:@?
�@F?565�3J�
&A6?*@FC46�2E�
Origins
http://en.wikipedia.org/wiki/Mapreduce8Friday, January 15, 2010
![Page 124: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/124.jpg)
Today
-6CD:@?�����$2?J�4@>A2?:6D�4@?EC:3FE:?8�D�@7�4@>A2?:6D�FD:?8
http://hadoop.apache.org9Friday, January 15, 2010
![Page 125: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/125.jpg)
Today
http://wiki.apache.org/hadoop/PoweredBy10Friday, January 15, 2010
![Page 126: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/126.jpg)
Today
11Friday, January 15, 2010
![Page 127: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/127.jpg)
Why Hadoop?
12Friday, January 15, 2010
![Page 128: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/128.jpg)
$74.85
1tb
4gb
13Friday, January 15, 2010
![Page 129: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/129.jpg)
vs$10,000
$1,000
14Friday, January 15, 2010
![Page 130: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/130.jpg)
vs
Buy ou
r
way
out
of Failure
Failu
re is
inevitab
le
Go Cheap
This concept doesn’t work well at weddings or dinner parties15Friday, January 15, 2010
![Page 131: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/131.jpg)
Bzzzt!
Crrrkt!
Sproinnnng!
http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-has-serious-cooling-needs/http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-center-design-ability-to-predict-performance.html
16Friday, January 15, 2010
![Page 132: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/132.jpg)
#@8�7:=6D
)��$*
Structured
Unstructured
&��
!>286D
)6=2E:@?D�DA2??:?8���D
�@4F>6?ED
17Friday, January 15, 2010
![Page 133: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/133.jpg)
NOSQL
�62E9�@7�E96�)��$*�:D�2�=:6�:8�52E2�E@@=D�2C6�D@=G:?8�5:776C6?E�:DDF6D�E92?�)��$*6D
http://nosql.mypopescu.com/18Friday, January 15, 2010
![Page 134: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/134.jpg)
Applications
'C@E6:?�7@=5:?8�A92C>246FE:42=D�
*62C49�6?8:?6D*@CE:?8�=2DD:7:42E:@?�8@G6C?>6?E�:?E6==:86?46�
19Friday, January 15, 2010
![Page 135: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/135.jpg)
Applications
'C:46�D62C49*E682?@8C2A9J�?2=JE:4D'C:>6D�4@56�3C62<:?8�
20Friday, January 15, 2010
![Page 136: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/136.jpg)
Particle Physics
#2C86� 25C@?��@==:56C���A6E23JE6D�@7�52E2�A6C�J62C
http://upload.wikimedia.org/wikipedia/commons/f/fc/CERN_LHC_Tunnel1.jpg
21Friday, January 15, 2010
![Page 137: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/137.jpg)
Financial Trends
�2:=J�EC256�A6C7@C>2?46�2?2=JD:D$2C<6E�EC6?5:?8
,D6D�6>A=@J66�56D<E@AD�5FC:?8�@77�9@FCD�:D42==J�C6DA@?D:3=664@?@>:42=
22Friday, January 15, 2010
![Page 138: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/138.jpg)
Contextual Ads
*9@AA6C�� �FDE@>6C�
�
'C@5F4E�.
=@@<:?8�2E
'C@5F4E�/�FDE@>6C�
�
�FDE@>6C��
2=D@�3@F89E�/
2=D@�3@F89E�/
5:5�?@E�3FJ
@7�@E96C�4
FDE@>6CD
:D�E@=5�E92E����
H9@�3@F89E�.
23Friday, January 15, 2010
![Page 139: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/139.jpg)
Hadoop Family
24Friday, January 15, 2010
![Page 140: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/140.jpg)
':8 :G6�@C6��@>>@?
�9F<H2 �2D6 ��*0@@"66A6C
Hadoop Components
25Friday, January 15, 2010
![Page 141: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/141.jpg)
0@@"66A6C :G6
�9F<H2 �@>>@? �2D6
��*
the Playersthe PlayAs
http://www.flickr.com/photos/mandj98/3804322095/http://www.flickr.com/photos/8583446@N05/3304141843/http://www.flickr.com/photos/joits/219824254/http://www.flickr.com/photos/streetfly_jz/2312194534/http://www.flickr.com/photos/sybrenstuvel/2811467787/http://www.flickr.com/photos/lacklusters/2080288154/http://www.flickr.com/photos/sybrenstuvel/2811467787/
26Friday, January 15, 2010
![Page 142: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/142.jpg)
MapReduce>2A�E96?����F>����C65F46�
27Friday, January 15, 2010
![Page 143: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/143.jpg)
The process
�G6CJ�:E6>�:?�52E2D6E�:D�A2C2==6=�42?5:52E6�7@C�$2A$2A�<��G������=:DE�<��G���@==64ED�2?5�8C@FAD�A2:CD�7C@>�����=:DED�3J�<6J)65F46�:?�A2C2==6=�@?�6249�8C@FA)65F46�<���=:DE��G�������=:DE�G �
28Friday, January 15, 2010
![Page 144: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/144.jpg)
Map
24E65!@=5J62CD7@FC2E
@=5J62CD7@FC?@E2>!
8=25
7@FCD@>6@?6
=:<624EDE:==!J6E
�=25�!�2>�?@E�7@FC�J62CD�@=5
/6E�!�DE:==�24E�=:<6�
D@>6@?6�7@FC
Start
@FE
�E�7@FC�J62CD�@=5�!�24E65�@FE
29Friday, January 15, 2010
![Page 145: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/145.jpg)
Grouping
@FE24E65
!
@=5 J62CD
7@FC2E
@=5
J62CD
7@FC?@E
2>
!8=25
7@FC
D@>6@?6=:<624EDE:==
!
J6E
30Friday, January 15, 2010
![Page 146: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/146.jpg)
Reduce
@FE����24E65���� J62CD����
7@FC���
2E����
@=5����
?@E����
2>����
!���
8=25����
D@>6@?6����=:<6����24E����DE:==����
J6E���
31Friday, January 15, 2010
![Page 147: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/147.jpg)
MapReduce Demo
32Friday, January 15, 2010
![Page 148: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/148.jpg)
HDFS
33Friday, January 15, 2010
![Page 149: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/149.jpg)
34Friday, January 15, 2010
![Page 150: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/150.jpg)
HDFS Basics
�2D65�@?��@@8=6��:8+23=6)6A=:42E65�52E2�DE@C6*E@C65�:?���$��3=@4<D
35Friday, January 15, 2010
![Page 151: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/151.jpg)
Data Overload
56DE:?2E:@?�
@G6C7=@H�7C@>EC25:E:@?2=�
)��$*�@C�=@8�7:=6D
36Friday, January 15, 2010
![Page 152: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/152.jpg)
37Friday, January 15, 2010
![Page 153: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/153.jpg)
HDFS Demo
38Friday, January 15, 2010
![Page 154: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/154.jpg)
Pig
39Friday, January 15, 2010
![Page 155: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/155.jpg)
Pig Basics
/29@@�2FE9@C65�255�@?�E@@=�?2=JK6D�=2C86�52E2�D6ED :89�=6G6=�=2?8F286�7@C�6IAC6DD:?8�52E2�2?2=JD:D�AC@8C2>D':8�#2E:?
http://hadoop.apache.org/pig/docs/r0.3.0/getstarted.html#Sample+Code
40Friday, January 15, 2010
![Page 156: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/156.jpg)
Pig Sample
A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id;dump B; store B into 'id.out';
41Friday, January 15, 2010
![Page 157: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/157.jpg)
Pig demo
42Friday, January 15, 2010
![Page 158: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/158.jpg)
Hive
43Friday, January 15, 2010
![Page 159: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/159.jpg)
Hive Basics
�FE9@C65�3J�*(#�:?E6C7246�E@� 25@@A :G6�:D�=@H�=6G6= :G6�DA64:7:4�>6E252E2
44Friday, January 15, 2010
![Page 160: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/160.jpg)
Sqoop
*B@@A�3J�����������������������:D�9:896C�=6G6=DB@@A���4@??64E�;534�>JDB=�52E232D6�6I2>A=6�4@>
http://www.cloudera.com/hadoop-sqoop45Friday, January 15, 2010
![Page 161: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/161.jpg)
HBase
46Friday, January 15, 2010
![Page 162: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/162.jpg)
HBase Basics
*ECF4EFC65�52E2�DE@C6*:>:=2C�E@��@@8=6��:8+23=6)6=:6D�@?�0@@"66A6C�2?5� ��*
47Friday, January 15, 2010
![Page 163: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/163.jpg)
HBase Demo
48Friday, January 15, 2010
![Page 164: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/164.jpg)
on
49Friday, January 15, 2010
![Page 165: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/165.jpg)
Amazon Elastic MapReduce
@DE65� 25@@A�4=FDE6CD+CF6�FD6�@7�4=@F5�4@>AFE:?8�2DJ�E@�D6E�FA'2J�A6C�FD6
Launched in April 2009Save results to S3 bucketshttp://aws.amazon.com/elasticmapreduce/#functionality
50Friday, January 15, 2010
![Page 166: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/166.jpg)
EMR Languages
���������������������������������������������!�������������������!!���������
51Friday, January 15, 2010
![Page 167: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/167.jpg)
EMR Pricing
Pay for both columns additively52Friday, January 15, 2010
![Page 168: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/168.jpg)
EMR Functions
�������� ���C62E6D�2�;@3�7=@H�C6BF6DE��DE2CED�����:?DE2?46D�2?5�368:?D�AC@46DD:?8�������������� ���'C@G:56D�DE2EFD�@7�J@FC�;@3�7=@H�C6BF6DE�D���������� ��������55D�255:E:@?2=�DE6A�E@�2?�2=C625J�CF??:?8�;@3�7=@H� ������������� ���+6C>:?2E6D�CF??:?8�;@3�7=@H�2?5�D9FE5@H?D�2==�:?DE2?46D�
53Friday, January 15, 2010
![Page 169: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/169.jpg)
54Friday, January 15, 2010
![Page 170: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/170.jpg)
Final Thoughts
55Friday, January 15, 2010
![Page 171: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/171.jpg)
Ha! Your Hadoop is
slower than my Hadoop!
Shut up!I’m reducing.
http://www.flickr.com/photos/robryb/14826486/sizes/l/56Friday, January 15, 2010
![Page 172: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/172.jpg)
+96�)��$*�:D�?@E�5625 2D�?6H�7C:6?5D��96=A6CD%@*(#�:D�E2<:?8�E96�H@C=5�3J�DE@C>%@�>@C6�E9C@H:?8�2H2J�A6C764E=J�8@@5�9:DE@C:42=�52E2
57Friday, January 15, 2010
![Page 173: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/173.jpg)
Does this Hadoop make my
list look ugly?
http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/58Friday, January 15, 2010
![Page 174: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/174.jpg)
�2:=FC6�!*�2446AE23=6�2:=FC6�:D�:?6G:E23=6�@�4962AM�@�5:DEC:3FE65M
59Friday, January 15, 2010
![Page 175: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/175.jpg)
Use Hadoop!
http://www.flickr.com/photos/robryb/14826417/sizes/l/60Friday, January 15, 2010
![Page 176: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/176.jpg)
Thanks
61Friday, January 15, 2010
![Page 177: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/177.jpg)
Hadoop�:G:56�2?5�4@?BF6C�8:82?E:4�52E2
62Friday, January 15, 2010
![Page 178: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/178.jpg)
Contact
�>2EE96H>44F==� 25@@A!?EC@
>2EE96H>�2>3:6?E:562D�4@>9EEA�2>3:6?E:562D�4@>3=@89EEA�DA62<6CC2E6�4@>>2EE96H�>44F==@F89
63Friday, January 15, 2010
![Page 179: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/179.jpg)
Credits
64Friday, January 15, 2010
![Page 180: An Intro to Hadoop](https://reader033.fdocuments.us/reader033/viewer/2022052315/54550d5daf79591a248b5b90/html5/thumbnails/180.jpg)
9EEA�HHH�7@?EDA246�4@>52G:5�C2<@HD<:EC:36429EEA�HHH�46C?�499EEA�HHH�C@3:?>2;F>52C�4@>����8@@8=6�52==6D�52E2�46?EC6�92D�D6C:@FD�4@@=:?8�?665D9EEA�HHH�8C66?> �4@>���8@@8=6D�D64C6E�E@�677:4:6?E�52E2�46?E6C�56D:8?�23:=:EJ�E@�AC65:4E�A6C7@C>2?46�9E>=9EEA�FA=@25�H:<:>65:2�@C8H:<:A65:24@>>@?D774��)%1# �1+F??6=��;A89EEA�HHH�7=:4<C�4@>A9@E@D>2?5;�� �� ����9EEA�HHH�7=:4<C�4@>A9@E@D��� ����%� ������ 9EEA�HHH�7=:4<C�4@>A9@E@D;@:ED���������9EEA�HHH�7=:4<C�4@>A9@E@DDEC66E7=J1;K� ������ �9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@D=24<=FDE6CD��������9EEA�HHH�7=:4<C�4@>A9@E@DDJ3C6?DEFG6=����������9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@D>4<2JD2G286� ������D:K6D=9EEA�HHH�7=:4<C�4@>A9@E@DC@3CJ3��������D:K6D=�==�@E96CD��:*E@4<'9@E@�4@>
65Friday, January 15, 2010