An Introduction to Hadoop Presentation.pdf
-
Upload
srinathvj3326 -
Category
Documents
-
view
233 -
download
0
Transcript of An Introduction to Hadoop Presentation.pdf
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
1/91
1
AnIntroduc+ontoHadoopMarkFeiCloudera
StrataHadoopWorld2012,NewYorkCity,October23,2012
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
2/91
WhoAmI?
MarkFeiCloudera!Durango, Colorado!
Current:! Senior Instructor at Cloudera!Past:! Professional Services Education, VMware!
Senior Member Technical Staff, Hill Associates!Sales Engineer, Nortel Networks!Systems Programmer, large Bank!Banking Applications software developer!
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
3/91
WhatsAhead? Solidintroduc+ontoApacheHadoop
Whatitis Whyitsrelevant Howitworks TheEcosystem
Nopriorexperienceneeded Feelfreetoaskques+ons
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
4/91
WhatisApacheHadoop? Scalabledatastorageandprocessing
OpensourceApacheproject Harnessesthepowerofcommodityservers Distributedandfault-tolerant
CoreHadoopconsistsoftwomainparts HDFS(storage)MapReduce(processing)
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
5/91
A large ecosystem
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
6/91
Who uses Hadoop?
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
7/91
Vendor integration
BI / Analytics ETL Database OS / Cloud /System Mgmt.
Hardware
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
8/91
About Cloudera Cloudera is The commercial Hadoop company Founded by leading experts on Hadoop from
Facebook, Google, Oracle and Yahoo Provides consulting and training services for
Hadoop users
Staff includes several committers to Hadoopprojects
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
9/91
Cloudera Software Clouderas Distribution including Apache Hadoop (CDH)
A single, easy-to-install package from the Apache Hadoop core repository Includes a stable version of Hadoop, plus critical bug fixes and solid new
features from the development version 100% open source
Components Apache Hadoop Apache Hive Apache Pig Apache HBase Apache Zookeeper Apache Flume, Apache Hue, Apache Oozie, Apache Sqoop, Apache Mahout
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
10/91
A Coherent Platform
Storage
Computation
Integration
Coordination
Access
Components of the
CDH Stack
Coordination
DataIntegration
FastRead/Write
Access
Languages / Compilers
Workflow Scheduling Metadata
APACHE ZOOKEEPER
APACHE FLUME,
APACHE SQOOP APACHE HBASE
APACHE PIG, APACHE HIVE, APACHE MAHOUT
APACHE OOZIE APACHE OOZIE APACHE HIVE
File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK
HDFS, MAPREDUCE
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
11/91
Cloudera Manager, Free Edition End-to-end Deployment and management of your
CDH cluster
Zero to Hadoop in 15 minutes Supports up to 50 nodes Free (but not open source)
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
12/91
Cloudera Enterprise Cloudera Enterprise
Clouderas Distribution including Apache Hadoop (CDH) Big data storage, processing and analytics platform based on
CDH Cloudera Manager (full version)
End-to-end deployment, management, and operation of CDH Provides sophisticated cluster monitoring tools not present in the
free version
Production support A team of experts on call to help you meet your Service LevelAgreements (SLAs)
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
13/91
Cloudera University Training for the entire Hadoop stack
Cloudera Developer Training for Apache Hadoop Cloudera Administrator Training for Apache Hadoop Cloudera Training for Apache HBase
Cloudera Training for Apache Hive and Pig Cloudera Essentials for Apache Hadoop More courses coming
Public and private classes offered Including customized on-site private classes
Industry-recognized Certifications Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera Certified Specialist in Apache HBase (CCSHB)
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
14/91
Professional Services Solutions Architects provide guidance and hands-
on expertise
Use Case Discovery New Hadoop Deployment Proof of Concept Production Pilot Process and Team Development Hadoop Deployment Certification
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
15/91
HowDidApacheHadoopOriginate? HeavilyinfluencedbyGooglesarchitecture
Notably,theGoogleFilesystemandMapReducepapersOtherWebcompaniesquicklysawthebenefits Earlyadop+onbyYahoo,Facebookandothers
2002 2003 2004 2005
Google publishesMapReduce paper
Nutch rewrittenfor MapReduce
Nutch spun offfrom Lucene
Google publishesGFS paper
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
16/91
WhyDoWeHaveSoMuchData? Andwhatarewesupposedtodowithit?
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
17/91
Velocity Whyweregenera+ngdatafasterthanever
Processesareincreasinglyautomated Systemsareincreasinglyinterconnected Peopleareincreasinglyinterac+ngonline
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
18/91
Variety Whattypesofdataareweproducing?
Applica+onlogs Textmessages Socialnetworkconnec+ons Tweets Photos
Notallofthismapscleanlytotherela+onalmodel
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
19/91
Volume Theresultofthisisthateverysingleday
Twi]erprocesses340millionmessages Facebookstores2.7billioncommentsandLikes Googleprocessesabout24petabytesofdata
Andeverysingleminute Morethan200millione-mailmessagesaresent Foursquareprocessesmorethan2,000check-ins
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
20/91
WhereDoesDataComeFrom? Science
Medicalimaging,sensordata,genomesequencing,weatherdata,satellitefeeds,etc.
Industry Financial,pharmaceu+cal,manufacturing,insurance,online,energy,retail
data
Legacy Salesdata,customerbehavior,productdatabases,accoun+ngdata,etc.
SystemData Logfiles,health&statusfeeds,ac+vitystreams,networkmessages,Web
Analy+cs,intrusiondetec+on,spamfilters
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
21/91
AnalyzingData:TheChallenges Hugevolumesofdata Mixedsourcesresultinmanydifferentformats
XML CSV EDI Logfiles Objects SQL Text JSON Binary Etc.
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
22/91
WhatisCommonAcrossHadoop-ableProblems? Natureofthedata
Complexdata Mul+pledatasources Lotsofit
Natureoftheanalysis Batchprocessing Parallelexecu+on Spreaddataoveraclusterofserversandtakethecomputa+ontothedata
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
23/91
BenefitsofAnalyzingWithHadoop Previouslyimpossible/imprac+caltodothisanalysis Analysisconductedatlowercost Analysisconductedinless+me Greaterflexibility Linearscalability
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
24/91
WhatAnalysisisPossibleWithHadoop? Textmining Indexbuilding Graphcrea+onandanalysis Pa]ernrecogni+on
Collabora+vefiltering Predic+onmodels Sen+mentanalysis Riskassessment
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
25/91
EightCommonHadoop-ableProblems1. Modelingtruerisk2. Customerchurnanalysis3. Recommenda+onengine4. PoStransac+onanalysis
5. Analyzingnetworkdatatopredictfailure
6. Threatanalysis7. Searchquality8. Datasandbox
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
26/91
1.ModelingTrueRiskChallenge:
Howmuchriskexposuredoesanorganiza+onreallyhavewitheachcustomer?
Mul+plesourcesofdataandacrossmul+plelinesofbusinessSolu+onwithHadoop: Sourceandaggregatedisparatedatasourcestobuilddatapicture
e.g.creditcardrecords,callrecordings,chatsessions,emails,bankingac+vity
Structureandanalyze Sen+mentanalysis,graphcrea+on,pa]ernrecogni+on
TypicalIndustry: FinancialServices(banks,insurancecompanies)
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
27/91
2.CustomerChurnAnalysisChallenge:
Whyisanorganiza+onreallylosingcustomers? Dataonthesefactorscomesfromdifferentsources
Solu-onwithHadoop:
Rapidlybuildbehavioralmodelfromdisparatedatasources StructureandanalyzewithHadoop
Traversing Graphcrea+on Pa]ernrecogni+on
TypicalIndustry: Telecommunica+ons,FinancialServices
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
28/91
3.Recommenda+onEngine/AdTarge+ngChallenge:
UsinguserdatatopredictwhichproductstorecommendSolu+onwithHadoop:
Batchprocessingframework Allowexecu+onininparalleloverlargedatasets
Collabora+vefiltering Collec+ngtasteinforma+onfrommanyusers U+lizinginforma+ontopredictwhatsimilaruserslike
TypicalIndustry Ecommerce,Manufacturing,Retail Adver+sing
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
29/91
4.PointofSaleTransac+onAnalysisChallenge:
AnalyzingPointofSale(PoS)datatotargetpromo+onsandmanageopera+ons
SourcesarecomplexanddatavolumesgrowacrosschainsofstoresandothersourcesSolu+onwithHadoop:
Batchprocessingframework Allowexecu+onininparalleloverlargedatasets
Pa]ernrecogni+on Op+mizingovermul+pledatasources U+lizinginforma+ontopredictdemand
TypicalIndustry: Retail
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
30/91
5.AnalyzingNetworkDatatoPredictFailureChallenge:
Analyzingreal-+medataseriesfromanetworkofsensors Calcula+ngaveragefrequencyover+meisextremelytediousbecauseofthe
needtoanalyzeterabytes
Solu+onwithHadoop: Takethecomputa+ontothedata
Expandfromsimplescanstomorecomplexdatamining Be]erunderstandhowthenetworkreactstofluctua+ons
Discreteanomaliesmay,infact,beinterconnected Iden+fyleadingindicatorsofcomponentfailureTypicalIndustry:
U+li+es,Telecommunica+ons,DataCenters
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
31/91
6.ThreatAnalysis/TradeSurveillanceChallenge:
Detec+ngthreatsintheformoffraudulentac+vityora]acks Largedatavolumesinvolved Likelookingforaneedleinahaystack
Solu+onwithHadoop: Parallelprocessingoverhugedatasets Pa]ernrecogni+ontoiden+fyanomalies,
i.e.,threatsTypicalIndustry:
Security,FinancialServices,General:spamfigh+ng,clickfraud
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
32/91
7.SearchQualityChallenge:
Providingreal+memeaningfulsearchresultsSolu+onwithHadoop:
Analyzingsearcha]emptsinconjunc+onwithstructureddata Pa]ernrecogni+on
Browsingpa]ernofusersperformingsearchesindifferentcategories
TypicalIndustry: Web,Ecommerce
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
33/91
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
34/91
Hadoop:Howdoesitwork? Mooreslawandnot
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
35/91
DiskCapacityandPrice Weregenera+ngmoredatathaneverbefore Fortunately,thesizeandcostofstoragehaskeptpace
CapacityhasincreasedwhilepricehasdecreasedYear Capacity (GB) Cost per GB (USD)
1997 2.1 $157
2004 200 $1.05
2012 3,000 $0.05
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
36/91
DiskCapacityandPerformance Diskperformancehasalsoincreasedinthelast15years Unfortunately,transferrateshaventkeptpacewith
capacity
Year Capacity (GB) Transfer Rate (MB/s) Disk Read Time
1997 2.1 16.6 126 seconds
2004 200 56.5 59 minutes
2012 3,000 210 3 hours, 58 minutes
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
37/91
ArchitectureofaTypicalHPCSystem
Storage System
Compute Nodes
Fast Network
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
38/91
ArchitectureofaTypicalHPCSystem
Storage System
Compute Nodes
Step 1: Copy input data
Fast Network
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
39/91
ArchitectureofaTypicalHPCSystem
Storage System
Compute Nodes
Step 2: Process the data
Fast Network
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
40/91
ArchitectureofaTypicalHPCSystem
Storage System
Compute Nodes
Step 3: Copy output data
Fast Network
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
41/91
YouDontJustNeedSpeed Theproblemisthatwehavewaymoredatathan
code
$ du -ks code/
1,083
$ du ks data/
854,632,947,314
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
42/91
YouNeedSpeedAtScale
Storage System
Compute Nodes
Bottleneck
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
43/91
HDFS:HADOOPDISTRIBUTEDFILESYSTEMBecause10,000harddisksarebe]erthanone
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
44/91
CollocatedStorageandProcessing Solu+on:storeandprocessdataonthesamenodes
Datalocality:Bringthecomputa+ontothedata ReducesI/Oandboostsperformance
"slave" nodes(storage and processing)
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
45/91
HardDiskLatency Diskseeksareexpensive Solu+on:Readlotsofdataatoncetoamor+zethe
costCurrent location of
disk head
Where the data you
need is stored
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
46/91
IntroducingHDFS HadoopDistributedFileSystem
ScalablestorageinfluencedbyGooglesfilesystempaper Itsnotageneral-purposefilesystem
HDFSisop+mizedforHadoop Valueshighthroughputmuchmorethanlowlatency Itsauser-spaceJavaprocess Primarilyaccessedviacommand-lineu+li+esandJavaAPI
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
47/91
HDFSis(Mostly)UNIX-like Inmanyways,HDFSissimilartoaUNIXfilesystem
Hierarchical UNIX-stylepaths(e.g./foo/bar/myfile.txt) Fileownershipandpermissions
Therearealsosomemajordevia+onsfromUNIX NoCWD Cannotmodifyfilesoncewri]en
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
48/91
HDFSHigh-LevelArchitecture HDFSfollowsamaster-slavearchitecture Therearetwoessen+aldaemonsinHDFS
Master:NameNode Responsiblefornamespaceandmetadata Namespace:filehierarchy Metadata:ownership,permissions,blockloca+ons,etc.
Slave:DataNode Responsibleforstoringactualdatablocks
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
49/91
AnatomyofaSmallHadoopCluster
Each "slave" node will run
The "master" node will run
- DataNode daemon
- NameNode daemon
ThediagramshowstheHDFS-relateddaemonsonasmallcluster
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
50/91
HDFSBlocks WhenafileisaddedtoHDFS,itssplitintoblocks Thisisasimilarconcepttona+vefilesystems
HDFSusesamuchlargerblocksize(64MB),forperformance
150 MB input fileBlock #1(64 MB)
Block #2(64 MB)
Block #3(remaining 22 MB)
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
51/91
HDFSReplica+on Thoseblocksarethenreplicatedacrossmachines ThefirstblockmightbereplicatedtoA,CandD
Block #1
Block #2
Block #3
C
D
A
B
E
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
52/91
HDFSReplica+on(contd) ThenextblockmightbereplicatedtoB,DandE
Block #1
Block #2
Block #3
C
D
A
B
E
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
53/91
HDFSReplica+on(contd) ThelastblockmightbereplicatedtoA,CandE
Block #1
Block #2
Block #3
C
D
A
B
E
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
54/91
HDFSReliability Replica+onhelpstoachievereliability
Evenwhenanodefails,twocopiesoftheblockremain Thesewillbere-replicatedtoothernodesautoma+cally
C
D
A
B
E
This failed node held blocks #1 and #3
X
Blocks #1 and #3 are still available here
Block #3 is still available here
Block #1 is still available here
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
55/91
DATAPROCESSINGWITHMAPREDUCEItnotonlyworks,itsfunc+onal
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
56/91
MapReduceHigh-LevelArchitecture LikeHDFS,MapReducehasamaster-slave
architecture
TherearetwodaemonsinclassicalMapReduce Master:JobTracker
Responsiblefordividing,schedulingandmonitoringwork Slave:TaskTracker
Responsibleforactualprocessing
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
57/91
AnatomyofaSmallHadoopCluster
Each "slave" node will run
The "master" node will run
- DataNode daemon
- TaskTracker daemon
- NameNode daemon
- JobTracker daemon
ThediagramshowsbothMapReduceandHDFSdaemons
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
58/91
GentleIntroduc+ontoMapReduce MapReduceisconceptuallylikeaUNIXpipeline
Onefunc+on(Map)processesdata Thatoutputisul+matelyinputtoanotherfunc+on
(Reduce)
Eachpieceissimple,butcanbepowerfulwhencombined$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c
941 ERROR
78264 INFO
4312 WARN
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
59/91
TheMapFunc+on Operatesoneachrecordindividually
Typicalusesincludefiltering,parsing,ortransforminginput
$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c
Map
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
60/91
IntermediateProcessing TheMapfunc+onsoutputisgroupedandsorted
Thisistheautoma+csortandshuffleprocessinHadoop$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c
Sortand
Shuffle
Map
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
61/91
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
62/91
MapReduceHistory MapReduceisnotalanguage,itsaprogrammingmodel
Astyleofprocessingdatayoucouldimplementinanylanguage MapReducehasitsrootsinfunc+onalprogramming
Manylanguageshavefunc+onsnamedmapandreduce Thesefunc+onshavelargelythesamepurposeinHadoop
Popularizedforlarge-scaledataprocessingbyGoogle
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
63/91
MapReduceBenefits Complexdetailsareabstractedawayfromthedeveloper
NofileI/O Nonetworkingcode Nosynchroniza+on
Itsscalablebecauseyouprocessonerecordata+me Arecordconsistsofakeyandcorrespondingvalue
Weoencareaboutonlyoneofthese
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
64/91
MapReduceExampleinPython MapReducecodeforHadoopistypicallywri]enin
Java
ButpossibletousenearlyanylanguagewithHadoopStreaming
IllshowthelogeventcounterusingMapReduceinPython Itsveryhelpfultoseethedataaswellasthecode
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
65/91
JobInput Eachmappergetsachunkofjobsinputdatato
process
ThischunkiscalledanInputSplit Inmostcases,thiscorrespondstoablockinHDFS2012-09-06 22:16:49.391 CDT INFO "This can wait"2012-09-06 22:16:49.392 CDT INFO "Blah blah"
2012-09-06 22:16:49.394 CDT WARN "Hmmm..."
2012-09-06 22:16:49.395 CDT INFO "More blather"
2012-09-06 22:16:49.397 CDT WARN "Hey there"
2012-09-06 22:16:49.398 CDT INFO "Spewing data"
2012-09-06 22:16:49.399 CDT ERROR "Oh boy!"
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
66/91
PythonCodeforMapFunc+on Ourmapfunc+onwillparsetheeventtype
Andthenoutputthatevent(key)andaliteral1(value)#!/usr/bin/env python
import sys
levels = ['TRACE', 'DEBUG', 'INFO',
'WARN', 'ERROR', 'FATAL']
for line in sys.stdin:
fields = line.split()
for fieldin fields:
field = field.strip().upper()
if fieldin levels:
print "%s\t1" % field
1
2
34
5
6
7
8
9
10
11
12
13
BoilerplatePythonstuff
Ifthisfieldmatchesalog
level,printit(anda1)
Spliteveryline(record)we
receiveonstandardinput
intofields,normalizedbycase
DefinelistofJUnitloglevels
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
67/91
OutputofMapFunc+on Themapfunc+onproduceskey/valuepairsasoutput
INFO 1
INFO 1WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
68/91
InputtoReduceFunc+on TheReducerreceivesakeyandallvaluesforthatkey
Keysarealwayspassedtoreducersinsortedorder Althoughitsnotobvioushere,valuesareunordered
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
69/91
PythonCodeforReduceFunc+on TheReducerfirstextractsthekeyandvalueitwaspassed
#!/usr/bin/env python
import sys
previous_key = ''
sum = 0
for line in sys.stdin:
fields = line.split()
key, value = line.split()
value = int(value)# continued on next slide
1
2
3
4
5
6
7
8
9
10
11
1213
BoilerplatePythonstuff
Ini+alizeloopvariables
Extractthekeyandvalue
passedviastandardinput
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
70/91
PythonCodeforReduceFunc+on Thensimplyaddsupthevalueforeachkey
# continued from previous slide
if key == previous_key:
sum = sum + valueelse:
if previous_key != '':
print '%s\t%i' % (previous_key, sum)
previous_key = key
sum = 1
print '%s\t%i' % (previous_key, sum)
14
15
1617
18
19
20
21
22
23
Ifkeyunchanged,
incrementthecount
Printsumforfinalkey
Ifkeychanged,print
sumforpreviouskey
Re-initloopvariables
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
71/91
OutputofReduceFunc+on TheoutputofthisReducefunc+onisasumforeach
level
ERROR 1
INFO 4
WARN 2
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
72/91
RecapofDataFlow
ERROR 1
INFO 4
WARN 2
2012-09-06 22:16:49.391 CDT INFO "This can wait"
2012-09-06 22:16:49.392 CDT INFO "Blah blah"
2012-09-06 22:16:49.394 CDT WARN "Hmmm..."
2012-09-06 22:16:49.395 CDT INFO "More blather"
2012-09-06 22:16:49.397 CDT WARN "Hey there"
2012-09-06 22:16:49.398 CDT INFO "Spewing data"
2012-09-06 22:16:49.399 CDT ERROR "Oh boy!"
INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
Mapinput
Mapoutput ReduceinputReduceoutput
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
73/91
InputSplitsFeedtheMapTasks Inputfortheen+rejobissubdividedintoInputSplits
AnInputSplitusuallycorrespondstoasingleHDFSblock EachoftheseservesasinputtoasingleMaptask
Input for entire job(192 MB)
Mapper #3
Mapper #1
Mapper #2
64 MB
64 MB
64 MB
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
74/91
MappersFeedtheShuffleandSort OutputofallMappersispar++oned,merged,and
sorted(NocoderequiredHadoopdoesthisautoma+cally)
Mapper #1
Mapper #2
Mapper #N
WARN 1
WARN 1
WARN 1
WARN 1
INFO 1
INFO 1
INFO 1
INFO 1
INFO 1
INFO 1
INFO 1
INFO 1
ERROR 1
ERROR 1
ERROR 1
INFO 1
WARN 1
INFO 1
INFO 1
ERROR 1
WARN 1
INFO 1
INFO 1
INFO 1
ERROR 1
WARN 1
INFO 1
WARN 1INFO 1
ERROR 1
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
75/91
ShuffleandSortFeedstheReducers Allvaluesforagivenkeyarethencollapsedintoalist
Thekeyandallitsvaluesarefedtoreducersasinput
Reducer #1
Reducer #2
WARN 1 1 1 1
INFO 1 1 1 1 1 1 1 1
ERROR 1 1 1
WARN 1
WARN 1
WARN 1
WARN 1
ERROR 1
ERROR 1
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
INFO 1
INFO 1
INFO 1
INFO 1
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
76/91
EachReducerHasanOutputFile ThesearestoredinHDFSbelowyouroutput
directory
Usehadoop fs -getmergetocombinethemintoalocalcopy
Reducer #1
Reducer #2
INFO 8
ERROR 3
WARN 4
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
77/91
ApacheHadoopEcosystem:Overview "CoreHadoop"consistsofHDFSandMapReduce
Thesearethekernelofamuchbroaderplaorm Hadoophasmanyrelatedprojects
SomehelpyouintegrateHadoopwithothersystems Othershelpyouanalyzeyourdata S+llothers,likeOozie,helpyouuseHadoopmoreeffec+vely
MostareopensourceApacheprojectslikeHadoop AlsolikeHadoop,theyhavefunnynames AllofthesearepartofClouderasCDHdistribu+on
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
78/91
Ecosystem:ApacheFlume
logfiles
syslog customsource
andmanymore
programoutput
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
79/91
Ecosystem:ApacheSqoop IntegrateswithanyJDBC-compa+bledatabase
Retrievealltables,asingletable,orapor+ontostoreinHDFS
CanalsoexportdatafromHDFSbacktothedatabaseDatabase
Hadoop Cluster
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
80/91
Ecosystem:ApacheHive HiveallowsyoutodoSQL-likequeriesondatain
HDFS
ItturnsthisintoMapReducejobsthatrunonyourcluster Reducesdevelopment+me MakesHadoopmoreaccessibletonon-engineers
SELECT customer.id, customer.name, sum(orders.cost)
FROMcustomers INNER JOIN
ON (customer.id = orders.customer_id)
WHERE customer.zipcode = '63105'
GROUP BY customer.id;
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
81/91
Ecosystem:ApachePig ApachePighasasimilarpurposetoHive
Ithasahigh-levellanguage(PigLa+n)fordataanalysis ScriptsyieldMapReducejobsthatrunonyourcluster
ButPigsapproachismuchdifferentthanHive
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
82/91
Ecosystem:ApacheHBase NoSQLdatabasebuiltonHDFS Low-latencyandhigh-performanceforreadsand
writes
Extremelyscalable Tablescanhavebillionsofrows Andpoten+allymillionsofcolumns
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
83/91
YouShouldBeUsingCDH ClouderasDistribu+onincludingApacheHadoop(CDH)
Themostwidelyuseddistribu+onofHadoop Astable,provenandsupportedenvironmentyoucancounton
CombinesHadoopwithmanyimportantecosystemtools SuchasHive,Pig,Sqoop,Flumeandmanymore Alloftheseareintegratedandworkwelltogether
Howmuchdoesitcost? Itscompletelyfree Apachelicensedits100%opensourcetoo
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
84/91
WhenisHadoop(Not)aGoodChoice Hadoopmaybeagreatchoicewhen
Youneedtoprocessnon-rela+onal(unstructured)data Youareprocessinglargeamountsofdata Youcanrunyourjobsinbatchmode
Hadoopmaynotbeagreatchoicewhen Youreprocessingsmallamountsofdata Youralgorithmsrequirecommunica+onamongnodes Youneedlowlatencyortransac+ons
Asalways,usethebesttoolforthejob Andknowhowtointegrateitwithothersystems
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
85/91
ManagingTheElephantInTheRoom-Roles SystemAdministrators Developers Analysts DataStewards
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
86/91
SystemAdministrators Requiredskills:
StrongLinuxadministra+onskills Networkingknowledge Understandingofhardware
Jobresponsibili+es Install,configureandupgradeHadoopsoware Managehardwarecomponents Monitorthecluster Integratewithothersystems(e.g.,FlumeandSqoop)
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
87/91
Developers RequiredSkills:
StrongJavaorscrip+ngcapabili+es UnderstandingofMapReduceandalgorithms
Jobresponsibili+es: Write,packageanddeployMapReduceprograms Op+mizeMapReducejobsandHive/Pigprograms
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
88/91
DataAnalyst/BusinessAnalyst Requiredskills:
SQL Understandingdataanaly+cs/datamining
Jobresponsibili+es: Extractintelligencefromthedata WriteHiveand/orPigprograms
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
89/91
DataSteward Requiredskills:
DatamodelingandETL Scrip+ngskills
Jobresponsibili+es: Catalogingthedata(analogoustoalibrarianforbooks) Managedatalifecycle,reten+on DataqualitycontrolwithSLAs
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
90/91
CombiningRoles SystemAdministratorStewardanalogoustoDBA Requiredskills:
DatamodelingandETL Scrip+ngskills StrongLinuxadministra+onskills
Jobresponsibili+es: Managedatalifecycle,reten+on DataqualitycontrolwithSLAs Install,configureandupgradeHadoopsoware Managehardwarecomponents Monitorthecluster Integratewithothersystems(e.g.,FlumeandSqoop)
-
7/29/2019 An Introduction to Hadoop Presentation.pdf
91/91
Conclusion Thanksforyour+me! Ques+ons?