An Introduction to Hadoop Presentation.pdf

7/29/2019 An Introduction to Hadoop Presentation.pdf

1/91

1

AnIntroduc+ontoHadoopMarkFeiCloudera

StrataHadoopWorld2012,NewYorkCity,October23,2012


2/91

WhoAmI?

MarkFeiCloudera!Durango, Colorado!

Current:! Senior Instructor at Cloudera!Past:! Professional Services Education, VMware!

Senior Member Technical Staff, Hill Associates!Sales Engineer, Nortel Networks!Systems Programmer, large Bank!Banking Applications software developer!


3/91

WhatsAhead? Solidintroduc+ontoApacheHadoop

Whatitis Whyitsrelevant Howitworks TheEcosystem

Nopriorexperienceneeded Feelfreetoaskques+ons


4/91

WhatisApacheHadoop? Scalabledatastorageandprocessing

OpensourceApacheproject Harnessesthepowerofcommodityservers Distributedandfault-tolerant

CoreHadoopconsistsoftwomainparts HDFS(storage)MapReduce(processing)


5/91

A large ecosystem


6/91

Who uses Hadoop?


7/91

Vendor integration

BI / Analytics ETL Database OS / Cloud /System Mgmt.

Hardware


8/91

About Cloudera Cloudera is The commercial Hadoop company Founded by leading experts on Hadoop from

Facebook, Google, Oracle and Yahoo Provides consulting and training services for

Hadoop users

Staff includes several committers to Hadoopprojects


9/91

Cloudera Software Clouderas Distribution including Apache Hadoop (CDH)

A single, easy-to-install package from the Apache Hadoop core repository Includes a stable version of Hadoop, plus critical bug fixes and solid new

features from the development version 100% open source

Components Apache Hadoop Apache Hive Apache Pig Apache HBase Apache Zookeeper Apache Flume, Apache Hue, Apache Oozie, Apache Sqoop, Apache Mahout


10/91

A Coherent Platform

Storage

Computation

Integration

Coordination

Access

Components of the

CDH Stack

Coordination

DataIntegration

FastRead/Write

Access

Languages / Compilers

Workflow Scheduling Metadata

APACHE ZOOKEEPER

APACHE FLUME,

APACHE SQOOP APACHE HBASE

APACHE PIG, APACHE HIVE, APACHE MAHOUT

APACHE OOZIE APACHE OOZIE APACHE HIVE

File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK

HDFS, MAPREDUCE


11/91

Cloudera Manager, Free Edition End-to-end Deployment and management of your

CDH cluster

Zero to Hadoop in 15 minutes Supports up to 50 nodes Free (but not open source)


12/91

Cloudera Enterprise Cloudera Enterprise

Clouderas Distribution including Apache Hadoop (CDH) Big data storage, processing and analytics platform based on

CDH Cloudera Manager (full version)

End-to-end deployment, management, and operation of CDH Provides sophisticated cluster monitoring tools not present in the

free version

Production support A team of experts on call to help you meet your Service LevelAgreements (SLAs)


13/91

Cloudera University Training for the entire Hadoop stack

Cloudera Developer Training for Apache Hadoop Cloudera Administrator Training for Apache Hadoop Cloudera Training for Apache HBase

Cloudera Training for Apache Hive and Pig Cloudera Essentials for Apache Hadoop More courses coming

Public and private classes offered Including customized on-site private classes

Industry-recognized Certifications Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera Certified Specialist in Apache HBase (CCSHB)


14/91

Professional Services Solutions Architects provide guidance and hands-

on expertise

Use Case Discovery New Hadoop Deployment Proof of Concept Production Pilot Process and Team Development Hadoop Deployment Certification


15/91

HowDidApacheHadoopOriginate? HeavilyinfluencedbyGooglesarchitecture

Notably,theGoogleFilesystemandMapReducepapersOtherWebcompaniesquicklysawthebenefits Earlyadop+onbyYahoo,Facebookandothers

2002 2003 2004 2005

Google publishesMapReduce paper

Nutch rewrittenfor MapReduce

Nutch spun offfrom Lucene

Google publishesGFS paper


16/91

WhyDoWeHaveSoMuchData? Andwhatarewesupposedtodowithit?


17/91

Velocity Whyweregenera+ngdatafasterthanever

Processesareincreasinglyautomated Systemsareincreasinglyinterconnected Peopleareincreasinglyinterac+ngonline


18/91

Variety Whattypesofdataareweproducing?

Applica+onlogs Textmessages Socialnetworkconnec+ons Tweets Photos

Notallofthismapscleanlytotherela+onalmodel


19/91

Volume Theresultofthisisthateverysingleday

Twi]erprocesses340millionmessages Facebookstores2.7billioncommentsandLikes Googleprocessesabout24petabytesofdata

Andeverysingleminute Morethan200millione-mailmessagesaresent Foursquareprocessesmorethan2,000check-ins


20/91

WhereDoesDataComeFrom? Science

Medicalimaging,sensordata,genomesequencing,weatherdata,satellitefeeds,etc.

Industry Financial,pharmaceu+cal,manufacturing,insurance,online,energy,retail

data

Legacy Salesdata,customerbehavior,productdatabases,accoun+ngdata,etc.

SystemData Logfiles,health&statusfeeds,ac+vitystreams,networkmessages,Web

Analy+cs,intrusiondetec+on,spamfilters


21/91

AnalyzingData:TheChallenges Hugevolumesofdata Mixedsourcesresultinmanydifferentformats

XML CSV EDI Logfiles Objects SQL Text JSON Binary Etc.


22/91

WhatisCommonAcrossHadoop-ableProblems? Natureofthedata

Complexdata Mul+pledatasources Lotsofit

Natureoftheanalysis Batchprocessing Parallelexecu+on Spreaddataoveraclusterofserversandtakethecomputa+ontothedata


23/91

BenefitsofAnalyzingWithHadoop Previouslyimpossible/imprac+caltodothisanalysis Analysisconductedatlowercost Analysisconductedinless+me Greaterflexibility Linearscalability


24/91

WhatAnalysisisPossibleWithHadoop? Textmining Indexbuilding Graphcrea+onandanalysis Pa]ernrecogni+on

Collabora+vefiltering Predic+onmodels Sen+mentanalysis Riskassessment


25/91

EightCommonHadoop-ableProblems1. Modelingtruerisk2. Customerchurnanalysis3. Recommenda+onengine4. PoStransac+onanalysis

5. Analyzingnetworkdatatopredictfailure

6. Threatanalysis7. Searchquality8. Datasandbox


26/91

1.ModelingTrueRiskChallenge:

Howmuchriskexposuredoesanorganiza+onreallyhavewitheachcustomer?

Mul+plesourcesofdataandacrossmul+plelinesofbusinessSolu+onwithHadoop: Sourceandaggregatedisparatedatasourcestobuilddatapicture

e.g.creditcardrecords,callrecordings,chatsessions,emails,bankingac+vity

Structureandanalyze Sen+mentanalysis,graphcrea+on,pa]ernrecogni+on

TypicalIndustry: FinancialServices(banks,insurancecompanies)


27/91

2.CustomerChurnAnalysisChallenge:

Whyisanorganiza+onreallylosingcustomers? Dataonthesefactorscomesfromdifferentsources

Solu-onwithHadoop:

Rapidlybuildbehavioralmodelfromdisparatedatasources StructureandanalyzewithHadoop

Traversing Graphcrea+on Pa]ernrecogni+on

TypicalIndustry: Telecommunica+ons,FinancialServices


28/91

3.Recommenda+onEngine/AdTarge+ngChallenge:

UsinguserdatatopredictwhichproductstorecommendSolu+onwithHadoop:

Batchprocessingframework Allowexecu+onininparalleloverlargedatasets

Collabora+vefiltering Collec+ngtasteinforma+onfrommanyusers U+lizinginforma+ontopredictwhatsimilaruserslike

TypicalIndustry Ecommerce,Manufacturing,Retail Adver+sing


29/91

4.PointofSaleTransac+onAnalysisChallenge:

AnalyzingPointofSale(PoS)datatotargetpromo+onsandmanageopera+ons

SourcesarecomplexanddatavolumesgrowacrosschainsofstoresandothersourcesSolu+onwithHadoop:

Batchprocessingframework Allowexecu+onininparalleloverlargedatasets

Pa]ernrecogni+on Op+mizingovermul+pledatasources U+lizinginforma+ontopredictdemand

TypicalIndustry: Retail


30/91

5.AnalyzingNetworkDatatoPredictFailureChallenge:

Analyzingreal-+medataseriesfromanetworkofsensors Calcula+ngaveragefrequencyover+meisextremelytediousbecauseofthe

needtoanalyzeterabytes

Solu+onwithHadoop: Takethecomputa+ontothedata

Expandfromsimplescanstomorecomplexdatamining Be]erunderstandhowthenetworkreactstofluctua+ons

Discreteanomaliesmay,infact,beinterconnected Iden+fyleadingindicatorsofcomponentfailureTypicalIndustry:

U+li+es,Telecommunica+ons,DataCenters


31/91

6.ThreatAnalysis/TradeSurveillanceChallenge:

Detec+ngthreatsintheformoffraudulentac+vityora]acks Largedatavolumesinvolved Likelookingforaneedleinahaystack

Solu+onwithHadoop: Parallelprocessingoverhugedatasets Pa]ernrecogni+ontoiden+fyanomalies,

i.e.,threatsTypicalIndustry:

Security,FinancialServices,General:spamfigh+ng,clickfraud


32/91

7.SearchQualityChallenge:

Providingreal+memeaningfulsearchresultsSolu+onwithHadoop:

Analyzingsearcha]emptsinconjunc+onwithstructureddata Pa]ernrecogni+on

Browsingpa]ernofusersperformingsearchesindifferentcategories

TypicalIndustry: Web,Ecommerce


33/91


34/91

Hadoop:Howdoesitwork? Mooreslawandnot


35/91

DiskCapacityandPrice Weregenera+ngmoredatathaneverbefore Fortunately,thesizeandcostofstoragehaskeptpace

CapacityhasincreasedwhilepricehasdecreasedYear Capacity (GB) Cost per GB (USD)

1997 2.1 $157

2004 200 $1.05

2012 3,000 $0.05


36/91

DiskCapacityandPerformance Diskperformancehasalsoincreasedinthelast15years Unfortunately,transferrateshaventkeptpacewith

capacity

Year Capacity (GB) Transfer Rate (MB/s) Disk Read Time

1997 2.1 16.6 126 seconds

2004 200 56.5 59 minutes

2012 3,000 210 3 hours, 58 minutes


37/91

ArchitectureofaTypicalHPCSystem

Storage System

Compute Nodes

Fast Network


38/91


Storage System

Compute Nodes

Step 1: Copy input data

Fast Network


39/91


Storage System

Compute Nodes

Step 2: Process the data

Fast Network


40/91


Storage System

Compute Nodes

Step 3: Copy output data

Fast Network


41/91

YouDontJustNeedSpeed Theproblemisthatwehavewaymoredatathan

code

$ du -ks code/

1,083

$ du ks data/

854,632,947,314


42/91

YouNeedSpeedAtScale

Storage System

Compute Nodes

Bottleneck


43/91

HDFS:HADOOPDISTRIBUTEDFILESYSTEMBecause10,000harddisksarebe]erthanone


44/91

CollocatedStorageandProcessing Solu+on:storeandprocessdataonthesamenodes

Datalocality:Bringthecomputa+ontothedata ReducesI/Oandboostsperformance

"slave" nodes(storage and processing)


45/91

HardDiskLatency Diskseeksareexpensive Solu+on:Readlotsofdataatoncetoamor+zethe

costCurrent location of

disk head

Where the data you

need is stored


46/91

IntroducingHDFS HadoopDistributedFileSystem

ScalablestorageinfluencedbyGooglesfilesystempaper Itsnotageneral-purposefilesystem

HDFSisop+mizedforHadoop Valueshighthroughputmuchmorethanlowlatency Itsauser-spaceJavaprocess Primarilyaccessedviacommand-lineu+li+esandJavaAPI


47/91

HDFSis(Mostly)UNIX-like Inmanyways,HDFSissimilartoaUNIXfilesystem

Hierarchical UNIX-stylepaths(e.g./foo/bar/myfile.txt) Fileownershipandpermissions

Therearealsosomemajordevia+onsfromUNIX NoCWD Cannotmodifyfilesoncewri]en


48/91

HDFSHigh-LevelArchitecture HDFSfollowsamaster-slavearchitecture Therearetwoessen+aldaemonsinHDFS

Master:NameNode Responsiblefornamespaceandmetadata Namespace:filehierarchy Metadata:ownership,permissions,blockloca+ons,etc.

Slave:DataNode Responsibleforstoringactualdatablocks


49/91

AnatomyofaSmallHadoopCluster

Each "slave" node will run

The "master" node will run

- DataNode daemon

- NameNode daemon

ThediagramshowstheHDFS-relateddaemonsonasmallcluster


50/91

HDFSBlocks WhenafileisaddedtoHDFS,itssplitintoblocks Thisisasimilarconcepttona+vefilesystems

HDFSusesamuchlargerblocksize(64MB),forperformance

150 MB input fileBlock #1(64 MB)

Block #2(64 MB)

Block #3(remaining 22 MB)


51/91

HDFSReplica+on Thoseblocksarethenreplicatedacrossmachines ThefirstblockmightbereplicatedtoA,CandD

Block #1

Block #2

Block #3

C

D

A

B

E


52/91

HDFSReplica+on(contd) ThenextblockmightbereplicatedtoB,DandE

Block #1

Block #2

Block #3

C

D

A

B

E


53/91

HDFSReplica+on(contd) ThelastblockmightbereplicatedtoA,CandE

Block #1

Block #2

Block #3

C

D

A

B

E


54/91

HDFSReliability Replica+onhelpstoachievereliability

Evenwhenanodefails,twocopiesoftheblockremain Thesewillbere-replicatedtoothernodesautoma+cally

C

D

A

B

E

This failed node held blocks #1 and #3

X

Blocks #1 and #3 are still available here

Block #3 is still available here

Block #1 is still available here


55/91

DATAPROCESSINGWITHMAPREDUCEItnotonlyworks,itsfunc+onal


56/91

MapReduceHigh-LevelArchitecture LikeHDFS,MapReducehasamaster-slave

architecture

TherearetwodaemonsinclassicalMapReduce Master:JobTracker

Responsiblefordividing,schedulingandmonitoringwork Slave:TaskTracker

Responsibleforactualprocessing


57/91

AnatomyofaSmallHadoopCluster

Each "slave" node will run

The "master" node will run

- DataNode daemon

- TaskTracker daemon

- NameNode daemon

- JobTracker daemon

ThediagramshowsbothMapReduceandHDFSdaemons


58/91

GentleIntroduc+ontoMapReduce MapReduceisconceptuallylikeaUNIXpipeline

Onefunc+on(Map)processesdata Thatoutputisul+matelyinputtoanotherfunc+on

(Reduce)

Eachpieceissimple,butcanbepowerfulwhencombined$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c

941 ERROR

78264 INFO

4312 WARN


61/91


62/91

MapReduceHistory MapReduceisnotalanguage,itsaprogrammingmodel

Astyleofprocessingdatayoucouldimplementinanylanguage MapReducehasitsrootsinfunc+onalprogramming

Manylanguageshavefunc+onsnamedmapandreduce Thesefunc+onshavelargelythesamepurposeinHadoop

Popularizedforlarge-scaledataprocessingbyGoogle


63/91

MapReduceBenefits Complexdetailsareabstractedawayfromthedeveloper

NofileI/O Nonetworkingcode Nosynchroniza+on

Itsscalablebecauseyouprocessonerecordata+me Arecordconsistsofakeyandcorrespondingvalue

Weoencareaboutonlyoneofthese


64/91

MapReduceExampleinPython MapReducecodeforHadoopistypicallywri]enin

Java

ButpossibletousenearlyanylanguagewithHadoopStreaming

IllshowthelogeventcounterusingMapReduceinPython Itsveryhelpfultoseethedataaswellasthecode


65/91

JobInput Eachmappergetsachunkofjobsinputdatato

process

ThischunkiscalledanInputSplit Inmostcases,thiscorrespondstoablockinHDFS2012-09-06 22:16:49.391 CDT INFO "This can wait"2012-09-06 22:16:49.392 CDT INFO "Blah blah"

2012-09-06 22:16:49.394 CDT WARN "Hmmm..."

2012-09-06 22:16:49.395 CDT INFO "More blather"

2012-09-06 22:16:49.397 CDT WARN "Hey there"

2012-09-06 22:16:49.398 CDT INFO "Spewing data"

2012-09-06 22:16:49.399 CDT ERROR "Oh boy!"


66/91

PythonCodeforMapFunc+on Ourmapfunc+onwillparsetheeventtype

Andthenoutputthatevent(key)andaliteral1(value)#!/usr/bin/env python

import sys

levels = ['TRACE', 'DEBUG', 'INFO',

'WARN', 'ERROR', 'FATAL']

for line in sys.stdin:

fields = line.split()

for fieldin fields:

field = field.strip().upper()

if fieldin levels:

print "%s\t1" % field

1

2

34

5

6

7

8

9

10

11

12

13

BoilerplatePythonstuff

Ifthisfieldmatchesalog

level,printit(anda1)

Spliteveryline(record)we

receiveonstandardinput

intofields,normalizedbycase

DefinelistofJUnitloglevels


67/91

OutputofMapFunc+on Themapfunc+onproduceskey/valuepairsasoutput

INFO 1

INFO 1WARN 1

INFO 1

WARN 1

INFO 1

ERROR 1


68/91

InputtoReduceFunc+on TheReducerreceivesakeyandallvaluesforthatkey

Keysarealwayspassedtoreducersinsortedorder Althoughitsnotobvioushere,valuesareunordered

ERROR 1

INFO 1

INFO 1

INFO 1

INFO 1

WARN 1

WARN 1


69/91

PythonCodeforReduceFunc+on TheReducerfirstextractsthekeyandvalueitwaspassed

#!/usr/bin/env python

import sys

previous_key = ''

sum = 0

for line in sys.stdin:

fields = line.split()

key, value = line.split()

value = int(value)# continued on next slide

1

2

3

4

5

6

7

8

9

10

11

1213

BoilerplatePythonstuff

Ini+alizeloopvariables

Extractthekeyandvalue

passedviastandardinput


70/91

PythonCodeforReduceFunc+on Thensimplyaddsupthevalueforeachkey

# continued from previous slide

if key == previous_key:

sum = sum + valueelse:

if previous_key != '':

print '%s\t%i' % (previous_key, sum)

previous_key = key

sum = 1

print '%s\t%i' % (previous_key, sum)

14

15

1617

18

19

20

21

22

23

Ifkeyunchanged,

incrementthecount

Printsumforfinalkey

Ifkeychanged,print

sumforpreviouskey

Re-initloopvariables


71/91

OutputofReduceFunc+on TheoutputofthisReducefunc+onisasumforeach

level

ERROR 1

INFO 4

WARN 2


72/91

RecapofDataFlow

ERROR 1

INFO 4

WARN 2

2012-09-06 22:16:49.391 CDT INFO "This can wait"

2012-09-06 22:16:49.392 CDT INFO "Blah blah"

2012-09-06 22:16:49.394 CDT WARN "Hmmm..."

2012-09-06 22:16:49.395 CDT INFO "More blather"

2012-09-06 22:16:49.397 CDT WARN "Hey there"

2012-09-06 22:16:49.398 CDT INFO "Spewing data"

2012-09-06 22:16:49.399 CDT ERROR "Oh boy!"

INFO 1

INFO 1

WARN 1

INFO 1

WARN 1

INFO 1

ERROR 1

ERROR 1

INFO 1

INFO 1

INFO 1

INFO 1

WARN 1

WARN 1

Mapinput

Mapoutput ReduceinputReduceoutput


73/91

InputSplitsFeedtheMapTasks Inputfortheen+rejobissubdividedintoInputSplits

AnInputSplitusuallycorrespondstoasingleHDFSblock EachoftheseservesasinputtoasingleMaptask

Input for entire job(192 MB)

Mapper #3

Mapper #1

Mapper #2

64 MB

64 MB

64 MB


74/91

MappersFeedtheShuffleandSort OutputofallMappersispar++oned,merged,and

sorted(NocoderequiredHadoopdoesthisautoma+cally)

Mapper #1

Mapper #2

Mapper #N

WARN 1

WARN 1

WARN 1

WARN 1

INFO 1

INFO 1

INFO 1

INFO 1

INFO 1

INFO 1

INFO 1

INFO 1

ERROR 1

ERROR 1

ERROR 1

INFO 1

WARN 1

INFO 1

INFO 1

ERROR 1

WARN 1

INFO 1

INFO 1

INFO 1

ERROR 1

WARN 1

INFO 1

WARN 1INFO 1

ERROR 1


75/91

ShuffleandSortFeedstheReducers Allvaluesforagivenkeyarethencollapsedintoalist

Thekeyandallitsvaluesarefedtoreducersasinput

Reducer #1

Reducer #2

WARN 1 1 1 1

INFO 1 1 1 1 1 1 1 1

ERROR 1 1 1

WARN 1

WARN 1

WARN 1

WARN 1

ERROR 1

ERROR 1

ERROR 1

INFO 1

INFO 1

INFO 1

INFO 1

INFO 1

INFO 1

INFO 1

INFO 1


76/91

EachReducerHasanOutputFile ThesearestoredinHDFSbelowyouroutput

directory

Usehadoop fs -getmergetocombinethemintoalocalcopy

Reducer #1

Reducer #2

INFO 8

ERROR 3

WARN 4


77/91

ApacheHadoopEcosystem:Overview "CoreHadoop"consistsofHDFSandMapReduce

Thesearethekernelofamuchbroaderplaorm Hadoophasmanyrelatedprojects

SomehelpyouintegrateHadoopwithothersystems Othershelpyouanalyzeyourdata S+llothers,likeOozie,helpyouuseHadoopmoreeffec+vely

MostareopensourceApacheprojectslikeHadoop AlsolikeHadoop,theyhavefunnynames AllofthesearepartofClouderasCDHdistribu+on


78/91

Ecosystem:ApacheFlume

logfiles

syslog customsource

andmanymore

programoutput


79/91

Ecosystem:ApacheSqoop IntegrateswithanyJDBC-compa+bledatabase

Retrievealltables,asingletable,orapor+ontostoreinHDFS

CanalsoexportdatafromHDFSbacktothedatabaseDatabase

Hadoop Cluster


80/91

Ecosystem:ApacheHive HiveallowsyoutodoSQL-likequeriesondatain

HDFS

ItturnsthisintoMapReducejobsthatrunonyourcluster Reducesdevelopment+me MakesHadoopmoreaccessibletonon-engineers

SELECT customer.id, customer.name, sum(orders.cost)

FROMcustomers INNER JOIN

ON (customer.id = orders.customer_id)

WHERE customer.zipcode = '63105'

GROUP BY customer.id;


81/91

Ecosystem:ApachePig ApachePighasasimilarpurposetoHive

Ithasahigh-levellanguage(PigLa+n)fordataanalysis ScriptsyieldMapReducejobsthatrunonyourcluster

ButPigsapproachismuchdifferentthanHive


82/91

Ecosystem:ApacheHBase NoSQLdatabasebuiltonHDFS Low-latencyandhigh-performanceforreadsand

writes

Extremelyscalable Tablescanhavebillionsofrows Andpoten+allymillionsofcolumns


83/91

YouShouldBeUsingCDH ClouderasDistribu+onincludingApacheHadoop(CDH)

Themostwidelyuseddistribu+onofHadoop Astable,provenandsupportedenvironmentyoucancounton

CombinesHadoopwithmanyimportantecosystemtools SuchasHive,Pig,Sqoop,Flumeandmanymore Alloftheseareintegratedandworkwelltogether

Howmuchdoesitcost? Itscompletelyfree Apachelicensedits100%opensourcetoo


84/91

WhenisHadoop(Not)aGoodChoice Hadoopmaybeagreatchoicewhen

Youneedtoprocessnon-rela+onal(unstructured)data Youareprocessinglargeamountsofdata Youcanrunyourjobsinbatchmode

Hadoopmaynotbeagreatchoicewhen Youreprocessingsmallamountsofdata Youralgorithmsrequirecommunica+onamongnodes Youneedlowlatencyortransac+ons

Asalways,usethebesttoolforthejob Andknowhowtointegrateitwithothersystems


85/91

ManagingTheElephantInTheRoom-Roles SystemAdministrators Developers Analysts DataStewards


86/91

SystemAdministrators Requiredskills:

StrongLinuxadministra+onskills Networkingknowledge Understandingofhardware

Jobresponsibili+es Install,configureandupgradeHadoopsoware Managehardwarecomponents Monitorthecluster Integratewithothersystems(e.g.,FlumeandSqoop)


87/91

Developers RequiredSkills:

StrongJavaorscrip+ngcapabili+es UnderstandingofMapReduceandalgorithms

Jobresponsibili+es: Write,packageanddeployMapReduceprograms Op+mizeMapReducejobsandHive/Pigprograms


88/91

DataAnalyst/BusinessAnalyst Requiredskills:

SQL Understandingdataanaly+cs/datamining

Jobresponsibili+es: Extractintelligencefromthedata WriteHiveand/orPigprograms


89/91

DataSteward Requiredskills:

DatamodelingandETL Scrip+ngskills

Jobresponsibili+es: Catalogingthedata(analogoustoalibrarianforbooks) Managedatalifecycle,reten+on DataqualitycontrolwithSLAs


90/91

CombiningRoles SystemAdministratorStewardanalogoustoDBA Requiredskills:

DatamodelingandETL Scrip+ngskills StrongLinuxadministra+onskills

Jobresponsibili+es: Managedatalifecycle,reten+on DataqualitycontrolwithSLAs Install,configureandupgradeHadoopsoware Managehardwarecomponents Monitorthecluster Integratewithothersystems(e.g.,FlumeandSqoop)


91/91

Conclusion Thanksforyour+me! Ques+ons?

An Introduction to Hadoop Presentation.pdf

Documents

Transcript of An Introduction to Hadoop Presentation.pdf