An Introduction to Hadoop Presentation.pdf

download An Introduction to Hadoop Presentation.pdf

of 91

Transcript of An Introduction to Hadoop Presentation.pdf

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    1/91

    1

    AnIntroduc+ontoHadoopMarkFeiCloudera

    StrataHadoopWorld2012,NewYorkCity,October23,2012

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    2/91

    WhoAmI?

    MarkFeiCloudera!Durango, Colorado!

    Current:! Senior Instructor at Cloudera!Past:! Professional Services Education, VMware!

    Senior Member Technical Staff, Hill Associates!Sales Engineer, Nortel Networks!Systems Programmer, large Bank!Banking Applications software developer!

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    3/91

    WhatsAhead? Solidintroduc+ontoApacheHadoop

    Whatitis Whyitsrelevant Howitworks TheEcosystem

    Nopriorexperienceneeded Feelfreetoaskques+ons

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    4/91

    WhatisApacheHadoop? Scalabledatastorageandprocessing

    OpensourceApacheproject Harnessesthepowerofcommodityservers Distributedandfault-tolerant

    CoreHadoopconsistsoftwomainparts HDFS(storage)MapReduce(processing)

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    5/91

    A large ecosystem

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    6/91

    Who uses Hadoop?

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    7/91

    Vendor integration

    BI / Analytics ETL Database OS / Cloud /System Mgmt.

    Hardware

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    8/91

    About Cloudera Cloudera is The commercial Hadoop company Founded by leading experts on Hadoop from

    Facebook, Google, Oracle and Yahoo Provides consulting and training services for

    Hadoop users

    Staff includes several committers to Hadoopprojects

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    9/91

    Cloudera Software Clouderas Distribution including Apache Hadoop (CDH)

    A single, easy-to-install package from the Apache Hadoop core repository Includes a stable version of Hadoop, plus critical bug fixes and solid new

    features from the development version 100% open source

    Components Apache Hadoop Apache Hive Apache Pig Apache HBase Apache Zookeeper Apache Flume, Apache Hue, Apache Oozie, Apache Sqoop, Apache Mahout

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    10/91

    A Coherent Platform

    Storage

    Computation

    Integration

    Coordination

    Access

    Components of the

    CDH Stack

    Coordination

    DataIntegration

    FastRead/Write

    Access

    Languages / Compilers

    Workflow Scheduling Metadata

    APACHE ZOOKEEPER

    APACHE FLUME,

    APACHE SQOOP APACHE HBASE

    APACHE PIG, APACHE HIVE, APACHE MAHOUT

    APACHE OOZIE APACHE OOZIE APACHE HIVE

    File System Mount UI Framework SDKFUSE-DFS HUE HUE SDK

    HDFS, MAPREDUCE

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    11/91

    Cloudera Manager, Free Edition End-to-end Deployment and management of your

    CDH cluster

    Zero to Hadoop in 15 minutes Supports up to 50 nodes Free (but not open source)

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    12/91

    Cloudera Enterprise Cloudera Enterprise

    Clouderas Distribution including Apache Hadoop (CDH) Big data storage, processing and analytics platform based on

    CDH Cloudera Manager (full version)

    End-to-end deployment, management, and operation of CDH Provides sophisticated cluster monitoring tools not present in the

    free version

    Production support A team of experts on call to help you meet your Service LevelAgreements (SLAs)

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    13/91

    Cloudera University Training for the entire Hadoop stack

    Cloudera Developer Training for Apache Hadoop Cloudera Administrator Training for Apache Hadoop Cloudera Training for Apache HBase

    Cloudera Training for Apache Hive and Pig Cloudera Essentials for Apache Hadoop More courses coming

    Public and private classes offered Including customized on-site private classes

    Industry-recognized Certifications Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera Certified Administrator for Apache Hadoop (CCAH) Cloudera Certified Specialist in Apache HBase (CCSHB)

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    14/91

    Professional Services Solutions Architects provide guidance and hands-

    on expertise

    Use Case Discovery New Hadoop Deployment Proof of Concept Production Pilot Process and Team Development Hadoop Deployment Certification

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    15/91

    HowDidApacheHadoopOriginate? HeavilyinfluencedbyGooglesarchitecture

    Notably,theGoogleFilesystemandMapReducepapersOtherWebcompaniesquicklysawthebenefits Earlyadop+onbyYahoo,Facebookandothers

    2002 2003 2004 2005

    Google publishesMapReduce paper

    Nutch rewrittenfor MapReduce

    Nutch spun offfrom Lucene

    Google publishesGFS paper

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    16/91

    WhyDoWeHaveSoMuchData? Andwhatarewesupposedtodowithit?

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    17/91

    Velocity Whyweregenera+ngdatafasterthanever

    Processesareincreasinglyautomated Systemsareincreasinglyinterconnected Peopleareincreasinglyinterac+ngonline

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    18/91

    Variety Whattypesofdataareweproducing?

    Applica+onlogs Textmessages Socialnetworkconnec+ons Tweets Photos

    Notallofthismapscleanlytotherela+onalmodel

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    19/91

    Volume Theresultofthisisthateverysingleday

    Twi]erprocesses340millionmessages Facebookstores2.7billioncommentsandLikes Googleprocessesabout24petabytesofdata

    Andeverysingleminute Morethan200millione-mailmessagesaresent Foursquareprocessesmorethan2,000check-ins

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    20/91

    WhereDoesDataComeFrom? Science

    Medicalimaging,sensordata,genomesequencing,weatherdata,satellitefeeds,etc.

    Industry Financial,pharmaceu+cal,manufacturing,insurance,online,energy,retail

    data

    Legacy Salesdata,customerbehavior,productdatabases,accoun+ngdata,etc.

    SystemData Logfiles,health&statusfeeds,ac+vitystreams,networkmessages,Web

    Analy+cs,intrusiondetec+on,spamfilters

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    21/91

    AnalyzingData:TheChallenges Hugevolumesofdata Mixedsourcesresultinmanydifferentformats

    XML CSV EDI Logfiles Objects SQL Text JSON Binary Etc.

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    22/91

    WhatisCommonAcrossHadoop-ableProblems? Natureofthedata

    Complexdata Mul+pledatasources Lotsofit

    Natureoftheanalysis Batchprocessing Parallelexecu+on Spreaddataoveraclusterofserversandtakethecomputa+ontothedata

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    23/91

    BenefitsofAnalyzingWithHadoop Previouslyimpossible/imprac+caltodothisanalysis Analysisconductedatlowercost Analysisconductedinless+me Greaterflexibility Linearscalability

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    24/91

    WhatAnalysisisPossibleWithHadoop? Textmining Indexbuilding Graphcrea+onandanalysis Pa]ernrecogni+on

    Collabora+vefiltering Predic+onmodels Sen+mentanalysis Riskassessment

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    25/91

    EightCommonHadoop-ableProblems1. Modelingtruerisk2. Customerchurnanalysis3. Recommenda+onengine4. PoStransac+onanalysis

    5. Analyzingnetworkdatatopredictfailure

    6. Threatanalysis7. Searchquality8. Datasandbox

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    26/91

    1.ModelingTrueRiskChallenge:

    Howmuchriskexposuredoesanorganiza+onreallyhavewitheachcustomer?

    Mul+plesourcesofdataandacrossmul+plelinesofbusinessSolu+onwithHadoop: Sourceandaggregatedisparatedatasourcestobuilddatapicture

    e.g.creditcardrecords,callrecordings,chatsessions,emails,bankingac+vity

    Structureandanalyze Sen+mentanalysis,graphcrea+on,pa]ernrecogni+on

    TypicalIndustry: FinancialServices(banks,insurancecompanies)

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    27/91

    2.CustomerChurnAnalysisChallenge:

    Whyisanorganiza+onreallylosingcustomers? Dataonthesefactorscomesfromdifferentsources

    Solu-onwithHadoop:

    Rapidlybuildbehavioralmodelfromdisparatedatasources StructureandanalyzewithHadoop

    Traversing Graphcrea+on Pa]ernrecogni+on

    TypicalIndustry: Telecommunica+ons,FinancialServices

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    28/91

    3.Recommenda+onEngine/AdTarge+ngChallenge:

    UsinguserdatatopredictwhichproductstorecommendSolu+onwithHadoop:

    Batchprocessingframework Allowexecu+onininparalleloverlargedatasets

    Collabora+vefiltering Collec+ngtasteinforma+onfrommanyusers U+lizinginforma+ontopredictwhatsimilaruserslike

    TypicalIndustry Ecommerce,Manufacturing,Retail Adver+sing

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    29/91

    4.PointofSaleTransac+onAnalysisChallenge:

    AnalyzingPointofSale(PoS)datatotargetpromo+onsandmanageopera+ons

    SourcesarecomplexanddatavolumesgrowacrosschainsofstoresandothersourcesSolu+onwithHadoop:

    Batchprocessingframework Allowexecu+onininparalleloverlargedatasets

    Pa]ernrecogni+on Op+mizingovermul+pledatasources U+lizinginforma+ontopredictdemand

    TypicalIndustry: Retail

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    30/91

    5.AnalyzingNetworkDatatoPredictFailureChallenge:

    Analyzingreal-+medataseriesfromanetworkofsensors Calcula+ngaveragefrequencyover+meisextremelytediousbecauseofthe

    needtoanalyzeterabytes

    Solu+onwithHadoop: Takethecomputa+ontothedata

    Expandfromsimplescanstomorecomplexdatamining Be]erunderstandhowthenetworkreactstofluctua+ons

    Discreteanomaliesmay,infact,beinterconnected Iden+fyleadingindicatorsofcomponentfailureTypicalIndustry:

    U+li+es,Telecommunica+ons,DataCenters

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    31/91

    6.ThreatAnalysis/TradeSurveillanceChallenge:

    Detec+ngthreatsintheformoffraudulentac+vityora]acks Largedatavolumesinvolved Likelookingforaneedleinahaystack

    Solu+onwithHadoop: Parallelprocessingoverhugedatasets Pa]ernrecogni+ontoiden+fyanomalies,

    i.e.,threatsTypicalIndustry:

    Security,FinancialServices,General:spamfigh+ng,clickfraud

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    32/91

    7.SearchQualityChallenge:

    Providingreal+memeaningfulsearchresultsSolu+onwithHadoop:

    Analyzingsearcha]emptsinconjunc+onwithstructureddata Pa]ernrecogni+on

    Browsingpa]ernofusersperformingsearchesindifferentcategories

    TypicalIndustry: Web,Ecommerce

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    33/91

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    34/91

    Hadoop:Howdoesitwork? Mooreslawandnot

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    35/91

    DiskCapacityandPrice Weregenera+ngmoredatathaneverbefore Fortunately,thesizeandcostofstoragehaskeptpace

    CapacityhasincreasedwhilepricehasdecreasedYear Capacity (GB) Cost per GB (USD)

    1997 2.1 $157

    2004 200 $1.05

    2012 3,000 $0.05

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    36/91

    DiskCapacityandPerformance Diskperformancehasalsoincreasedinthelast15years Unfortunately,transferrateshaventkeptpacewith

    capacity

    Year Capacity (GB) Transfer Rate (MB/s) Disk Read Time

    1997 2.1 16.6 126 seconds

    2004 200 56.5 59 minutes

    2012 3,000 210 3 hours, 58 minutes

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    37/91

    ArchitectureofaTypicalHPCSystem

    Storage System

    Compute Nodes

    Fast Network

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    38/91

    ArchitectureofaTypicalHPCSystem

    Storage System

    Compute Nodes

    Step 1: Copy input data

    Fast Network

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    39/91

    ArchitectureofaTypicalHPCSystem

    Storage System

    Compute Nodes

    Step 2: Process the data

    Fast Network

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    40/91

    ArchitectureofaTypicalHPCSystem

    Storage System

    Compute Nodes

    Step 3: Copy output data

    Fast Network

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    41/91

    YouDontJustNeedSpeed Theproblemisthatwehavewaymoredatathan

    code

    $ du -ks code/

    1,083

    $ du ks data/

    854,632,947,314

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    42/91

    YouNeedSpeedAtScale

    Storage System

    Compute Nodes

    Bottleneck

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    43/91

    HDFS:HADOOPDISTRIBUTEDFILESYSTEMBecause10,000harddisksarebe]erthanone

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    44/91

    CollocatedStorageandProcessing Solu+on:storeandprocessdataonthesamenodes

    Datalocality:Bringthecomputa+ontothedata ReducesI/Oandboostsperformance

    "slave" nodes(storage and processing)

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    45/91

    HardDiskLatency Diskseeksareexpensive Solu+on:Readlotsofdataatoncetoamor+zethe

    costCurrent location of

    disk head

    Where the data you

    need is stored

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    46/91

    IntroducingHDFS HadoopDistributedFileSystem

    ScalablestorageinfluencedbyGooglesfilesystempaper Itsnotageneral-purposefilesystem

    HDFSisop+mizedforHadoop Valueshighthroughputmuchmorethanlowlatency Itsauser-spaceJavaprocess Primarilyaccessedviacommand-lineu+li+esandJavaAPI

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    47/91

    HDFSis(Mostly)UNIX-like Inmanyways,HDFSissimilartoaUNIXfilesystem

    Hierarchical UNIX-stylepaths(e.g./foo/bar/myfile.txt) Fileownershipandpermissions

    Therearealsosomemajordevia+onsfromUNIX NoCWD Cannotmodifyfilesoncewri]en

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    48/91

    HDFSHigh-LevelArchitecture HDFSfollowsamaster-slavearchitecture Therearetwoessen+aldaemonsinHDFS

    Master:NameNode Responsiblefornamespaceandmetadata Namespace:filehierarchy Metadata:ownership,permissions,blockloca+ons,etc.

    Slave:DataNode Responsibleforstoringactualdatablocks

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    49/91

    AnatomyofaSmallHadoopCluster

    Each "slave" node will run

    The "master" node will run

    - DataNode daemon

    - NameNode daemon

    ThediagramshowstheHDFS-relateddaemonsonasmallcluster

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    50/91

    HDFSBlocks WhenafileisaddedtoHDFS,itssplitintoblocks Thisisasimilarconcepttona+vefilesystems

    HDFSusesamuchlargerblocksize(64MB),forperformance

    150 MB input fileBlock #1(64 MB)

    Block #2(64 MB)

    Block #3(remaining 22 MB)

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    51/91

    HDFSReplica+on Thoseblocksarethenreplicatedacrossmachines ThefirstblockmightbereplicatedtoA,CandD

    Block #1

    Block #2

    Block #3

    C

    D

    A

    B

    E

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    52/91

    HDFSReplica+on(contd) ThenextblockmightbereplicatedtoB,DandE

    Block #1

    Block #2

    Block #3

    C

    D

    A

    B

    E

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    53/91

    HDFSReplica+on(contd) ThelastblockmightbereplicatedtoA,CandE

    Block #1

    Block #2

    Block #3

    C

    D

    A

    B

    E

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    54/91

    HDFSReliability Replica+onhelpstoachievereliability

    Evenwhenanodefails,twocopiesoftheblockremain Thesewillbere-replicatedtoothernodesautoma+cally

    C

    D

    A

    B

    E

    This failed node held blocks #1 and #3

    X

    Blocks #1 and #3 are still available here

    Block #3 is still available here

    Block #1 is still available here

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    55/91

    DATAPROCESSINGWITHMAPREDUCEItnotonlyworks,itsfunc+onal

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    56/91

    MapReduceHigh-LevelArchitecture LikeHDFS,MapReducehasamaster-slave

    architecture

    TherearetwodaemonsinclassicalMapReduce Master:JobTracker

    Responsiblefordividing,schedulingandmonitoringwork Slave:TaskTracker

    Responsibleforactualprocessing

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    57/91

    AnatomyofaSmallHadoopCluster

    Each "slave" node will run

    The "master" node will run

    - DataNode daemon

    - TaskTracker daemon

    - NameNode daemon

    - JobTracker daemon

    ThediagramshowsbothMapReduceandHDFSdaemons

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    58/91

    GentleIntroduc+ontoMapReduce MapReduceisconceptuallylikeaUNIXpipeline

    Onefunc+on(Map)processesdata Thatoutputisul+matelyinputtoanotherfunc+on

    (Reduce)

    Eachpieceissimple,butcanbepowerfulwhencombined$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c

    941 ERROR

    78264 INFO

    4312 WARN

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    59/91

    TheMapFunc+on Operatesoneachrecordindividually

    Typicalusesincludefiltering,parsing,ortransforminginput

    $ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c

    Map

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    60/91

    IntermediateProcessing TheMapfunc+onsoutputisgroupedandsorted

    Thisistheautoma+csortandshuffleprocessinHadoop$ egrep 'INFO|WARN|ERROR' app.log | cut -f3 | sort | uniq -c

    Sortand

    Shuffle

    Map

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    61/91

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    62/91

    MapReduceHistory MapReduceisnotalanguage,itsaprogrammingmodel

    Astyleofprocessingdatayoucouldimplementinanylanguage MapReducehasitsrootsinfunc+onalprogramming

    Manylanguageshavefunc+onsnamedmapandreduce Thesefunc+onshavelargelythesamepurposeinHadoop

    Popularizedforlarge-scaledataprocessingbyGoogle

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    63/91

    MapReduceBenefits Complexdetailsareabstractedawayfromthedeveloper

    NofileI/O Nonetworkingcode Nosynchroniza+on

    Itsscalablebecauseyouprocessonerecordata+me Arecordconsistsofakeyandcorrespondingvalue

    Weoencareaboutonlyoneofthese

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    64/91

    MapReduceExampleinPython MapReducecodeforHadoopistypicallywri]enin

    Java

    ButpossibletousenearlyanylanguagewithHadoopStreaming

    IllshowthelogeventcounterusingMapReduceinPython Itsveryhelpfultoseethedataaswellasthecode

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    65/91

    JobInput Eachmappergetsachunkofjobsinputdatato

    process

    ThischunkiscalledanInputSplit Inmostcases,thiscorrespondstoablockinHDFS2012-09-06 22:16:49.391 CDT INFO "This can wait"2012-09-06 22:16:49.392 CDT INFO "Blah blah"

    2012-09-06 22:16:49.394 CDT WARN "Hmmm..."

    2012-09-06 22:16:49.395 CDT INFO "More blather"

    2012-09-06 22:16:49.397 CDT WARN "Hey there"

    2012-09-06 22:16:49.398 CDT INFO "Spewing data"

    2012-09-06 22:16:49.399 CDT ERROR "Oh boy!"

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    66/91

    PythonCodeforMapFunc+on Ourmapfunc+onwillparsetheeventtype

    Andthenoutputthatevent(key)andaliteral1(value)#!/usr/bin/env python

    import sys

    levels = ['TRACE', 'DEBUG', 'INFO',

    'WARN', 'ERROR', 'FATAL']

    for line in sys.stdin:

    fields = line.split()

    for fieldin fields:

    field = field.strip().upper()

    if fieldin levels:

    print "%s\t1" % field

    1

    2

    34

    5

    6

    7

    8

    9

    10

    11

    12

    13

    BoilerplatePythonstuff

    Ifthisfieldmatchesalog

    level,printit(anda1)

    Spliteveryline(record)we

    receiveonstandardinput

    intofields,normalizedbycase

    DefinelistofJUnitloglevels

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    67/91

    OutputofMapFunc+on Themapfunc+onproduceskey/valuepairsasoutput

    INFO 1

    INFO 1WARN 1

    INFO 1

    WARN 1

    INFO 1

    ERROR 1

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    68/91

    InputtoReduceFunc+on TheReducerreceivesakeyandallvaluesforthatkey

    Keysarealwayspassedtoreducersinsortedorder Althoughitsnotobvioushere,valuesareunordered

    ERROR 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

    WARN 1

    WARN 1

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    69/91

    PythonCodeforReduceFunc+on TheReducerfirstextractsthekeyandvalueitwaspassed

    #!/usr/bin/env python

    import sys

    previous_key = ''

    sum = 0

    for line in sys.stdin:

    fields = line.split()

    key, value = line.split()

    value = int(value)# continued on next slide

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    1213

    BoilerplatePythonstuff

    Ini+alizeloopvariables

    Extractthekeyandvalue

    passedviastandardinput

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    70/91

    PythonCodeforReduceFunc+on Thensimplyaddsupthevalueforeachkey

    # continued from previous slide

    if key == previous_key:

    sum = sum + valueelse:

    if previous_key != '':

    print '%s\t%i' % (previous_key, sum)

    previous_key = key

    sum = 1

    print '%s\t%i' % (previous_key, sum)

    14

    15

    1617

    18

    19

    20

    21

    22

    23

    Ifkeyunchanged,

    incrementthecount

    Printsumforfinalkey

    Ifkeychanged,print

    sumforpreviouskey

    Re-initloopvariables

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    71/91

    OutputofReduceFunc+on TheoutputofthisReducefunc+onisasumforeach

    level

    ERROR 1

    INFO 4

    WARN 2

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    72/91

    RecapofDataFlow

    ERROR 1

    INFO 4

    WARN 2

    2012-09-06 22:16:49.391 CDT INFO "This can wait"

    2012-09-06 22:16:49.392 CDT INFO "Blah blah"

    2012-09-06 22:16:49.394 CDT WARN "Hmmm..."

    2012-09-06 22:16:49.395 CDT INFO "More blather"

    2012-09-06 22:16:49.397 CDT WARN "Hey there"

    2012-09-06 22:16:49.398 CDT INFO "Spewing data"

    2012-09-06 22:16:49.399 CDT ERROR "Oh boy!"

    INFO 1

    INFO 1

    WARN 1

    INFO 1

    WARN 1

    INFO 1

    ERROR 1

    ERROR 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

    WARN 1

    WARN 1

    Mapinput

    Mapoutput ReduceinputReduceoutput

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    73/91

    InputSplitsFeedtheMapTasks Inputfortheen+rejobissubdividedintoInputSplits

    AnInputSplitusuallycorrespondstoasingleHDFSblock EachoftheseservesasinputtoasingleMaptask

    Input for entire job(192 MB)

    Mapper #3

    Mapper #1

    Mapper #2

    64 MB

    64 MB

    64 MB

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    74/91

    MappersFeedtheShuffleandSort OutputofallMappersispar++oned,merged,and

    sorted(NocoderequiredHadoopdoesthisautoma+cally)

    Mapper #1

    Mapper #2

    Mapper #N

    WARN 1

    WARN 1

    WARN 1

    WARN 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

    ERROR 1

    ERROR 1

    ERROR 1

    INFO 1

    WARN 1

    INFO 1

    INFO 1

    ERROR 1

    WARN 1

    INFO 1

    INFO 1

    INFO 1

    ERROR 1

    WARN 1

    INFO 1

    WARN 1INFO 1

    ERROR 1

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    75/91

    ShuffleandSortFeedstheReducers Allvaluesforagivenkeyarethencollapsedintoalist

    Thekeyandallitsvaluesarefedtoreducersasinput

    Reducer #1

    Reducer #2

    WARN 1 1 1 1

    INFO 1 1 1 1 1 1 1 1

    ERROR 1 1 1

    WARN 1

    WARN 1

    WARN 1

    WARN 1

    ERROR 1

    ERROR 1

    ERROR 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

    INFO 1

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    76/91

    EachReducerHasanOutputFile ThesearestoredinHDFSbelowyouroutput

    directory

    Usehadoop fs -getmergetocombinethemintoalocalcopy

    Reducer #1

    Reducer #2

    INFO 8

    ERROR 3

    WARN 4

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    77/91

    ApacheHadoopEcosystem:Overview "CoreHadoop"consistsofHDFSandMapReduce

    Thesearethekernelofamuchbroaderplaorm Hadoophasmanyrelatedprojects

    SomehelpyouintegrateHadoopwithothersystems Othershelpyouanalyzeyourdata S+llothers,likeOozie,helpyouuseHadoopmoreeffec+vely

    MostareopensourceApacheprojectslikeHadoop AlsolikeHadoop,theyhavefunnynames AllofthesearepartofClouderasCDHdistribu+on

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    78/91

    Ecosystem:ApacheFlume

    logfiles

    syslog customsource

    andmanymore

    programoutput

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    79/91

    Ecosystem:ApacheSqoop IntegrateswithanyJDBC-compa+bledatabase

    Retrievealltables,asingletable,orapor+ontostoreinHDFS

    CanalsoexportdatafromHDFSbacktothedatabaseDatabase

    Hadoop Cluster

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    80/91

    Ecosystem:ApacheHive HiveallowsyoutodoSQL-likequeriesondatain

    HDFS

    ItturnsthisintoMapReducejobsthatrunonyourcluster Reducesdevelopment+me MakesHadoopmoreaccessibletonon-engineers

    SELECT customer.id, customer.name, sum(orders.cost)

    FROMcustomers INNER JOIN

    ON (customer.id = orders.customer_id)

    WHERE customer.zipcode = '63105'

    GROUP BY customer.id;

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    81/91

    Ecosystem:ApachePig ApachePighasasimilarpurposetoHive

    Ithasahigh-levellanguage(PigLa+n)fordataanalysis ScriptsyieldMapReducejobsthatrunonyourcluster

    ButPigsapproachismuchdifferentthanHive

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    82/91

    Ecosystem:ApacheHBase NoSQLdatabasebuiltonHDFS Low-latencyandhigh-performanceforreadsand

    writes

    Extremelyscalable Tablescanhavebillionsofrows Andpoten+allymillionsofcolumns

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    83/91

    YouShouldBeUsingCDH ClouderasDistribu+onincludingApacheHadoop(CDH)

    Themostwidelyuseddistribu+onofHadoop Astable,provenandsupportedenvironmentyoucancounton

    CombinesHadoopwithmanyimportantecosystemtools SuchasHive,Pig,Sqoop,Flumeandmanymore Alloftheseareintegratedandworkwelltogether

    Howmuchdoesitcost? Itscompletelyfree Apachelicensedits100%opensourcetoo

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    84/91

    WhenisHadoop(Not)aGoodChoice Hadoopmaybeagreatchoicewhen

    Youneedtoprocessnon-rela+onal(unstructured)data Youareprocessinglargeamountsofdata Youcanrunyourjobsinbatchmode

    Hadoopmaynotbeagreatchoicewhen Youreprocessingsmallamountsofdata Youralgorithmsrequirecommunica+onamongnodes Youneedlowlatencyortransac+ons

    Asalways,usethebesttoolforthejob Andknowhowtointegrateitwithothersystems

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    85/91

    ManagingTheElephantInTheRoom-Roles SystemAdministrators Developers Analysts DataStewards

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    86/91

    SystemAdministrators Requiredskills:

    StrongLinuxadministra+onskills Networkingknowledge Understandingofhardware

    Jobresponsibili+es Install,configureandupgradeHadoopsoware Managehardwarecomponents Monitorthecluster Integratewithothersystems(e.g.,FlumeandSqoop)

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    87/91

    Developers RequiredSkills:

    StrongJavaorscrip+ngcapabili+es UnderstandingofMapReduceandalgorithms

    Jobresponsibili+es: Write,packageanddeployMapReduceprograms Op+mizeMapReducejobsandHive/Pigprograms

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    88/91

    DataAnalyst/BusinessAnalyst Requiredskills:

    SQL Understandingdataanaly+cs/datamining

    Jobresponsibili+es: Extractintelligencefromthedata WriteHiveand/orPigprograms

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    89/91

    DataSteward Requiredskills:

    DatamodelingandETL Scrip+ngskills

    Jobresponsibili+es: Catalogingthedata(analogoustoalibrarianforbooks) Managedatalifecycle,reten+on DataqualitycontrolwithSLAs

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    90/91

    CombiningRoles SystemAdministratorStewardanalogoustoDBA Requiredskills:

    DatamodelingandETL Scrip+ngskills StrongLinuxadministra+onskills

    Jobresponsibili+es: Managedatalifecycle,reten+on DataqualitycontrolwithSLAs Install,configureandupgradeHadoopsoware Managehardwarecomponents Monitorthecluster Integratewithothersystems(e.g.,FlumeandSqoop)

  • 7/29/2019 An Introduction to Hadoop Presentation.pdf

    91/91

    Conclusion Thanksforyour+me! Ques+ons?