Apache Flume: Distributed Log Collection -...

Post on 05-Jul-2018

258 views 1 download

Transcript of Apache Flume: Distributed Log Collection -...

ApacheFlume:DistributedLogCollectionforHadoopSecondEdition

TableofContents

ApacheFlume:DistributedLogCollectionforHadoopSecondEdition

Credits

AbouttheAuthor

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.OverviewandArchitecture

Flume0.9

Flume1.X(Flume-NG)

TheproblemwithHDFSandstreamingdata/logs

Sources,channels,andsinks

Flumeevents

Interceptors,channelselectors,andsinkprocessors

Tiereddatacollection(multipleflowsand/oragents)

TheKiteSDK

Summary

2.AQuickStartGuidetoFlume

DownloadingFlume

FlumeinHadoopdistributions

AnoverviewoftheFlumeconfigurationfile

Startingupwith“Hello,World!”

Summary

3.Channels

Thememorychannel

Thefilechannel

SpillableMemoryChannel

Summary

4.SinksandSinkProcessors

HDFSsink

Pathandfilename

Filerotation

Compressioncodecs

EventSerializers

Textoutput

Textwithheaders

ApacheAvro

User-providedAvroschema

Filetype

SequenceFile

DataStream

CompressedStream

Timeoutsandworkers

Sinkgroups

Loadbalancing

Failover

MorphlineSolrSink

Morphlineconfigurationfiles

TypicalSolrSinkconfiguration

Sinkconfiguration

ElasticSearchSink

LogStashSerializer

DynamicSerializer

Summary

5.SourcesandChannelSelectors

Theproblemwithusingtail

TheExecsource

SpoolingDirectorySource

Syslogsources

ThesyslogUDPsource

ThesyslogTCPsource

ThemultiportsyslogTCPsource

JMSsource

Channelselectors

Replicating

Multiplexing

Summary

6.Interceptors,ETL,andRouting

Interceptors

Timestamp

Host

Static

Regularexpressionfiltering

Regularexpressionextractor

Morphlineinterceptor

Custominterceptors

Thepluginsdirectory

Tieringflows

TheAvrosource/sink

CompressingAvro

SSLAvroflows

TheThriftsource/sink

Usingcommand-lineAvro

TheLog4Jappender

TheLog4Jload-balancingappender

Theembeddedagent

Configurationandstartup

Sendingdata

Shutdown

Routing

Summary

7.PuttingItAllTogether

WeblogstosearchableUI

Settingupthewebserver

Configuringlogrotationtothespooldirectory

Settingupthetarget–Elasticsearch

SettingupFlumeoncollector/relay

SettingupFlumeontheclient

Creatingmoresearchfieldswithaninterceptor

Settingupabetteruserinterface–Kibana

ArchivingtoHDFS

Summary

8.MonitoringFlume

Monitoringtheagentprocess

Monit

Nagios

Monitoringperformancemetrics

Ganglia

InternalHTTPserver

Custommonitoringhooks

Summary

9.ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollection

Transporttimeversuslogtime

Timezonesareevil

Capacityplanning

Considerationsformultipledatacenters

Complianceanddataexpiry

Summary

Index

ApacheFlume:DistributedLogCollectionforHadoopSecondEdition

ApacheFlume:DistributedLogCollectionforHadoopSecondEditionCopyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:July2013

Secondedition:February2015

Productionreference:1190215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78439-217-8

www.packtpub.com

CreditsAuthor

SteveHoffman

Reviewers

SachinHandiekar

MichaelKeane

StefanWill

CommissioningEditor

DipikaGaonkar

AcquisitionEditor

ReshmaRaman

ContentDevelopmentEditor

NeetuAnnMathew

TechnicalEditor

MenzaMathew

CopyEditors

VikrantPhadke

StutiSrivastava

ProjectCoordinator

MaryAlex

Proofreader

SimranBhogal

SafisEditing

Indexer

RekhaNair

Graphics

SheetalAute

AbhinashSahu

ProductionCoordinator

KomalRamchandani

CoverWork

KomalRamchandani

AbouttheAuthorSteveHoffmanhas32yearsofexperienceinsoftwaredevelopment,rangingfromembeddedsoftwaredevelopmenttothedesignandimplementationoflarge-scale,service-oriented,object-orientedsystems.Forthelast5years,hehasfocusedoninfrastructureascode,includingautomatedHadoopandHBaseimplementationsanddataingestionusingApacheFlume.SteveholdsaBSincomputerengineeringfromtheUniversityofIllinoisatUrbana-ChampaignandanMSincomputersciencefromDePaulUniversity.HeiscurrentlyaseniorprincipalengineeratOrbitzWorldwide(http://orbitz.com/).

MoreinformationonStevecanbefoundathttp://bit.ly/bacoboyandonTwitterat@bacoboy.

ThisisthefirstupdatetoSteve’sfirstbook,ApacheFlume:DistributedLogCollectionforHadoop,PacktPublishing.

I’dagainliketodedicatethisupdatedbooktomylovingandsupportivewife,Tracy.Sheputsupwithalot,andthatisverymuchappreciated.Icouldn’taskforabetterfrienddailybymyside.

Myterrificchildren,RachelandNoah,areaconstantreminderthathardworkdoespayoffandthatgreatthingscancomefromchaos.

Ialsowanttogiveabigthankstomyparents,AlanandKaren,formoldingmeintothesomewhatsatisfactoryhumanI’vebecome.TheirdedicationtofamilyandeducationaboveallelseguidesmedailyasIattempttohelpmyownchildrenfindtheirhappinessintheworld.

AbouttheReviewersSachinHandiekarisaseniorsoftwaredeveloperwithover5yearsofexperienceinJavaEEdevelopment.HegraduatedincomputersciencefromtheUniversityofGreenwich,London,andcurrentlyworksforaglobalconsultingcompany,developingenterpriseapplicationsusingvariousopensourcetechnologies,suchasApacheCamel,ServiceMix,ActiveMQ,andZooKeeper.

Sachinhasalotofinterestinopensourceprojects.HehascontributedcodetoApacheCamelanddevelopedpluginsforSpringSocial,whichcanbefoundatGitHub(https://github.com/sachin-handiekar).

Healsoactivelywritesaboutenterpriseapplicationdevelopmentonhisblog(http://sachinhandiekar.com).

MichaelKeanehasaBSincomputersciencefromtheUniversityofIllinoisatUrbana-Champaign.Hehasworkedasasoftwareengineer,codingalmostexclusivelyinJavasinceJDK1.1.Hehasalsoworkedonthemission-criticalmedicaldevicesoftware,e-commerce,transportation,navigation,andadvertisingdomains.HeiscurrentlyadevelopmentleaderforConversant,wherehemaintainsFlumeflowsofnearly100billionloglinesperday.

Michaelisafatherofthree,andbesideswork,hespendsmostofhistimewithhisfamilyandcoachingyouthsoftball.

StefanWillisacomputerscientistwithadegreeinmachinelearningandpatternrecognitionfromtheUniversityofBonn,Germany.Foroveradecade,hehasworkedforseveralstart-upsinSiliconValleyandRaleigh,NorthCarolina,intheareaofsearchandanalytics.Presently,heleadsthedevelopmentofthesearchbackendandreal-timeanalyticsplatformatZendesk,aproviderofcustomerservicesoftware.

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<service@packtpub.com>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandviewnineentirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

PrefaceHadoopisagreatopensourcetoolforshiftingtonsofunstructureddataintosomethingmanageablesothatyourbusinesscangainbetterinsightintoyourcustomers’needs.It’scheap(mostlyfree),scaleshorizontallyaslongasyouhavespaceandpowerinyourdatacenter,andcanhandleproblemsthatwouldcrushyourtraditionaldatawarehouse.Thatsaid,alittle-knownsecretisthatyourHadoopclusterrequiresyoutofeeditdata.Otherwise,youjusthaveaveryexpensiveheatgenerator!Youwillquicklyrealize(onceyougetpastthe“playingaround”phasewithHadoop)thatyouwillneedatooltoautomaticallyfeeddataintoyourcluster.Inthepast,youhadtocomeupwithasolutionforthisproblem,butnomore!FlumewasstartedasaprojectoutofCloudera,whenitsintegrationengineershadtokeepwritingtoolsoverandoveragainfortheircustomerstoautomaticallyimportdata.Today,theprojectliveswiththeApacheFoundation,isunderactivedevelopment,andboastsofuserswhohavebeenusingitintheirproductionenvironmentsforyears.

Inthisbook,IhopetogetyouupandrunningquicklywithanarchitecturaloverviewofFlumeandaquick-startguide.Afterthat,we’lldivedeepintothedetailsofmanyofthemoreusefulFlumecomponents,includingtheveryimportantfilechannelforthepersistenceofin-flightdatarecordsandtheHDFSSinkforbufferingandwritingdataintoHDFS(theHadoopFileSystem).SinceFlumecomeswithawidevarietyofmodules,chancesarethattheonlytoolyou’llneedtogetstartedisatexteditorfortheconfigurationfile.

Bythetimeyoureachtheendofthisbook,youshouldknowenoughtobuildahighlyavailable,fault-tolerant,streamingdatapipelinethatfeedsyourHadoopcluster.

WhatthisbookcoversChapter1,OverviewandArchitecture,introducesFlumeandtheproblemspacethatit’stryingtoaddress(specificallywithregardstoHadoop).Anarchitecturaloverviewofthevariouscomponentstobecoveredinlaterchaptersisgiven.

Chapter2,AQuickStartGuidetoFlume,servestogetyouupandrunningquickly.ItincludesdownloadingFlume,creatinga“Hello,World!”configuration,andrunningit.

Chapter3,Channels,coversthetwomajorchannelsmostpeoplewilluseandtheconfigurationoptionsavailableforeachofthem.

Chapter4,SinksandSinkProcessors,goesintogreatdetailonusingtheHDFSFlumeoutput,includingcompressionoptionsandoptionsforformattingthedata.Failoveroptionsarealsocoveredsothatyoucancreateamorerobustdatapipeline.

Chapter5,SourcesandChannelSelectors,introducesseveraloftheFlumeinputmechanismsandtheirconfigurationoptions.Alsocoveredisswitchingbetweendifferentchannelsbasedondatacontent,whichallowsthecreationofcomplexdataflows.

Chapter6,Interceptors,ETL,andRouting,explainshowtotransformdatain-flightaswellasextractinformationfromthepayloadtousewithChannelSelectorstomakeroutingdecisions.ThenthischaptercoverstieringFlumeagentsusingAvroserialization,aswellasusingtheFlumecommandlineasastandaloneAvroclientfortestingandimportingdatamanually.

Chapter7,PuttingItAllTogether,walksyouthroughthedetailsofanend-to-endusecasefromthewebserverlogstoasearchableUI,backedbyElasticsearchaswellasarchivalstorageinHDFS.

Chapter8,MonitoringFlume,discussesvariousoptionsavailableformonitoringFlumebothinternallyandexternally,includingMonit,Nagios,Ganglia,andcustomhooks.

Chapter9,ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollection,isacollectionofmiscellaneousthingstoconsiderthatareoutsidethescopeofjustconfiguringandusingFlume.

WhatyouneedforthisbookYou’llneedacomputerwithaJavaVirtualMachineinstalled,sinceFlumeiswritteninJava.Ifyoudon’thaveJavaonyourcomputer,youcandownloaditfromhttp://java.com/.

YouwillalsoneedanInternetconnectionsothatyoucandownloadFlumetoruntheQuickStartexample.

ThisbookcoversApacheFlume1.5.2.

WhothisbookisforThisbookisforpeopleresponsibleforimplementingtheautomaticmovementofdatafromvarioussystemstoaHadoopcluster.IfitisyourjobtoloaddataintoHadooponaregularbasis,thisbookshouldhelpyoutocodeyourselfoutofmanualmonkeyworkorfromwritingacustomtoolyou’llbesupportingforaslongasyouworkatyourcompany.

OnlybasicknowledgeofHadoopandHDFSisrequired.Somecustomimplementationsarecovered,shouldyourneedsnecessitatethem.Forthislevelofimplementation,youwillneedtoknowhowtoprograminJava.

Finally,you’llneedyourfavoritetexteditor,sincemostofthisbookcovershowtoconfigurevariousFlumecomponentsviaanagent’stextconfigurationfile.

ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andexplanationsoftheirmeanings.

Codewordsintextareshownasfollows:“Ifyouwanttousethisfeature,yousettheuseDualCheckpointspropertytotrueandspecifyalocationforthatsecondcheckpointdirectorywiththebackupCheckpointDirproperty.”

Ablockofcodeissetasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/access

agent.sinks.k1.hdfs.filePrefix=access

agent.sinks.k1.hdfs.fileSuffix=.log

Whenwewishtodrawyourattentiontoaparticularpartofacodeblock,therelevantlinesoritemsaresetinbold:

agent.sources.s1.command=uptime

agent.sources.s1.restart=true

agent.sources.s1.restartThrottle=60000

Anycommand-lineinputoroutputiswrittenasfollows:

$tar-zxfapache-flume-1.5.2.tar.gz

$cdapache-flume-1.5.2

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxesforexample,appearinthetextlikethis:“FlumewasfirstintroducedinCloudera’sCDH3distributionin2011.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedormayhavedisliked.Readerfeedbackisimportantforustodeveloptitlesthatyoureallygetthemostoutof.

Tosendusgeneralfeedback,simplysendane-mailto<feedback@packtpub.com>,andmentionthebooktitleviathesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideonwww.packtpub.com/authors.

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesforallPacktbooksyouhavepurchasedfromyouraccountathttp://www.packtpub.com.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyouwouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheerratasubmissionformlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedonourwebsite,oraddedtoanylistofexistingerrata,undertheErratasectionofthattitle.Anyexistingerratacanbeviewedbyselectingyourtitlefromhttp://www.packtpub.com/support.

PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<copyright@packtpub.com>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.

QuestionsYoucancontactusat<questions@packtpub.com>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.

Chapter1.OverviewandArchitectureIfyouarereadingthisbook,chancesareyouareswimminginoceansofdata.Creatingmountainsofdatahasbecomeveryeasy,thankstoFacebook,Twitter,Amazon,digitalcamerasandcameraphones,YouTube,Google,andjustaboutanythingelseyoucanthinkofbeingconnectedtotheInternet.Asaproviderofawebsite,10yearsago,yourapplicationlogswereonlyusedtohelpyoutroubleshootyourwebsite.Today,thissamedatacanprovideavaluableinsightintoyourbusinessandcustomersifyouknowhowtopangoldoutofyourriverofdata.

Furthermore,asyouarereadingthisbook,youarealsoawarethatHadoopwascreatedtosolve(partially)theproblemofsiftingthroughmountainsofdata.Ofcourse,thisonlyworksifyoucanreliablyloadyourHadoopclusterwithdataforyourdatascientiststopickapart.

GettingdataintoandoutofHadoop(inthiscase,theHadoopFileSystem,orHDFS)isn’thard;itisjustasimplecommand,suchas:

%hadoopfs--putdata.csv.

Thisworksgreatwhenyouhaveallyourdataneatlypackagedandreadytoupload.

However,yourwebsiteiscreatingdataallthetime.HowoftenshouldyoubatchloaddatatoHDFS?Daily?Hourly?Whateverprocessingperiodyouchoose,eventuallysomebodyalwaysasks“canyougetmethedatasooner?”Whatyoureallyneedisasolutionthatcandealwithstreaminglogs/data.

Turnsoutyouaren’taloneinthisneed.Cloudera,aproviderofprofessionalservicesforHadoopaswellastheirowndistributionofHadoop,sawthisneedoverandoverwhenworkingwiththeircustomers.Flumewascreatedtofillthisneedandcreateastandard,simple,robust,flexible,andextensibletoolfordataingestionintoHadoop.

Flume0.9FlumewasfirstintroducedinCloudera’sCDH3distributionin2011.Itconsistedofafederationofworkerdaemons(agents)configuredfromacentralizedmaster(ormasters)viaZookeeper(afederatedconfigurationandcoordinationsystem).Fromthemaster,youcouldchecktheagentstatusinawebUIaswellaspushoutconfigurationcentrallyfromtheUIorviaacommand-lineshell(bothreallycommunicatingviaZookeepertotheworkeragents).

Datacouldbesentinoneofthreemodes:Besteffort(BE),DiskFailover(DFO),andEnd-to-End(E2E).ThemasterswereusedfortheE2Emodeacknowledgementsandmultimasterconfigurationneverreallymatured,soyouusuallyonlyhadonemaster,makingitacentralpointoffailureforE2Edataflows.TheBEmodeisjustwhatitsoundslike:theagentwouldtrytosendthedata,butifitcouldn’t,thedatawouldbediscarded.Thismodeisgoodforthingssuchasmetrics,wheregapscaneasilybetolerated,asnewdataisjustasecondaway.TheDFOmodestoresundeliverabledatatothelocaldisk(orsometimes,alocaldatabase)andwouldkeepretryinguntilthedatacouldbedeliveredtothenextrecipientinyourdataflow.Thisishandyforthoseplanned(orunplanned)outages,aslongasyouhavesufficientlocaldiskspacetobuffertheload.

InJune,2011,ClouderamovedcontroloftheFlumeprojecttotheApacheFoundation.Itcameoutoftheincubatorstatusayearlaterin2012.Duringtheincubationyear,workhadalreadybeguntorefactorFlumeundertheStar-Trek-themedtag,Flume-NG(FlumetheNextGeneration).

Flume1.X(Flume-NG)ThereweremanyreasonswhyFlumewasrefactored.Ifyouareinterestedinthedetails,youcanreadaboutthemathttps://issues.apache.org/jira/browse/FLUME-728.WhatstartedasarefactoringbrancheventuallybecamethemainlineofdevelopmentasFlume1.X.

ThemostobviouschangeinFlume1.Xisthatthecentralizedconfigurationmaster(s)andZookeeperaregone.TheconfigurationinFlume0.9wasoverlyverbose,andmistakeswereeasytomake.Furthermore,centralizedconfigurationwasreallyoutsidethescopeofFlume’sgoals.Centralizedconfigurationwasreplacedwithasimpleon-diskconfigurationfile(althoughtheconfigurationproviderispluggablesothatitcanbereplaced).Theseconfigurationfilesareeasilydistributedusingtoolssuchascf-engine,Chef,andPuppet.IfyouareusingaClouderadistribution,takealookatClouderaManagertomanageyourconfigurations.Abouttwoyearsago,theycreatedafreeversionwithnonodelimit,soitmaybeanattractiveoptionforyou.Justbesureyoudon’tmanagetheseconfigurationsmanually,oryou’llbeeditingthesefilesmanuallyforever.

AnothermajordifferenceinFlume1.Xisthatthereadingofinputdataandthewritingofoutputdataarenowhandledbydifferentworkerthreads(calledRunners).InFlume0.9,theinputthreadalsodidthewritingtotheoutput(exceptforfailoverretries).Iftheoutputwriterwasslow(ratherthanjustfailingoutright),itwouldblockFlume’sabilitytoingestdata.Thisnewasynchronousdesignleavestheinputthreadblissfullyunawareofanydownstreamproblem.

ThefirsteditionofthisbookcoveredalltheversionsofFlumeuptillVersion1.3.1.ThissecondeditionwillcovertillVersion1.5.2(thecurrentversionatthetimeofwritingthis).

TheproblemwithHDFSandstreamingdata/logsHDFSisn’tarealfilesystem,atleastnotinthetraditionalsense,andmanyofthethingswetakeforgrantedwithnormalfilesystemsdon’tapplyhere,suchasbeingabletomountit.ThismakesgettingyourstreamingdataintoHadoopalittlemorecomplicated.

InaregularPOSIX-stylefilesystem,ifyouopenafileandwritedata,itstillexistsonthediskbeforethefileisclosed.Thatis,ifanotherprogramopensthesamefileandstartsreading,itwillgetthedataalreadyflushedbythewritertothedisk.Furthermore,ifthiswritingprocessisinterrupted,anyportionthatmadeittodiskisusable(itmaybeincomplete,butitexists).

InHDFS,thefileexistsonlyasadirectoryentry;itshowszerolengthuntilthefileisclosed.Thismeansthatifdataiswrittentoafileforanextendedperiodwithoutclosingit,anetworkdisconnectwiththeclientwillleaveyouwithnothingbutanemptyfileforallyourefforts.Thismayleadyoutotheconclusionthatitwouldbewisetowritesmallfilessothatyoucanclosethemassoonaspossible.

TheproblemisthatHadoopdoesn’tlikelotsoftinyfiles.AstheHDFSfilesystemmetadataiskeptinmemoryontheNameNode,themorefilesyoucreate,themoreRAMyou’llneedtouse.FromaMapReduceprospective,tinyfilesleadtopoorefficiency.Usually,eachMapperisassignedasingleblockofafileastheinput(unlessyouhaveusedcertaincompressioncodecs).Ifyouhavelotsoftinyfiles,thecostofstartingtheworkerprocessescanbedisproportionallyhighcomparedtothedataitisprocessing.ThiskindofblockfragmentationalsoresultsinmoreMappertasks,increasingtheoveralljobruntimes.

ThesefactorsneedtobeweighedwhendeterminingtherotationperiodtousewhenwritingtoHDFS.Iftheplanistokeepthedataaroundforashorttime,thenyoucanleantowardthesmallerfilesize.However,ifyouplanonkeepingthedataforaverylongtime,youcaneithertargetlargerfilesordosomeperiodiccleanuptocompactsmallerfilesintofewer,largerfilestomakethemmoreMapReducefriendly.Afterall,youonlyingestthedataonce,butyoumightrunaMapReducejobonthatdatahundredsorthousandsoftimes.

Sources,channels,andsinksTheFlumeagent’sarchitecturecanbeviewedinthissimplediagram.Inputsarecalledsourcesandoutputsarecalledsinks.Channelsprovidethegluebetweensourcesandsinks.Alloftheseruninsideadaemoncalledanagent.

NoteKeepinmind:

Asourcewriteseventstooneormorechannels.Achannelistheholdingareaaseventsarepassedfromasourcetoasink.Asinkreceiveseventsfromonechannelonly.Anagentcanhavemanychannels.

FlumeeventsThebasicpayloadofdatatransportedbyFlumeiscalledanevent.Aneventiscomposedofzeroormoreheadersandabody.

Theheadersarekey/valuepairsthatcanbeusedtomakeroutingdecisionsorcarryotherstructuredinformation(suchasthetimestampoftheeventorthehostnameoftheserverfromwhichtheeventoriginated).YoucanthinkofitasservingthesamefunctionasHTTPheaders—awaytopassadditionalinformationthatisdistinctfromthebody.

Thebodyisanarrayofbytesthatcontainstheactualpayload.Ifyourinputiscomprisedoftailedlogfiles,thearrayismostlikelyaUTF-8-encodedstringcontainingalineoftext.

Flumemayaddadditionalheadersautomatically(likewhenasourceaddsthehostnamewherethedataissourcedorcreatinganevent’stimestamp),butthebodyismostlyuntouchedunlessyouedititenrouteusinginterceptors.

Interceptors,channelselectors,andsinkprocessorsAninterceptorisapointinyourdataflowwhereyoucaninspectandalterFlumeevents.Youcanchainzeroormoreinterceptorsafterasourcecreatesanevent.IfyouarefamiliarwiththeAOPSpringFramework,thinkMethodInterceptor.InJavaServlets,it’ssimilartoServletFilter.Here’sanexampleofwhatusingfourchainedinterceptorsonasourcemightlooklike:

Channelselectorsareresponsibleforhowdatamovesfromasourcetooneormorechannels.Flumecomespackagedwithtwochannelselectorsthatcovermostusecasesyoumighthave,althoughyoucanwriteyourownifneedbe.Areplicatingchannelselector(thedefault)simplyputsacopyoftheeventintoeachchannel,assumingyouhaveconfiguredmorethanone.Incontrast,amultiplexingchannelselectorcanwritetodifferentchannelsdependingonsomeheaderinformation.Combinedwithsomeinterceptorlogic,thisduoformsthefoundationforroutinginputtodifferentchannels.

Finally,asinkprocessoristhemechanismbywhichyoucancreatefailoverpathsforyoursinksorloadbalanceeventsacrossmultiplesinksfromachannel.

Tiereddatacollection(multipleflowsand/oragents)YoucanchainyourFlumeagentsdependingonyourparticularusecase.Forexample,youmaywanttoinsertanagentinatieredfashiontolimitthenumberofclientstryingtoconnectdirectlytoyourHadoopcluster.Morelikely,yoursourcemachinesdon’thavesufficientdiskspacetodealwithaprolongedoutageormaintenancewindow,soyoucreateatierwithlotsofdiskspacebetweenyoursourcesandyourHadoopcluster.

Inthefollowingdiagram,youcanseethattherearetwoplaceswheredataiscreated(ontheleft-handside)andtwofinaldestinationsforthedata(theHDFSandElasticSearchcloudbubblesontheright-handside).Tomakethingsmoreinteresting,let’ssayoneofthemachinesgeneratestwokindsofdata(let’scallthemsquareandtriangledata).Youcanseethatinthelower-leftagent,weuseamultiplexingchannelselectortosplitthetwokindsofdataintodifferentchannels.Therectanglechannelisthenroutedtotheagentintheupper-rightcorner(alongwiththedatacomingfromtheupper-leftagent).ThecombinedvolumeofeventsiswrittentogetherinHDFSinDatacenter1.Meanwhile,thetriangledataissenttotheagentthatwritestoElasticSearchinDatacenter2.Keepinmindthatdatatransformationscanoccurafteranysource.Howallofthesecomponentscanbeusedtobuildcomplicateddataworkflowswillbebecomeclearasweproceed.

TheKiteSDKOneofthenewtechnologiesincorporatedinFlume,startingwithVersion1.4,issomethingcalledaMorphline.YoucanthinkofaMorphlineasaseriesofcommandschainedtogethertoformadatatransformationpipe.

IfyouareafanofpipeliningUnixcommands,thiswillbeveryfamiliartoyou.Thecommandsthemselvesareintendedtobesmall,single-purposefunctionsthatwhenchainedtogethercreatepowerfullogic.Inmanyways,usingaMorphlinecommandchaincanbeidenticalinfunctionalitytotheinterceptorparadigmjustmentioned.ThereisaMorphlineinterceptorwewillcoverinChapter6,Interceptors,ETL,andRouting,whichyoucanuseinsteadof,orinadditionto,theincludedJava-basedinterceptors.

NoteTogetanideaofhowusefulthesecommandscanbe,takealookatthehandygrokcommandanditsincludedextensibleregularexpressionlibraryathttps://github.com/kite-sdk/kite/blob/master/kite-morphlines/kite-morphlines-core/src/test/resources/grok-dictionaries/grok-patterns

ManyofthecustomJavainterceptorsthatI’vewritteninthepastweretomodifythebody(data)andcaneasilybereplacedwithanout-of-the-boxMorphlinecommandchain.YoucangetfamiliarwiththeMorphlinecommandsbycheckingouttheirreferenceguideathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

FlumeVersion1.4alsoincludesaMorphline-backedsinkusedprimarilytofeeddataintoSolr.We’llseemoreofthisinChapter4,SinksandSinkProcessors,MorphlineSolrSearchSink.

MorphlinesarejustonecomponentoftheKiteSDKincludedinFlume.StartingwithVersion1.5,FlumehasaddedexperimentalsupportforKiteData,whichisanefforttocreateastandardlibraryfordatasetsinHadoop.Itlooksverypromising,butitisoutsidethescopeofthisbook.

NotePleaseseetheprojecthomepageformoreinformation,asitwillcertainlybecomemoreprominentintheHadoopecosystemasthetechnologymatures.YoucanreadallabouttheKiteSDKathttp://kitesdk.org.

SummaryInthischapter,wediscussedtheproblemthatFlumeisattemptingtosolve:gettingdataintoyourHadoopclusterfordataprocessinginaneasilyconfigured,reliableway.WealsodiscussedtheFlumeagentanditslogicalcomponents,includingevents,sources,channelselectors,channels,sinkprocessors,andsinks.Finally,webrieflydiscussedMorphlinesasapowerfulnewETL(Extract,Transform,Load)library,startingwithVersion1.4ofFlume.

Thenextchapterwillcovertheseinmoredetail,specifically,themostcommonlyusedimplementationsofeach.Likeallgoodopensourceprojects,almostallofthesecomponentsareextensibleifthebundledonesdon’tdowhatyouneedthemtodo.

Chapter2.AQuickStartGuidetoFlumeAswecoveredsomeofthebasicsinthepreviouschapter,thischapterwillhelpyougetstartedwithFlume.So,let’sstartwiththefirststep:downloadingandconfiguringFlume.

DownloadingFlumeLet’sdownloadFlumefromhttp://flume.apache.org/.Lookforthedownloadlinkinthesidenavigation.You’llseetwocompressed.tararchivesavailablealongwiththechecksumandGPGsignaturefilesusedtoverifythearchives.Instructionstoverifythedownloadareonthewebsite,soIwon’tcoverthemhere.Checkingthechecksumfilecontentsagainsttheactualchecksumverifiesthatthedownloadwasnotcorrupted.Checkingthesignaturefilevalidatesthatallthefilesyouaredownloading(includingthechecksumandsignature)camefromApacheandnotsomenefariouslocation.Doyoureallyneedtoverifyyourdownloads?Ingeneral,itisagoodideaanditisrecommendedbyApachethatyoudoso.Ifyouchoosenotto,Iwon’ttell.

Thebinarydistributionarchivehasbininthename,andthesourcearchiveismarkedwithsrc.ThesourcearchivecontainsjusttheFlumesourcecode.ThebinarydistributionismuchlargerbecauseitcontainsnotonlytheFlumesourceandthecompiledFlumecomponents(jars,javadocs,andsoon),butalsoallthedependentJavalibraries.ThebinarypackagecontainsthesameMavenPOMfileasthesourcearchive,soyoucanalwaysrecompilethecodeevenifyoustartwiththebinarydistribution.

Goahead,downloadandverifythebinarydistributiontosaveussometimeingettingstarted.

FlumeinHadoopdistributionsFlumeisavailablewithsomeHadoopdistributions.ThedistributionssupposedlyprovidebundlesofHadoop’scorecomponentsandsatelliteprojects(suchasFlume)inawaythatensuresthingssuchasversioncompatibilityandadditionalbugfixesaretakenintoaccount.Thesedistributionsaren’tbetterorworse;they’rejustdifferent.

Therearebenefitstousingadistribution.Someoneelsehasalreadydonetheworkofpullingtogetheralltheversion-compatiblecomponents.Today,thisislessofanissuesincetheApacheBigTopprojectstarted(http://bigtop.apache.org/).Nevertheless,havingprebuiltstandardOSpackages,suchasRPMsandDEBs,easeinstallationaswellasprovidestartup/shutdownscripts.Eachdistributionhasdifferentlevelsoffreeandpaidoptions,includingpaidprofessionalservicesifyoureallygetintoasituationyoujustcan’thandle.

Therearedownsides,ofcourse.TheversionofFlumebundledinadistributionwilloftenlagquiteabitbehindtheApachereleases.Ifthereisaneworbleeding-edgefeatureyouareinterestedinusing,you’lleitherbewaitingforyourdistribution’sprovidertobackportitforyou,oryou’llbestuckpatchingityourself.Furthermore,whilethedistributionprovidersdoafairamountoftesting,suchasanygeneral-purposeplatform,youwillmostlikelyencountersomethingthattheirtestingdidn’tcover,inwhichcase,youarestillonthehooktocomeupwithaworkaroundordiveintothecode,fixit,andhopefully,submitthatpatchbacktotheopensourcecommunity(where,atafuturepoint,it’llmakeitintoanupdateofyourdistributionorthenextversion).

So,thingsmoveslowerinaHadoopdistributionworld.Youcanseethatasgoodorbad.Usually,largecompaniesdon’tliketheinstabilityofbleeding-edgetechnologyormakingchangesoften,aschangecanbethemostcommoncauseofunplannedoutages.You’dbehardpressedtofindsuchacompanyusingthebleeding-edgeLinuxkernelratherthansomethinglikeRedHatEnterpriseLinux(RHEL),CentOS,UbuntuLTS,oranyoftheotherdistributionswhosetargetisstabilityandcompatibility.IfyouareastartupbuildingthenextInternetfad,youmightneedthatbleeding-edgefeaturetogetalegupontheestablishedcompetition.

Ifyouareconsideringadistribution,dotheresearchandseewhatyouaregetting(ornotgetting)witheach.Rememberthateachoftheseofferingsishopingthatyou’lleventuallywantand/orneedtheirEnterpriseoffering,whichusuallydoesn’tcomecheap.Doyourhomework.

NoteHere’sashort,nondefinitivelistofsomeofthemoreestablishedplayers.Formoreinformation,refertothefollowinglinks:

Cloudera:http://cloudera.com/Hortonworks:http://hortonworks.com/MapR:http://mapr.com/

AnoverviewoftheFlumeconfigurationfileNowthatwe’vedownloadedFlume,let’sspendsometimegoingoverhowtoconfigureanagent.

AFlumeagent’sdefaultconfigurationproviderusesasimpleJavapropertyfileofkey/valuepairsthatyoupassasanargumenttotheagentuponstartup.Asyoucanconfiguremorethanoneagentinasinglefile,youwillneedtoadditionallypassanagentidentifier(calledaname)sothatitknowswhichconfigurationstouse.InmyexampleswhereI’monlyspecifyingoneagent,I’mgoingtousethenameagent.

NoteBydefault,theconfigurationpropertyfileismonitoredforchangesevery30seconds.Ifachangeisdetected,Flumewillattempttoreconfigureitself.Inpractice,manyoftheconfigurationsettingscannotbechangedaftertheagenthasstarted.Saveyourselfsometroubleandpasstheundocumented--no-reload-confargumentwhenstartingtheagent(exceptindevelopmentsituationsperhaps).

IfyouusetheClouderadistribution,thepassingofthisflagiscurrentlynotpossible.I’veopenedatickettofixthatathttps://issues.cloudera.org/browse/DISTRO-648.Ifthisisimportanttoyou,pleasevoteitup.

Eachagentisconfigured,startingwiththreeparameters:

agent.sources=<listofsources>

agent.channels=<listofchannels>

agent.sinks=<listofsinks>

TipDownloadingtheexamplecode

Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Eachsource,channel,andsinkalsohasauniquenamewithinthecontextofthatagent.Forexample,ifI’mgoingtotransportmyApacheaccesslogs,Imightdefineachannelnamedaccess.Theconfigurationsforthischannelwouldallstartwiththeagent.channels.accessprefix.EachconfigurationitemhasatypepropertythattellsFlumewhatkindofsource,channel,orsinkitis.Inthiscase,wearegoingtouseanin-memorychannelwhosetypeismemory.Thecompleteconfigurationforthechannelnamedaccessintheagentnamedagentwouldbe:

agent.channels.access.type=memory

Anyargumentstoasource,channel,orsinkareaddedasadditionalpropertiesusingthe

sameprefix.ThememorychannelhasacapacityparametertoindicatethemaximumnumberofFlumeeventsitcanhold.Let’ssaywedidn’twanttousethedefaultvalueof100;ourconfigurationwouldnowlooklikethis:

agent.channels.access.type=memory

agent.channels.access.capacity=200

Finally,weneedtoaddtheaccesschannelnametotheagent.channelspropertysothattheagentknowstoloadit:

agent.channels=access

Let’slookatacompleteexampleusingthecanonicalHello,World!example.

Startingupwith“Hello,World!”NotechnicalbookwouldbecompletewithoutaHello,World!example.Hereistheconfigurationfilewe’llbeusing:

agent.sources=s1

agent.channels=c1

agent.sinks=k1

agent.sources.s1.type=netcat

agent.sources.s1.channels=c1

agent.sources.s1.bind=0.0.0.0

agent.sources.s1.port=12345

agent.channels.c1.type=memory

agent.sinks.k1.type=logger

agent.sinks.k1.channel=c1

Here,I’vedefinedoneagent(calledagent)whohasasourcenameds1,achannelnamedc1,andasinknamedk1.

Thes1source’stypeisnetcat,whichsimplyopensasocketlisteningforevents(onelineoftextperevent).Itrequirestwoparameters:abindIPandaportnumber.Inthisexample,weareusing0.0.0.0forabindaddress(theJavaconventiontospecifylistenonanyaddress)andport12345.Thesourceconfigurationalsohasaparametercalledchannels(plural),whichisthenameofthechannel(s)thesourcewillappendeventsto,inthiscase,c1.Itisplural,becauseyoucanconfigureasourcetowritetomorethanonechannel;wejustaren’tdoingthatinthissimpleexample.

Thechannelnamedc1isamemorychannelwithadefaultconfiguration.

Thesinknamedk1isoftheloggertype.Thisisasinkthatismostlyusedfordebuggingandtesting.ItwilllogalleventsattheINFOlevelusingLog4j,whichitreceivesfromtheconfiguredchannel,inthiscase,c1.Here,thechannelkeywordissingularbecauseasinkcanonlybefeddatafromonechannel.

Usingthisconfiguration,let’sruntheagentandconnecttoitusingtheLinuxnetcatutilitytosendanevent.

First,explodethe.tararchiveofthebinarydistributionwedownloadedearlier:

$tar-zxfapache-flume-1.5.2-bin.tar.gz

$cdapache-flume-1.5.2-bin

Next,let’sbrieflylookatthehelp.Runtheflume-ngcommandwiththehelpcommand:

$./bin/flume-nghelp

Usage:./bin/flume-ng<command>[options]...

commands:

helpdisplaythishelptext

agentrunaFlumeagent

avro-clientrunanavroFlumeclient

versionshowFlumeversioninfo

globaloptions:

--conf,-c<conf>useconfigsin<conf>directory

--classpath,-C<cp>appendtotheclasspath

--dryrun,-ddonotactuallystartFlume,justprintthecommand

--plugins-path<dirs>colon-separatedlistofplugins.ddirectories.See

the

plugins.dsectionintheuserguideformore

details.

Default:$FLUME_HOME/plugins.d

-Dproperty=valuesetsaJavasystempropertyvalue

-Xproperty=valuesetsaJava-Xoption

agentoptions:

--conf-file,-f<file>specifyaconfigfile(required)

--name,-n<name>thenameofthisagent(required)

--help,-hdisplayhelptext

avro-clientoptions:

--rpcProps,-P<file>RPCclientpropertiesfilewithserverconnection

params

--host,-H<host>hostnametowhicheventswillbesent

--port,-p<port>portoftheavrosource

--dirname<dir>directorytostreamtoavrosource

--filename,-F<file>textfiletostreamtoavrosource(default:std

input)

--headerFile,-R<file>Filecontainingeventheadersaskey/valuepairs

oneachnewline

--help,-hdisplayhelptext

Either--rpcPropsorboth--hostand--portmustbespecified.

Notethatif<conf>directoryisspecified,thenitisalwaysincluded

firstintheclasspath.

Asyoucansee,therearetwowayswithwhichyoucaninvokethecommand(otherthanthesimplehelpandversioncommands).Wewillbeusingtheagentcommand.Theuseofavro-clientwillbecoveredlater.

Theagentcommandhastworequiredparameters:aconfigurationfiletouseandtheagentname(incaseyourconfigurationcontainsmultipleagents).

Let’stakeoursampleconfigurationandopenaneditor(viinmycase,butusewhateveryoulike):

$viconf/hw.conf

Next,placethecontentsoftheprecedingconfigurationintotheeditor,save,andexitbacktotheshell.

Nowyoucanstarttheagent:

$./bin/flume-ngagent-nagent-cconf-fconf/hw.conf-

Dflume.root.logger=INFO,console

The-Dflume.root.loggerpropertyoverridestherootloggerinconf/log4j.propertiestousetheconsoleappender.

Ifwedidn’toverridetherootlogger,everythingwouldstillwork,buttheoutputwouldgotothelog/flume.logfileinsteadofbeingbasedonthecontentsofthedefaultconfigurationfile.Ofcourse,youcanedittheconf/log4j.propertiesfileandchangetheflume.root.loggerproperty(oranythingelseyoulike).Tochangejustthepathorfilename,youcansettheflume.log.dirandflume.log.filepropertiesintheconfigurationfileorpassadditionalflagsonthecommandlineasfollows:

$./bin/flume-ngagent-nagent-cconf-fconf/hw.conf-

Dflume.root.logger=INFO,console-Dflume.log.dir=/tmp-

Dflume.log.file=flume-agent.log

Youmightaskwhyyouneedtospecifythe-cparameter,asthe-fparametercontainsthecompleterelativepathtotheconfiguration.ThereasonforthisisthattheLog4jconfigurationfileshouldbeincludedontheclasspath.

Ifyouleftthe-cparameteroffthecommand,you’llseethiserror:

Warning:Noconfigurationdirectoryset!Use--conf<dir>tooverride.

log4j:WARNNoappenderscouldbefoundforlogger

(org.apache.flume.lifecycle.LifecycleSupervisor).

log4j:WARNPleaseinitializethelog4jsystemproperly.

log4j:WARNSeehttp://logging.apache.org/log4j/1.2/faq.html#noconfigfor

moreinfo.

Butyoudidn’tdothatsoyoushouldseethesekeyloglines:

2014-10-0515:39:06,109(conf-file-poller-0)[INFO-

org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfigu

ration.java:140)]Post-validationflumeconfigurationcontains

configurationforagents:[agent]

Thislinetellsyouthatyouragentstartswiththenameagent.

Usuallyyou’dlookforthislineonlytobesureyoustartedtherightconfigurationwhenyouhavemultipleconfigurationsdefinedinyourconfigurationfile.

2014-10-0515:39:06,076(conf-file-poller-0)[INFO-

org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatche

rRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)]

Reloadingconfigurationfile:conf/hw.conf

Thisisanothersanitychecktomakesureyouareloadingthecorrectfile,inthiscaseourhw.conffile.

2014-10-0515:39:06,221(conf-file-poller-0)[INFO-

org.apache.flume.node.Application.startAllComponents(Application.java:138)]

Startingnewconfiguration:{sourceRunners:{s1=EventDrivenSourceRunner:{

source:org.apache.flume.source.NetcatSource{name:s1,state:IDLE}}}

sinkRunners:{k1=SinkRunner:{

policy:org.apache.flume.sink.DefaultSinkProcessor@442fbe47counterGroup:{

name:nullcounters:{}}}}channels:

{c1=org.apache.flume.channel.MemoryChannel{name:c1}}}

Oncealltheconfigurationshavebeenparsed,youwillseethismessage,whichshowsyoueverythingthatwasconfigured.Youcansees1,c1,andk1,andwhichJavaclassesareactuallydoingthework.Asyouprobablyguessed,netcatisaconveniencefororg.apache.flume.source.NetcatSource.Wecouldhaveusedtheclassnameifwewanted.Infact,ifIhadmyowncustomsourcewritten,Iwoulduseitsclassnameforthesource’stypeparameter.YoucannotdefineyourownshortnameswithoutpatchingtheFlumedistribution.

2014-10-0515:39:06,427(lifecycleSupervisor-1-0)[INFO-

org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)]Created

serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345]

Here,weseethatoursourceisnowlisteningonport12345fortheinput.So,let’ssendsomedatatoit.

Finally,openasecondterminal.We’llusethenccommand(youcanuseTelnetoranythingelsesimilar)tosendtheHelloWorldstringandpresstheReturn(Enter)keytomarktheendoftheevent:

%nclocalhost12345

HelloWorld

OK

TheOKmessagecamefromtheagentafterwepressedtheReturnkey,signifyingthatitacceptedthelineoftextasasingleFlumeevent.Ifyoulookattheagentlog,youwillseethefollowing:

2014-10-0515:44:11,215(SinkRunner-PollingRunner-DefaultSinkProcessor)

[INFO-org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]

Event:{headers:{}body:48656C6C6F20576F726C64

HelloWorld}

ThislogmessageshowsyouthattheFlumeeventcontainsnoheaders(NetcatSourcedoesn’taddanyitself).Thebodyisshowninhexadecimalalongwithastringrepresentation(forushumanstoread,inthiscase,ourHelloWorldmessage).

IfIsendthefollowinglineandthenpresstheEnterkey,you’llgetanOKmessage:

Thequickbrownfoxjumpedoverthelazydog.

You’llseethisintheagent’slog:

2014-10-0515:44:57,232(SinkRunner-PollingRunner-DefaultSinkProcessor)

[INFO-org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]

Event:{headers:{}body:54686520717569636B2062726F776E20

Thequickbrown}

Theeventappearstohavebeentruncated.Theloggersink,bydesign,limitsthebodycontentto16bytestokeepyourscreenfrombeingfilledwithmorethanwhatyou’dneedinadebuggingcontext.Ifyouneedtoseethefullcontentsfordebugging,youshoulduseadifferentsink,perhapsthefile_rollsink,whichwouldwritetothelocalfilesystem.

SummaryInthischapter,wecoveredhowtodownloadtheFlumebinarydistribution.Wecreatedasimpleconfigurationfilethatincludedonesourcewritingtoonechannel,feedingonesink.Thesourcelistenedonasocketfornetworkclientstoconnecttoandtosenditeventdata.Theseeventswerewrittentoanin-memorychannelandthenfedtoaLog4jsinktobecometheoutput.WethenconnectedtoourlisteningagentusingtheLinuxnetcatutilityandsentsomestringeventstoourFlumeagent’ssource.Finally,weverifiedthatourLog4j-basedsinkwrotetheeventsout.

Inthenextchapter,we’lltakeadetailedlookatthetwomajorchanneltypesyou’llmostlikelyuseinyourdataprocessingworkflows:thememorychannelandthefilechannel.

Wewillalsotakealookatanewexperimentalchannel,introducedinVersion1.5ofFlume,calledtheSpillableMemoryChannel,whichattemptstobeahybridoftheothertwo.

Foreachtype,we’lldiscussalltheconfigurationknobsavailabletoyou,whenandwhyyoumightwanttodeviatefromthedefaults,andmostimportantly,whytouseoneovertheother.

Chapter3.ChannelsInFlume,achannelistheconstructusedbetweensourcesandsinks.Itprovidesabufferforyourin-flighteventsaftertheyarereadfromsourcesuntiltheycanbewrittentosinksinyourdataprocessingpipelines.

Theprimarytypeswe’llcoverhereareamemory-backed/nondurablechannelandalocal-filesystem-backed/durablechannel.StartingwithFlume1.5,anexperimentalhybridmemoryandfilechannelcalledtheSpillableMemoryChannelisintroduced.Thedurablefilechannelflushesallchangestodiskbeforeacknowledgingthereceiptoftheeventtothesender.Thisisconsiderablyslowerthanusingthenondurablememorychannel,butitprovidesrecoverabilityintheeventofsystemorFlumeagentrestarts.Conversely,thememorychannelismuchfaster,butfailureresultsindatalossandithasmuchlowerstoragecapacitywhencomparedtothemultiterabytedisksbackingthefilechannel.ThisiswhytheSpillableMemoryChannelwascreated.Intheory,yougetthebenefitsofmemoryspeeduntilthememoryfillsupduetoflowbackpressure.Atthispoint,thediskwillbeusedtostoretheevents—andwiththatcomesmuchlargercapacity.Therearetrade-offshereaswell,asperformanceisnowvariabledependingonhowtheentireflowisperforming.Ultimately,thechannelyouchoosedependsonyourspecificusecases,failurescenarios,andrisktolerance.

Thatsaid,regardlessofwhatchannelyouchoose,ifyourrateofingestfromthesourcesintothechannelisgreaterthantherateatwhichthesinkcanwritedata,youwillexceedthecapacityofthechannelandthrowaChannelException.Whatyoursourcedoesordoesn’tdowiththatChannelExceptionissource-specific,butinsomecases,datalossispossible,soyou’llwanttoavoidfillingchannelsbysizingthingsproperly.Infact,youalwayswantyoursinktobeabletowritefasterthanyoursourceinput.Otherwise,youmightgetintoasituationwhereonceyoursinkfallsbehind,youcannevercatchup.Ifyourdatavolumetrackswiththesiteusage,youcanhavehighervolumesduringthedayandlowervolumesatnight,givingyourchannelstimetodrain.Inpractice,you’llwanttotryandkeepthechanneldepth(thenumberofeventscurrentlyinthechannel)aslowaspossiblebecausetimespentinthechanneltranslatestoatimedelaybeforereachingthefinaldestination.

ThememorychannelAmemorychannel,asexpected,isachannelwherein-flighteventsarestoredinmemory.Asmemoryis(usually)ordersofmagnitudefasterthanthedisk,eventscanbeingestedmuchmorequickly,resultinginreducedhardwareneeds.Thedownsideofusingthischannelisthatanagentfailure(hardwareproblem,poweroutage,JVMcrash,Flumerestart,andsoon)resultsinthelossofdata.Dependingonyourusecase,thismightbeperfectlyfine.Systemmetricsusuallyfallintothiscategory,asafewlostdatapointsisn’ttheendoftheworld.However,ifyoureventsrepresentpurchasesonyourwebsite,thenamemorychannelwouldbeapoorchoice.

Tousethememorychannel,setthetypeparameteronyournamedchanneltomemory.

agent.channels.c1.type=memory

Thisdefinesamemorychannelnamedc1fortheagentnamedagent.

Hereisatableofconfigurationparametersyoucanadjustfromthedefaultvalues:

Key Required Type Default

type Yes String memory

capacity No int 100

transactionCapacity No int 100

byteCapacityBufferPercentage No int(percent) 20%

byteCapacity No long(bytes) 80%ofJVMHeap

keep-alive No int 3(seconds)

Thedefaultcapacityofthischannelis100events.Thiscanbeadjustedbysettingthecapacitypropertyasfollows:

agent.channels.c1.capacity=200

Rememberthatifyouincreasethisvalue,youwillmostlikelyhavetoincreaseyourJavaheapspaceusingthe-Xmx,andoptionally-Xms,parameters.

Anothercapacity-relatedsettingyoucansetistransactionCapacity.Thisisthemaximumnumberofeventsthatcanbewritten,alsocalledaput,byasource’sChannelProcessor,thecomponentresponsibleformovingdatafromthesourcetothechannel,inasingletransaction.Thisisalsothenumberofeventsthatcanberead,alsocalledatake,inasingletransactionbytheSinkProcessor,whichisthecomponentresponsibleformovingdatafromthechanneltothesink.Youmightwanttosetthishigherinordertodecreasetheoverheadofthetransactionwrapper,whichmightspeedthingsup.Thedownsidetoincreasingthisisthatasourcewouldhavetorollbackmoredataintheeventofafailure.

TipFlumeonlyprovidestransactionalguaranteesforeachchannelineachindividualagent.Inamultiagent,multichannelconfiguration,duplicatesandout-of-orderdeliveryarelikelybutshouldnotbeconsideredthenorm.Ifyouaregettingduplicatesinnonfailureconditions,itmeansthatyouneedtocontinuetuningyourFlumeconfigurations.

Ifyouareusingasinkthatwritessomeplacethatbenefitsfromlargerbatchesofwork(suchasHDFS),youmightwanttosetthishigher.Likemanythings,theonlywaytobesureistorunperformancetestswithdifferentvalues.ThisblogpostfromFlumecommitterMikePercyshouldgiveyousomegoodstartingpoints:http://bit.ly/flumePerfPt1.

ThebyteCapacityBufferPercentageandbyteCapacityparameterswereintroducedinhttps://issues.apache.org/jira/browse/FLUME-1535asameanstosizethememorychannelcapacityusingthenumberofbytesusedratherthanthenumberofeventsaswellastryingtoavoidOutOfMemoryErrors.Ifyoureventshavealargevarianceinsize,youmightbetemptedtousethesesettingstoadjustthecapacity,butbewarnedthatcalculationsareestimatedfromtheevent’sbodyonly.Ifyouhaveanyheaders,whichyouwill,youractualmemoryusagewillbehigherthantheconfiguredvalues.

Finally,thekeep-aliveparameteristhetimethethreadwritingdataintothechannelwillwaitwhenthechannelisfull,beforegivingup.Asdataisbeingdrainedfromthechannelatthesametime,ifspaceopensupbeforethetimeoutexpires,thedatawillbewrittentothechannelratherthanthrowinganexceptionbacktothesource.Youmightbetemptedtosetthisvalueveryhigh,butrememberthatwaitingforawritetoachannelwillblockthedataflowingintoyoursource,whichmightcausedatatobackupinanupstreamagent.Eventually,thismightresultineventsbeingdropped.Youneedtosizeforperiodicspikesintrafficaswellastemporaryplanned(andunplanned)maintenance.

ThefilechannelAfilechannelisachannelthatstoreseventstothelocalfilesystemoftheagent.Thoughit’sslowerthanthememorychannel,itprovidesadurablestoragepaththatcansurvivemostissuesandshouldbeusedinusecaseswhereagapinyourdataflowisundesirable.

ThisdurabilityisprovidedbyacombinationofaWriteAheadLog(WAL)andoneormorefilestoragedirectories.TheWALisusedtotrackallinputandoutputfromthechannelinanatomicallysafeway.Thisway,iftheagentisrestarted,theWALcanbereplayedtomakesurealltheeventsthatcameintothechannel(puts)havebeenwrittenout(takes)beforethestoreddatacanbepurgedfromthelocalfilesystem.

Additionally,thefilechannelsupportstheencryptionofdatawrittentothefilesystemifyourdatahandlingpolicyrequiresthatalldataonthedisk(eventemporarily)beencrypted.Iwon’tcoverthishere,butshouldyouneedit,thereisanexampleintheFlumeUserGuide(http://flume.apache.org/FlumeUserGuide.html).Keepinmindthatusingencryptionwillreducethethroughputofyourfilechannel.

Tousethefilechannel,setthetypeparameteronyournamedchanneltofile.

agent.channels.c1.type=file

Thisdefinesafilechannelnamedc1fortheagentnamedagent.

Hereisatableofconfigurationparametersyoucanadjustfromthedefaultvalues:

Key Required Type Default

type Yes String file

checkpointDir No String ~/.flume/file-channel/checkpoint

useDualCheckpoints No boolean false

backupCheckpointDir No StringNodefault,butmustbedifferentfromcheckpointDir

dataDirs No String(commaseparatedlist)

~/.flume/file-channel/data

capacity No int 1000000

keep-alive No int 3(seconds)

transactionCapacity No int 10000

checkpointInterval No long 30000(milliseconds)

maxFileSize No long 2146435071(bytes)

minimumRequiredSpace No long 524288000(bytes)

TospecifythelocationwheretheFlumeagentshouldholddata,setthecheckpointDiranddataDirsproperties:

agent.channels.c1.checkpointDir=/flume/c1/checkpoint

agent.channels.c1.dataDirs=/flume/c1/data

Technically,thesepropertiesarenotrequiredandhavesensibledefaultvaluesfordevelopment.However,ifyouhavemorethanonefilechannelconfiguredinyouragent,onlythefirstchannelwillstart.Forproductiondeploymentsanddevelopmentworkwithmultiplefilechannels,youshouldusedistinctdirectorypathsforeachfilechannelstorageareaandconsiderplacingdifferentchannelsondifferentdiskstoavoidIOcontention.Additionally,ifyouaresizingalargemachine,considerusingsomeformofRAIDthatcontainsstriping(RAID10,50,or60)toachievehigherdiskperformanceratherthanbuyingmoreexpensive10Kor15KdrivesorSSDs.Ifyoudon’thaveRAIDstripingbuthavemultipledisks,setdataDirstoacomma-separatedlistofeachstoragelocation.UsingmultiplediskswillspreadthedisktrafficalmostaswellasstripedRAIDbutwithoutthecomputationaloverheadassociatedwithRAID50/60aswellasthe50percentspacewasteassociatedwithRAID10.You’llwanttotestyoursystemtoseewhethertheRAIDoverheadisworththespeeddifference.Asharddrivefailuresareareality,youmightprefercertainRAIDconfigurationstosingledisksinordertoprotectyourselffromthedatalossassociatedwithsingledrivefailures.RAID6acrossthemaximumnumberofdiskscanprovidethehighestperformancefortheminimalamountofdataprotectionwhencombinedwithredundantandreliablepowersources(suchasanuninterruptablepowersupplyorUPS).

UsingtheJDBCchannelisabadideaasitwouldintroduceabottleneckandsinglepointoffailureinwhatshouldbedesignedasahighlydistributedsystem.NFSstorageshouldbeavoidedforthesamereason.

TipBesuretosetHADOOP_PREFIXandJAVA_HOMEenvironmentvariableswhenusingthefilechannel.Whileweseeminglyhaven’tusedanythingHadoop-specific(suchaswritingtoHDFS),thefilechannelusesHadoopWritablesasanon-diskserializationformat.IfFlumecan’tfindtheHadooplibraries,youmightseethisinyourstartup,socheckyourenvironmentvariables:java.lang.NoClassDefFoundError:org/apache/hadoop/io/Writable

StartingwithFlume1.4,thefilechannelsupportsasecondarycheckpointdirectory.Insituationswhereafailureoccurswhilewritingthecheckpointdata,thatinformationcouldbecomeunusableandafullreplayofthelogsisnecessaryinordertorecoverthestateofthechannel.Asonlyonecheckpointisupdatedatatimebeforeflippingtotheother,oneshouldalwaysbeinaconsistentstate,thusshorteningrestarttimes.Withoutvalidcheckpointinformation,theFlumeagentcan’tknowwhathasbeensentandwhathasnotbeensentindataDirs.Asfilesinthedatadirectoriesmightcontainlargeamountsofdataalreadysentbutnotyetdeleted,alackofcheckpointinformationwouldresultinalargenumberofrecordsbeingresentasduplicates.

Incomingdatafromsourcesiswrittenandacknowledgedattheendofthemostcurrentfileinthedatadirectory.Thefilesinthecheckpointdirectorykeeptrackofthisdatawhenit’stakenbyasink.

Ifyouwanttousethisfeature,settheuseDualCheckpointspropertytotrueandspecifyalocationforthatsecondcheckpointdirectorywiththebackupCheckpointDirproperty.Forperformancereasons,itisalwayspreferredthatthisbeonadifferentdiskfromtheotherdirectoriesusedbythefilechannel:

agent.channels.c1.useDualCheckpoints=true

agent.channels.c1.backupCheckpointDir=/flume/c1/checkpoint2

Thedefaultfilechannelcapacityisonemillioneventsregardlessofthesizeoftheeventcontents.Ifthechannelcapacityisreached,asourcewillnolongerbeabletoingestthedata.Thisdefaultshouldbefineforlowvolumecases.You’llwanttosizethishigherifyouringestionissoheavythatyoucan’ttoleratenormalplannedorunplannedoutages.Forinstance,therearemanyconfigurationchangesyoucanmakeinHadoopthatrequireaclusterrestart.IfyouhaveFlumewritingimportantdataintoHadoop,thefilechannelshouldbesizedtotoleratethetimeittakestorestartHadoop(andmaybeaddacomfortbufferfortheunexpected).Ifyourclusterorothersystemsareunreliable,youcansetthishigherstilltohandleevenlargeramountsofdowntime.Atsomepoint,you’llrunintothefactthatyourdiskspaceisafiniteresource,soyouwillhavetopicksomeupperlimit(orbuybiggerdisks).

Thekeep-aliveparameterissimilartomemorychannels.Itisthemaximumtimethesourcewillwaitwhentryingtowriteintoafullchannelbeforegivingup.Ifspacebecomesavailablebeforethetimeout,thewriteissuccessful;otherwise,ChannelExceptionisthrownbacktothesource.

ThetransactionCapacitypropertyisthemaximumnumberofeventsallowedinasingletransaction.Thismightbecomeimportantforcertainsourcesthatbatchtogethereventsandpassthemtothechannelinasinglecall.Mostlikely,youwon’tneedtochangethisfromthedefault.Settingthishigherallocatesadditionalresourcesinternally,soyoushouldn’tincreaseitunlessyourunintoperformanceissues.

ThecheckpointIntervalpropertyisthenumberofmillisecondsbetweenperformingacheckpoint(whichalsorollsthelogfileswrittentologDirs).Ifyoudonotsetthis,30secondswillbeused.

CheckpointfilesalsorollbasedonthevolumeofdatawrittentothemusingthemaxFileSizeproperty.Youcanlowerthisvalueforlowtrafficchannelsifyouwanttotryandsavesomediskspace.Let’ssayyourmaximumfilesizeis50,000bytesbutyourchannelonlywrites500bytesaday;itwouldtake100daystofillasinglelog.Let’ssaythatyouwereonday100and2000bytescameinallatonce.Somedatawouldbewrittentotheoldfileandanewfilewouldbestartedwiththeoverflow.Aftertheroll,Flumetriestoremoveanylogfilesthataren’tneededanymore.Asthefullloghasunprocessedrecords,itcannotberemovedyet.Thenextchancetocleanupthatoldlogfilemightnotcomeforanother100days.Itprobablydoesn’tmatterifthatold50,000bytefilesticks

aroundlonger,butasthedefaultisaround2GB,youcouldhavetwicethat(4GB)diskspaceusedperchannel.Dependingonhowmuchdiskyouhaveavailableandthenumberofchannelsconfiguredinyouragent,thismightormightnotbeaproblem.Ifyourmachineshaveplentyofstoragespace,thedefaultshouldbefine.

Finally,theminimumRequiredSpacepropertyistheamountofspaceyoudonotwanttouseforwritinglogs.Thedefaultconfigurationwillthrowanexceptionifyouattempttousethelast500MBofthediskassociatedwiththedataDirpath.Thislimitappliesacrossallchannels,soifyouhavethreefilechannelsconfigured,theupperlimitisstill500MBandnot1.5GB.Youcansetthisvalueaslowas1MB,butgenerallyspeaking,badthingstendtohappenwhenyoupushdiskutilizationtowards100percent.

SpillableMemoryChannelIntroducedinFlume1.5,theSpillableMemoryChannelisachannelthatactslikeamemorychanneluntilitisfull.Atthatpoint,itactslikeafilechannelthatisconfiguredwithamuchlargercapacitythanitsmemorycounterpartbutrunsatthespeedofyourdisks(whichmeansordersofmagnitudeslower).

NoteTheSpillableMemoryChannelisstillconsideredexperimental.Useitatyourownrisk!

Ihavemixedfeelingsaboutthisnewchanneltype.Onthesurface,itseemslikeagoodidea,butinpractice,Icanseeproblems.Specifically,havingavariablechannelspeedthatchangesdependingonhowdownstreamentitiesinyourdatapipebehavemakesfordifficultcapacityplanning.Asamemorychannelisusedundergoodconditions,thisimpliesthatthedatacontainedinitcanbelost.SowhywouldIgothroughextratroubletosavesomeofittothedisk?Thedataiseitherveryimportantformetospoolittodiskwithafile-backedchannel,orit’slessimportantandcanbelost,soIcangetawaywithlesshardwareanduseafastermemory-backedchannel.IfIreallyneedmemoryspeedbutwiththecapacityofaharddrive,SolidStateDrive(SSD)priceshavecomedownenoughinrecentyearsforafilechannelonSSDtonowbeaviableoptionforyouratherthanusingthishybridchanneltype.Idonotusethischannelmyselfforthesereasons.

Tousethischannelconfiguration,setthetypeparameteronyournamedchanneltospillablememory:

agent.channels.c1.type=spillablememory

ThisdefinesaSpillableMemoryChannelnamedc1fortheagentnamedagent.

Hereisatableofconfigurationparametersyoucanadjustfromthedefaultvalues:

Key Required Type Default

type Yes String spillablememory

memoryCapacity No int 10000

overflowCapacity No int 100000000

overflowTimeout No int 3(seconds)

overflowDeactivationThreshold No int 5(percent)

byteCapacityBufferPercentage No int 20(percent)

byteCapacity No long(bytes) 80%ofJVMHeap

checkpointDir No String ~/.flume/file-channel/checkpoint

dataDirsNo String(comma

~/.flume/file-channel/data

separatedlist)

useDualCheckpoints No boolean false

backupCheckpointDir No StringNodefault,butmustbedifferentthancheckpointDir

transactionCapacity No int 10000

checkpointInterval No long 30000(milliseconds)

maxFileSize No long 2146435071(bytes)

minimumRequiredSpace No long 524288000(bytes)

Asyoucansee,manyofthefieldsmatchagainstthememorychannelandfilechannel’sproperties,sothereshouldbenosurpriseshere.Let’sstartwiththememoryside.

ThememoryCapacitypropertydeterminesthemaximumnumberofeventsheldinthememory(thiswasjustcalledcapacityforthememorychannelbutwasrenamedheretoavoidambiguity).Also,thedefaultvalue,ifunspecified,is10,000recordsinsteadof100.Ifyouwantedtodoublethedefaultcapacity,theconfigurationmightlooksomethinglikethis:

agent.channels.c1.memoryCapacity=20000

Asmentionedpreviously,youwillmostlikelyneedtoincreasetheJavaheapspaceallocatedusingthe-Xmxand-Xmsparameters.LikemostJavaprograms,morememoryusuallyhelpsthegarbagecollectorrunmoreefficiently,especiallyasanapplicationsuchasFlumegeneratesalotofshort-livedobjects.BesuretodoyourresearchandpickanappropriateJVMgarbagecollectorbasedonyouravailablehardware.

TipYoucansetmemoryCapacitytozero,whicheffectivelyturnsitintoafilechannel.Don’tdothis.Justusethefilechannelandbedonewithit.

TheoverflowCapacitypropertydeterminesthemaximumnumberofeventsthatcanbewrittentothediskbeforeanerroristhrownbacktothesourcefeedingit.Thedefaultvalueisagenerous100,000,000events(farlargerthanthefilechannel’sdefaultof100,000).Mostserversnowadayshavemultiterabytedisks,sospaceshouldnotbeaproblem,butdothemathagainstyouraverageeventsizetobesureyoudon’tfillyourdisksbyaccident.Ifyourchannelsarefillingthismuch,youareprobablydealingwithanotherissuedownstreamandthelastthingyouneedisyourdatabufferlayerfillingupcompletely.Forexample,ifyouhadalarge1megabyteeventpayload,100millionoftheseaddupto100terabytes,whichisprobablybiggerthanthediskspaceonanaverageserver.A1kilobytepayloadwouldonlytake100gigabytes,whichisprobablyfine.Justdothemathaheadoftimesoyouarenotsurprised.

TipYoucansetoverflowCapacitytozero,whicheffectivelyturnsitintoamemorychannel.

Don’tdothis.Justusethememorychannelandbedonewithit.

ThetransactionCapacitypropertyadjuststhebatchsizeofeventswrittentoachannelinasingletransaction.Ifthisissettoolow,itwilllowerthethroughputoftheagentonhighvolumeflowsbecauseofthetransactionoverhead.Forahighvolumechannel,youwillprobablyneedtosetthishigher,buttheonlywaytobesureistotestyourparticularworkflow.SeeChapter8,MonitoringFlume,tolearnhowtoaccomplishthis.

Finally,thebyteCapacityBufferPercentageandbyteCapacityparametersareidenticalinfunctionalityanddefaultstothememorychannel,soIwon’twasteyourtimerepeatingithere.

WhatisimportantistheoverflowTimeoutproperty.Thisisthenumberofsecondsafterwhichthememorypartofthechannelfillsbeforedatastartsgettingwrittentothedisk-backedportionofthechannel.Ifyouwantwritestostartoccurringimmediately,youcansetthistozero.Youmightwonderwhyyouneedtowaitbeforestartingtowritetothediskportionofthechannel.ThisiswheretheundocumentedoverflowDeactivationThresholdpropertycomesintoplay.Thisistheamountoftimethatspacehastobeavailableinthememorypathbeforeitcanswitchbackfromdiskwriting.Ibelievethisisanattempttopreventflappingbackandforthbetweenthetwo.Ofcourse,therereallyarenoorderingguaranteesinFlume,soIdon’tknowwhyyouwouldchoosetoappendtothediskbufferifaspotisavailableinfastermemory.Perhapstheyaretryingtoavoidsomekindofstarvationcondition,althoughthecodeappearstoattempttoremoveeventsintheorderofarrivalevenwhenusingbothmemoryanddiskqueues.Perhapsitwillbeexplainedtousshoulditevercomeoutofexperimentalstatus.

NoteTheoverflowDeactivationThresholdpropertyisstatedtobeforinternaluseonly,soadjustitatyourownperil.Ifyouareconsideringit,besuretogetfamiliarwiththesourcecodesothatyouunderstandtheimplicationsofalteringthedefault.

Therestofthepropertiesonthischannelareidenticalinnameandfunctionalitytoitsfilechannelcounterpart,sopleaserefertotheprevioussection.

SummaryInthischapter,wecoveredthetwochanneltypesyouaremostlikelytouseinyourdataprocessingpipelines.

Thememorychanneloffersspeedatthecostofdatalossintheeventoffailure.Alternatively,thefilechannelprovidesamorereliabletransportinthatitcantolerateagentfailuresandrestartsataperformancecost.

Youwillneedtodecidewhichchannelisappropriateforyourusecases.Whentryingtodecidewhetheramemorychannelisappropriate,askyourselfwhatthemonetarycostisifyoulosesomedata.Weighthatagainsttheadditionalcostsofmorehardwaretocoverthedifferenceinperformancewhendecidingifyouneedadurablechannelafterall.Anotherconsiderationiswhetherornotthedatacanberesent.NotalldatayoumightingestintoHadoopwillcomefromstreamingapplicationlogs.Ifyoureceive“dailydownloads”ofdata,youcangetawaywithusingamemorychannelbecauseifyouencounteraproblem,youcanalwaysreruntheimport.

Finally,wecoveredtheexperimentalSpillableMemoryChannel.Personally,Ithinkitscreationisabadidea,butlikemostthingsincomputerscience,everybodyhasanopiniononwhatisgoodorbad.Ifeelthattheaddedcomplexityandnondeterministicperformancemakefordifficultcapacityplanning,asyoushouldalwayssizethingsfortheworst-casescenario.Ifyourdataiscriticalenoughforyoutooverflowtothediskratherthandiscardtheevents,thenyouaren’tgoingtobeokaywithlosingeventhesmallamountheldinmemory.

Inthenextchapter,we’lllookatsinks,specifically,theHDFSsinktowriteeventstoHDFS,theElasticSearchsinktowriteeventstoElasticSearch,andtheMorphlineSolrsinktowriteeventstoSolr.WewillalsocoverEventSerializers,whichspecifyhowFlumeeventsaretranslatedintooutputthat’smoresuitableforthesink.Finally,wewillcoversinkprocessorsandhowtosetuploadbalancingandfailurepathsinatieredconfigurationformorerobustdatatransport.

Chapter4.SinksandSinkProcessorsBynow,youshouldhaveaprettygoodideawherethesinkfitsintotheFlumearchitecture.Inthischapter,wewillfirstlearnaboutthemost-usedsinkwithHadoop,theHDFSsink.WewillthencovertwoofthenewersinksthatsupportcommonNearRealTime(NRT)logprocessing:theElasticSearchSinkandtheMorphlineSolrSink.Asyou’dexpect,thefirstwritesdataintoElasticsearchandthelattertoSolr.ThegeneralarchitectureofFlumesupportsmanyothersinkswewon’thavespacetocoverinthisbook.SomecomebundledwithFlumeandcanwritetoHBase,IRC,and,aswesawinChapter2,AQuickStartGuidetoFlume,alog4jandfilesink.OthersinksareavailableontheInternetandcanbeusedtowritedatatoMongoDB,Cassandra,RabbitMQ,Redis,andjustaboutanyotherdatastoreyoucanthinkof.Ifyoucan’tfindasinkthatsuitsyourneeds,youcanwriteoneeasilybyextendingtheorg.apache.flume.sink.AbstractSinkclass.

HDFSsinkThejoboftheHDFSsinkistocontinuouslyopenafileinHDFS,streamdataintoit,andatsomepoint,closethatfileandstartanewone.AswediscussedinChapter1,OverviewandArchitecture,thetimebetweenfilesrotationsmustbebalancedwithhowquicklyfilesareclosedinHDFS,thusmakingthedatavisibleforprocessing.Aswe’vediscussed,havinglotsoftinyfilesforinputwillmakeyourMapReducejobsinefficient.

TousetheHDFSsink,setthetypeparameteronyournamedsinktohdfs.

agent.sinks.k1.type=hdfs

ThisdefinesaHDFSsinknamedk1fortheagentnamedagent.Therearesomeadditionalparametersyoumustspecify,startingwiththepathinHDFSyouwanttowritethedatato:

agent.sinks.k1.hdfs.path=/path/in/hdfs

ThisHDFSpath,likemostfilepathsinHadoop,canbespecifiedinthreedifferentways:absolute,absolutewithservername,andrelative.Theseareallequivalent(assumingyourFlumeagentisrunastheflumeuser):

absolute /Users/flume/mydata

absolutewithserver hdfs://namenode/Users/flume/mydata

relative mydata

IprefertoconfigureanyserverI’minstallingFlumeonwithaworkinghadoopcommandlinebysettingthefs.default.namepropertyinHadoop’score-site.xmlfile.Idon’tkeeppersistentdatainHDFSuserdirectoriesbutprefertouseabsolutepathswithsomemeaningfulpathname(forexample,/logs/apache/access).TheonlytimeIwouldspecifyaNameNodespecificallyisifthetargetwasadifferentHadoopclusterentirely.Thisallowsyoutomoveconfigurationsyou’vealreadytestedinoneenvironmentintoanotherwithoutunintendedconsequencessuchasyourproductionserverwritingdatatoyourstagingHadoopclusterbecausesomebodyforgottoeditthetargetintheconfiguration.Iconsiderexternalizingenvironmentspecificsagoodbestpracticetoavoidsituationssuchasthese.

OnefinalrequiredparameterfortheHDFSsink,actuallyanysink,isthechannelthatitwillbedoingtakeoperationsfrom.Forthis,setthechannelparameterwiththechannelnametoreadfrom:

agent.sinks.k1.channel=c1

Thistellsthek1sinktoreadeventsfromthec1channel.

Hereisamostlycompletetableofconfigurationparametersyoucanadjustfromthedefaultvalues:

Key Required Type Default

type Yes String hdfs

channel Yes String

hdfs.path Yes String

hdfs.filePrefix No String FlumeData

hdfs.fileSuffix No String

hdfs.minBlockReplicas No intSeethedfs.replicationpropertyinyourinheritedHadoopconfiguration,usually,3.

hdfs.maxOpenFiles No long 5000

hdfs.closeTries No int 0(0=tryforever,otherwiseacount)

hdfs.retryInterval No int 180Seconds(0=don’tretry)

hdfs.round No boolean false

hdfs.roundValue No int 1

hdfs.roundUnit No String(second,

minuteorhour)second

hdfs.timeZone No String Localtime

hdfs.useLocalTimeStamp No boolean False

hdfs.inUsePrefix No String Blank

hdfs.inUseSuffix No String .tmp

hdfs.rollInterval No long(seconds) 30Seconds(0=disable)

hdfs.rollSize No long(bytes) 1024bytes(0=disable)

hdfs.rollCount No long 10(0=disable)

hdfs.batchSize No long 100

hdfs.codeC No String

RemembertoalwayschecktheFlumeUserGuidefortheversionyouareusingathttp://flume.apache.org/,asthingsmightchangebetweenthereleaseofthisbookandtheversionyouareactuallyusing.

PathandfilenameEachtimeFlumestartsanewfileathdfs.pathinHDFStowritedatainto,thefilenameiscomposedofthehdfs.filePrefix,aperiodcharacter,theepochtimestampatwhichthefilewasstarted,andoptionally,afilesuffixspecifiedbythehdfs.fileSuffixproperty(ifset),forexample:

agent.sinks.k1.hdfs.path=/logs/apache/access

Theprecedingcommandwouldresultinafilesuchas/logs/apache/access/FlumeData.1362945258

However,inthefollowingconfiguration,yourfilenameswouldbemorelike/logs/apache/access/access.1362945258.log:

agent.sinks.k1.hdfs.path=/logs/apache/access

agent.sinks.k1.hdfs.filePrefix=access

agent.sinks.k1.hdfs.fileSuffix=.log

Overtime,thehdfs.pathdirectorywillgetveryfull,soyouwillwanttoaddsomekindoftimeelementintothepathtopartitionthefilesintosubdirectories.Flumesupportsvarioustime-basedescapesequences,suchas%Ytospecifyafour-digityear.Iliketousesequencesintheyear/month/day/hourform(sothattheyaresortedoldesttonewest),soIoftenusethisforapath:

agent.sinks.k1.hdfs.path=/logs/apache/access/%Y/%m/%d/%H

ThissaysIwantapathlike/logs/apache/access/2013/03/10/18/.

NoteForacompletelistoftime-basedescapesequences,seetheFlumeUserGuide.

AnotherhandyescapesequencemechanismistheabilitytouseFlumeheadervaluesinyourpath.Forinstance,iftherewasaheaderwithakeyoflogType,IcouldsplitApacheaccessanderrorlogsintodifferentdirectorieswhileusingthesamechannel,byescapingtheheader’skeyasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%{logType}/%Y/%m/%d/%H

Theprecedinglineofcodewouldresultinaccesslogsgoingto/logs/apache/access/2013/03/10/18/,anderrorlogsgoingto/logs/apache/error/2013/03/10/18/.However,ifIpreferredbothlogtypesinthesamedirectorypath,IcouldhaveusedlogTypeinmyhdfs.filePrefixinstead,asfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H

agent.sinks.k1.hdfs.filePrefix=%{logType}

Obviously,itispossibleforFlumetowritetomultiplefilesatonce.Thehdfs.maxOpenFilespropertysetstheupperlimitforhowmanycanbeopenatonce,withadefaultof5000.Ifyoushouldexceedthislimit,theoldestfilethat’sstillopenisclosed.RememberthateveryopenfileincursoverheadbothattheOSlevelandinHDFS(NameNodeandDataNodeconnections).

Anothersetofpropertiesyoumightfindusefulallowforroundingdowneventtimesatanhour,minute,orsecondgranularitywhilestillmaintainingtheseelementsinfilepaths.Let’ssayyouhadapathspecificationasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H%M

However,ifyouwantedonlyfoursubdirectoriesperday(at00,15,30,and45pastthehour,eachcontaining15minutesofdata),youcouldaccomplishthisbysettingthefollowing:

agent.sinks.k1.hdfs.round=true

agent.sinks.k1.hdfs.roundValue=15

agent.sinks.k1.hdfs.roundUnit=minute

Thiswouldresultinlogsbetween01:15:00and01:29:59onMarch10,2013beingwrittentofilescontainedin/logs/apache/2013/03/10/0115/.Logsfrom01:30:00to01:44:59wouldbewritteninfilescontainedin/logs/apache/2013/03/10/0130/.

Thehdfs.timeZonepropertyisusedtospecifythetimezonethatyouwanttimeinterpretedforyourescapesequences.Thedefaultisyourcomputer’slocaltime.Ifyourlocaltimeisaffectedbydaylightsavingstimeadjustments,youwillhavetwiceasmuchdatawhen%H==02(inthefall)andnodatawhen%H==02(inthespring).Ithinkitisabadideatointroducetimezonesintothingsthataremeantforcomputerstoread.Ibelievetimezonesareaconcernforhumansaloneandcomputersshouldonlyconverseinuniversaltime.Forthisreason,IsetthispropertyonmyFlumeagentstomakethetimezoneissuejustgoaway:

-Duser.timezone=UTC

Ifyoudon’tagree,youarefreetousethedefault(localtime)orsethdfs.timeZonetowhateveryoulike.Thevalueyoupassedisusedinacalltojava.util.Timezone.getTimeZone(…),sochecktheJavadocsforacceptablevaluestobeusedhere.

Theothertime-relatedpropertyisthehdfs.useLocalTimeStampbooleanproperty.Bydefault,itsvalueisfalse,whichtellsthesinktousetheevent’stimestampheaderwhencalculatingdate-basedescapesequencesinfilepaths,asshownpreviously.Ifyousetthepropertytotrue,thecurrentsystemtimewillbeusedinstead,effectivelytellingFlumetousethetransportarrivaltimeratherthantheoriginaleventtime.YouwouldnotsetthisincaseswhereHDFSwasthefinaltargetforthestreamedevents.Thisway,delayedeventswillstillbeplacedcorrectly(whereuserswouldnormallylookforthem)regardlessoftheirarrivaltime.However,theremaybeausecasewhereeventsaretemporarilywrittentoHadoopandprocessedinbatches,onsomeinterval(perhapsdaily).Inthiscase,thetransporttimewouldbepreferred,soyourpostprocessingjobdoesn’tneedtoscanolderfoldersfordelayeddata.

RememberthatfilesinHDFSarebrokenintofileblocksthatarereplicatedacrosstheDataNodes.Thedefaultnumberofreplicasisusuallythree(assetintheHadoopbaseconfiguration).Youcanoverridethisvalueupordownforthissinkwiththehdfs.minBlockReplicasproperty.Forexample,ifIhaveadatastreamthatIfeelonly

needstworeplicasinsteadofthree,Icanoverridethisasfollows:

agent.skinks.k1.hdfs.minBlockReplicas=2

NoteDon’tsettheminimumreplicacounthigherthanthenumberofdatanodesyouhave,otherwiseyou’llcreateadegradedstateHDFS.Youalsodon’twanttosetitsohighthatadownedboxformaintenancewouldtriggerthissituation.Personally,I’veneversetthishigherthanthedefaultofthree,butIhavesetitloweronlessimportantdatainordertosavespace.

Finally,whilefilesarebeingwrittentoHDFS,a.tmpextensionisadded.Whenthefileisclosed,theextensionisremoved.Youcanchangetheextensionusedbysettingthehdfs.inUseSuffixproperty,butI’veneverhadareasontodoso:

agent.sinks.k1.hdfs.inUseSuffix=flumeiswriting

ThisallowsyoutoseewhichfilesarebeingwrittentosimplybylookingatadirectorylistinginHDFS.AsyoutypicallyspecifyadirectoryforinputinyourMapReducejob(orbecauseyouareusingHive),thetemporaryfileswilloftenbepickedupasemptyorgarbledinputbymistake.Toavoidhavingyourtemporaryfilespickedupbeforebeingclosed,settheprefixtoeitheradotoranunderscorecharacterasfollows:

agent.sinks.k1.hdfs.inUsePrefix=_

Thatsaid,thereareoccasionswherefileswerenotclosedproperlyduetosomeHDFSglitch,soyoumightseefileswiththein-useprefix/suffixthathaven’tbeenusedinsometime.AfewnewpropertieswereaddedinVersion1.5tochangethedefaultbehaviorofclosingfiles.Thefirstisthehdfs.closeTriesproperty.Thedefaultofzeroactuallymeans“tryforever”,soitisalittleconfusing.Settingitto4meanstry4timesbeforegivingup.Youcanadjusttheintervalbetweenretriesbysettingthehdfs.retryIntervalproperty.SettingittoolowcouldswampyourNameNodewithtoomanyrequests,sobecarefulifyoulowerthisfromthedefaultof3minutes.Ofcourse,ifyouareopeningfilestooquickly,youmightneedtolowerthisjusttokeepfromgoingoverthehdfs.maxOpenFilessettingwhichwascoveredpreviously.Ifyouactuallydidn’twantanyretries,youcansethdfs.retryIntervaltozeroseconds(again,nottobeconfusedwithcloseTries=0,whichmeanstryforever).Hopefullyinafutureversion,theywillusethemorecommonlyusedconventionofanegativenumber(usually,-1)wheninfiniteisdesired.

FilerotationBydefault,Flumewillrotateactivelywritten-tofilesevery30seconds,10events,or1024bytes.Thisisdonebysettingthehdfs.rollInterval,hdfs.rollCount,andhdfs.rollSizeproperties,respectively.Oneormoreofthesecanbesettozerotodisablethisparticularrollingmechanism.Forinstance,ifyouonlywantedatime-basedrollof1minute,youwouldsetthefollowing:

agent.sinks.k1.hdfs.rollInterval=60

agent.sinks.k1.hdfs.rollCount=0

agent.sinks.k1.hdfs.rollSize=0

Ifyouroutputcontainsanyamountofheaderinformation,theHDFSsizeperfilecanbelargerthanwhatyouexpect,becausethehdfs.rollSizerotationschemeonlycountstheeventbodylength.Clearly,youmightnotwanttodisableallthreemechanismsforrotationatthesametime,oryouwillhaveonedirectoryinHDFSoverflowingwithfiles.

Finally,arelatedparameterishdfs.batchSize.Thisisthenumberofeventsthatthesinkwillreadpertransactionfromthechannel.Ifyouhavealargevolumeofdatainyourchannel,youmightseeaperformanceincreasebysettingthishigherthanthedefaultof100,whichdecreasesthetransactionoverheadperevent.

Nowthatwe’vediscussedthewayfilesaremanagedandrolledinHDFS,let’slookintohowtheeventcontentsgetwritten.

CompressioncodecsCodecs(Coder/Decoders)areusedtocompressanddecompressdatausingvariouscompressionalgorithms.Flumesupportsgzip,bzip2,lzo,andsnappy,althoughyoumighthavetoinstalllzoyourself,especiallyifyouareusingadistributionsuchasCDH,duetolicensingissues.

Ifyouwanttospecifycompressionforyourdata,setthehdfs.codeCpropertyifyouwanttheHDFSsinktowritecompressedfiles.ThepropertyisalsousedasthefilesuffixforthefileswrittentoHDFS.Forexample,ifyouspecifythefollowing,allfilesthatarewrittenwillhavea.gzipextension,soyoudon’tneedtospecifythehdfs.fileSuffixpropertyinthiscase:

agent.sinks.k1.hdfs.codeC=gzip

Thecodecyouchoosetousewillrequiresomeresearchonyourpart.Thereareargumentsforusinggziporbzip2fortheirhighercompressionratiosatthecostoflongercompressiontimes,especiallyifyourdataiswrittenoncebutwillbereadhundredsorthousandsoftimes.Ontheotherhand,usingsnappyorlzoresultsinfastercompressionperformancebutresultsinalowercompressionratio.Keepinmindthatthesplitabilityofthefile,especiallyifyouareusingplaintextfiles,willgreatlyaffecttheperformanceofyourMapReducejobs.GopickupacopyofHadoopBeginner’sGuide,GarryTurkington,PacktPublishing(http://amzn.to/14Dh6TA)orHadoop:TheDefinitiveGuide,TomWhite,O’Reilly(http://amzn.to/16OsfIf)ifyouaren’tsurewhatI’mtalkingabout.

EventSerializersAnEventSerializeristhemechanismbywhichaFlumeEventisconvertedintoanotherformatforoutput.ItissimilarinfunctiontotheLayoutclassinlog4j.Bydefault,thetextserializer,whichoutputsjusttheFlumeeventbody,isused.Thereisanotherserializer,header_and_text,whichoutputsboththeheadersandthebody.Finally,thereisanavro_eventserializerthatcanbeusedtocreateanAvrorepresentationoftheevent.Ifyouwriteyourown,you’dusetheimplementation’sfullyqualifiedclassnameastheserializerpropertyvalue.

TextoutputAsmentionedpreviously,thedefaultserializeristhetextserializer.ThiswilloutputonlytheFlumeeventbody,withtheheadersdiscarded.Eacheventhasanewlinecharacterappenderunlessyouoverridethisdefaultbehaviorbysettingtheserializer.appendNewLinepropertytofalse.

Key Required Type Default

Serializer No String text

serializer.appendNewLine No boolean true

TextwithheadersThetext_with_headersserializerallowsyoutosavetheFlumeeventheadersratherthandiscardthem.Theoutputformatconsistsoftheheaders,followedbyaspace,thenthebodypayload,andfinally,terminatedbyanoptionallydisablednewlinecharacter,forinstance:

{key1=value1,key2=value2}bodytexthere

Key Required Type Default

serializer No String text_with_headers

serializer.appendNewLine No boolean true

ApacheAvroTheApacheAvroproject(http://avro.apache.org/)providesaserializationformatthatissimilarinfunctionalitytoGoogleProtocolBuffersbutismoreHadoopfriendlyasthecontainerisbasedonHadoop’sSequenceFileandhassomeMapReduceintegration.Theformatisalsoself-describingusingJSON,makingforagoodlong-termdatastorageformat,asyourdataformatmightevolveovertime.IfyourdatahasalotofstructureandyouwanttoavoidturningitintoStringsonlytothenparsetheminyourMapReducejob,youshouldreadmoreaboutAvrotoseewhetheryouwanttouseitasastorageformatinHDFS.

Theavro_eventserializercreatesAvrodatabasedontheFlumeeventschema.IthasnoformattingparametersasAvrodictatestheformatofthedata,andthestructureoftheFlumeeventdictatestheschemaused:

Key Required Type Default

serializer No String avro_event

serializer.compressionCodec No String(gzip,bzip2,lzo,orsnappy)

serializer.syncIntervalBytes No int(bytes) 2048000(bytes)

IfyouwantyourdatacompressedbeforebeingwrittentotheAvrocontainer,youshouldsettheserializer.compressionCodecpropertytothefileextensionofaninstalledcodec.Theserializer.syncIntervalBytespropertydeterminesthesizeofthedatabufferusedbeforeflushingthedatatoHDFS,andtherefore,thissettingcanaffectyourcompressionratiowhenusingacodec.HereisanexampleusingsnappycompressiononAvrodatausinga4MBbuffer:

agent.sinks.k1.serializer=avro_event

agent.sinks.k1.serializer.compressionCodec=snappy

agent.sinks.k1.serializer.syncIntervalBytes=4194304

agent.sinks.k1.hdfs.fileSuffix=.avro

ForAvrofilestoworkinanAvroMapReducejob,theymustendin.avroortheywillbeignoredasinput.Forthisreason,youneedtoexplicitlysetthehdfs.fileSuffixproperty.Furthermore,youwouldnotsetthehdfs.codeCpropertyonanAvrofile.

User-providedAvroschemaIfyouwanttouseadifferentschemafromtheFlumeeventschemausedwiththeavro_eventtype,startinginVersion1.4,thecloselynamedAvroEventSerializerwillletyoudothis.Keepinmindthatusingthisimplementationonly,theevent’sbodyisserializedandheadersarenotpassedon.

Settheserializertypetothefullyqualifiedorg.apache.flume.sink.hdfs.AvroEventSerializerclassname:

agent.sinks.k1.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer

UnliketheotherserializersthattakeadditionalparametersintheFlumeconfigurationfile,thisonerequiresthatyoupasstheschemainformationviaaFlumeheader.ThisisabyproductofoneoftheAvro-awaresourceswe’llseeinChapter6,Interceptors,ETL,andRouting,whereschemainformationissentfromthesourcetothefinaldestinationviatheeventheader.Youcanfakethisifyouareusingasourcethatdoesn’tsetthesebyusingastaticheaderinterceptor.We’lltalkmoreaboutinterceptorsinChapter6,Interceptors,ETL,andRouting,soflipbacktothispartlateron.

TospecifytheschemadirectlyintheFlumeconfigurationfile,usetheflume.avro.schema.literalheaderasshowninthisexample(usingamapofstringsschema):

agent.sinks.k1.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer

agent.sinks.k1.interceptors=i1

agent.sinks.k1.interceptors.i1.type=static

agent.sinks.k1.interceptors.i1.key=flume.avro.schema.literal

agent.sinks.k1.interceptors.i1.value="

{\"type\":\"map\",\"values\":\"string\"}"

IfyouprefertoputtheschemafileinHDFS,usetheflume.avro.schema.urlheaderinstead,asshowninthisexample:

agent.sinks.k1.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer

agent.sinks.k1.interceptors=i1

agent.sinks.k1.interceptors.i1.type=static

agent.sinks.k1.interceptors.i1.key=flume.avro.schema.url

agent.sinks.k1.interceptors.i1.value=hdfs://path/to/schema.avsc

Actually,inthissecondform,youcanpassanyURLincludingafile://URL,butthiswouldindicateafilelocaltowhereyouarerunningtheFlumeagent,whichmightcreateadditionalsetupworkforyouradministrators.ThisisalsotrueofconfigurationservedupbyaHTTPwebserverorfarm.Ratherthancreatingadditionalsetupdependencies,justusethedependencyyoucannotremove,whichisHDFS,usingahdfs://URL.

Besuretoonlyseteithertheflume.avro.schema.literalheaderortheflume.avro.schema.urlheaderbothnotboth.

FiletypeBydefault,theHDFSsinkwritesdatatoHDFSasHadoop’sSequenceFile.ThisisacommonHadoopwrapperthatconsistsofakeyandvaluefieldseparatedbybinaryfieldandrecorddelimiters.Usually,textfilesonacomputermakeassumptionslikeanewlinecharacterterminateseachrecord.So,whatdoyoudoifyourdatacontainsanewlinecharacter,suchassomeXML?Usingasequencefilecansolvethisproblembecauseitusesnonprintablecharactersfordelimiters.Sequencefilesarealsosplittable,whichmakesforbetterlocalityandparallelismwhenrunningMapReducejobsonyourdata,especiallyonlargefiles.

SequenceFileWhenusingaSequenceFilefiletype,youneedtospecifyhowyouwantthekeyandvaluetobewrittenontherecordintheSequenceFile.ThekeyoneachrecordwillalwaysbeaLongWritabletypeandwillcontainthecurrenttimestamp,orifthetimestampeventheaderisset,itwillbeusedinstead.Bydefault,theformatofthevalueisaorg.apache.hadoop.io.BytesWritabletype,whichcorrespondstothebyte[]Flumebody:

Key Required Type Default

hdfs.fileType No String SequenceFile

hdfs.writeFormat No String writable

However,ifyouwantthepayloadinterpretedasaString,youcanoverridethehdfs.writeFormatproperty,soorg.apache.hadoop.io.Textwillbeusedasthevaluefield:

Key Required Type Default

hdfs.fileType No String SequenceFile

hdfs.writeFormat No String text

DataStreamIfyoudonotwanttooutputaSequenceFilefilebecauseyourdatadoesn’thaveanaturalkey,youcanuseaDataStreamtooutputonlytheuncompressedvalue.Simplyoverridethehdfs.fileTypeproperty:

agent.sinks.k1.hdfs.fileType=DataStream

ThisisthefiletypeyouwouldusewithAvroserialization,asanycompressionshouldhavebeendoneintheEventSerializer.Toserializegzip-compressedAvrofiles,youwouldsettheseproperties:

agent.sinks.k1.serializer=avro_event

agent.sinks.k1.serializer.compressionCodec=gzip

agent.sinks.k1.hdfs.fileType=DataStream

agent.sinks.k1.hdfs.fileSuffix=.avro

CompressedStreamCompressedStreamissimilartoaDataStream,exceptthatthedataiscompressedwhenit’swritten.Youcanthinkofthisasrunningthegziputilityonanuncompressedfile,butallinonestep.ThisdiffersfromacompressedAvrofilewhosecontentsarecompressedandthenwrittenintoanuncompressedAvrowrapper:

agent.sinks.k1.hdfs.fileType=CompressedStream

RememberthatonlycertaincompressedformatsaresplittableinMapReduceshouldyoudecidetouseCompressedStream.Thecompressionalgorithmselectiondoesn’thaveaFlumeconfigurationbutisdictatedbythezlib.compress.strategyandzlib.compress.levelpropertiesincoreHadoopinstead.

TimeoutsandworkersFinally,therearetwomiscellaneouspropertiesrelatedtotimeoutsandtwoforworkerpoolsthatyoucanchange:

Key Required Type Default

hdfs.callTimeout No long(milliseconds) 10000

hdfs.idleTimeout No int(seconds) 0(0=disable)

hdfs.threadsPoolSize No int 10

hdfs.rollTimerPoolSize No int 1

Thehdfs.callTimeoutpropertyistheamountoftimetheHDFSsinkwillwaitforHDFSoperationstoreturnasuccess(orfailure)beforegivingup.IfyourHadoopclusterisparticularlyslow(forinstance,adevelopmentorvirtualcluster),youmightneedtosetthisvaluehigherinordertoavoiderrors.Keepinmindthatyourchannelwilloverflowifyoucannotsustainhigherwritethroughputthantheinputrateofyourchannel.

Thehdfs.idleTimeoutproperty,ifsettoanonzerovalue,isthetimeFlumewillwaittoautomaticallycloseanidlefile.Ihaveneverusedthisashdfs.fileRollIntervalhandlestheclosingoffilesforeachrollperiod,andifthechannelisidle,itwillnotopenanewfile.Thissettingseemstohavebeencreatedasanalternativerollmechanismtothesize,time,andeventcountmechanismsthathavealreadybeendiscussed.Youmightwantasmuchdatawrittentoafileaspossibleandonlycloseitwhentherereallyisnomoredata.Inthiscase,youcanusehdfs.idleTimeouttoaccomplishthisrotationschemeifyoualsosethdfs.rollInterval,hdfs.rollSize,andhdfs.rollCounttozero.

Thefirstpropertyyoucansettoadjustthenumberofworkersishdfs.threadsPoolSizeanditdefaultsto10.Thisisthemaximumnumberoffilesthatcanbewrittentoatthesametime.Ifyouareusingeventheaderstodeterminefilepathsandnames,youmighthavemorethan10filesopenatonce,butbecarefulwhenincreasingthisvaluetoomuchsoasnottooverwhelmHDFS.

Thelastpropertyrelatedtoworkerpoolsisthehdfs.rollTimerPoolSize.Thisisthenumberofworkersprocessingtimeoutssetbythehdfs.idleTimeoutproperty.Theamountofworktoclosethefilesisprettysmall,soincreasingthisvaluefromthedefaultofoneworkerisunlikely.Ifyoudonotusearotationbasedonhdfs.idleTimeout,youcanignorethehdfs.rollTimerPoolSizeproperty,asitisnotused.

SinkgroupsInordertoremovesinglepointsoffailuresinyourdataprocessingpipeline,Flumehastheabilitytosendeventstodifferentsinksusingeitherloadbalancingorfailover.Inordertodothis,weneedtointroduceanewconceptcalledasinkgroup.Asinkgroupisusedtocreatealogicalgroupingofsinks.Thebehaviorofthisgroupingisdictatedbysomethingcalledthesinkprocessor,whichdetermineshoweventsarerouted.

Thereisadefaultsinkprocessorthatcontainsasinglesinkwhichisusedwheneveryouhaveasinkthatisn’tpartofanysinkgroup.OurHello,World!exampleinChapter2,AQuickStartGuidetoFlume,usedthedefaultsinkprocessor.Nospecialconfigurationisrequiredforsinglesinks.

InorderforFlumetoknowaboutthesinkgroups,thereisanewtop-levelagentpropertycalledsinkgroups.Similartosources,channels,andsinks,youprefixthepropertywiththeagentname:

agent.sinkgroups=sg1

Here,wehavedefinedasinkgroupcalledsg1fortheagentnamedagent.

Foreachnamedsinkgroup,youneedtospecifythesinksitcontainsusingthesinkspropertyconsistingofaspace-delimitedlistofsinknames:

agent.sinkgroups.sg1.sinks=k1k2

Thisdefinesthatthek1andk2sinksarepartofthesg1sinkgroupfortheagentnamedagent.

Often,sinkgroupsareusedinconjunctionwiththetieredmovementofdatatoroutearoundfailures.However,theycanalsobeusedtowritetodifferentHadoopclusters,asevenawell-maintainedclusterhasperiodicmaintenance.

LoadbalancingContinuingtheprecedingexample,let’ssayyouwanttoloadbalancetraffictok1andk2evenly.Therearesomeadditionalpropertiesyouneedtospecify,aslistedinthistable:

Key Type Default

processor.type String load_balance

processor.selector String(round_robin,random) round_robin

processor.backoff boolean false

Whenyousetprocessor.typetoload_balance,roundrobinselectionwillbeused,unlessotherwisespecifiedbytheprocessor.selectorproperty.Thiscanbesettoeitherround_robinorrandom.Youcanalsospecifyyourownloadbalancingselectormechanism,whichwewon’tcoverhere.ConsulttheFlumedocumentationifyouneedthiscustomcontrol.

Theprocessor.backoffpropertyspecifieswhetheranexponentialbackupshouldbeusedwhenretryingasinkthatthrewanexception.Thedefaultisfalse,whichmeansthatafterathrownexception,thesinkwillbetriedagainthenexttimeitsturnisupbasedonroundrobinorrandomselection.Ifsettotrue,thenthewaittimeforeachfailureisdoubled,startingat1seconduptoalimitofaround18hours(216seconds).

NoteInanearlierversionofFlume,thedefaultinthecodeforprocessor.backoffwasstatedasfalse,butthedocumentationstateditastrue.Thiserrorhasbeenfixed,however,itmaysaveyouaheadachebyspecifyingwhatyouwantforpropertysettingsratherthanrelyingonthedefaults.

FailoverIfyouwouldrathertryonesinkandifthatonefailstotryanother,thenyouwanttosetprocessor.typetofailover.Next,you’llneedtosetadditionalpropertiestospecifytheorderbysettingtheprocessor.priorityproperty,followedbythesinkname:

Key Type Default

processor.type String failover

processor.priority.NAME int

processor.maxpenality int(milliseconds) 30000

Let’slookatthefollowingexample:

agent.sinkgroups.sg1.sinks=k1k2k3

agent.sinkgroups.sg1.processor.type=failover

agent.sinkgroups.sg1.processor.priority.k1=10

agent.sinkgroups.sg1.processor.priority.k2=20

agent.sinkgroups.sg1.processor.priority.k3=20

Lowerprioritynumberscomefirst,andinthecaseofatie,orderisarbitrary.Youcanuseanynumberingsystemthatmakessensetoyou(byones,fives,tens—whatever).Inthisexample,thek1sinkwillbetriedfirst,andifanexceptionisthrown,eitherk2ork3willbetriednext.Ifk3wasselectedfirstfortrialanditfailed,k2willstillbetried.Ifallsinksinthesinkgroupfail,thetransactionwiththechannelisrolledback.

Finally,processor.maxPenalitysetsanupperlimittoanexponentialbackoffforfailedsinksinthegroup.Afterthefirstfailure,itwillbe1secondbeforeitcanbeusedagain.Eachsubsequentfailuredoublesthewaittimeuntilprocessor.maxPenalityisreached.

MorphlineSolrSinkHDFSisnottheonlyusefulplacetosendyourlogsanddata.Solrisapopularreal-timesearchplatformusedtoindexlargeamountsofdata,sofulltextsearchingcanbeperformedalmostinstantaneously.Hadoop’shorizontalscalabilitycreatesaninterestingproblemforSolr,asthereisnowmoredatathanasingleinstancecanhandle.Forthisreason,ahorizontallyscalableversionofSolrwascreated,calledSolrCloud.Cloudera’sSearchproductisalsobasedonSolrCloud,soitshouldbenosurprisethatFlumedeveloperscreatedanewsinkspecificallytowritestreamingdataintoSolr.

Likemoststreamingdataflows,younotonlytransportthedata,butyoualsooftenreformatitintoaformmoreconsumabletothetargetoftheflow.Typically,thisisdoneinaFlume-onlyworkflowbyapplyingoneormoreinterceptorsjustpriortothesinkwritingthedatatothetargetsystem.ThissinkusestheMorphlineenginetotransformthedata,insteadofinterceptors.

Internally,eachFlumeeventisconvertedintoaMorphlinerecordandpassedtothefirstcommandintheMorphlinecommandchain.Arecordcanbethoughtofasasetofkey/valuepairswithstringkeysandarbitraryobjectvalues.EachoftheFlumeheadersispassedasaRecordFieldwiththesameheaderkeys.AspecialRecordFieldkey_attachment_bodyisusedfortheFlumeeventbody.Keepinmindthatthebodyisstillabytearray(Javabyte[])atthispointandmustbespecificallyprocessedintheMorphlinecommandchain.

Eachcommandprocessestherecordinturn,passingtheoutputtotheinputofthenextcommandinlinewiththefinalcommandresponsibleforterminatingtheflow.Inmanyways,itissimilarinfunctionallytoFlume’sInterceptorfunctionality,whichwe’llseeinChapter6,Interceptors,ETL,andRouting.InthecaseofwritingtoSolr,weusetheloadSolrcommandtoconverttheMorphlinerecordintoaSolrDocumentandwritetotheSolrcluster.Hereiswhatthissimplifiedflowmightlooklikeinapictureform:

MorphlineconfigurationfilesMorphlineconfigurationfilesusetheHOCONformat,whichissimilartoJSONbuthasalessstrictsyntax,makingthemlesserror-pronewhenusedforconfigurationfilesoverJSON.

NoteHOCONisanacronymforHumanOptimizedConfigurationObjectNotation.YoucanreadmoreaboutHOCONonthisGitHubpage:https://github.com/typesafehub/config/blob/master/HOCON.md

Theconfigurationfilecontainsasinglekeywiththemorphlinesvalue.ThevalueisanarrayofMorphlineconfigurations.Eachindividualentryiscomprisedofthreekeys:

id

importCommands

commands

IfyourconfigurationcontainsmultipleMorphlines,thevalueofidmustbeprovidedtotheFlumesinkbywayofthemorphlineIdproperty.ThevalueofimportCommandsspecifiestheJavaclassestoimportwhentheMorphineisevaluated.Thedoublestarindicatesthatallpathsandclassesfromthatpointinthepackagehierarchyshouldbeincluded.Allclassesthatimplementcom.cloudera.cdk.morphline.api.CommandBuilderareinterrogatedfortheirnamesviathegetNames()method.Thesenamesarethecommandnamesyouuseinthenextsection.Don’tworry;youdon’tneedtosiftthroughthesourcecodetofindthem,astheyhaveawell-documentedreferenceguideonline.Finally,thecommandskeyreferencesalistofcommanddictionaries.EachcommanddictionaryhasasinglekeyconsistingofthenameoftheMorphlinecommandfollowedbyitsspecificproperties.

NoteForalistofMorphlinecommandsandassociatedconfigurationproperties,seethereferenceguideathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

Hereiswhataskeletonconfigurationfilemightlooklike:

morphlines:[

{

id:transform_my_data

importCommands:[

"com.cloudera.**",

"org.apache.solr.**"

]

commands:[

{

COMMAND_NAME1:{

property1:value1

property2:value2

}

}

{COMMAND_NAME2:{

property1:value1

}

]

}

]

TypicalSolrSinkconfigurationHereistheprecedingskeletonconfigurationappliedtoourSolrusecase.Thisisnotmeanttobecomplete,butitissufficienttodiscusstheflowintheprecedingdiagram:

morphlines:[

{

id:solr_flow

importCommands:[

"com.cloudera.**",

"org.apache.solr.**"

]

commands:[

{

readLine:{

charset:UTF-8

}

{

grok:{

GROK_PROPERTIES_HERE

}

}

{

loadSolr:{

solrLocator:{

collection:my_collection

zkHost:"solr.example.com:2181/solr"

}

}

}

]

}

]

YoucanseethesameboilerplateconfigurationwherewedefineasingleMorphlinewiththesolr_flowidentifier.ThecommandsequencestartswiththereadLinecommand.Thissimplyreadstheeventbodyfromthe_attachment_bodyfieldandconvertsbyte[]toStringusingtheconfiguredencoding(inthiscase,UTF-8).TheresultingStringvalueissettothefieldwiththekeymessage.Thenextcommandinthesequence,whichisthegrokcommand,usesregularexpressionstoextractadditionalfieldstomakeamoreinterestingSolrDocument.Icouldn’tpossiblydothiscommandjusticebytryingtoexplaineverythingyoucandowithit.Forthat,pleaseseetheKiteSDKdocumentation.

NoteSeethereferenceguideforacompletelistofMorphlinecommands,theirproperties,andusageinformationathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

Sufficetosay,grokletsmetakeawebserverloglinesuchasthis:10.4.240.176--[14/Mar/2014:12:02:17-0500]"POST

http://mysite.com/do_stuff.phpHTTP/1.1"500834

Then,itletsmeturnitintomorestructureddatalikethis:

{

ip:10.4.240.176

timestamp:1413306137

method:POST

url:http://mysite.com/do_stuff.php

protocol:HTTP/1.1

status_code:500

length:834

}

Ifyouwantedtosearchforallthetimesthispagethrewa500statuscode,havingthesefieldsbrokenoutmakesthetaskeasyforSolr.

Finally,wecalltheloadSolrcommandtoinserttherecordintoourSolrcluster.ThesolrLocatorpropertyindicatesthetargetSolrcluster(bywayofitsZookeeperserver(s))andthedatacollectiontowritethesedocumentsinto.

SinkconfigurationNowthatyouhaveabasicideaofhowtocreateaMorphlineconfigurationfile,let’sapplythistotheactualsinkconfiguration.

Thefollowingtabledetailsthesink’sparametersanddefaultvalues:

Key Required Type Default

type Yes String org.apache.flume.sink.solr.morphline.MorphlineSolrSink

channel Yes String

morphlineFile Yes String

morphlineId No StringRequirediftheMorphlineconfigurationfilecontainsmorethanoneMorphline.

batchSize No int 1000

batchDurationMillis No long 1000(milliseconds)

handlerClass No String org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl

TheMorphlineSolrSinkdoesnothaveashorttypealias,sosetthetypeparameteronyournamedsinktoorg.apache.flume.sink.solr.morphline.MorphlineSolrSink:

agent.sinks.k1.type=org.apache.flume.sink.solr.morphline.MorphlineSolrSink

ThisdefinesaMorphlineSolrSinknamedk1fortheagentnamedagent.

Thenextrequiredparameteristhechannelproperty.Thisspecifieswhichchanneltoreadeventsfromforprocessing.

agent.sinks.k1.channel=c1

Thistellsthek1sinktoreadeventsfromthec1channel.

TheonlyotherrequiredparameteristherelativeorabsolutepathtotheMorphlineconfigurationfile.ThiscannotbeapathinHDFS;itmustaccessibleontheservertheFlumeagentisrunningon(localdisk,NFSdisk,andsoon).

Tospecifytheconfigurationfilepath,setthemorphlineFileproperty:

agent.sinks.k1.morphlineFile=/path/on/local/system/morphline.conf

AsaMorphlineconfigurationfilecancontainmultipleMorphlines,youmustspecifytheidentifierifmorethanoneexists,usingthemorphlineIdproperty:

agent.sinks.k1.morphlineId=transform_my_data

Thenexttwopropertiesarefairlycommonamongsinks.Theyspecifyhowmanyeventstoremoveatatimeforprocessing,alsoknownasabatch.ThebatchSizepropertydefaultsto1000events,butyoumightneedtosetthishigherifyouaren’tconsuming

eventsfromthechannelfasterthantheyarebeinginserted.Clearly,youcanonlyincreasethissomuch,asthethingyouarewritingto—inthiscase,Solr—willhavesomerecordconsumptionlimit.Onlythroughtestingwillyoubeabletostressyoursystemstoseewherethelimitsare.

TherelatedbatchDurationMillispropertyspecifiesthemaximumtimetowaitbeforethesinkproceedswiththeprocessingwhenfewerthanthebatchSizenumberofeventshavebeenread.Thedefaultvalueis1secondandisspecifiedinmillisecondsintheconfigurationproperties.Inasituationwithalightdataflow(usingthedefaults,lessthan1000recordspersecond),settingbatchDurationMillishighercanmakethingsworse.Forinstance,ifyouareusingamemorychannelwiththissink,yourFlumeagentcouldbesittingtherewithdatatowritetothesink’stargetbutiswaitingformore,onlytoshowupwhenacrashhappens,resultinginlostdata.Thatsaid,yourdownstreamentitymightperformbetteronlargerbatches,whichmightpushboththeseconfigurationvalueshigher,sothereisnouniversallycorrectanswer.Startwiththedefaultsifyouareunsure,anduseharddatathatyou’llcollectusingtechniquesinChapter8,MonitoringFlume,toadjustbasedonfactsandnotguesses.

Finally,youshouldneverneedtotouchthehandlerClasspropertyunlessyouplantowriteanalternateimplementationoftheMorphlineprocessingclass.AsthereisonlyoneMorphlineengineimplementationtodate,I’mnotreallysurewhythisisadocumentedpropertyinFlume.I’mjustmentioningitforcompleteness.

ElasticSearchSinkAnothercommontargettostreamdatatobesearchedinNRTisElasticsearch.ElasticsearchisalsoaclusteredsearchingplatformbasedonLucene,likeSolr.Itisoftenusedalongwiththelogstashproject(tocreatestructuredlogs)andtheKibanaproject(awebUIforsearches).ThistrioisoftenreferredtoastheacronymELK(Elasticsearch/Logstash/Kibana).

NoteHerearetheprojecthomepagesfortheELKstackthatcangiveyouamuchbetteroverviewthanIcaninafewshortpages:

Elasticsearch:http://elasticsearch.org/Logstash:http://logstash.net/Kibana:http://www.elasticsearch.org/overview/kibana/

InElasticsearch,dataisgroupedintoindices.YoucanthinkoftheseasbeingequivalenttodatabasesinasingleMySQLinstallation.Theindicesarecomposedoftypes(similartotablesindatabases),whicharemadeupofdocuments.Adocumentislikeasinglerowinadatabase,so,eachFlumeeventwillbecomeasingledocumentinElasticSearch.Documentshaveoneormorefields(justlikecolumnsinadatabase).

ThisisbynomeansacompleteintroductiontoElasticsearch,butitshouldbeenoughtogetyoustarted,assumingyoualreadyhaveanElasticsearchclusteratyourdisposal.Aseventsgetmappedtodocumentsbythesink’sserializer,theactualsinkconfigurationneedsonlyafewconfigurationitems:wheretheclusterislocated,whichindextowriteto,andwhattypetherecordis.

ThistablesummarizesthesettingsforElasticSearchSink:

Key Required Type Default

type Yes String org.apache.flume.sink.elasticsearch.ElasticSearchSink

hostNames Yes StringAcomma-separatedlistofElasticsearchnodestoconnectto.Iftheportisspecified,useacolonafterthename.Thedefaultportis9300.

clusterName No String elasticsearch

indexName No String flume

indexType No String log

ttl No String Defaultstoneverexpire.Specifythenumberandunit(5m=5minutes).

batchSize No int 100

Withthisinformationinmind,let’sstartbysettingthesink’stypeproperty:

agent.sinks.k1.type=org.apache.flume.sink.elasticsearch.ElasticSearchSink

Next,weneedtosetthelistofserversandportstoestablishconnectivityusingthehostNamesproperty.Thisisacomma-separatedlistofhostname:portpairs.Ifyouareusingthedefaultportof9300,youcanjustspecifytheservernameorIP,forexample:

agent.sinks.k1.hostNames=es1.example.com,es2.example.com:12345

NowthatwecancommunicatewiththeElasticsearchservers,weneedtotellthemwhichcluster,index,andtypetowriteourdocumentsto.TheclusterisspecifiedusingtheclusterNameproperty.Thiscorrespondswiththecluster.namepropertyinElasticsearch’selasticsearch.ymlconfigurationfile.Itneedstobespecified,asanElasticsearchnodecanparticipateinmorethanonecluster.HereishowIwouldspecifyanondefaultclusternamecalledproduction:

agent.sinks.k1.clusterName=production

TheindexNamepropertyisreallyaprefixusedtocreateadailyindex.Thiskeepsanysingleindexfrombecomingtoolargeovertime.Ifyouusethedefaultindexname,theindexonSeptember30,2014willbenamedflume-2014-10-30.

Lastly,theindexTypepropertyspecifiestheElasticsearchtype.Ifunspecified,thelogdefaultvaluewillbeused.

Bydefault,datawrittenintoElasticsearchwillneverexpire.Ifyouwantthedatatoautomaticallyexpire,youcanspecifyatime-to-livevalueontherecordswiththettlproperty.Valuesareanumericnumberinmillisecondsoranumberwithunits.Theunitsaregiveninthistable:

Unitstring Definition Example

ms Milliseconds 5ms=5milliseconds

notspecified Milliseconds 10000=10seconds

m Minutes 10m=10minutes

h Hours 1h=1hour

d Days 7d=7days

w Weeks 4w=4weeks

KeepinmindthatyoualsoneedtoenabletheTTLfeaturesontheElasticsearchcluster,asitdisabledbydefault.SeetheElasticsearchdocumentationforhowtodothis.

Finally,liketheHDFSsink,thebatchpropertyisthenumberofeventspertransactionthatthesinkwillreadfromthechannel.Ifyouhavealargevolumeofdatainyourchannel,youshouldseeaperformanceincreasebysettingthishigherthanthedefaultof100,duetothereducedoverheadpertransaction.

Thesink’sserializerdoestheworkoftransformingtheFlumeeventtotheElasticsearchdocument.TherearetwoElasticsearchserializersthatcomepackagedwithFlume,neither

hasadditionalconfigurationpropertiessincetheymostlyuseexistingheaderstodictatefieldmappings.

We’llseemoreofthissinkinactioninChapter7,PuttingItAllTogether.

LogStashSerializerThedefaultserializer,ifnotspecified,isElasticSearchLogStashEventSerializer:

agent.sinks.k1.serializer=org.apache.flume.sink.elasticsearch.

ElasticSearchLogStashEventSerializer

ItwritesdatainthesameformatthatLogstashusesinconjunctionwithKibana.HereisatableofthecommonlyusedfieldsandtheirassociatedmappingsfromFlumeevents:

Elasticsearchfield

TakenfromtheFlumeheader Notes

@timestamp timestamp Fromtheheader,ifpresent

@source source Fromtheheader,ifpresent

@source_host source_host Fromtheheader,ifpresent

@source_path source_path Fromtheheader,ifpresent

@type type Fromtheheader,ifpresent

@host host Fromtheheader,ifpresent

@fields allheaders AdictionaryofallFlumeheaders,includingtheonesthatmighthavebeenmappedtotheotherfields,suchasthehost

@message FlumeBody

Whileyoumightthinkthedocument’s@typefieldwillbeautomaticallysettothesink’sindexTypeconfigurationproperty,you’dbeincorrect.Ifyouhadonlyonetypeoflog,itwouldbewastefultowritethisoverandoveragainforeverydocument.However,ifyouhadmorethanonelogtypegoingthroughyourFlumechannel,youcandesignateitstypeinElasticsearchusingtheStaticinterceptorwe’llseeinChapter6,Interceptors,ETL,andRouting,tosetthetype(or@type)Flumeheaderontheevent.

DynamicSerializerAnotherserializeristheElasticSearchDynamicSerializerserializer.Ifyouusethisserializer,theevent’sbodyiswrittentoafieldcalledbody.AllotherFlumeheaderkeysareusedasfieldnames.Clearly,youwanttoavoidhavingaflumeheaderkeycalledbody,asthiswillconflictwiththeactualevent’sbodywhentransformedintotheElasticsearchdocument.Tousethisserializer,specifythefullyqualifiedclassname,asshowninthisexample:

agent.sinks.k1.serializer=org.apache.flume.sink.elasticsearch.

ElasticSearchDynamicSerializer

Forcompleteness,hereisatablethatshowsyouthebreakdownofhowFlumeheadersandbodygetmappedtoElasticsearchfields:

Flumeentity Elasticsearchfield

Allheaders sameasFlumeheaders

Body body

AstheversionofElasticsearchcanbedifferentforeachuser,Flumedoesn’tpackagetheElasticsearchclientandcorrespondingLucenelibraries.FindoutfromyouradministratorwhichversionsshouldbeincludedontheFlumeclasspath,orcheckouttheMavenpom.xmlfileonGitHubforthecorrespondingversiontagorbranchathttps://github.com/elasticsearch/elasticsearch/blob/master/pom.xml.MakesurethelibraryversionsusedbyFlumematchwithElasticsearchoryoumightseeserializationerrors.

NoteAsSolrandElasticsearchhavesimilarcapabilities,checkoutKelvinTan’sappropriately-namedside-by-sidedetailedfeaturebreakdownwebpage.Itshouldhelpgetyoustartedwithwhatismostappropriateforyourspecificusecase:

http://solr-vs-elasticsearch.com/

SummaryInthischapter,wecoveredtheHDFSsinkindepth,whichwritesstreamingdataintoHDFS.WecoveredhowFlumecanseparatedataintodifferentHDFSpathsbasedontimeorcontentsofFlumeheaders.Severalfile-rollingtechniqueswerealsodiscussed,includingtimerotation,eventcountrotation,sizerotation,androtationonidleonly.

CompressionwasdiscussedasameanstoreducestoragerequirementsinHDFS,andshouldbeusedwhenpossible.Besidesstoragesavings,itisoftenfastertoreadacompressedfileanddecompressinmemorythanitistoreadanuncompressedfile.ThiswillresultinperformanceimprovementsinMapReducejobsrunonthisdata.Thesplitabilityofcompresseddatawasalsocoveredasafactortodecidewhenandwhichcompressionalgorithmtouse.

EventSerializerswereintroducedasthemechanismbywhichFlumeeventsareconvertedintoanexternalstorageformat,includingtext(bodyonly),textandheaders(headersandbody),andAvroserialization(withoptionalcompression).

Next,variousfileformats,includingsequencefiles(Hadoopkey/valuefiles),DataStreams(uncompresseddatafiles,likeAvrocontainers),andCompressedDataStreams,werediscussed.

Next,wecoveredsinkgroupsasameanstorouteeventstodifferentsourcesusingloadbalancingorfailoverpaths,whichcanbeusedtoeliminatesinglepointsoffailureinroutingdatatoitsdestination.

Finally,wecoveredtwonewsinksaddedinFlume1.4towritedatatoApacheSolrandElasticSearchinaNearRealTime(NRT)way.Foryears,MapReducejobshaveserveduswell,andwillcontinuetodoso,butsometimesitstillisn’tfastenoughtosearchlargedatasetsquicklyandlookatthingsfromdifferentangleswithoutreprocessingdata.KiteSDKMorphlineswerealsointroducedasawaytopreparedataforwritingtoSolr.WewillrevisitMorphlinesagaininChapter6,Interceptors,ETL,andRouting,whenwelookataMorphline-poweredinterceptor.

Inthenextchapter,wewilldiscussvariousinputmechanisms(sources)thatwillfeedyourconfiguredchannelswhichwerecoveredbackinChapter3,Channels.

Chapter5.SourcesandChannelSelectorsNowthatwehavecoveredchannelsandsinks,wewillnowcoversomeofthemorecommonwaystogetdataintoyourFlumeagents.AsdiscussedinChapter1,OverviewandArchitecture,thesourceistheinputpointfortheFlumeagent.TherearemanysourcesavailablewiththeFlumedistributionaswellasmanyopensourceoptionsavailable.Likemostopensourcesoftware,ifyoucan’tfindwhatyouneed,youcanalwayswriteyourownbyextendingtheorg.apache.flume.source.AbstractSourceclass.SincetheprimaryfocusofthisbookisingestingfilesoflogsintoHadoop,we’llcoverafewofthemoreappropriatesourcestoaccomplishthis.

TheproblemwithusingtailIfyouhaveusedanyoftheFlume0.9releases,you’llnoticethattheTailSourceisnolongerapartofFlume.TailSourceprovidedamechanismto“tail”(http://en.wikipedia.org/wiki/Tail_(Unix))anyfileonthesystemandcreateFlumeeventsforeachlineofthefile.Itcouldalsohandlefilerotations,somanyusedthefilesystemasahandoffpointbetweentheapplicationcreatingthedata(forinstance,log4j)andthemechanismresponsibleformovingthosefilessomeplaceelse(forinstance,syslog).

Asisthecasewithbothchannelsandsinks,eventsareaddedandremovedfromachannelaspartofatransaction.Whenyouaretailingafile,thereisnowaytoparticipateproperlyinatransaction.Iffailuretowritesuccessfullytoachanneloccurred,orifthechannelwassimplyfull(amorelikelyeventthanfailure),thedatacouldn’tbe“putback”asrollbacksemanticsdictate.

Furthermore,iftherateofdatawrittentoafileexceedstherateFlumecouldreadthedata,itispossibletoloseoneormorelogfilesofinputoutright.Forexample,sayyouweretailing/var/log/app.log.Whenthatfilereachesacertainsize,itisrotatedorrenamed,to/var/log/app.log.1,andanewfilecalled/var/log/app.logiscreated.Let’ssayyouhadafavorablereviewinthepressandyourapplicationlogsaremuchhigherthanusual.Flumemaystillbereadingfromtherotatedfile(/var/log/app.log.1)whenanotherrotationoccurs,moving/var/log/app.logto/var/log/app.log.1.ThefileFlumeisreadingisnowrenamedto/var/log/app.log.2.WhenFlumefinisheswiththisfile,itwillmovetowhatitthinksisthenextfile(/var/log/app.log),thusskippingthefilethatnowresidesat/var/log/app.log.1.Thiskindofdatalosswouldgocompletelyunnoticedandissomethingwewanttoavoidifpossible.

Forthesereasons,itwasdecidedtoremovethetailfunctionalityfromFlumewhenitwasrefactored.TherearesomeworkaroundsforTailSourceafterit’sremoved,butitshouldbenotedthatnoworkaroundcaneliminatethepossibilityofdatalossunderloadthatoccursundertheseconditions.

TheExecsourceTheExecsourceprovidesamechanismtorunacommandoutsideFlumeandthenturntheoutputintoFlumeevents.TousetheExecsource,setthetypepropertytoexec:

agent.sources.s1.type=exec

AllsourcesinFlumearerequiredtospecifythelistofchannelstowriteeventstousingthechannels(plural)property.Thisisaspace-separatedlistofoneormorechannelnames:

agent.sources.s1.channels=c1

Theonlyotherrequiredparameteristhecommandproperty,whichtellsFlumewhatcommandtopasstotheoperatingsystem.Hereisanexampleoftheuseofthisproperty:

agent.sources=s1

agent.sources.s1.channels=c1

agent.sources.s1.type=exec

agent.sources.s1.command=tail-F/var/log/app.log

Here,Ihaveconfiguredasinglesources1foranagentnamedagent.Thesource,anExecsource,willtailthe/var/log/app.logfileandfollowanyrotationsthatoutsideapplicationsmayperformonthatlogfile.Alleventsarewrittentothec1channel.ThisisanexampleofoneoftheworkaroundsforthelackofTailSourceinFlume1.x.Thisisnotmypreferredworkaroundtousetail,butjustasimpleexampleoftheexecsourcetype.IwillshowmypreferredmethodinChapter7,PullingItAllTogether.

NoteShouldyouusethetail-FcommandinconjunctionwiththeExecsource,itisprobablethattheforkedprocesswillnotshutdown100percentofthetimewhentheFlumeagentshutsdownorrestarts.Thiswillleaveorphanedtailprocessesthatwillneverexit.Thetail–Fcommand,bydefinition,hasnoend.Evenifyoudeletethefilebeingtailed(atleastinLinux),therunningtailprocesswillkeepthefilehandleopenindefinitely.Thiskeepsthefile’sspacefromactuallybeingreclaimeduntilthetailprocessexits,whichwon’thappen.IthinkyouarebeginningtoseewhyFlumedevelopersdon’tliketailingfiles.

Ifyougothisroute,besuretoperiodicallyscantheprocesstablesfortail-FwhoseparentPIDis1.Theseareeffectivelydeadprocessesandneedtobekilledmanually.

HereisalistofotherpropertiesyoucanusewiththeExecsource:

Key Required Type Default

type Yes String exec

channels Yes String spaceseparatedlistofchannels

command Yes String

shell No String shellcommand

restart No boolean false

restartThrottle No long(milliseconds) 10000(milliseconds)

logStdErr No boolean false

batchSize No int 20

batchTimeout No long 3000(milliseconds)

Noteverycommandkeepsrunning,eitherbecauseitfails(forexample,whenthechannelitiswritingtoisfull)orbecauseitisdesignedtoexitimmediately.Inthisexample,wewanttorecordthesystemloadviatheLinuxuptimecommand,whichprintsoutsomesysteminformationtostdoutandexits:

agent.sources.s1.command=uptime

Thiscommandwillimmediatelyexit,soyoucanusetherestartandrestartThrottlepropertiestorunitperiodically:

agent.sources.s1.command=uptime

agent.sources.s1.restart=true

agent.sources.s1.restartThrottle=60000

Thiswillproduceoneeventperminute.Inthetailexample,shouldthechannelfillcausingtheExecsourcetofail,youcanusethesepropertiestorestarttheExecsource.Inthiscase,settingtherestartpropertywillstartthetailingofthefilefromthebeginningofthecurrentfile,thusproducingduplicates.DependingonhowlongtherestartThrottlepropertyis,youmayhavemissedsomedataduetoafilerotationoutsideFlume.Furthermore,thechannelmaystillbeunabletoacceptdata,inwhichcasethesourcewillfailagain.Settingthisvaluetoolowmeansgivinglesstimetothechanneltodrain,andunlikesomeofthesinkswesaw,thereisnotanoptionforexponentialbackoff.

Ifyouneedtouseshell-specificfeaturessuchaswildcardexpansion,youcansettheshellpropertyasinthisexample:

agent.sources.s1.command=grep–iapachelib/*.jar|wc-l

agent.sources.s1.shell=/bin/bash-c

agent.sources.s1.restart=true

agent.sources.s1.restartThrottle=60000

Thisexamplewillfindthenumberoftimesthecase-insensitiveapachestringisfoundinalltheJARfilesinthelibdirectory.Onceperminute,thatcountwillbesentasaFlumeeventpayload.

WhilethecommandoutputwrittentostdoutbecomestheFlumeeventbody,errorsaresometimeswrittentostderr.IfyouwanttheselinesincludedintheFlumeagent’ssystemlogs,setthelogStdErrpropertytotrue.Otherwise,theywillbesilentlyignored,whichisthedefaultbehavior.

Finally,youcanspecifythenumberofeventstowritepertransactionbychangingthebatchSizeproperty.Youmayneedtosetthisvaluehigherthanthedefaultof20ifyourinputdataislargeandyourealizethatyoucannotwritetoyourchannelfastenough.Usingahigherbatchsizereducestheoverallaveragetransactionoverheadperevent.Testingwithdifferentvaluesandmonitoringthechannel’sputrateistheonlywaytoknowthisforsure.TherelatedbatchTimeoutpropertysetsthemaximumtimetowaitwhenrecordsfewerthanthebatchsize’snumberofrecordshavebeenseenbefore,flushingapartialbatchtothechannel.Thedefaultsettingforthisis3seconds(specifiedinmilliseconds).

SpoolingDirectorySourceInanefforttoavoidalltheassumptionsinherentintailingafile,anewsourcewasdevisedtokeeptrackofwhichfileshavebeenconvertedintoFlumeeventsandwhichstillneedtobeprocessed.TheSpoolingDirectorySourceisgivenadirectorytowatchfornewfilesappearing.Itisassumedthatfilescopiedtothisdirectoryarecomplete.Otherwise,thesourcemighttryandsendapartialfile.Italsoassumesthatfilenamesneverchange.Otherwise,onrestarting,thesourcewouldforgetwhichfileshavebeensentandwhichhavenot.Thefilenameconditioncanbemetinlog4jusingDailyRollingFileAppenderratherthanRollingFileAppender.However,thecurrentlyopenfilewouldneedtobewrittentoonedirectoryandcopiedtothespooldirectoryafterbeingclosed.Noneofthelog4jappendersshippinghavethiscapability.

Thatsaid,ifyouareusingtheLinuxlogrotateprograminyourenvironment,thismightbeofinterest.Youcanmovecompletedfilestoaseparatedirectoryusingapostrotatescript.Thefinalflowmightlooksomethinglikethis:

TocreateaSpoolingDirectorySource,setthetypepropertytospooldir.YoumustspecifythedirectorytowatchbysettingthespoolDirproperty:

agent.sources=s1

agent.sources.channels=c1

agent.sources.s1.type=spooldir

agent.sources.s1.spoolDir=/path/to/files

HereisasummaryofthepropertiesfortheSpoolingDirectorySource:

Key Required Type Default

type Yes String spooldir

channels Yes String spaceseparatedlistofchannels

spoolDir Yes String pathtodirectorytospool

fileSuffix No String .COMPLETED

deletePolicy No String(neverorimmediate) never

fileHeader No boolean false

fileHeaderKey No String file

basenameHeader No boolean false

basenameHeaderKey No String basename

ignorePattern No String ^$

trackerDir No String ${spoolDir}/.flumespool

consumeOrder No String(oldest,youngest,orrandom) oldest

batchSize No int 10

bufferMaxLines No int 100

maxBufferLineLength No int 5000

maxBackoff No int 4000(milliseconds)

Whenafilehasbeencompletelytransmitted,itwillberenamedwitha.COMPLETEDextension,unlessoverriddenbysettingthefileSuffixproperty,likethis:

agent.sources.s1.fileSuffix=.DONE

StartingwithFlume1.4,anewproperty,deletePolicy,wascreatedtoremovecompletedfilesfromthefilesystemratherthanjustmarkingthemasdone.Inaproductionenvironment,thisiscriticalbecauseyourspooldiskwillfillupovertime.Currently,youcanonlysetthisforimmediatedeletionortoleavethefilesforever.Ifyouwantdelayeddeletion,you’llneedtoimplementyourownperiodic(cron)job,perhapsusingthefindcommandtofindfilesinthespooldirectorywiththeCOMPLETEDfilesuffixandamodificationtimelongerthansomeregularvalue,forexample:

find/path/to/spool/dir-typef-name"*.COMPLETED"–mtime7–execrm{}\;

Thiswillfindallfilescompletedmorethansevendaysagoanddeletethem.

Ifyouwanttheabsolutefilepathattachedtoeachevent,setthefileHeaderpropertytotrue.ThiswillcreateaheaderwiththefilekeyunlesssettosomethingelseusingthefileHeaderKeyproperty,likethiswouldaddthe{sourceFile=/path/to/files/foo.1234.log}headeriftheeventwasreadfromthe/path/to/files/foo.1234.logfile:

agent.sources.s1.fileHeader=true

agent.sources.s1.fileHeaderKey=sourceFile

Therelatedproperty,basenameHeader,ifsettotrue,willaddaheaderwiththebasenamekey,whichcontainsjustthefilename.ThebasenameHeaderKeypropertyallowsyoutochangethekey’svalue,asshownhere:

agent.sources.s1.basenameHeader=true

agent.sources.s1.basenameHeaderKey=justTheName

Thisconfigurationwouldaddthe{justTheName=foo.1234.log}headeriftheeventwasreadfromthesamefilelocatedat/path/to/files/foo.1234.log.

Iftherearecertainfilepatternsthatyoudonotwantthissourcetoreadasinput,youcanpassaregularexpressionusingtheignorePatternproperty.Personally,Iwon’tcopyanyfilesIdon’twanttransferredtoFlumeinthespoolDirinthefirstplace.Ifthissituationcannotbeavoided,usetheignorePatternpropertytopassaregularexpressiontomatchfilenamesthatshouldnotbetransferredasdata.Furthermore,subdirectoriesandfilesthatstartwitha"."(period)characterareignored,soyoucanavoidcostlyregularexpressionprocessingusingthisconventioninstead.

WhileFlumeissendingdata,itkeepstrackofhowfarithasgottenineachfilebykeepingametadatafileinthedirectoryspecifiedbythetrackerDirproperty.Bydefault,thisfilewillbe.flumespoolunderspoolDir.ShouldyouwantalocationotherthaninsidespoolDir,youcanspecifyanabsolutefilepath.

Filesinthedirectoryareprocessedinthe“oldestfirst”mannerascalculatedbylookingatthemodificationtimesofthefiles.YoucanchangethisbehaviorbysettingtheconsumeOrderproperty.Ifyousetthispropertytoyoungest,thenewestfileswillbeprocessedfirst.Thismaybedesiredifdataistimesensitive.Ifyou’drathergiveequalprecedencetoallfiles,youcansetthispropertytoavalueofrandom.

ThebatchSizepropertyallowsyoutotunethenumberofeventspertransactionforwritestothechannel.Increasingthismayprovidebetterthroughputatthecostoflargertransactions(andpossiblylargerrollbacks).ThebufferMaxLinespropertyisusedtosetthesizeofthememorybufferusedinreadingfilesbymultiplyingitwithmaxBufferLineLength.Ifyourdataisveryshort,youmightconsiderincreasingbufferMaxLineswhilereducingthemaxBufferLineLengthproperty.Inthiscase,itwillresultinbetterthroughputwithoutincreasingyourmemoryoverhead.Thatsaid,ifyouhaveeventslongerthan5000characters,you’llwanttosetmaxBufferLineLengthhigher.

Ifthereisaproblemwritingdatatothechannel,aChannelExceptionisthrownbacktothesource,whereit’llretryafteraninitialwaittimeof250ms.Eachfailedattemptwilldoublethistimeuptoamaximumof4seconds.Tosetthismaximumtimehigher,setthemaxBackoffproperty.Forinstance,ifIwantedamaximumof5minutes,Icansetitinmillisecondslikethis:

agent.sources.s1.maxBackoff=300000

Finally,you’llwanttoensurethatwhatevermechanismiswritingnewfilestoyourspoolingdirectorycreatesuniquefilenames,suchasaddingatimestamp(andpossibly

more).Reusingafilenamewillconfusethesource,andyourdatamaynotbeprocessed.

Asalways,rememberthatrestartsanderrorswillcreateduplicatesduetoretransmissionoffilespartiallysent,butnotmarkedascomplete,orbecausethemetadataisincomplete.

SyslogsourcesSysloghasbeenaroundfordecadesandisoftenusedasanoperating-system-levelmechanismtocaptureandmovelogsaroundsystems.Inmanyways,thereareoverlapswithsomeofthefunctionalityFlumeprovides.ThereisevenaHadoopmoduleforrsyslog,oneofthemoremodernvariantsofsyslog(http://www.rsyslog.com/doc/rsyslog_conf_modules.html/omhdfs.html).Generally,Idon’tlikesolutionsthatcoupletechnologiesthatmayversionindependently.Ifyouusethisrsyslog/Hadoopintegration,youwouldberequiredtoupdatetheversionofHadoopyoucompiledintorsyslogatthesametimeyouupgradedyourHadoopclustertoanewmajorversion.Thismaybelogisticallydifficultifyouhavealargenumberofserversand/orenvironments.BackwardcompatibilityinHadoopwireprotocolsissomethingthatisbeingactivelyworkedonintheHadoopcommunity,butcurrently,itisn’tthenorm.We’lltalkmoreaboutthisinChapter8,MonitoringFlume,whenwediscusstieringdataflows.

SysloghasanolderUDPtransportaswellasanewerTCPprotocolthatcanhandledatalargerthanasingleUDPpacketcantransmit(about64KB)anddealwithnetwork-relatedcongestioneventsthatmightrequirethedatatoberetransmitted.

Finally,therearesomeundocumentedpropertiesofsyslogsourcesthatallowustoaddmoreregular-expressionpattern-matchingformessagesthatdonotconformtoRFCstandards.Iwon’tbediscussingtheseadditionalsettings,butyoushouldbeawareofthemifyourunintofrequentparsingerrors.Inthiscase,takealookatthesourcefororg.apache.flume.source.SyslogUtilsforimplementationdetailstofindthecause.

Moredetailsonsyslogterms(suchasafacility)andstandardformatscanbefoundinRFC3164athttp://tools.ietf.org/html/rfc3164.

ThesyslogUDPsourceTheUDPversionofsyslogisusuallysafetousewhenyouarereceivingdatafromtheserver’slocalsyslogprocess,providedthedataissmallenough(lessthanabout64KB).

NoteTheimplementationforthissourcehaschosen2,500bytesasthemaximumpayloadsizeregardlessofwhatyournetworkcanactuallyhandle.Soifyourpayloadwillbelargerthanthis,useoneoftheTCPsourcesinstead.

TocreateaSyslogUDPsource,setthetypepropertytosyslogudp.Youmustsettheporttolistenonusingtheportproperty.Theoptionalhostpropertyspecifiesthebindaddress.Ifnohostisspecified,allIPsfortheserverwillbeused,whichisthesameasspecifying0.0.0.0.Inthisexample,wewillonlylistenforlocalUDPconnectionsonport5140:

agent.sources=s1

agent.sources.channels=c1

agent.sources.s1.type=syslogudp

agent.sources.s1.host=localhost

agent.sources.s1.port=5140

Ifyouwantsyslogtoforwardatailedfile,youcanaddalinelikethistoyoursyslogconfigurationfile:

*.err;*.alert;*.crit;*.emerg;kern.*@localhost:5140

Thiswillsendallerrorpriority,criticalpriority,emergencypriority,andkernelmessagesofanypriorityintoyourFlumesource.Thesingle@symboldesignatesthatUDPprotocolshouldbeused.

HereisasummaryofthepropertiesoftheSyslogUDPsource:

Key Required Type Default

type Yes String syslogudp

channels Yes String spaceseparatedlistofchannels

port Yes int

host No String 0.0.0.0

keepFields No boolean false

ThekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbytheSyslogUDPsourcearesummarizedhere:

HeaderKey Description

Facility Thisisthesyslogfacility.Seethesyslogdocumentation.

Priority Thisisthesyslogpriority.Seethesyslogdocumentation.

timestampThisisthetimeofthesyslogevent,translatedintoanepochtimestamp.It’somittedifit’snotparsedfromoneofthestandardRFCformats.

hostname Thisistheparsedhostnameinthesyslogmessage.Itisomittedifitisnotparsed.

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.ThisvalueissettoInvalidifthepayloaddidn’tconformtotheRFCs,andsettoIncompleteifthemessagewaslongerthantheeventSizevalue(forUDP,thisissetinternallyto2,500bytes).It’somittedifeverythingisfine.

ThesyslogTCPsourceAspreviouslymentioned,theSyslogTCPsourceprovidesanendpointformessagesoverTCP,allowingforalargerpayloadsizeandTCPretrysemanticsthatshouldbeusedforanyreliableinter-servercommunications.

TocreateaSyslogTCPsource,setthetypepropertytosyslogtcp.Youmuststillsetthebindaddressandporttolistenon:

agent.sources=s1

agent.sources.s1.type=syslogtcp

agent.sources.s1.host=0.0.0.0

agent.sources.s1.port=12345

IfyoursyslogimplementationsupportssyslogoverTCP,theconfigurationisusuallythesame,exceptthatadouble@symbolisusedtoindicateTCPtransport.HereisthesameexampleusingTCP,whereIamforwardingthevaluestoaFlumeagentthatisrunningonadifferentservernamedflume-1.

*.err;*.alert;*.crit;*.emerg;kern.*@@flume-1:12345

TherearesomeoptionalpropertiesfortheSyslogTCPsource,aslistedhere:

Key Required Type Default

type Yes String syslogtcp

channels Yes String spaceseparatedlistofchannels

port Yes int

host No String 0.0.0.0

keepFields No boolean false

eventSize No int(bytes) 2500bytes

ThekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbytheSyslogTCPsourcearesummarizedhere:

HeaderKey Description

Facility Thisisthesyslogfacility.Seethesyslogdocumentation.

Priority Thisisthesyslogpriority.Seethesyslogdocumentation.

timestampThisisthetimeofthesyslogeventtranslatedintoanepochtimestamp.It’somittedifnotparsedfromoneofthestandardRFCformats.

hostname Theparsedhostnameinthesyslogmessage.It’somittedifnotparsed.

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.It’ssettoInvalidifthepayloaddidn’tconformtotheRFCsandsettoIncompleteifthemessagewaslongerthantheconfiguredeventSize.It’somittedifeverythingisfine.

ThemultiportsyslogTCPsourceTheMultiportSyslogTCPsourceisnearlyidenticalinfunctionalitytotheSyslogTCPsource,exceptthatitcanlistentomultipleportsforinput.Youmayneedtousethiscapabilityifyouareunabletochangewhichportsyslogwilluseinitsforwardingrules(itmaynotbeyourserveratall).Itismorelikelythatyouwillusethistoreadmultipleformatsusingonesourcetowritetodifferentchannels.We’llcoverthatinamomentintheChannelSelectorssection.Underthehood,ahigh-performanceasynchronousTCPlibrarycalledMina(https://mina.apache.org/)isused,whichoftenprovidesbetterthroughputonmulticoreserversevenwhenconsumingonlyasingleTCPport.

Toconfigurethissource,setthetypepropertytomultiport_syslogtcp:

agent.sources.s1.type=multiport_syslogtcp

Liketheothersyslogsources,youneedtospecifytheport,butinthiscaseitisaspace-separatedlistofports.Youcanusethisonlyifyouhaveoneportspecified.Thepropertyforthisisports(plural):

agent.sources.s1.type=multiport_syslogtcp

agent.sources.s1.channels=c1

agent.sources.s1.ports=3333344444

agent.sources.s1.host=0.0.0.0

ThiscodeconfigurestheMultiportSyslogTCPsourcenameds1tolistentoanyincomingconnectionsonports33333and44444andsendthemtochannelc1.

Inordertotellwhicheventcamefromwhichport,youcansettheoptionalportHeaderpropertytothenameofthekeywhosevaluewillbetheportnumber.Let’saddthispropertytotheconfiguration:

agent.sources.s1.portHeader=port

Then,anyeventsreceivedfromport33333wouldhaveaheaderkey/valueof{"port"="33333"}.AsyousawinChapter4,SinksandSinkProcessors,youcannowusethisvalue(oranyheader)asapartofyourHDFSSinkfilepathconvention,likethis:

agent.sinks.k1.hdfs.path=/logs/%{hostname}/%{port}/%Y/%m/%D/%H

Hereisacompletetableoftheproperties:

Key Required Type Default

type Yes String syslogtcp

channels Yes String Space-separatedlistofchannels

ports Yes int Space-separatedlistofportnumbers

host No String 0.0.0.0

keepFields No boolean false

eventSize No int 2500(bytes)

portHeader No String

batchSize No int 100

readBufferSize No int(bytes) 1024

numProcessors No int automaticallydetected

charset.default No String UTF-8

charset.port.PORT# No String

ThisTCPsourcehassomeadditionaltunableoptionsoverthestandardTCPsyslogsource.ThefirstisthebatchSizeproperty.Thisisthenumberofeventsprocessedpertransactionwiththechannel.ThereisalsothereadBufferSizeproperty.ItspecifiestheinternalbuffersizeusedbyaninternalMinalibrary.Finally,thenumProcessorspropertyisusedtosizetheworkerthreadpoolinMina.Beforeyoutunetheseparameters,youmaywanttofamiliarizeyourselfwithMina(http://mina.apache.org/),andlookatthesourcecodebeforedeviatingfromthedefaults.

Finally,youcanspecifythedefaultandper-portcharacterencodingtousewhenconvertingbetweenStringsandbytes:

agent.sources.s1.charset.default=UTF-16

agent.sources.s1.charset.port.33333=UTF-8

ThissampleconfigurationshowsthatallportswillbeinterpretedusingUTF-16encoding,exceptforport33333traffic,whichwilluseUTF-8.

Asyou’vealreadyseenintheothersyslogsources,thekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbythissourcearesummarizedhere:

HeaderKey Description

Facility Thisisthesyslogfacility.Seethesyslogdocumentation.

Priority Thisisthesyslogpriority.Seethesyslogdocumentation.

timestampThisisthetimeofthesyslogeventtranslatedintoanepochtimestamp.OmittedifnotparsedfromoneofthestandardRFCformats.

hostname Thisistheparsedhostnameinthesyslogmessage.Omittedifnotparsed.

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.ThisissettoInvalidifthepayloaddidn’tconformtotheRFCs,settoIncompleteifthemessagewaslongerthantheconfiguredeventSize,andomittedifeverythingisfine.

JMSsourceSometimes,datacanoriginatefromasynchronousmessagequeues.Forthesecases,youcanuseFlume’sJMSsourcetocreateeventsreadfromaJMSQueueorTopic.WhileitistheoreticallypossibletouseanyJavaMessageService(JMS)implementation,FlumehasonlybeentestedwithActiveMQ,sobesuretotestthoroughlyifyouuseadifferentprovider.

LikethepreviouslycoveredElasticSearchsinkinChapter4,SinksandSinkProcessors,FlumedoesnotcomepackagedwiththeJMSimplementationyou’llbeusing,astheversionsneedtomatchup,soyou’llneedtoincludethenecessaryimplementationJARfilesontheFlumeagent’sclasspath.Thepreferredmethodistousethe--plugins-dirparametermentionedinChapter2,AQuickStartGuidetoFlume,whichwe’llcoverinmoredetailinthenextchapter.

NoteActiveMQisjustoneprovideroftheJavaMessageServiceAPI.Formoreinformation,seetheprojecthomepageathttp://activemq.apache.org.

Toconfigurethissource,setthetypepropertytojms:

agent.sources.s1.type=jms

ThefirstthreepropertiesareusedtoestablishaconnectionwiththeJMSServer.TheyareinitialContextFactory,connectionFactory,andproviderURL.TheinitialContextFactorypropertyforActiveMQwillbeorg.apache.activemq.jndi.ActiveMQInitialContextFactory.TheconnectionFactorypropertywillbetheregisteredJNDIname,whichdefaultstoConnectionFactoryifunspecified.Finally,providerURListheconnectionStringpassedtotheconnectionfactorytoactuallyestablishanetworkconnection.ItisusuallyaURL-likeStringconsistingoftheservernameandportinformation.Ifyouaren’tfamiliarwithyourJMSconfiguration,asksomebodywhodoeswhatvaluestouseinyourenvironment.

Thistablesummarizesthesesettingsandotherswe’lldiscussinamoment:

Key Required Type Default

type Yes String jms

channels Yes String spaceseparatedlistofchannels

initialContextFactory Yes String

connectionFactory No String ConnectionFactory

providerURL Yes String

userName No String usernameforauthentication

passwordFile No String pathtofilecontainingpasswordforauthentication

destinationName Yes String

destinationType Yes String

messageSelector No String

errorThreshold No int 10

pollTimeout No long 1000(milliseconds)

batchSize No int 100

IfyourJMSServerrequiresauthentication,passtheuserNameandpasswordFileproperties:

agent.sources.s1.userName=jms_bot_user

agent.sources.s1.passwordFile=/path/to/password.txt

Puttingthepasswordinafileyoureference,ratherthandirectlyintheFlumeconfiguration,allowsyoutokeepthepermissionsontheconfigurationopenforinspection,whilestoringthemoresensitivepassworddatainaseparatefilethatisaccessibleonlytotheFlumeagent(restrictionsarecommonlyprovidedbytheoperatingsystem’spermissionssystem).

ThedestinationNamepropertydecideswhichmessagetoreadfromourconnection.SinceJMSsupportsbothqueuesandtopics,youneedtosetthedestinationTypepropertytoqueueortopic,respectively.Here’swhatthepropertiesmightlooklikeforaqueue:

agent.source.s1.destinationName=my_cool_data

agent.source.s1.destinationType=queue

Foratopic,itwilllookasfollows:

agent.source.s1.destinationName=restart_events

agent.source.s1.destinationType=topic

Thedifferencebetweenaqueueandatopicisbasicallythenumberofentitiesthatwillbereadfromthenameddestination.Ifyouarereadingfromaqueue,themessagewillberemovedfromthequeueonceitiswrittentoFlumeasanevent.Atopic,onceread,isstillavailableforotherentitiestoread.

Shouldyouneedonlyasubsetofmessagespublishedtoatopicorqueue,theoptionalmessageSelectorStringpropertyprovidesthiscapability.Forexample,tocreateFlumeeventsonmessageswithafieldcalledAgethatislargerthan10,IcanspecifyamessageSelectorfilter:

agent.source.s1.messageSelector="Age>10"

NoteDescribingtheselectorcapabilitiesandsyntaxindetailisfarbeyondthescopeofthis

book.Seehttp://docs.oracle.com/javaee/1.4/api/javax/jms/Message.htmlorpickupabookonJMStobecomemorefamiliarwithJMSmessageselectors.

ShouldtherebeaproblemincommunicatingwiththeJMSserver,theconnectionwillresetitselfafteranumberoffailuresspecifiedbytheerrorThresholdproperty.Thedefaultvalueof10isreasonableandwillmostlikelynotneedtobechanged.

Next,theJMSsourcehasapropertyusedtoadjustthevalueofbatchSize,whichdefaultsto100.Bynow,youshouldbefairlyfamiliarwithadjustingbatchsizeswithothersourcesandsinks.Forhigh-volumeflows,you’llwanttosetthishighertoconsumedatainlargerchunksfromyourJMSservertogethigherthroughput.Settingthistoohighforlowvolumeflowscoulddelayprocessing.Asalways,testingistheonlysurewaytoadjustthisproperly.TherelatedpollTimeoutpropertyspecifieshowlongtowaitfornewmessagestoappearbeforeattemptingtoreadabatch.Thedefault,specifiedinmilliseconds,is1second.Ifnomessagesarereadbeforethepolltimeout,theSourcewillgointoanexponentialbackoffmodebeforeattemptinganotherread.Thiscoulddelaytheprocessingofmessagesuntilitwakesupagain.Chancesarethatyouwon’tneedtochangethisvaluefromthedefault.

WhenthemessagegetsconvertedintoaFlumeevent,themessagepropertiesbecomeFlumeheaders.ThemessagepayloadbecomestheFlumebody,butsinceJMSusesserializedJavaobjects,weneedtotelltheJMSsourcehowtointerpretthepayload.Wedothisbysettingtheconverter.typeproperty,whichdefaultstotheonlyimplementationthatispackagedwithFlumeusingtheDEFAULTString(implementedbytheDefaultJMSMessageConvererclass).ItcandeserializeJMSBytesMessages,TextMessages,andObjectMessages(Javaobjectsthatimplementthejava.io.DataOutputinterface).ItdoesnothandleStreamMessagesorMapMessages,soifyouneedtoprocessthem,you’llneedtoimplementyourownconvertertype.Todothis,you’llimplementtheorg.apache.flume.source.jms.JMSMessageConverterinterface,anduseitsfullyqualifiedclassnameastheconverter.typepropertyvalue.

Forcompleteness,herearethesepropertiesintabularform:

Key Required Type Default

converter.type No String DEFAULT

converter.charset No String UTF-8

ChannelselectorsAswediscussedinChapter1,OverviewandArchitecture,asourcecanwritetooneormorechannels.Thisiswhythepropertyisplural(channelsinsteadofchannel).Therearetwowaysmultiplechannelscanbehandled.Theeventcanbewrittentoallthechannelsortojustonechannel,basedonsomeFlumeheadervalue.TheinternalmechanismforthisinFlumeiscalledachannelselector.

Theselectorforanychannelcanbespecifiedusingtheselector.typeproperty.Allselector-specificpropertiesbeginwiththeusualSourceprefix:theagentname,keywordsources,andsourcename:

agent.sources.s1.selector.type=replicating

ReplicatingIfyoudonotspecifyaselectorforasource,replicatingisthedefault.Thereplicatingselectorwritesthesameeventtoallchannelsinthesource’schannelslist:

agent.sources.s1.channels=c1c2c3

agent.sources.s1.selector.type=replicating

Inthisexample,everyeventwillbewrittentoallthreechannels:c1,c2,andc3.

Thereisanoptionalpropertyonthisselector,calledoptional.Itisaspace-separatedlistofchannelsthatareoptional.Considerthismodifiedexample:

agent.sources.s1.channels=c1c2c3

agent.sources.s1.selector.type=replicating

agent.sources.s1.selector.optional=c2c3

Now,anyfailuretowritetochannelsc2orc3willnotcausethetransactiontofail,andanydatawrittentoc1willbecommitted.Intheearlierexamplewithnooptionalchannels,anysinglechannelfailurewouldrollbackthetransactionforallchannels.

MultiplexingIfyouwanttosenddifferenteventstodifferentchannels,youshoulduseamultiplexingchannelselectorbysettingthevalueofselector.typetomultiplexing.Youalsoneedtotellthechannelselectorwhichheadertousebysettingtheselector.headerproperty:

agent.sources.s1.selector.type=multiplexing

agent.sources.s1.selector.header=port

Let’sassumeweusedtheMultiportSyslogTCPsourcetolistenonfourports—11111,22222,33333,and44444—withaportHeadersettingofport:

agent.sources.s1.selector.default=c2

agent.sources.s1.selector.mapping.11111=c1c2

agent.sources.s1.selector.mapping.44444=c2

agent.sources.s1.selector.optional.44444=c3

Thisconfigurationwillresultinthetrafficofport22222andport33333goingtothec2channelonly.Thetrafficofport11111willgotothec1andc2channels.Afailureoneitherchannelwouldresultinnothingbeingaddedtoeitherchannel.Thetrafficofport44444willgotochannelsc2andc3.However,afailuretowritetoc3willstillcommitthetransactiontoc2,andc3willnotbeattemptedagainwiththatevent.

SummaryInthischapter,wecoveredindepththevarioussourcesthatwecanusetoinsertlogdataintoFlume,includingtheExecsource,theSpoolingDirectorySource,Syslogsources(UDP,TCP,andmultiportTCP),andtheJMSsource.

WediscussedreplicatingtheoldTailSourcefunctionalityinFlume0.9andproblemswithusingtailsemanticsingeneral.

Wealsocoveredchannelselectorsandsendingeventstooneormorechannels,specificallythereplicatingandmultiplexingchannelselectors.

Optionalchannelswerealsodiscussedasawaytoonlyfailaputtransactionforonlysomeofthechannelswhenmorethanonechannelisused.

Inthenextchapter,we’llintroduceinterceptorsthatwillallowin-flightinspectionandtransformationofevents.Usedinconjunctionwithchannelselectors,interceptorsprovidethefinalpiecetocreatecomplexdataflowswithFlume.Additionally,wewillcoverRPCmechanisms(source/sinkpairs)betweenFlumeagentsusingbothAvroandThrift,whichcanbeusedtocreatecomplexdataflows.

Chapter6.Interceptors,ETL,andRoutingThefinalpieceoffunctionalityrequiredinyourdataprocessingpipelineistheabilitytoinspectandtransformeventsinflight.Thiscanbeaccomplishedusinginterceptors.Interceptors,aswediscussedinChapter1,OverviewandArchitecture,canbeinsertedafterasourcecreatesanevent,butbeforewritingtothechanneloccurs.

InterceptorsAninterceptor’sfunctionalitycanbesummedupwiththismethod:

publicEventintercept(Eventevent);

AFlumeeventispassedtoit,anditreturnsaFlumeevent.Itmaydonothing,inwhichcase,thesameunalteredeventisreturned.Often,italterstheeventinsomeusefulway.Ifnullisreturned,theeventisdropped.

Toaddinterceptorstoasource,simplyaddtheinterceptorspropertytothenamedsource,forexample:

agent.sources.s1.interceptors=i1i2i3

Thisdefinesthreeinterceptors:i1,i2,andi3onthes1sourcefortheagentnamedagent.

NoteInterceptorsarerunintheorderinwhichtheyarelisted.Intheprecedingexample,i2willreceivetheoutputfromi1.Then,i3willreceivetheoutputfromi2.Finally,thechannelselectorreceivestheoutputfromi3.

Nowthatwehavedefinedtheinterceptorbyname,weneedtospecifyitstypeasfollows:

agent.sources.s1.interceptors.i1.type=TYPE1

agent.sources.s1.interceptors.i1.additionalProperty1=VALUE

agent.sources.s1.interceptors.i2.type=TYPE2

agent.sources.s1.interceptors.i3.type=TYPE3

Let’slookatsomeoftheinterceptorsthatcomebundledwithFlumetogetabetterideaofhowtoconfigurethem.

TimestampTheTimestampinterceptor,asitsnamesuggests,addsaheaderwiththetimestampkeytotheFlumeeventifonedoesn’talreadyexist.Touseit,setthetypepropertytotimestamp.

Iftheeventalreadycontainsatimestampheader,itwillbeoverwrittenwiththecurrenttimeunlessconfiguredtopreservetheoriginalvaluebysettingthepreserveExistingpropertytotrue.

HereisatablesummarizingthepropertiesoftheTimestampinterceptor:

Key Required Type Default

type Yes String timestamp

preserveExisting No boolean false

Hereiswhatatotalconfigurationforasourcemightlooklikeifweonlywantittoaddatimestampheaderifnoneexists:

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.i1.type=timestamp

agent.sources.s1.interceptors.i1.preserveExisting=true

RecallthisHDFSSinkpathfromChapter4,SinksandSinkProcessors,utilizingtheeventdate:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H

Thetimestampheaderiswhatdeterminesthispath.Ifitismissing,youcanbesureFlumewillnotknowwheretocreatethefiles,andyouwillnotgettheresultyouarelookingfor.

HostSimilarinsimplicitytotheTimestampinterceptor,theHostinterceptorwilladdaheadertotheeventcontainingtheIPaddressofthecurrentFlumeagent.Touseit,setthetypepropertytohost:

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.type=host

ThekeyforthisheaderwillbehostunlessyouspecifysomethingelseusingthehostHeaderproperty.Likebefore,anexistingheaderwillbeoverwritten,unlessyousetthepreserveExistingpropertytotrue.Finally,ifyouwantareverseDNSlookupofthehostnametobeusedinsteadoftheIPasavalue,settheuseIPpropertytofalse.Rememberthatreverselookupswilladdprocessingtimetoyourdataflow.

HereisatablesummarizingthepropertiesoftheHostinterceptor:

Key Required Type Default

type Yes String host

hostHeader No String host

preserveExisting No boolean false

useIP No boolean true

HereiswhatatotalconfigurationforasourcemightlooklikeifweonlywantittoaddarelayHostheadercontainingtheDNShostnameofthisagenttoeveryevent:

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.i1.type=host

agent.sources.s1.interceptors.i1.hostHeader=relayHost

agent.sources.s1.interceptors.i1.useIP=false

Thisinterceptormightbeusefulifyouwantedtorecordthepathyoureventstookthoughyourdataflow,forinstance.Chancesareyouaremoreinterestedintheoriginoftheeventratherthanthepathittook,whichiswhyIhaveyettousethis.

StaticTheStaticinterceptorisusedtoinsertasinglekey/valueheaderintoeachFlumeeventprocessed.Ifmorethanonekey/valueisdesired,yousimplyaddadditionalStaticinterceptors.Unliketheinterceptorswe’velookedatsofar,thedefaultbehavioristopreserveexistingheaderswiththesamekey.Asalways,myrecommendationistoalwaysspecifywhatyouwantandnotrelyonthedefaults.

Idonotknowwhythekeyandvaluepropertiesarenotrequired,asthedefaultsarenotterriblyuseful.

HereisatablesummarizingthepropertiesoftheStaticinterceptor:

Key Required Type Default

type Yes String static

key No String key

value No String value

preserveExisting No boolean true

Finally,let’slookatanexampleconfigurationthatinsertstwonewheaders,providedtheydon’talreadyexistintheevent:

agent.sources.s1.interceptors=posenv

agent.sources.s1.interceptors.pos.type=static

agent.sources.s1.interceptors.pos.key=pointOfSale

agent.sources.s1.interceptors.pos.value=US

agent.sources.s1.interceptors.env.type=static

agent.sources.s1.interceptors.env.key=environment

agent.sources.s1.interceptors.env.value=staging

RegularexpressionfilteringIfyouwanttofiltereventsbasedonthecontentofthebody,theregularexpressionfilteringinterceptorisyourfriend.Basedonaregularexpressionyouprovide,itwilleitherfilteroutthematchingeventsorkeeponlythematchingevents.Startbysettingthetypeinterceptortoregex_filter.ThepatternyouwanttomatchisspecifiedusingaJava-styleregularexpressionsyntax.Seethesejavadocsforusagedetailsathttp://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html.Thepatternstringissetintheregexproperty.BesuretoescapebackslashesinJavaStrings.Forinstance,the\d+patternwouldneedtobewrittenwithtwobackslashes:\\d+.Ifyouwantedtomatchabackslash,thedocumentationsaystotypetwobackslashes,buteachneedstobeescaped,resultinginfour,thatis,\\\\.Youwillseetheuseofescapedbackslashesthroughoutthischapter.Finally,youneedtotelltheinterceptorifyouwanttoexcludematchingrecordsbysettingtheexcludeEventspropertytotrue.Thedefault(false)indicatesthatyouwanttoonlykeepeventsthatmatchthepattern.

Hereisatablesummarizingthepropertiesoftheregularexpressionfilteringinterceptor:

Key Required Type Default

type Yes String regex_filter

regex No String .*

excludeEvents No boolean false

Inthisexample,anyeventscontainingtheNullPointerExceptionstringwillbedropped:

agent.sources.s1.interceptors=npe

agent.sources.s1.interceptors.npe.type=regex_filter

agent.sources.s1.interceptors.npe.regex=NullPointerException

agent.sources.s1.interceptors.npe.excludeEvents=true

RegularexpressionextractorSometimes,you’llwanttoextractbitsofyoureventbodyintoFlumeheaderssothatyoucanperformroutingviaChannelSelectors.Youcanusetheregularexpressionextractorinterceptortoperformthisfunction.Startbysettingthetypeinterceptortoregex_extractor:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

Liketheregularexpressionfilteringinterceptor,theregularexpressionextractorinterceptortoousesaJava-styleregularexpressionsyntax.Inordertoextractoneormorefields,youstartbyspecifyingtheregexpropertywithgroupmatchingparentheses.Let’sassumewearelookingforerrornumbersinoureventsintheError:Nform,whereNisanumber:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+)

Asyoucansee,Iputcaptureparenthesesaroundthenumber,whichmaybeoneormoredigit.NowthatI’vematchedmydesiredpattern,IneedtotellFlumewhattodowithmymatch.Here,weneedtointroduceserializers,whichprovideapluggablemechanismforhowtointerpreteachmatch.Inthisexample,I’veonlygotonematch,somyspace-separatedlistofserializernameshasonlyoneentry:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+)

agent.sources.s1.interceptors.e1.serializers=ser1

agent.sources.s1.interceptors.e1.serializers.ser1.type=default

agent.sources.s1.interceptors.e1.serializers.ser1.name=error_no

Thenamepropertyspecifiestheeventkeytouse,wherethevalueisthematchingtextfromtheregularexpression.Thetypeofdefaultvalue(alsothedefaultifnotspecified)isasimplepass-throughserializer.Forthiseventbody,lookatthefollowing:

NullPointerException:Aproblemoccurred.Error:123.TxnID:5X2T9E.

Thefollowingheaderwouldbeaddedtotheevent:

{"error_no":"123"}

IfIwantedtoaddtheTxnIDvalueasaheader,I’dsimplyaddanothermatchingpatterngroupandserializer:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+).*TxnID:\\s(\\w+)

agent.sources.s1.interceptors.e1.serializers=ser1ser2

agent.sources.s1.interceptors.e1.serializers.ser1.type=default

agent.sources.s1.interceptors.e1.serializers.ser1.name=error_no

agent.sources.s1.interceptors.e1.serializers.ser2.type=default

agent.sources.s1.interceptors.e1.serializers.ser2.name=txnid

Then,Iwouldcreatetheseheadersfortheprecedinginput:

{"error_no":"123","txnid":"5x2T9E"}

However,takealookatwhatwouldhappenifthefieldswerereversedasfollows:

NullPointerException:Aproblemoccurred.TxnID:5X2T9E.Error:123.

Iwouldwindupwithonlyaheaderfortxnid.Abetterwaytohandlethiskindoforderingwouldbetousemultipleinterceptorssothattheorderdoesn’tmatter:

agent.sources.s1.interceptors=e1e2

agent.sources.s1.interceptors.e1.type=regex_extractor

agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+)

agent.sources.s1.interceptors.e1.serializers=ser1

agent.sources.s1.interceptors.e1.serializers.ser1.type=default

agent.sources.s1.interceptors.e1.serializers.ser1.name=error_no

agent.sources.s1.interceptors.e2.type=regex_extractor

agent.sources.s1.interceptors.e2.regex=TxnID:\\s(\\w+)

agent.sources.s1.interceptors.e2.serializers=ser1

agent.sources.s1.interceptors.e2.serializers.ser1.type=default

agent.sources.s1.interceptors.e2.serializers.ser1.name=txnid

TheonlyothertypeofserializerimplementationthatshipswithFlume,otherthanthepass-through,istospecifythefullyqualifiedclassnameoforg.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer.Thisserializerisusedtoconverttimesintomilliseconds.Youneedtospecifyapatternpropertybasedonorg.joda.time.format.DateTimeFormatpatterns.

Forinstance,let’ssayyouwereingestingApacheWebServeraccesslogs,forexample:

192.168.1.42--[29/Mar/2013:15:27:09-0600]"GET/index.htmlHTTP/1.1"

2001037

Thecompleteregularexpressionforthismightlooklikethis(intheformofaJavaString,withbackslashandquotesescapedwithanextrabackslash):

^([\\d.]+)\\S+\\S+\\[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})

(\\d+)

Thetimepatternmatchedcorrespondstotheorg.joda.time.format.DateTimeFormatpattern:

yyyy/MMM/dd:HH:mm:ssZ

Takealookatwhatwouldhappenifwemakeourconfigurationsomethinglikethis:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

agent.sources.s1.interceptors.e1.regex=^([\\d.]+)\\S+\\S+\\

[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})(\\d+)

agent.sources.s1.interceptors.e1.serializers=ipdturlscbc

agent.sources.s1.interceptors.e1.serializers.ip.name=ip_address

agent.sources.s1.interceptors.e1.serializers.dt.type=org.apache.flume.inter

ceptor.RegexExtractorInterceptorMillisSerializer

agent.sources.s1.interceptors.e1.serializers.dt.pattern=dd/MMM/yyyy:HH:mm:s

sZ

agent.sources.s1.interceptors.e1.serializers.dt.name=timestamp

agent.sources.s1.interceptors.e1.serializers.url.name=http_request

agent.sources.s1.interceptors.e1.serializers.sc.name=status_code

agent.sources.s1.interceptors.e1.serializers.bc.name=bytes_xfered

Thiswouldcreatethefollowingheadersfortheprecedingsample:

{"ip_address":"192.168.1.42","timestamp":"1364588829",

"http_request":"GET/index.htmlHTTP/1.1","status_code":"200",

"bytes_xfered":"1037"}

Thebodycontentisunaffected.You’llalsonoticethatIdidn’tspecifydefaultfortheothertypeofserializers,asthatisthedefault.

NoteThereisnooverwritecheckinginthisinterceptortype.Forinstance,usingthetimestampkeywilloverwritetheevent’sprevioustimevalueiftherewasone.

Youcanimplementyourownserializersforthisinterceptorbyimplementingtheorg.apache.flume.interceptor.RegexExtractorInterceptorSerializerinterface.However,ifyourgoalistomovedatafromthebodyofaneventtotheheader,you’llprobablywanttoimplementacustominterceptorsothatyoucanalterthebodycontentsinadditiontosettingtheheadervalue,otherwisethedatawillbeeffectivelyduplicated.

Tosummarize,let’sreviewthepropertiesforthisinterceptor:

Key Required Type Default

type Yes String regex_extractor

regex Yes String

serializers Yes Space-separatedlistofserializernames

serializers.NAME.name Yes String

serializers.NAME.type No DefaultorFQCNofimplementation default

serializers.NAME.PROP No Serializer-specificproperties

MorphlineinterceptorAswesawinChapter4,SinksandSinkProcessors,apowerfullibraryoftransformationsbackstheMorphlineSolrSinkfromtheKiteSDKproject.Itshouldcomeasnosurprisethatyoucanalsousetheselibrariesinmanyplaceswhereyou’dbeforcedtowriteacustominterceptor.Similarinconfigurationtoitssinkcounterpart,youonlyneedtospecifytheMorphlineconfigurationfile,andoptionally,theMorphlineuniqueidentifier(iftheconfigurationspecifiesmorethanoneMorphline).HereisasummarytableoftheFlumeinterceptorconfiguration:

Key Required Type Default

type Yes String org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder

morphlineFile Yes String

morphlineId No String Picksthefirstifnotspecifiedandanerrorifnotspecified;morethanoneexists.

YourFlumeconfigurationmightlooksomethinglikethis:

agent.sources.s1.interceptors=i1m1

agent.sources.s1.interceptors.i1.type=timestamp

agent.sources.s1.interceptors.m1.type=org.apache.flume.sink.solr.morphline.

MorphlineInterceptor$Builder

agent.sources.s1.interceptors.m1.morphlineFile=/path/to/morph.conf

agent.sources.s1.interceptors.m1.morphlineId=goMorphy

Inthisexample,wehavespecifiedthes1sourceontheagentnamedagent,whichcontainstwointerceptors,i1andm1,processedinthatorder.Thefirstinterceptorisastandardinterceptorthatinsertsatimestampheaderifnoneexists.ThesecondwillsendtheeventthroughtheMorphlineprocessorspecifiedbytheMorphlineconfigurationfilecorrespondingtothegoMorphyID.

Eventsprocessedasaninterceptormustonlyoutputoneeventforeveryinputevent.IfyouneedtooutputmultiplerecordsfromasingleFlumeevent,youmustdothisintheMorphlineSolrSink.Withinterceptors,itisstrictlyoneeventinandoneeventout.

ThefirstcommandwillmostlikelybethereadLinecommand(orreadBlob,readCSV,andsoon)toconverttheeventintoaMorphlineRecord.Fromthere,yourunanyotherMorphlinecommandsyoulikewiththefinalRecordattheendoftheMorphlinechainbeingconvertedbackintoaFlumeevent.WeknowfromChapter4,SinksandSinkProcessors,thatthe_attachment_bodyspecialRecordkeyshouldbebyte[](thesameastheevent’sbody).Thismeansthatthelastcommandneedstodotheconversion(suchastoByteArrayorwriteAvroToByteArray).AllotherRecordkeysareconvertedtoStringvaluesandsetasheaderswiththesamekeys.

NoteTakealookatthereferenceguideforacompletelistofMorphlinecommands,theirpropertiesandusageinformationathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

TheMorphlineconfigurationfilesyntaxhasalreadybeendiscussedindetailintheMorphlineconfigurationfilesectioninChapter4,SinksandSinkProcessors.

CustominterceptorsIfthereisonepieceofcustomcodeyouwilladdtoyourFlumeimplementation,itwillmostlikelybeacustominterceptor.Asmentionedearlier,youimplementtheorg.apache.flume.interceptor.Interceptorinterfaceandtheassociatedorg.apache.flume.interceptor.Interceptor.Builderinterface.

Let’ssayIneededtoURLCodemyeventbody.Thecodewouldlooksomethinglikethis:

publicclassURLDecodeimplementsInterceptor{

publicvoidinitialize(){}

publicEventintercept(Eventevent){

try{

byte[]decoded=URLDecoder.decode(newString(event.getBody()),"UTF-

8").getBytes("UTF-8");

event.setBody(decoded);

}catchUnsupportedEncodingExceptione){

//Shouldn'thappen.Fallthroughtounalteredevent.

}

returnevent;

}

publicList<Event>intercept(List<Event>events){

for(Eventevent:events){

intercept(event);

}

returnevents;

}

publicvoidclose(){}

publicstaticclassBuilderimplementsInterceptor.Builder{

publicInterceptorbuild(){

returnnewURLDecode();

}

publicvoidconfigure(Contextcontext){}

}

}

Then,toconfiguremynewinterceptor,usethefullyqualifiedclassnamefortheBuilderclassasthetype:

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.i1.type=com.example.URLDecoder$Builder

Formoreexamplesofhowtopassandvalidateproperties,lookatanyoftheexistinginterceptorimplementationsintheFlumesourcecode.

Keepinmindthatanyheavyprocessinginyourcustominterceptorcanaffecttheoverallthroughput,sobemindfulofobjectchurnorcomputationallyintensiveprocessinginyourimplementations.

ThepluginsdirectoryCustomcode(sources,interceptors,andsoon)canalwaysbeinstalledalongsidetheFlumecoreclassesinthe$FLUME_HOME/libdirectory.YoucouldalsospecifyadditionalpathsforCLASSPATHonstartupbywayoftheflume-env.shshellscript(whichisoftensourcedatstartuptimewhenusingapackageddistributionofFlume).StartingwithFlume1.4,thereisnowacommand-lineoptiontospecifyadirectorythatcontainsthecustomcode.Bydefault,ifnotspecified,the$FLUME_HOME/plugins.ddirectoryisused,butyoucanoverridethisusingthe--plugins-pathcommand-lineparameter.

Withinthisdirectory,eachpieceofcustomcodeisseparatedintoasubdirectorywhosenameisofyourchoosing—picksomethingeasyforyoutokeeptrackofthings.Inthisdirectory,youcanincludeuptothreesubdirectories:lib,libext,andnative.

ThelibdirectoryshouldcontaintheJARfileforyourcustomcomponent.Anydependenciesitusesshouldbeaddedtothelibextsubdirectory.BothofthesepathsareaddedtotheJavaCLASSPATHvariableatstartup.

TipTheFlumedocumentationimpliesthatthisdirectoryseparationbycomponentallowsforconflictingJavalibrariestocoexist.Intruth,thisisnotpossibleunlesstheunderlyingimplementationmakesuseofdifferentclassloaderswithintheJVM.Inthiscase,theFlumestartupcodesimplyappendsallofthesepathstothestartupCLASSPATHvariable,sotheorderinwhichthesubdirectoriesareprocessedwilldeterminetheprecedence.Youcannotevenbeguaranteedthatthesubdirectorieswillbeprocessedinalexicographicorder,astheunderlyingbashshellforloopcangivenosuchguarantees.Inpractice,youshouldalwaystryandavoidconflictingdependencies.Codethatdependstoomuchonorderingtendstobebuggy,especiallyifyourclasspathreordersitselffromserverinstallationtoserverinstallation.

Thethirddirectory,native,iswhereyouputanynativelibrariesassociatedwithyourcustomcomponent.Thispath,ifitexists,getsaddedtoLD_LIBRARY_PATHatstartupsothattheJVMcanlocatethesenativecomponents.

So,ifIhadthreecustomcomponents,mydirectorystructuremightlooksomethinglikethis:

$FLUME_HOME/plugins.d/base64-enc/lib/base64interceptor.jar

$FLUME_HOME/plugins.d/base64-enc/libext/base64-2.0.0.jar

$FLUME_HOME/plugins.d/base64-enc/native/libFoo.so

$FLUME_HOME/plugins.d/uuencode/lib/uuEncodingInterceptor.jar

$FLUME_HOME/plugins.d/my-avro-serializer/myAvroSerializer.jar

$FLUME_HOME/plugins.d/my-avro-serializer/native/libAvro.so

Keepinmindthatallthesepathsgetmashedtogetheratstartup,soconflictingversionsoflibrariesshouldbeavoided.Thisstructureisanorganizationmechanismforeaseofdeploymentforhumans.Personally,Ihavenotusedityet,asIuseChef(orPuppet)toinstallcustomcomponentsonmyFlumeagentsdirectlyinto$FLUME_HOME/lib,butifyouprefertousethismechanism,itisavailable.

TieringflowsInChapter1,OverviewandArchitecture,wetalkedabouttieringyourdataflows.Thereareseveralreasonsforyoutowanttodothis.YoumaywanttolimitthenumberofFlumeagentsthatdirectlyconnecttoyourHadoopcluster,tolimitthenumberofparallelrequests.YoumayalsolacksufficientdiskspaceonyourapplicationserverstostoreasignificantamountofdatawhileyouareperformingmaintenanceonyourHadoopcluster.Whateveryourreasonorusecase,themostcommonmechanismtochainFlumeagentsistousetheAvrosource/sinkpair.

TheAvrosource/sinkWecoveredAvroabitinChapter4,SinksandSinkProcessors,whenwediscussedhowtouseitasanon-diskserializationformatforfilesstoredinHDFS.Here,we’llputittouseincommunicationbetweenFlumeagents.Atypicalconfigurationmightlooksomethinglikethis:

TousetheAvrosource,youspecifythetypepropertywithavalueofavro.Youneedtoprovideabindaddressandportnumbertolistenon:

collector.sources=av1

collector.sources.av1.type=avro

collector.sources.av1.bind=0.0.0.0

collector.sources.av1.port=42424

collector.sources.av1.channels=ch1

collector.channels=ch1

collector.channels.ch1.type=memory

collector.sinks=k1

collector.sinks.k1.type=hdfs

collector.sinks.k1.channel=ch1

collector.sinks.k1.hdfs.path=/path/in/hdfs

Here,wehaveconfiguredtheagentinthemiddlethatlistensonport42424,usesamemorychannel,andwritestoHDFS.I’veusedthememorychannelforbrevityinthisexampleconfiguration.AlsonotethatI’vegiventhisagentadifferentname,collector,justtoavoidconfusion.

Theagentsonthetopandbottomsidesfeedingthecollectortiermighthaveaconfigurationsimilartothis.Ihaveleftthesourcesoffthisconfigurationforbrevity:

client.channels=ch1

client.channels.ch1.type=memory

client.sinks=k1

client.sinks.k1.type=avro

client.sinks.k1.channel=ch1

client.sinks.k1.hostname=collector.example.com

client.sinks.k1.port=42424

Thehostname,collector.example.com,hasnothingtodowiththeagentnameonthismachine;itisthehostname(oryoucanuseanIP)ofthetargetmachinewiththereceivingAvrosource.Thisconfiguration,namedclient,wouldbeappliedtobothagentsonthetopandbottomsides,assumingbothhadsimilarsourceconfigurations.

AsIdon’tlikesinglepointsoffailure,Iwouldconfiguretwocollectoragentswiththeprecedingconfigurationandinstead,seteachclientagenttoroundrobinbetweenthetwo,usingasinkgroup.Again,I’veleftoffthesourcesforbrevity:

client.channels=ch1

client.channels.ch1.type=memory

client.sinks=k1k2

client.sinks.k1.type=avro

client.sinks.k1.channel=ch1

client.sinks.k1.hostname=collectorA.example.com

client.sinks.k1.port=42424

client.sinks.k2.type=avro

client.sinks.k2.channel=ch1

client.sinks.k2.hostname=collectorB.example.com

client.sinks.k2.port=42424

client.sinkgroups=g1

client.sinkgroups.g1=k1k2

client.sinkgroups.g1.processor.type=load_balance

client.sinkgroups.g1.processor.selector=round_robin

client.sinkgroups.g1.processor.backoff=true

TherearefouradditionalpropertiesassociatedwiththeAvrosinkthatyoumayneedtoadjustfromtheirsensibledefaults.

Thefirstisthebatch-sizeproperty,whichdefaultsto100.Inheavyloads,youmayseebetterthroughputbysettingthishigher,forexample:

client.sinks.k1.batch-size=1024

Thenexttwopropertiescontrolnetworkconnectiontimeouts.Theconnect-timeoutproperty,whichdefaultsto20seconds(specifiedinmilliseconds),istheamountoftimerequiredtoestablishaconnectionwithanAvrosource(receiver).Therelatedrequest-timeoutproperty,whichalsodefaultsto20seconds(specifiedinmilliseconds),isthe

amountoftimeforasentmessagetobeacknowledged.IfIwantedtoincreasethesevaluesto1minute,Icouldaddtheseadditionalproperties:

client.sinks.k1.connect-timeout=60000

client.sinks.k1.request-timeout=60000

Finally,thereset-connection-intervalpropertycanbesettoforceconnectionstoreestablishthemselvesaftersometimeperiod.ThiscanbeusefulwhenyoursinkisconnectingthroughaVIP(VirtualIPAddress)orhardwareloadbalancertokeepthingsbalanced,asservicesofferedbehindtheVIPmaychangeovertimeduetofailuresorchangesincapacity.Bydefault,theconnectionswillnotberesetexceptincasesoffailure.Ifyouwantedtochangethissothattheconnectionsresetthemselveseveryhour,forexample,youcanspecifythisbysettingthispropertywiththenumberofseconds,asfollows:

client.sinks.k1.reset-connection-interval=3600

CompressingAvroCommunicationbetweentheAvrosourceandsinkcanbecompressedbysettingthecompression-typepropertytodeflate.OntheAvrosink,youcanadditionally,setthecompression-levelpropertytoanumberbetween1and9,withthedefaultbeing6.Typically,therearediminishingreturnsathighercompressionlevels,buttestingmayproveanondefaultvaluethatworksbetterforyou.Clearly,youneedtoweightheadditionalCPUcostsagainsttheoverheadtoperformahigherlevelofcompression.Typically,thedefaultisfine.

Compressedcommunicationsareespeciallyimportantwhenyouaredealingwithhighlatencynetworks,suchassendingdatabetweentwodatacenters.Anothercommonusecaseforcompressioniswhereyouarechargedforthebandwidthconsumed,suchasmostpubliccloudservices.Inthesecases,youwillprobablychoosetospendCPUcycles,compressinganddecompressingyourflowratherthansendinghighlycompressibledatauncompressed.

TipItisveryimportantthatifyousetthecompression-typepropertyinasource/sinkpair,yousetthispropertyatbothends.Otherwise,asinkcouldbesendingdatathesourcecan’tconsume.

Continuingtheprecedingexample,toaddcompression,youwouldaddtheseadditionalpropertyfieldsonbothagents:

collector.sources.av1.compression-type=deflate

client.sinks.k1.compression-type=deflate

client.sinks.k1.compression-level=7

client.sinks.k2.compression-type=deflate

client.sinks.k2.compression-level=7

SSLAvroflowsSometimes,communicationisofasensitivenatureormaytraverseuntrustednetwork

paths.YoucanencryptyourcommunicationsbetweenanAvrosinkandanAvrosourcebysettingthesslpropertytotrue.Additionally,youwillneedtopassSSLcertificateinformationtothesource(thereceiver)andoptionally,thesink(thesender).

Iwon’tclaimtobeanexpertinSSLandSSLcertificates,butthegeneralideaisthatacertificatecaneitherbeself-signedorsignedbyatrustedthirdpartysuchasVeriSign(calledacertificateauthorityorCA).Generally,becauseofthecostassociatedwithgettingaverifiedcertificate,peopleonlydothisforwebbrowsercertificatesacustomermightseeinawebbrowser.SomeorganizationswillhaveaninternalCAwhocansigncertificates.Forthepurposeofthisexample,we’llbeusingself-signedcertificates.Thisbasicallymeansthatwe’llgenerateacertificatebutnothaveitsignedbyanyauthority.Forthis,we’llusethekeytoolutilitythatcomeswithallJavainstallations(thereferencecanbefoundathttp://docs.oracle.com/javase/7/docs/technotes/tools/solaris/keytool.html).Here,IcreateaJavaKeyStore(JKS)filethatcontainsa2048bitkey.Whenpromptedforapassword,I’llusethepasswordstring.Makesureyouusearealpasswordinyournontestenvironments:

%keytool-genkey-aliasflumey-keyalgRSA-keystorekeystore.jks-keysize

2048

Enterkeystorepassword:password

Re-enternewpassword:password

Whatisyourfirstandlastname?

[Unknown]:SteveHoffman

Whatisthenameofyourorganizationalunit?

[Unknown]:Operations

Whatisthenameofyourorganization?

[Unknown]:Me,MyselfandI

WhatisthenameofyourCityorLocality?

[Unknown]:Chicago

WhatisthenameofyourStateorProvince?

[Unknown]:Illinois

Whatisthetwo-lettercountrycodeforthisunit?

[Unknown]:US

IsCN=SteveHoffman,OU=Operations,O="Me,MyselfandI",L=Chicago,

ST=Illinois,C=UScorrect?

[no]:yes

Enterkeypasswordfor<flumey>

(RETURNifsameaskeystorepassword):

IcanverifythecontentsoftheJKSbyrunningthelistsubcommand:

%keytool-list-keystorekeystore.jks

Enterkeystorepassword:password

Keystoretype:JKS

Keystoreprovider:SUN

Yourkeystorecontains1entry

flumey,Nov11,2014,PrivateKeyEntry,Certificatefingerprint(SHA1):

5C:BC:3C:7F:7A:E7:77:EB:B5:54:FA:E2:8B:DD:D3:66:36:86:DE:E4

NowthatIhaveakeyinmykeystorefile,Icansettheadditionalpropertiesonthereceivingsource:

collector.sources.av1.ssl=true

collector.sources.av1.keystore=/path/to/keystore.jks

collector.sources.av1.keystore-password=password

AsIamusingaself-signedcertificate,thesinkwon’tbesurethatcommunicationscanbetrusted,soIhavetwooptions.Thefirstistotellittojusttrustallcertificates.Clearly,youwouldonlydothisonaprivatenetworkthathassomereasonableassurancesaboutitssecurity.Toignorethedubiousoriginofmycertificate,Icansetthetrust-all-certspropertytotrueasfollows:

client.sinks.k1.ssl=true

client.sinks.k1.trust-all-certs=true

Ifyoudidn’tsetthis,you’dseesomethinglikethisinthelogs:

org.apache.flume.EventDeliveryException:Failedtosendevents…

Causedby:javax.net.ssl.SSLHandshakeException:GeneralSSLEngineproblem…

Causedby:sun.security.validator.ValidatorException:Notrusted

certificatefound…

ForcommunicationsoverthepublicInternetorinmultitenantcloudenvironments,morecareshouldbetaken.Inthiscase,youcanprovideatruststoretothesink.Atruststoreissimilartoakeystore,exceptthatitcontainscertificateauthoritiesyouaretellingFlumecanbetrusted.Forwell-knowncertificateauthorities,suchasVeriSignandothers,youdon’tneedtospecifyatruststore,astheiridentitiesarealreadyincludedinyourJavadistribution(atleastforthemajorones).Youwillneedtohaveyourcertificatesignedbyacertificateauthorityandthenaddthesignedcertificatefiletothesource’skeystorefile.YouwillalsoneedtoincludethesigningCA’scertificateinthekeystoresothatthecertificatechaincanbefullyresolved.

Onceyouhaveaddedthekey,signedcertificate,andsigningCA’scertificatetothesource,youneedtoconfigurethesink(thesender)totrusttheseauthoritiessothatitwillpassthevalidationstepduringtheSSLhandshake.Tospecifythetruststore,setthetruststoreandtruststore-passwordpropertiesasfollows:

client.sinks.k1.ssl=true

client.sinks.k1.truststore=/path/to/truststore.jks

client.sinks.k1.truststore-password=password

Chancesarethereissomebodyinyourorganizationresponsibleforobtainingthethird-partycertificatesorsomeonewhocanissueanorganizationalCA-signed-certificatetoyou,soI’mnotgoingtogointodetailsabouthowtocreateyourowncertificateauthority.ThereisplentyofinformationontheInternetifyouchoosetotakethisroute.Rememberthatcertificateshaveanexpirationdate(usually,ayear),soyou’llneedtorepeatthisprocesseveryyearorcommunicationswillabruptlystopwhencertificatesorcertificateauthoritiesexpire.

TheThriftsource/sinkAnothersource/sinkpairyoucanusetotieryourdataflowsisbasedonThrift(http://thrift.apache.org/).UnlikeAvrodata,whichisself-documenting,Thriftusesanexternalschema,whichcanbecompiledintojustabouteveryprogramminglanguageontheplanet.TheconfigurationisalmostidenticaltotheAvroexamplealreadycovered.TheprecedingcollectorconfigurationthatusesThriftwouldnowlooksomethinglikethis:

collector.sources=th1

collector.sources.th1.type=thrift

collector.sources.th1.bind=0.0.0.0

collector.sources.th1.port=42324

collector.sources.th1.channels=ch1

collector.channels=ch1

collector.channels.ch1.type=memory

collector.sinks=k1

collector.sinks.k1.type=hdfs

collector.sinks.k1.channel=ch1

collector.sinks.k1.hdfs.path=/path/in/hdfs

Thereisoneadditionalpropertythatsetsthemaximumworkerthreadsintheunderlyingthreadpool.Ifunset,avalueofzeroisassumed,whichmakesthethreadpoolunbounded,soyoushouldprobablysetthisinyourproductionconfigurations.Testingshouldprovideyouwithareasonableupperlimitforyourenvironment.Tosetthethreadpoolsizeto10,forexample,youwouldaddthethreadsproperty:

collector.source.th1.threads=10

The“client”agentswouldusethecorrespondingThriftsinkasfollows:

client.channels=ch1

client.channels.ch1.type=memory

client.sinks=k1

client.sinks.k1.type=thrift

client.sinks.k1.channel=ch1

client.sinks.k1.hostname=collector.example.com

client.sinks.k1.port=42324

LikeitsAvrocounterpart,thereareadditionalsettingsforthebatchsize,connectiontimeouts,andconnectionresetintervalsthatyoumaywanttoadjustbasedonyourtestingresults.RefertotheTheAvrosource/sinksectionearlierinthischapterfordetails.

Usingcommand-lineAvroTheAvrosourcecanalsobeusedinconjunctionwithoneofthecommand-lineoptionsyoumayhavenoticedbackinChapter2,AQuickStartGuidetoFlume.Ratherthanrunningflume-ngwiththeagentparameter,youcanpasstheavro-clientparametertosendoneormorefilestoanAvrosource.Thesearetheoptionsspecifictoavro-clientfromthehelptext:

avro-clientoptions:

--dirname<dir>directorytostreamtoavrosource

--host,-H<host>hostnametowhicheventswillbesent(required)

--port,-p<port>portoftheavrosource(required)

--filename,-F<file>textfiletostreamtoavrosource[default:std

input]

--headerFile,-R<file>headerFilecontainingheadersaskey/valuepairson

eachnewline

--help,-hdisplayhelptext

Thisvariationisveryusefulfortesting,resendingdatamanuallyduetoerrors,orimportingolderdatastoredelsewhere.

JustlikeanAvrosink,youhavetospecifythehostnameandportyouwillbesendingdatato.Youcansendasinglefilewiththe--filenameoptionorallthefilesinadirectorywiththe--dirnameoption.Ifyouspecifyneitherofthese,stdinwillbeused.Hereishowyoumightsendafilenamedfoo.logtotheFlumeagentwepreviouslyconfigured:

$./flume-ngavro-client--filenamefoo.log--hostcollector.example.com--

port42424

EachlineoftheinputwillbeconvertedintoasingleFlumeevent.

Optionally,youcanspecifyafilecontainingkey/valuepairstosetFlumeheadervalues.ThefileusesJavapropertyfilesyntax.SupposeIhadafilenamedheaders.propertiescontaining:

pointOfSale=US

environment=staging

Then,includingthe--headerFileoptionwouldsetthesetwoheadersoneveryeventcreated:

$./flume-ngavro-client--filenamefoo.log--headerFileheaders.properties

--hostcollector.example.com--port42424

TheLog4JappenderAswediscussedinChapter5,SourcesandChannelSelectors,thereareissuesthatmayarisefromusingafilesystemfileasasource.OnewaytoavoidthisproblemistousetheFlumeLog4JAppenderinyourJavaapplication(s).Underthehood,itusesthesameAvrocommunicationthattheAvrosinkuses,soyouneedonlyconfigureittosenddatatoanAvrosource.

TheAppenderhastwoproperties,whichareshownhereinXML:

<appendername="FLUME"

class="org.apache.flume.clients.log4jappender.Log4jAppender">

<paramname="Hostname"value="collector.example.com"/>

<paramname="Port"value="42424"/>

</appender>

TheformatofthebodywillbedictatedbytheAppender’sconfiguredlayout(notshown).Thelog4jfieldsthatgetmappedtoFlumeheadersaresummarizedinthistable:

Flumeheaderkey Log4Jloggingeventfield

flume.client.log4j.logger.name event.getLoggerName()

flume.client.log4j.log.levelevent.getLevel()asanumber.Seeorg.apache.log4j.Levelformappings.

flume.client.log4j.timestamp event.getTimeStamp()

flume.client.log4j.message.encoding N/A—alwaysUTF8

flume.client.log4j.logger.otherWillonlyseethisiftherewasaproblemmappingoneoftheabovefields,sonormally,thiswon’tbepresent.

Refertohttp://logging.apache.org/log4j/1.2/formoredetailsonusingLog4J.

Youwillneedtoincludetheflume-ng-sdkJARintheclasspathofyourJavaapplicationatruntimetouseFlume’sLog4JAppender.

KeepinmindthatifthereisaproblemsendingdatatotheAvrosource,theappenderwillthrowanexceptionandthelogmessagewillbedropped,asthereisnoplacetoputit.KeepingitinmemorycouldquicklyoverloadyourJVMheap,whichisusuallyconsideredworsethandroppingthedatarecord.

TheLog4Jload-balancingappenderI’msureyounoticedthattheprecedingLog4jAppenderonlyhasasinglehostname/portinitsconfiguration.Ifyouwantedtospreadtheloadacrossmultiplecollectoragents,eitherforadditionalcapacityorforfaulttolerance,youcanusetheLoadBalancingLog4jAppender.ThisappenderhasasinglerequiredpropertynamedHosts,whichisaspace-separatedlistofhostnamesandportnumbersseparatedbyacolon,asfollows:

<appendername="FLUME"

class="org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender">

<paramname="Hosts"value="server1:42424server2:42424"/>

</appender>

Thereisanoptionalproperty,Selector,whichspecifiesthemethodthatyouwanttoloadbalance.ValidvaluesareRANDOMandROUND_ROBIN.Ifnotspecified,thedefaultisROUND_ROBIN.Youcanimplementyourownselector,butthatisoutsidethescopeofthisbook.Ifyouareinterested,gohavealookatthewell-documentedsourcecodefortheLoadBalancingLog4jAppenderclass.

Finally,thereisanotheroptionalpropertytooverridethemaximumtimeforexponentialbackoffwhenaservercannotbecontacted.Initially,ifaservercannotbecontacted,1secondwillneedtopassbeforethatserveristriedagain.Eachtimetheserverisunavailable,theretrytimedoubles,uptoadefaultmaximumof30seconds.Ifwewantedtoincreasethismaximumto2minutes,wecanspecifyaMaxBackoffpropertyinmillisecondsasfollows:

<appendername="FLUME"

class="org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender">

<paramname="Hosts"value="server1:42424server2:42424"/>

<paramname="Selector"value="RANDOM"/>

<paramname="MaxBackoff"value="120000"/>

</appender>

Inthisexample,wehavealsooverriddenthedefaultround_robinselectortousearandomselection.

TheembeddedagentIfyouarewritingaJavaprogramthatcreatesdata,youmaychoosetosendthedatadirectlyasstructureddatausingaspecialmodeofFlumecalledtheEmbeddedAgent.Itisbasicallyasimplesinglesource/singlechannelFlumeagentthatyouruninsideyourJVM.

Therearebenefitsanddrawbackstothisapproach.Onthepositiveside,youdon’tneedtomonitoranadditionalprocessonyourserverstorelaydata.Theembeddedchannelalsoallowsforthedataproducertocontinueexecutingitscodeimmediatelyafterqueuingtheeventtothechannel.TheSinkRunnerthreadhandlestakingeventsfromthechannelandsendingthemtotheconfiguredsinks.Evenifyoudidn’tuseembeddedFlumetoperformthishandofffromthecallingthread,youwouldmostlikelyusesomekindofsynchronizedqueue(suchasBlockingQueue)toisolatethesendingofthedatafromthemainexecutionthread.UsingEmbeddedFlumeprovidesthesamefunctionalitywithouthavingtoworrywhetheryou’vewrittenyourmultithreadedcodecorrectly.

ThemajordrawbacktoembeddingFlumeinyourapplicationisaddedmemorypressureonyourJVM’sgarbagecollector.Ifyouareusinganin-memorychannel,anyunsenteventsareheldintheheapandgetinthewayofcheapgarbagecollectionbywayofshort-livedobjects.However,thisiswhyyousetachannelsize:tokeepthemaximummemoryfootprinttoaknownquantity.Furthermore,anyconfigurationchangeswillrequireanapplicationrestart,whichcanbeproblematicifyouroverallsystemdoesn’thavesufficientredundancybuiltintotoleraterestarts.

ConfigurationandstartupAssumingyouarenotdissuaded(andyoushouldn’tbe),youwillfirstneedtoincludetheflume-ng-embedded-agentlibrary(anddependencies)intoyourJavaproject.Dependingonwhatbuildsystemyouareusing(Maven,Ivy,Gradle,andsoon),theexactformatwilldiffer,soI’lljustshowyoutheMavenconfigurationhere.Youcanlookupthealternativeformatsathttp://mvnrepository.com/:

<dependency>

<groupId>org.apache.flume</groupId>

<artifactId>flume-ng-embedded-agent</artifactId>

<version>1.5.2</version>

</dependency>

StartbycreatinganEmbeddedAgentobjectinyourJavacodebycallingtheconstructor(andpassingastringname—usedonlyinerrormessages):

EmbeddedAgenttoHadoop=newEmbeddedAgent("myData");

Next,youhavetosetpropertiesforthechannel,sinks,andsinkprocessorviatheconfigure()method,asshowninthisexample.Mostofthisconfigurationshouldlookveryfamiliartoyouatthispoint:

Map<String,String>config=newHashMap<String,String>;

config.put("channel.type","memory");

config.put("channel.capacity","75");

config.put("sinks","s1s2");

config.put("sink.s1.type","avro");

config.put("sink.s1.hostname","foo.example.com");

config.put("sink.s1.port","12345");

config.put("sink.s1.compression-type","deflate");

config.put("sink.s2.type","avro");

config.put("sink.s2.hostname","bar.example.com");

config.put("sink.s2.port","12345");

config.put("sink.s2.compression-type","deflate");

config.put("processor.type","failover");

config.put("processor.priority.s1","10");

config.put("processor.priority.s2","20");

toHadoop.configure(config);

Here,wedefineamemorychannelwithacapacityof75eventsalongwithtwoAvrosinksinanactive/standbyconfiguration(first,tofoo.example.com,andifthatfails,tobar.example.com)usingAvroserializationwithcompression.RefertoChapter3,Channels,forspecificsettingsformemory-orfile-backedchannelproperties.

Thesinkprocessoronlycomesintoplayifyouhavemorethanonesinkdefinedinyoursinksproperty.UnliketheoptionalsinkgroupsoptionscoveredinChapter4,SinksandSinkProcessors,youneedtospecifyasinklist,evenifitisjustone.Clearly,thereisnobehaviordifferencebetweenfailoverandloadbalancewhenthereisonlyoncesink.Refertoeachspecificsink’spropertiesinChapter4,SinksandSinkProcessors,forspecificconfigurationparameters.

Finally,beforeyoustartusingthisagent,youneedtocallthestart()methodto

instantiateeverythingbasedonyourpropertiesandstartallthebackgroundprocessingthreadsasfollows:

toHadoop.start();

Theclassisnowreadytostartreceivingandforwardingdata.

You’llwanttokeepareferencetothisobjectaroundforcleanuplater,aswellastopassittootherobjectsthatwillbesendingdataastheunderlyingchannelprovidesthreadsafety.Personally,IuseSpringFramework’sdependencyinjectiontoconfigureandpassreferencesatstartupratherthandoingitprogrammatically,asI’veshowninthisexample.RefertotheSpringwebsiteformoreinformation(http://spring.io/),asproperuseofSpringisawholebookuntoitself,anditisnottheonlydependencyinjectionframeworkavailabletoyou.

SendingdataDatacanbesenteitherassingleeventsorinbatches.Batchescansometimesbemoreefficientinhighvolumedatastreams.

Tosendasingleevent,justcalltheput()method,asshowninthisexample:

Evente=EventBuilder.withBody("HelloHadoop",Charset.forName("UTF8");

toHadoop.put(e);

Here,I’musingoneofmanymethodsavailableontheorg.apache.flume.event.EventBuilderclass.Hereisacompletelistofmethodsyoucanusetoconstructeventswiththishelperclass:

EventBuilder.withBody(byte[]body,Map<String,String>headers);

EventBuilder.withBody(byte[]body);

EventBuilder.withBody(Stringbody,Charsetcharset,Map<String,String>

headers);

EventBuilder.withBody(Stringbody,Charsetcharset);

Inordertosendseveraleventsinonebatch,youcancalltheputAll()methodwithalistofevents,asshowninthisexample,whichalsoincludesatimeheader:

Map<String,String>headers=newHashMap<String,String>;

headers.put("timestamp",Long.toString(System.currentTimeMillis());

List<Event>events=newArrayList<Event>;

events.add(EventBuilder.withBody("First".getBytes("UTF-8"),headers);

events.add(EventBuilder.withBody("Second".getBytes("UTF-8"),headers);

toHadoop.putAll(events);

Asinterceptorsarenotsupported,youwillneedtoaddanyheadersyouwanttoaddtotheeventsprogrammaticallyinJavabeforeyoucallput()orputAll().

ShutdownFinally,astherearebackgroundthreadswaitingfordatatoarriveonyourembeddedchannelwhicharemostlikelyholdingpersistentconnectionstoconfigureddestinations,whenitistimetoshutdownyourapplication,you’llwanttohaveFlumedoitscleanupaswell.Yousimplycallthestop()methodontheconfiguredEmbeddedAgentasfollows:

toHadoop.stop();

RoutingTheroutingofdatatodifferentdestinationsbasedoncontentshouldbefairlystraightforwardnowthatyou’vebeenintroducedtoallthevariousmechanismsinFlume.

ThefirststepistogetthedatayouwanttoswitchonintoaFlumeheaderbymeansofasource-sideinterceptoriftheheaderisn’talreadyavailable.ThesecondstepistouseaMultiplexingChannelSelectoronthatheadervaluetoswitchthedatatoanalternatechannel.

Forinstance,let’ssayyouwantedtocaptureallexceptionstoHDFS.Inthisconfiguration,youcanseeeventscominginonthes1sourceviaavroonport42424.TheeventistestedtoseewhetherthebodycontainsthetextException.Ifitdoes,itcreatesanexceptionheaderkey(withthevalueofException).Thisheaderisusedtoswitchtheseeventstochannelc1,andultimately,HDFS.Iftheeventdidn’tmatchthepattern,itwouldnothavetheexceptionheaderandwouldgetpassedtothec2channelviathedefaultselectorwhereitwouldbeforwardedviaAvroserializationtoport12345onserverfoo.example.com:

agent.sources=s1

agent.sources.s1.type=avro

agent.sources.s1.bind=0.0.0.0

agent.sources.s1.port=42424

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.i1.type=regex_extractor

agent.sources.s1.interceptors.i1.regex=(Exception)

agent.sources.s1.interceptors.i1.serializers=ex

agent.sources.s1.intercetpros.i1.serializers.ex.name=exception

agent.sources.s1.selector.type=multiplexing

agent.sources.s1.selector.header=exception

agent.sources.s1.selector.mapping.Exception=c1

agent.sources.s1.selector.default=c2

agent.channels=c1c2

agent.channels.c1.type=memory

agent.channels.c2.type=memory

agent.sinks=k1k2

agent.sinks.k1.type=hdfs

agent.sinks.k1.channel=c1

agent.sinks.k1.hdfs.path=/logs/exceptions/%y/%M/%d/%H

agent.sinks.k2.type=avro

agent.sinks.k2.channel=c2

agent.sinks.k2.hostname=foo.example.com

agent.sinks.k2.port=12345

SummaryInthischapter,wecoveredvariousinterceptorsshippedwithFlume,including:

Timestamp:Theseareusedtoaddatimestampheader,possiblyoverwritinganexistingone.Host:ThisisusedtoaddtheFlumeagenthostnameorIPasaheaderintheevent.Static:ThisisusedtoaddstaticStringheaders.Regularexpressionfiltering:Thisisusedtoincludeorexcludeeventsbasedonamatchedregularexpression.Regularexpressionextractor:Thisisusedtocreateheadersfrommatchedregularexpressions.It’susefulforroutingwithChannelSelectors.Morphline:ThisisusedtodelegatetransformationtoaMorphlinecommandchain.Custom:Thisisusedtocreateanycustomtransformationsyouneedthatyoucan’tfindelsewhere.

WealsocoveredtieringdataflowsusingtheAvrosourceandsink.OptionalcompressionandSSLwithAvroflowswerecoveredaswell.Finally,Thriftsourcesandsinkswerebrieflycovered,assomeenvironmentsmayalreadyhaveThriftdataflowstointegratewith.

Next,weintroducedtwoLog4JAppenders,asinglepathandaload-balancingversion,fordirectintegrationwithJavaapplications.

TheEmbeddedFlumeAgentwascoveredforthosewishingtodirectlyintegratebasicFlumefunctionalityintotheirJavaapplications.

Finally,wegaveyouanexampleofusinginterceptorsinconjunctionwithaChannelSelectortoprovideroutingdecisionlogic.

Inthenextchapter,wewilldiveintoanexampletostitchtogethereverythingwehavecoveredsofar.

Chapter7.PuttingItAllTogetherNowthatwe’vewalkedthroughallthecomponentsandconfigurations,let’sputtogetheraworkingend-to-endconfiguration.Thisexampleisbynomeansexhaustive,nordoesitcovereverypossiblescenarioyoumightneed,butIthinkitshouldcoveracoupleofcommonusecasesI’veseenoverandover:

FindingerrorsbysearchinglogsacrossmultipleserversinnearrealtimeStreamingdatatoHDFSforlong-termbatchprocessing

Inthefirstsituation,yoursystemsmaybeimpaired,andyouhavemultipleplaceswhereyouneedtosearchforproblems.Bringingallofthoselogstoasingleplacethatyoucansearchmeansgettingyoursystemsrestoredquickly.Inthesecondscenario,youareinterestedincapturingdatainthelongtermforanalyticsandmachinelearning.

WeblogstosearchableUILet’ssimulateawebapplicationbysettingupasimplewebserverwhoselogswewanttostreamintosomesearchableapplication.Inthiscase,we’llbeusingaKibanaUItoperformadhocqueriesagainstElasticsearch.

Forthisexample,I’llstartthreeserversinAmazon’sElasticComputeCluster(EC2),asshowninthisdiagram:

EachserverhasapublicIP(startingwith54)andaprivateIP(startingwith172).Forinterservercommunication,I’llbeusingtheprivateIPsinmyconfigurations.Mypersonalinteractionwiththewebserver(tosimulatetraffic)andwithKibanaandElasticsearchwillrequirethepublicIPs,sinceI’msittinginmyhouseandnotinAmazon’sdatacenters.

NotePaycarefulattentiontotheIPaddressesintheshellprompts,aswewillbejumpingfrommachinetomachine,andIdon’twantyoutogetlost.Forinstance,ontheCollectorbox,thepromptwillcontainitsprivateIP:[ec2-user@ip-172-31-26-205~]$

IfyoutrythisoutyourselfinEC2,youwillgetdifferentIPassignments,soadjusttheconfigurationfilesandURLsreferencedasneeded.Forthisexample,I’llbeusingAmazon’sLinuxAMIforanoperatingsystemandthet1.microsizeserver.Also,besuretoadjustthesecuritygroupstoallownetworktrafficbetweentheseservers(andyourself).Whenindoubtaboutconnectivity,useutilitiessuchastelnetornctotestconnectionsbetweenserversandports.

TipWhilegoingthroughthisexercise,Imademistakesinmysecuritygroupsmorethanonce,soperformthesetestsonconnectivityexceptions.

SettingupthewebserverLet’sstartbysettingupthewebserver.Forthis,we’llinstallthepopularNginxwebserver(http://nginx.org).Sincewebserverlogsareprettymuchstandardizednowadays,Icouldhavechosenanywebserver,butthepointhereisthatthisapplicationwritesitlogs(bydefault)toafileonthedisk.SinceI’vewarnedyoufromtryingtousethetailprogramtostreamthem,we’llbeusingacombinationoflogrotateandtheSpoolingDirectorySourcetoingestthedata.First,let’slogintotheserver.ThedefaultaccountwithsudoforanAmazonLinuxserverisec2-user.You’llneedtopassthisinyoursshcommand:

%ssh54.69.112.61-lec2-user

Theauthenticityofhost'54.69.112.61(54.69.112.61)'can'tbe

established.

RSAkeyfingerprintisdc:ee:7a:1a:05:e1:36:cd:a4:81:03:97:48:8d:b3:cc.

Areyousureyouwanttocontinueconnecting(yes/no)?yes

Warning:Permanentlyadded'54.69.112.61'(RSA)tothelistofknownhosts.

__|__|_)

_|(/AmazonLinuxAMI

___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2014.09-release-notes/

18package(s)neededforsecurity,outof41available

Run"sudoyumupdate"toapplyallupdates.

Nowthatyouareloggedintotheserver,let’sinstallNginx(I’mnotgoingtoshowalloftheoutputtosavepaper):

[ec2-user@ip-172-31-18-146~]$sudoyum-yinstallnginx

Loadedplugins:priorities,update-motd,upgrade-helper

ResolvingDependencies

-->Runningtransactioncheck

--->Packagenginx.x86_641:1.6.2-1.22.amzn1willbeinstalled

[SNIP]

Installed:

nginx.x86_641:1.6.2-1.22.amzn1

DependencyInstalled:

GeoIP.x86_640:1.4.8-1.5.amzn1gd.x86_640:2.0.35-11.10.amzn1

gperftools-libs.x86_640:2.0-11.5.amzn1libXpm.x86_640:3.5.10-2.9.amzn1

libunwind.x86_640:1.1-2.1.amzn1

Complete!

Next,let’sstarttheNginxwebserver:

[ec2-user@ip-172-31-18-146~]$sudo/etc/init.d/nginxstart

Startingnginx:[OK]

Atthispoint,thewebservershouldberunning,solet’suseourcomputer’swebbrowsertogotothepublicIP,http://54.69.112.61/inthiscase.Ifitisworking,youshouldseethedefaultwelcomepage,likethis:

Tohelpgenerateasampleinput,youcanusealoadgeneratorprogramsuchasApachebenchmark(http://httpd.apache.org/docs/2.4/programs/ab.html)orwrk(https://github.com/wg/wrk).HereisthecommandIusedtogeneratedatafor20minutesatatime:

%wrk-c2-d20mhttp://54.69.112.61

Running20mtest@http://54.69.112.61

2threadsand2connections

ConfiguringlogrotationtothespooldirectoryBydefault,theserver’saccesslogsarewrittento/var/log/nginx/access.log.Theproblemweneedtoovercomehereistomovethefilestoaseparatedirectorywhileitisstillopenbythenginxprocess.Thisiswherelogrotatecomesintothepicture.Let’sfirstcreateaspooldirectorythatwillbecometheinputpathfortheSpoolingDirectorySourcelateron.Forthepurposeofthisexample,I’lljustcreatethespooldirectoryinthehomedirectoryofec2-user.Inaproductionenvironment,youwouldplacethespooldirectorysomewhereelse:

[ec2-user@ip-172-31-18-146~]$pwd

/home/ec2-user

[ec2-user@ip-172-31-18-146~]$mkdirspool

Let’salsocreatealogrotatescriptinthehomedirectoryofec2-user,calledrotateAccess.conf.Openaneditorandpastethesecontents(aftertheprompt)intoafile

calledaccessRotate.conf:

[ec2-user@ip-172-31-18-146~]$cataccessRotate.conf

/var/log/nginx/access.log{

missingok

notifempty

rotate0

copytruncate

sharedscripts

olddir/home/ec2-user/spool

postrotate

chownec2-user:ec2-user/home/ec2-user/spool/access.log.1

ts=$(date+%s)

mv/home/ec2-user/spool/access.log.1/home/ec2-

user/spool/access.log.$ts

endscript

}

Let’sgooverthissothatyouunderstandwhatitisdoing.Thisisbynomeansacompleteintroductiontothelogrotateutility.Readtheonlinedocumentationathttp://linuxconfig.org/logrotateformoredetails.

Whenlogrotateruns,itwillcopy/var/log/nginx/access.logtothe/home/ec2-user/spooldirectory,butonlyifithasanonzerolength(meaning,thereisdatatosend).Therotate0commandtellslogrotatenottoremoveanyfilesfromthetargetdirectory(sincetheFlumesourcewilldothatafterithassuccessfullytransmittedeachfile).We’llmakeuseofthecopytruncatefeaturebecauseitkeepsthefileopenasfarasthenginxprocessisconcerned,whileresettingittozerolength.Thisavoidstheneedtosignalnginxthatarotationhasjustoccurred.Thedestinationoftherotatedfileinthespooldirectoryis/home/ec2-user/spool/access.log.1.Next,inthepostrotatescriptsection,wewillchangetheownershipofthefilefromtherootusertoec2-usersothattheFlumeagentcanreadit,andlater,deleteit.Finally,werenamethefilewithauniquetimestampsothatwhenthenextrotationoccurs,theaccess.log.1filewillnotexist.Thisalsomeansthatwe’llneedtoaddanexclusionruletotheFlumesourcelatertokeepitfromreadingaccess.log.1fileuntilthecopyingprocesshascompleted.Oncerenamed,itisreadytobeconsumedbyFlume.

Nowlet’stryrunningitindebugmode,whichwillbejustadryrunsothatwecanseewhatitcando.Normally,logrotatehasadailyrotationsettingthatpreventsitfromdoinganythingunlessenoughtimehaspassed:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-d/home/ec2-

user/accessRotate.conf

readingconfigfile/home/ec2-user/accessRotate.conf

readingconfiginfofor/var/log/nginx/access.log

olddirisnow/home/ec2-user/spool

Handling1logs

rotatingpattern:/var/log/nginx/access.log1048576bytes(nooldlogs

willbekept)

olddiris/home/ec2-user/spool,emptylogfilesarenotrotated,oldlogs

areremoved

consideringlog/var/log/nginx/access.log

logdoesnotneedrotating

notrunningpostrotatescript,sincenologswererotated

Therefore,wewillalsoincludethe-f(force)flagtooverridetheprecedingfunctionality:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-df/home/ec2-

user/accessRotate.conf

readingconfigfile/home/ec2-user/accessRotate.conf

readingconfiginfofor/var/log/nginx/access.log

olddirisnow/home/ec2-user/spool

Handling1logs

rotatingpattern:/var/log/nginx/access.logforcedfromcommandline(no

oldlogswillbekept)

olddiris/home/ec2-user/spool,emptylogfilesarenotrotated,oldlogs

areremoved

consideringlog/var/log/nginx/access.log

logneedsrotating

rotatinglog/var/log/nginx/access.log,log->rotateCountis0

dateextsuffix'-20141207'

globpattern'-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'

renaming/home/ec2-user/spool/access.log.1to/home/ec2-

user/spool/access.log.2(rotatecount1,logstart1,i1),

renaming/home/ec2-user/spool/access.log.0to/home/ec2-

user/spool/access.log.1(rotatecount1,logstart1,i0),

copying/var/log/nginx/access.logto/home/ec2-user/spool/access.log.1

truncating/var/log/nginx/access.log

runningpostrotatescript

runningscriptwitharg/var/log/nginx/access.log:"

chownec2-user:ec2-user/home/ec2-user/spool/access.log.1

ts=$(date+%s)

mv/home/ec2-user/spool/access.log.1/home/ec2-

user/spool/access.log.$ts

"

removingoldlog/home/ec2-user/spool/access.log.2

Asyoucansee,logrotatewillcopythelogtothespooldirectorywiththe.1extensionasexpected,followedbyourscriptblockattheendtochangepermissionsandrenameitwithauniquetimestamp.

Nowlet’srunitagain,butthistimeforreal,withoutthedebugflag,andthenwewilllistthesourceandtargetdirectoriestoseewhathappened:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-f/home/ec2-

user/accessRotate.conf

[ec2-user@ip-172-31-18-146~]$ls-lspool/

total188

-rw-r--r--1ec2-userec2-user189344Dec717:27access.log.1417973241

[ec2-user@ip-172-31-18-146~]$ls-l/var/log/nginx/

total4

-rw-r--r--1rootroot0Dec717:27access.log

-rw-r--r--1rootroot520Dec716:44error.log

[ec2-user@ip-172-31-18-146~]$

Youcanseetheaccesslogisemptyandtheoldcontentshavebeencopiedtothespooldirectory,withcorrectpermissionsandthefilenameendinginauniquetimestamp.

Let’salsoverifythattheemptyaccesslogwillresultinnoactionifrunwithoutanydata.You’llneedtoincludethedebugflagtoseewhattheprocessisthinking:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-df/home/ec2-

user/accessRotate.conf

readingconfigfileaccessRotate.conf

readingconfiginfofor/var/log/nginx/access.log

olddirisnow/home/ec2-user/spool

Handling1logs

rotatingpattern:/var/log/nginx/access.logforcedfromcommandline(1

rotations)

olddiris/home/ec2-user/spool,emptylogfilesarenotrotated,oldlogs

areremoved

consideringlog/var/log/nginx/access.log

logdoesnotneedrotating

Nowthatwehaveaworkingrotationscript,weneedsomethingtorunitperiodically(notindebugmode)insomechoseninterval.Forthis,wewillusethecrondaemon(http://en.wikipedia.org/wiki/Cron)bycreatingafileinthe/etc/cron.ddirectory:

[ec2-user@ip-172-31-18-146~]$cat/etc/cron.d/rotateLogsToSpool

#Movefilestospooldirectoryevery5minutes

*/5****root/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf

Here,I’veindicatedafive-minuteintervalandtorunastherootuser.Sincetherootuserownstheoriginallogfile,weneedelevatedaccesstoreassignownershiptoec2-usersothatitcanberemovedlaterbyFlume.

Onceyou’vesavedthiscronconfigurationfile,youshouldseeourscriptevery5minutes(aswellasotherprocesses),byinspectingthecrondaemon’slogfile:

[ec2-user@ip-172-31-18-146~]$sudotail/var/log/cron

Dec717:45:01ip-172-31-18-146CROND[22904]:(root)CMD

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)

Dec717:50:01ip-172-31-18-146CROND[22929]:(root)CMD

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)

Dec717:54:01ip-172-31-18-146anacron[2426]:Job'cron.weekly'started

Dec717:54:01ip-172-31-18-146anacron[2426]:Job'cron.weekly'

terminated

Dec717:55:01ip-172-31-18-146CROND[22955]:(root)CMD

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)

Dec718:00:01ip-172-31-18-146CROND[23046]:(root)CMD

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)

Dec718:01:01ip-172-31-18-146CROND[23060]:(root)CMD(run-parts

/etc/cron.hourly)

Dec718:01:01ip-172-31-18-146run-parts(/etc/cron.hourly)[23060]:

starting0anacron

Dec718:01:01ip-172-31-18-146run-parts(/etc/cron.hourly)[23069]:

finished0anacron

Dec718:05:01ip-172-31-18-146CROND[23117]:(root)CMD

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)

Aninspectionofthespooldirectoryalsoshowsnewfilescreatedonourarbitrarilychosen5-minuteinterval:

[ec2-user@ip-172-31-18-146~]$ls-lspool/

total1072

-rw-r--r--1ec2-userec2-user145029Dec718:02access.log.1417975373

-rw-r--r--1ec2-userec2-user164169Dec718:05access.log.1417975501

-rw-r--r--1ec2-userec2-user166779Dec718:10access.log.1417975801

-rw-r--r--1ec2-userec2-user613785Dec718:15access.log.1417976101

KeepinmindthattheNginxRPMpackagealsoinstalledarotationconfigurationlocatedat/etc/logrotate.d/nginxtoperformdailyrotations.Ifyouaregoingtousethisexampleinproduction,you’llwanttoremovetheconfiguration,sincewedon’twantitclashingwithourmorefrequentlyrunningcronscript.I’llleavehandlingerror.logasanexerciseforyou.You’lleitherwanttosenditsomeplace(andhaveFlumeremoveit)orrotateitperiodicallysothatyourdiskdoesn’tfillupovertime.

Settingupthetarget–ElasticsearchNowlet’smovetotheotherserverinourdiagramandsetupElasticsearch,ourdestinationforsearchabledata.I’mgoingtobeusingtheinstructionsfoundathttp://www.elasticsearch.org/overview/elkdownloads/.First,let’sdownloadandinstalltheRPMpackage:

[ec2-user@ip-172-31-26-120~]$wget

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc

h-1.4.1.noarch.rpm

--2014-12-0718:25:35--

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc

h-1.4.1.noarch.rpm

Resolvingdownload.elasticsearch.org(download.elasticsearch.org)...

54.225.133.195,54.243.77.158,107.22.222.16,...

Connectingtodownload.elasticsearch.org

(download.elasticsearch.org)|54.225.133.195|:443…connected.

HTTPrequestsent,awaitingresponse…200OK

Length:26326154(25M)[application/x-redhat-package-manager]

Savingto:'elasticsearch-1.4.1.noarch.rpm'

100%

[==========================================================================

==================================>]26,326,1549.92MB/sin2.5s

2014-12-0718:25:38(9.92MB/s)-'elasticsearch-1.4.1.noarch.rpm'saved

[26326154/26326154]

[ec2-user@ip-172-31-26-120~]$sudorpm-ivhelasticsearch-1.4.1.noarch.rpm

Preparing…#################################[100%]

Updating/installing…

1:elasticsearch-1.4.1-1#################################[100%]

###NOTstartingoninstallation,pleaseexecutethefollowingstatements

toconfigureelasticsearchtostartautomaticallyusingchkconfig

sudo/sbin/chkconfig--addelasticsearch

###Youcanstartelasticsearchbyexecuting

sudoserviceelasticsearchstart

Theinstallationiskindenoughtotellmehowtoconfiguretheservicetoautomaticallystartonsystemboot-up,anditalsotellsmetolaunchtheservicenow,solet’sdothat:

[ec2-user@ip-172-31-26-120~]$sudo/sbin/chkconfig--addelasticsearch

[ec2-user@ip-172-31-26-120~]$sudoserviceelasticsearchstart

Startingelasticsearch:[OK]

Let’sperformaquicktestwithcurltoverifythatitisrunningonthedefaultport,whichis9200:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/

{

"status":200,

"name":"Siege",

"cluster_name":"elasticsearch",

"version":{

"number":"1.4.1",

"build_hash":"89d3241d670db65f994242c8e8383b169779e2d4",

"build_timestamp":"2014-11-26T15:49:29Z",

"build_snapshot":false,

"lucene_version":"4.10.2"

},

"tagline":"YouKnow,forSearch"

}

Sincetheindexesarecreatedasdatacomesin,andwehaven’tsentanydata,weexpecttoseenoindexescreatedyet:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/_cat/indices?v

healthstatusindexprirepdocs.countdocs.deletedstore.size

pri.store.size

Allweseeistheheaderwithnoindexeslisted,solet’sheadovertothecollectorserverandsetuptheFlumerelaywhichwillwritedatatoElasticsearch.

SettingupFlumeoncollector/relayTheFlumeconfigurationontheFlumecollectorwillbecompressedAvrocominginandElasticsearchgoingout.Goaheadandlogintothecollectorserver(172.31.26.205).

Let’sstartbydownloadingtheFlumebinaryfromtheApachewebsite.Followthedownloadlinkandselectamirror:

[ec2-user@ip-172-31-26-205~]$wget

http://apache.arvixe.com/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz

--2014-12-0719:50:30--http://apache.arvixe.com/flume/1.5.2/apache-flume-

1.5.2-bin.tar.gz

Resolvingapache.arvixe.com(apache.arvixe.com)...198.58.87.82

Connectingtoapache.arvixe.com(apache.arvixe.com)|198.58.87.82|:80…

connected.

HTTPrequestsent,awaitingresponse…200OK

Length:25323459(24M)[application/x-gzip]

Savingto:'apache-flume-1.5.2-bin.tar.gz'

100%[==========================>]25,323,4598.38MB/sin2.9s

2014-12-0719:50:33(8.38MB/s)-'apache-flume-1.5.2-bin.tar.gz'saved

[25323459/25323459]

Next,expandandchangedirectories:

[ec2-user@ip-172-31-26-205~]$tar-zxfapache-flume-1.5.2-bin.tar.gz

[ec2-user@ip-172-31-26-205~]$cdapache-flume-1.5.2-bin

[ec2-user@ip-172-31-26-205apache-flume-1.5.2-bin]$

Createtheconfigurationfile,calledcollector.conf,inFlume’sconfigurationdirectoryusingyourfavoriteeditor:

[ec2-user@ip-172-31-18-146apache-flume-1.5.2-bin]$catconf/collector.conf

collector.sources=av

collector.channels=m1

collector.sinks=es

collector.sources.av.type=avro

collector.sources.av.bind=0.0.0.0

collector.sources.av.port=12345

collector.sources.av.compression-type=deflate

collector.sources.av.channels=m1

collector.channels.m1.type=memory

collector.channels.m1.capacity=10000

collector.sinks.es.type=org.apache.flume.sink.elasticsearch.ElasticSearchSi

nk

collector.sinks.es.channel=m1

collector.sinks.es.hostNames=172.31.26.120

Here,youcanseethatweareusingasimplememorychannelconfiguredwithacapacityof10,000events.

ThesourceisconfiguredtoacceptcompressedAvroonport12345andpassittoour

memorychannel.

Finally,thesinkisconfiguredtowritetoElasticsearchontheserverwejustsetupatthe172.31.26.120privateIP.Weareusingthedefaultsettings,whichmeansit’llwritetotheindexnamedflume-YYYY-MM-DDwiththelogtype.

Let’stryrunningtheFlumeagent:

[ec2-user@ip-172-31-26-205apache-flume-1.5.2-bin]$./bin/flume-ngagent-n

collector-cconf-fconf/collector.conf-Dflume.root.logger=INFO,console

You’llseeanexceptioninthelog,includingsomethinglikethis:

2014-12-0720:00:13,184(conf-file-poller-0)[ERROR-

org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatche

rRunnable.run(PollingPropertiesFileConfigurationProvider.java:145)]Failed

tostartagentbecausedependencieswerenotfoundinclasspath.Error

follows.

java.lang.NoClassDefFoundError:org/elasticsearch/common/io/BytesStream

TheFlumeagentcan’tfindtheElasticsearchclasses.RememberthatthesearenotpackagedwithFlume,asthelibrariesneedtobecompatiblewiththeversionofElasticsearchyouarerunning.LookingbackattheElasticsearchserver,wecangetanideaofwhatweneed.Rememberthatthislistincludesmanyruntimeserverdependencies,soitisprobablymorethanwhatyou’llneedforafunctionalElasticsearchclient:

[ec2-user@ip-172-31-26-120~]$rpm-qilelasticsearch|grepjar

/usr/share/elasticsearch/lib/elasticsearch-1.4.1.jar

/usr/share/elasticsearch/lib/groovy-all-2.3.2.jar

/usr/share/elasticsearch/lib/jna-4.1.0.jar

/usr/share/elasticsearch/lib/jts-1.13.jar

/usr/share/elasticsearch/lib/log4j-1.2.17.jar

/usr/share/elasticsearch/lib/lucene-analyzers-common-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-core-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-expressions-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-grouping-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-highlighter-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-join-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-memory-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-misc-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-queries-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-queryparser-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-sandbox-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-spatial-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-suggest-4.10.2.jar

/usr/share/elasticsearch/lib/sigar/sigar-1.6.4.jar

/usr/share/elasticsearch/lib/spatial4j-0.4.1.jar

Really,youonlyneedtheelasticsearch.jarfileanditsdependencies,butwearegoingtobelazyandjustdownloadtheRPMagaininthecollectormachine,andcopytheJARfilestoFlume:

[ec2-user@ip-172-31-26-205~]$wget

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc

h-1.4.1.noarch.rpm

--2014-12-0720:03:38--

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc

h-1.4.1.noarch.rpm

Resolvingdownload.elasticsearch.org(download.elasticsearch.org)...

54.225.133.195,54.243.77.158,107.22.222.16,...

Connectingtodownload.elasticsearch.org

(download.elasticsearch.org)|54.225.133.195|:443…connected.

HTTPrequestsent,awaitingresponse…200OK

Length:26326154(25M)[application/x-redhat-package-manager]

Savingto:'elasticsearch-1.4.1.noarch.rpm'

100%

[==========================================================================

==================================>]26,326,1549.70MB/sin2.6s

2014-12-0720:03:41(9.70MB/s)-'elasticsearch-1.4.1.noarch.rpm'saved

[26326154/26326154]

[ec2-user@ip-172-31-26-205~]$sudorpm-ivhelasticsearch-1.4.1.noarch.rpm

Preparing…#################################

[100%]

Updating/installing…

1:elasticsearch-1.4.1-1#################################

[100%]

###NOTstartingoninstallation,pleaseexecutethefollowingstatements

toconfigureelasticsearchtostartautomaticallyusingchkconfig

sudo/sbin/chkconfig--addelasticsearch

###Youcanstartelasticsearchbyexecuting

sudoserviceelasticsearchstart

Thistimewewillnotconfiguretheservicetostart.Instead,we’llcopytheJARfilesweneedtoFlume’spluginsdirectoryarchitecture,whichwelearnedaboutinthepreviouschapter:

[ec2-user@ip-172-31-26-205~]$cdapache-flume-1.5.2-bin

[ec2-user@ip-172-31-26-205apache-flume-1.5.2-bin]$mkdir-p

plugins.d/elasticsearch/libext

[ec2-user@ip-172-31-26-205apache-flume-1.5.2-bin]$cp

/usr/share/elasticsearch/lib/*.jarplugins.d/elasticsearch/libext/

NowtryrunningtheFlumeagentagain:

[ec2-user@ip-172-31-26-205apache-flume-1.5.2-bin]$./bin/flume-ngagent-n

collector-cconf-fconf/collector.conf-Dflume.root.logger=INFO,console

Noexceptionsthistimearound,butstillnodata.Let’sgobacktothewebservermachineandsetupthefinalFlumeagent.

SettingupFlumeontheclientOnthefirstserver,thewebserver,let’sdownloadtheFlumebinariesagainandexpandthepackage:

[ec2-user@ip-172-31-18-146~]$wget

http://apache.arvixe.com/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz

--2014-12-0720:12:53--http://apache.arvixe.com/flume/1.5.2/apache-flume-

1.5.2-bin.tar.gz

Resolvingapache.arvixe.com(apache.arvixe.com)...198.58.87.82

Connectingtoapache.arvixe.com(apache.arvixe.com)|198.58.87.82|:80…

connected.

HTTPrequestsent,awaitingresponse…200OK

Length:25323459(24M)[application/x-gzip]

Savingto:'apache-flume-1.5.2-bin.tar.gz'

100%[=========================>]25,323,4598.13MB/sin3.0s

2014-12-0720:12:56(8.13MB/s)-'apache-flume-1.5.2-bin.tar.gz'saved

[25323459/25323459]

[ec2-user@ip-172-31-18-146~]$tar-zxfapache-flume-1.5.2-bin.tar.gz

[ec2-user@ip-172-31-18-146~]$cdapache-flume-1.5.2-bin

Thistime,ourFlumeconfigurationtakesthespooldirectorywesetupbefore,withlogrotateasinput,anditneedstowritecompressedAvroasthecollectorserver.Openaneditorandcreatetheclient.conffileusingyourfavoriteeditor:

[ec2-user@ip-172-31-18-146apache-flume-1.5.2-bin]$catconf/client.conf

client.sources=sd

client.channels=m1

client.sinks=av

client.sources.sd.type=spooldir

client.sources.sd.spoolDir=/home/ec2-user/spool

client.sources.sd.deletePolicy=immediate

client.sources.sd.ignorePattern=access.log.1$

client.sources.sd.channels=m1

client.channels.m1.type=memory

client.channels.m1.capacity=10000

client.sinks.av.type=avro

client.sinks.av.hostname=172.31.26.205

client.sinks.av.port=12345

client.sinks.av.compression-type=deflate

client.sinks.av.channel=m1

Again,forsimplicity,weareusingamemorychannelwith10,000-recordcapacity.

Forthesource,weconfiguretheSpoolingDirectorySourcewith/home/ec2-user/spoolastheinput.Additionally,weconfigurethedeletionpolicytoremovethefilesaftersendingiscomplete.Thisisalsowherewesetuptheexclusionrulefortheaccess.log.1filenamepatternmentionedearlier.Notethedollarsignattheendofthefilename,

denotingtheendoftheline.Withoutthis,theexclusionpatternwouldalsoexcludevalidfiles,suchasaccess.log.1417975373.

Finally,anAvrosinkisconfiguredtopointatthecollector’sprivateIPandport12345.Additionally,wesetthecompressionsothatitmatchesthereceivingAvrosource’ssettings.

Nowlet’stryrunningtheagent:

[ec2-user@ip-172-31-18-146apache-flume-1.5.2-bin]$./bin/flume-ngagent-n

client-cconf-fconf/client.conf-Dflume.root.logger=INFO,console

Noexceptions!Butmoreimportantly,Iseethelogfilesinthespooldirectorybeingprocessedanddeleted:

2014-12-0720:59:04,041(pool-4-thread-1)[INFO-

org.apache.flume.client.avro.ReliableSpoolingFileEventReader.deleteCurrentF

ile(ReliableSpoolingFileEventReader.java:390)]Preparingtodeletefile

/home/ec2-user/spool/access.log.1417976401

2014-12-0720:59:05,319(pool-4-thread-1)[INFO-

org.apache.flume.client.avro.ReliableSpoolingFileEventReader.deleteCurrentF

ile(ReliableSpoolingFileEventReader.java:390)]Preparingtodeletefile

/home/ec2-user/spool/access.log.1417976701

2014-12-0720:59:06,245(pool-4-thread-1)[INFO-

org.apache.flume.client.avro.ReliableSpoolingFileEventReader.deleteCurrentF

ile(ReliableSpoolingFileEventReader.java:390)]Preparingtodeletefile

/home/ec2-user/spool/access.log.1417977001

2014-12-0720:59:06,245(pool-4-thread-1)[INFO-

org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(Spo

olDirectorySource.java:254)]SpoolingDirectorySourcerunnerhasshutdown.

2014-12-0720:59:06,746(pool-4-thread-1)[INFO-

org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(Spo

olDirectorySource.java:254)]SpoolingDirectorySourcerunnerhasshutdown.

Onceallthefileshavebeenprocessed,thelastlinesarerepeatedevery500milliseconds.ThisisaknownbuginFlume(https://issues.apache.org/jira/browse/FLUME-2385).Ithasalreadybeenfixedandisslatedforthe1.6.0release,sobesuretosetuplogrotationonyourFlumeagentandcleanupthismessbeforeyourunoutofdiskspace.

Onthecollectorbox,weseewritesoccurringtoElasticsearch:

2014-12-0722:10:03,694(SinkRunner-PollingRunner-DefaultSinkProcessor)

[INFO-

org.apache.flume.sink.elasticsearch.client.ElasticSearchTransportClient.exe

cute(ElasticSearchTransportClient.java:181)]Sendingbulktoelasticsearch

cluster

QueryingtheElasticsearchRESTAPI,wecanseeanindexwithrecords:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/_cat/indices?v

healthstatusindexprirepdocs.countdocs.deletedstore.size

pri.store.size

yellowopenflume-2014-12-07513210201.3mb

1.3mb

Let’sreadthefirst5records:

[ec2-user@ip-172-31-26-120elasticsearch]$curl-XGET

'http://localhost:9200/flume-2014-12-07/_search?pretty=true&q=*.*&size=5'

{

"took":26,

"timed_out":false,

"_shards":{

"total":5,

"successful":5,

"failed":0

},

"hits":{

"total":32102,

"max_score":1.0,

"hits":[{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVVbObD75ecNqJg",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJj",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJo",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJt",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJy",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

}]

}

}

Asyoucansee,theloglinesareinthereunderthe@messagefield,andtheycannowbesearched.However,wecandobetter.Let’sbreakthatmessagedownintosearchablefields.

CreatingmoresearchfieldswithaninterceptorLet’sborrowsomecodefromwhatwecoveredearlierinthisbooktoextractsomeFlumeheadersfromthiscommonlogformat,knowingthatallFlumeheaderswillbecomefieldsinElasticsearch.SincewearecreatingfieldstobesearchedbyinElasticsearch,I’mgoingtoaddthemtothecollector’sconfigurationratherthanthewebserver’sFlumeagent.

ChangetheagentconfigurationonthecollectortoincludeaRegularExpressionExtractorinterceptor:

[ec2-user@ip-172-31-18-146apache-flume-1.5.2-bin]$catconf/collector.conf

collector.sources=av

collector.channels=m1

collector.sinks=es

collector.sources.av.type=avro

collector.sources.av.bind=0.0.0.0

collector.sources.av.port=12345

collector.sources.av.compression-type=deflate

collector.sources.av.channels=m1

collector.sources.av.interceptors=e1

collector.sources.av.interceptors.e1.type=regex_extractor

collector.sources.av.interceptors.e1.regex=^([\\d.]+)\\S+\\S+\\

[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})(\\d+)

collector.sources.av.interceptors.e1.serializers=ipdturlscbc

collector.sources.av.interceptors.e1.serializers.ip.name=source

collector.sources.av.interceptors.e1.serializers.dt.type=org.apache.flume.i

nterceptor.RegexExtractorInterceptorMillisSerializer

collector.sources.av.interceptors.e1.serializers.dt.pattern=dd/MMM/yyyy:dd:

HH:mm:ssZ

collector.sources.av.interceptors.e1.serializers.dt.name=timestamp

collector.sources.av.interceptors.e1.serializers.url.name=http_request

collector.sources.av.interceptors.e1.serializers.sc.name=status_code

collector.sources.av.interceptors.e1.serializers.bc.name=bytes_xfered

collector.channels.m1.type=memory

collector.channels.m1.capacity=10000

collector.sinks.es.type=org.apache.flume.sink.elasticsearch.ElasticSearchSi

nk

collector.sinks.es.channel=m1

collector.sinks.es.hostNames=172.31.26.120

TofollowtheLogstashconvention,I’verenamedthehostnameheadertosource.Now,tomakethenewformateasytofind,I’mgoingtodeletetheexistingindex:

[ec2-user@ip-172-31-26-120~]$curl-XDELETE'http://localhost:9200/flume-

2014-12-07/'

{"acknowledged":true}

[ec2-user@ip-172-31-26-120~]$curl-XGET'http://localhost:9200/flume-

2014-12-07/_search?pretty=true&q=*.*&size=5'

{

"error":"IndexMissingException[[flume-2014-12-07]missing]",

"status":404

}

Next,Icreatemoretrafficonmywebserverandwaitforittoappearattheotherend.ThenIquerysomerecordstoseewhatitlookslike:

[ec2-user@ip-172-31-26-120~]$curl-XGET'http://localhost:9200/flume-

2014-12-07/_search?pretty=true&q=*.*&size=5'

{

"took":95,

"timed_out":false,

"_shards":{

"total":5,

"successful":5,

"failed":0

},

"hits":{

"total":12083,

"max_score":1.0,

"hits":[{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_H",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:15

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-

12-07T18:13:15.000Z","@source":"207.222.127.224","@fields":

{"timestamp":"1417975995000","status_code":"200","source":"207.222.127.224"

,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_M",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-

12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":

{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"

,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_R",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-

12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":

{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"

,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_W",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-

12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":

{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"

,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_a",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-

12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":

{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"

,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}

}]

}

}

Asyoucansee,wenowhaveadditionalfieldsthatwecansearchby.Additionally,youcanseethatElasticsearchhastakenourmillisecond-basedtimestampfieldandcreateditsown@timestampfieldinISO-8601format.

Nowwecandosomemoreinterestingqueriessuchasfindingouthowmanysuccessful(status200)pageswesaw:

[ec2-user@ip-172-31-26-120elasticsearch]$curl-XGET

'http://localhost:9200/flume-2014-12-07/_count'-d'{"query":{"term":

{"status_code":"200"}}}'

{"count":46018,"_shards":{"total":5,"successful":5,"failed":0}}

Settingupabetteruserinterface–KibanaWhileitappearsasifwearedone,thereisonemorethingweshoulddo.Thedataisallthere,butyouneedtobeanexpertinqueryingElasticsearchtomakegooduseoftheinformation.Afterall,ifthedataisdifficulttoconsumeandgetsignored,thenwhybothercollectingitatall?Whatyoureallyneedisanice,searchablewebinterfacethathumanswithnon-technicalbackgroundscanuse.Forthis,wearegoingtosetupKibana.Inanutshell,KibanaisawebapplicationthatrunsasadynamicHTMLpageonyourbrowser,makingcallsfordatatoElasticsearchwhennecessary.Theresultisaninteractivewebinterfacethatdoesn’trequireyoutolearnthedetailsoftheElasticsearchqueryAPI.Thisisnottheonlyoptionavailabletoyou;itisjustwhatI’musinginthisexample.Let’sdownloadKibana3fromtheElasticsearchwebsite,andinstallthisonthesameserver(althoughyoucouldeasilyservethisfromanotherHTTPserver):

[ec2-user@ip-172-31-26-120~]$wget

https://download.elasticsearch.org/kibana/kibana/kibana-3.1.2.tar.gz

--2014-12-0718:31:16--

https://download.elasticsearch.org/kibana/kibana/kibana-3.1.2.tar.gz

Resolvingdownload.elasticsearch.org(download.elasticsearch.org)...

54.225.133.195,54.243.77.158,107.22.222.16,...

Connectingtodownload.elasticsearch.org

(download.elasticsearch.org)|54.225.133.195|:443…connected.

HTTPrequestsent,awaitingresponse…200OK

Length:1074306(1.0M)[application/octet-stream]

Savingto:'kibana-3.1.2.tar.gz'

100%

[==========================================================================

==================================>]1,074,3061.33MB/sin0.8s

2014-12-0718:31:17(1.33MB/s)-'kibana-3.1.2.tar.gz'saved

[1074306/1074306]

[ec2-user@ip-172-31-26-120~]$tar-zxfkibana-3.1.2.tar.gz

[ec2-user@ip-172-31-26-120~]$cdkibana-3.1.2

NoteAtthetimeofwritingthisbook,anewerversionofKibana(Kibana4)isinbeta.Theseinstructionsmaybeoutdatedbythetimethisbookisreleased,butrestassuredthatsomewhereintheKibanasetup,youcaneditaconfigurationfiletopointtoyourElasticsearchserver.Ihavenottriedoutthenewerversionyet,butthereisagoodoverviewofitontheElasticsearchblogathttp://www.elasticsearch.org/blog/kibana-4-beta-3-now-more-filtery.

Opentheconfig.jsfiletoeditthelinewiththeelasticsearchkey.ThisneedstobetheURLofthepublicnameorIPoftheElasticsearchAPI.Forourexample,thislineshouldlookasshownhere:

elasticsearch:'http://ec2-54-148-230-252.us-west-

2.compute.amazonaws.com:9200',

Nowweneedtoprovidethisdirectorywithawebbrowser.Let’sdownloadandinstallNginxagain:

[ec2-user@ip-172-31-26-120~]$sudoyuminstallnginx

Bydefault,therootdirectoryis/usr/share/nginx/html.WecanchangethisconfigurationinNginx,buttomakethingseasy,let’sjustcreateasymboliclinktopointtotherightlocation.First,movetheoriginalpathoutofthewaybyrenamingit:

[ec2-user@ip-172-31-26-120~]$sudomv/usr/share/nginx/html

/usr/share/nginx/html.dist

Next,linktheconfiguredKibanadirectoryasthenewwebroot:

[ec2-user@ip-172-31-26-120~]$sudoln-s~/kibana-3.1.2

/usr/share/nginx/html

Finally,startthewebserver:

[ec2-user@ip-172-31-26-120~]$sudo/etc/init.d/nginxstart

Startingnginx:[OK]

Fromyourcomputer,gotohttp://54.148.230.252/.Ifyouseethispage,itmeansyoumayhavemadeamistakeinyourKibanaconfiguration:

Thiserrorpagemeansyourwebbrowsercan’tconnecttoElasticsearch.Youmayneedtoclearyourbrowsercacheifyoufixedtheconfiguration,aswebpagesaretypicallycachedlocallyforsomeperiodoftime.Ifyougotitright,thescreenshouldlooklikethis:

GoaheadandselectSampleDashboard,thefirstoption.Youshouldseesomethinglikewhatisshowninthenextscreenshot.Thisincludessomeofthedataweingested,recordcountsinthecenter,andthefilteringfieldsintheleft-handmargin.

I’mnotgoingtoclaimtobeaKibanaexpert(I’mfarfromthat),soI’llleavefurthercustomizationofthistoyou.Usethisasabasetogobackandmakeadditionalmodificationstothedatatomakeiteasiertoconsume,search,orfilter.Todothat,you’llprobablyneedtogetmorefamiliarwithhowElasticsearchworks,butthat’sokaybecauseknowledgeisn’tabadthing.Acopiousamountofdocumentationiswaitingforyouathttp://www.elasticsearch.org/guide/en/kibana/current/index.html.

Atthispoint,wehavecompletedanend-to-endimplementationofdatafromawebserverstreamedinanearreal-timefashiontoaweb-basedtoolforsearching.Sinceboththesourceandtargetformatsweredictatedbyothers,weusedaninterceptortotransformthe

dataenroute.Thisusecaseisverygoodforshort-termtroubleshooting,butit’sclearlynotvery“Hadoopy”sincewehaveyettousecoreHadoop.

ArchivingtoHDFSWhenpeoplespeakofHadoop,theyusuallyrefertostoringlotsofdataforalongtime,usuallyinHDFS,somoreinterestingdatascienceormachinelearningcanbedonelater.Let’sextendourusecasebysplittingthedataflowatthecollectortostoreanextracopyinHDFSforlateruse.

So,backinAmazonAWS,IstartafourthservertorunHadoop.IfyouplanondoingallyourworkinHadoop,you’llprobablywanttowritethisdatatoS3,butforthisexample,let’sstickwithHDFS.Nowourserverdiagramlookslikethis:

IusedCloudera’sone-lineinstallationinstructionstospeedupthesetup.It’sinstructionscanbefoundathttp://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_qs_mrv1_pseudo.html

SincetheAmazonAMIiscompatiblewithEnterpriseLinux6,IselectedtheEL6RPMRepositoryandimportedthecorrespondingGPGkey:

[ec2-user@ip-172-31-7-175~]$sudorpm-ivh

http://archive.cloudera.com/cdh5/one-click-

install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm

Retrievinghttp://archive.cloudera.com/cdh5/one-click-

install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm

Preparing…#################################

[100%]

Updating/installing…

1:cloudera-cdh-5-0#################################

[100%]

[ec2-user@ip-172-31-7-175~]$sudorpm--import

http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

Next,Iinstalledthepseudo-distributedconfigurationtoruninasingle-nodeHadoop

cluster:

[ec2-user@ip-172-31-7-175~]$sudoyuminstall-yhadoop-0.20-conf-pseudo

ThismighttakeawhileasitdownloadsalltheClouderaHadoopdistributiondependencies.

Sincethisconfigurationisforsingle-nodeuse,weneedtoadjustthefs.defaultFSpropertyin/etc/hadoop/conf/core-site.xmltoadvertiseourprivateIPinsteadoflocalhost.Ifwedon’tdothis,thenamenodeprocesswillbindto127.0.0.1,andotherservers,suchasourcollector’sFlumeagent,willnotbeabletocontactit:

<property>

<name>fs.defaultFS</name>

<value>hdfs://172.31.7.175:8020</value>

</property>

Next,weformatthenewHDFSvolumeandstarttheHDFSdaemon(sincethatisallweneedforthisexample):

[ec2-user@ip-172-31-7-175~]$sudo-uhdfshdfsnamenode-format

[ec2-user@ip-172-31-7-175~]$forxin'cd/etc/init.d;lshadoop-hdfs-*'

;dosudoservice$xstart;done

startingdatanode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-

172-31-7-175.out

StartedHadoopdatanode(hadoop-hdfs-datanode):[OK]

startingnamenode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-namenode-ip-

172-31-7-175.out

StartedHadoopnamenode:[OK]

startingsecondarynamenode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-

secondarynamenode-ip-172-31-7-175.out

StartedHadoopsecondarynamenode:[OK]

[ec2-user@ip-172-31-7-175~]$hadoopfs-df

FilesystemSizeUsedAvailableUse%

hdfs://172.31.7.175:802083187834882457666618245120%

NowthatHDFSisrunning,let’sgobacktothecollectorboxconfigurationtocreateasecondchannelandanHDFSSinkbyaddingtheselines:

collector.channels.h1.type=memory

collector.channels.h1.capacity=10000

collector.sinks.hadoop.type=hdfs

collector.sinks.hadoop.channel=h1

collector.sinks.hadoop.hdfs.path=hdfs://172.31.7.175/access_logs/%Y/%m/%d/%H

collector.sinks.hadoop.hdfs.filePrefix=access

collector.sinks.hadoop.hdfs.rollInterval=60

collector.sinks.hadoop.hdfs.rollSize=0

collector.sinks.hadoop.hdfs.rollCount=0

Thenwemodifythetop-levelchannelsandsinkskeys:

collector.channels=m1h1

collector.sinks=eshadoop

Asyoucansee,I’vegonewithasimplememorychannelagain,butfeelfreetousea

durablefilechannelifyouneedit.FortheHDFSconfiguration,I’llbeusingadatedfilepathfromthe/access_logsrootdirectorywitha60-secondrotationregardlessofsize.Wearenotalteringthesourcejustyet,sodon’tworry.

Ifweattempttostartthecollectornow,weseethisexception:

java.lang.NoClassDefFoundError:

org/apache/hadoop/io/SequenceFile$CompressionType

RememberthatforApacheFlumetospeaktoHDFS,weneedcompatibleHDFSclassesanddependenciesfortheversionofHadoopwearespeakingto.Let’sgetsomehelpfromourfriendsatClouderaandinstalltheHadoopclientRPM(outputremovedtosavepaper):

[ec2-user@ip-172-31-26-205~]$sudorpm-ivh

http://archive.cloudera.com/cdh5/one-click-

install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm

[ec2-user@ip-172-31-26-205~]$sudorpm--import

http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

[ec2-user@ip-172-31-26-205~]$sudoyuminstallhadoop-client

TestwhetherRPMworksbycreatingthedestinationdirectoryforourdata.We’llalsosetpermissionstomatchtheaccountwe’llberunningtheFlumeagentunder(ec2-userinthiscase):

[ec2-user@ip-172-31-26-205~]$sudo-uhdfshadoopfs-mkdir

hdfs://172.31.7.175/access_logs

[ec2-user@ip-172-31-26-205~]$sudo-uhdfshadoopfs-chownec2-user:ec2-

userhdfs://172.31.7.175/access_logs

[ec2-user@ip-172-31-26-205~]$hadoopfs-lshdfs://172.31.7.175/

Found1items

drwxr-xr-x-ec2-userec2-user02014-12-1104:35

hdfs://172.31.7.175/access_logs

Now,ifwerunthecollectorFlumeagent,weseenoexceptionsduetomissingHDFSclasses.TheFlumestartupscriptdetectsthatwehavealocalHadoopinstallation,andappendsitsclasspathtoitsown.

Finally,wecangobacktotheFlumeconfigurationfile,andsplittheincomingdataatthesourcebylistingbothchannelsonthesourceasdestinationchannels:

collector.sources.av.channels=m1h1

Bydefault,areplicatingchannelselectorisused.Thisiswhatwewant,sothatnofurtherconfigurationisneeded.SavetheconfigurationfileandrestarttheFlumeagent.

Whendataisflowing,youshouldseetheexpectedHDFSactivityintheFlumecollectoragent’slogs:

2014-12-1105:05:04,141(SinkRunner-PollingRunner-DefaultSinkProcessor)

[INFO-

org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:261)]

Creating

hdfs://172.31.7.175/access_logs/2014/12/11/05/access.1418274302083.tmp

YoushouldalsoseedataappearinginHDFS:

[ec2-user@ip-172-31-7-175~]$ls-R/access_logs

drwxr-xr-x-ec2-userec2-user02014-12-1105:05

/access_logs/2014

drwxr-xr-x-ec2-userec2-user02014-12-1105:05

/access_logs/2014/12

drwxr-xr-x-ec2-userec2-user02014-12-1105:05

/access_logs/2014/12/11

drwxr-xr-x-ec2-userec2-user02014-12-1105:06

/access_logs/2014/12/11/05

-rw-r--r--3ec2-userec2-user104862014-12-1105:05

/access_logs/2014/12/11/05/access.1418274302082

-rw-r--r--3ec2-userec2-user104862014-12-1105:05

/access_logs/2014/12/11/05/access.1418274302083

-rw-r--r--3ec2-userec2-user64292014-12-1105:06

/access_logs/2014/12/11/05/access.1418274302084

IfyourunthiscommandfromaserverotherthantheHadoopnode,you’llneedtospecifythefullHDFSURI(hdfs://172.31.7.175/access_logs)orsetthedefault.FSpropertyinthecore-site.xmlconfigurationfileinHadoop.

NoteInretrospect,Iprobablyshouldhaveconfiguredthedefault.FSpropertyonanyinstalledHadoopclientthatwilluseHDFStosavemyselfalotoftyping.Liveandlearn!

LiketheprecedingElasticsearchimplementation,youcannowusethisexampleasabaseend-to-endconfigurationforHDFS.Thenextstepwillbetogobackandmodifytheformat,compression,andsoon,ontheHDFSSinktobettermatchwhatyou’lldowiththedatalater.

SummaryInthischapter,weiterativelyassembledanend-to-enddataflow.WestartedbysettingupanNginxwebservertocreateaccesslogs.Wealsoconfiguredcrontoexecutealogrotateconfigurationperiodicallytosafelyrotateoldlogstoaspoolingdirectory.

Next,weinstalledandconfiguredasingle-nodeElasticsearchserverandtestedsomeinsertionsanddeletions.ThenweconfiguredaFlumeclienttoreadinputfromourspoolingdirectoryfilledwithweblogs,andrelaythemtoaFlumecollectorusingcompressedAvroserialization.ThecollectorthenrelayedtheincomingdatatoourElasticsearchserver.

Oncewesawdataflowingfromoneendtoanother,wesetupasingle-nodeHDFSserverandmodifiedourcollectorconfigurationtosplittheinputdatafeedandrelayacopyofthemessagetoHDFS,simulatingarchivalstorage.Finally,wesetupaKibanaUIinfrontofourElasticsearchinstancetoprovideaneasy-searchfunctionfornontechnicalconsumers.

Inthenextchapter,wewillcovermonitoringFlumedataflowsusingGanglia.

Chapter8.MonitoringFlumeTheuserguideforFlumestates:

MonitoringinFlumeisstillaworkinprogress.Changescanhappenveryoften.SeveralFlumecomponentsreportmetricstotheJMXplatformMBeanserver.ThesemetricscanbequeriedusingJconsole.

WhileJMXisfineforcasualbrowsingofmetricvalues,thenumberofeyeballslookingatJconsoledoesn’tscalewhenyouhavehundredsoreventhousandsofserverssendingdataallovertheplace.Whatyouneedisawaytowatcheverythingatonce.However,whataretheimportantthingstolookfor?Thatisaverydifficultquestion,butI’lltryandcoverseveraloftheitemsthatareimportant,aswecovermonitoringoptionsinthischapter.

MonitoringtheagentprocessThemostobvioustypeofmonitoringyou’llwanttoperformisFlumeagentprocessmonitoring,thatis,makingsuretheagentisstillrunning.Therearemanyproductsthatdothiskindofprocessmonitoring,sothereisnowaywecancoverthemall.Ifyouworkatacompanyofanyreasonablesize,chancesarethereisalreadyasysteminplaceforthis.Ifthisisthecase,donotgooffandbuildyourown.Thelastthingoperationswantsisyetanotherscreentowatch24/7.

MonitIfyoudonotalreadyhavesomethinginplace,onefreemiumoptionisMonit(http://mmonit.com/monit/).ThedevelopersofMonithaveapaidversionthatprovidesmorebellsandwhistlesyoumaywanttoconsider.Eveninthefreeform,itcanprovideyouwithawaytocheckwhethertheFlumeagentisrunning,restartitifitisn’t,andsendyouane-mailwhenthishappenssothatyoucanlookintowhyitdied.

Monitdoesmuchmore,butthisfunctionalityiswhatwewillcoverhere.Ifyouaresmart,andIknowyouare,youwilladdcheckstothedisk,CPU,andmemoryusageasaminimum,inadditiontowhatwecoverinthischapter.

NagiosAnotheroptionforFlumeagentprocessmonitoringisNagios(http://www.nagios.org/).LikeMonit,youcanconfigureittowatchyourFlumeagentsandalertyouviaawebUI,e-mail,oranSNMPtrap.Thatsaid,itdoesn’thaverestartcapabilities.Thecommunityisquitestrong,andtherearemanypluginsforotheravailableapplications.

MycompanyusesthistochecktheavailabilityofHadoopwebUIs.Whilenotacompletepictureofhealth,itdoesprovidemoreinformationtotheoverallmonitoringofourHadoopecosystem.

Again,ifyoualreadyhavetoolsinplaceatyourcompany,seewhetheryoucanreusethembeforebringinginanothertool.

MonitoringperformancemetricsNowthatwehavecoveredsomeoptionsforprocessmonitoring,howdoyouknowwhetheryourapplicationisactuallydoingtheworkyouthinkitis?Onmanyoccasions,I’veseenastucksyslog-ngprocessthatappearstoberunning,butitjustwasn’tsendinganydata.I’mnotpickingonsyslog-ngspecifically;allsoftwaredoesthiswhenconditionsthatarenotdesignedforoccur.

WhentalkingaboutFlumedataflows,youneedtomonitorthefollowing:

DataenteringsourcesiswithinexpectedratesDataisn’toverflowingyourchannelsDataisexitingsinksatexpectedrates

Flumehasapluggablemonitoringframework,butasmentionedatthebeginningofthechapter,itisstillverymuchaworkinprogress.Thisdoesnotmeanyoushouldn’tuseit,asthatwouldbefoolish.Itmeansyou’llwanttoprepareextratestingandintegrationtimeanytimeyouupgrade.

NoteWhilenotcoveredintheFlumedocumentation,itiscommontoenableJMXinyourFlumeJVM(http://bit.ly/javajmx)andusetheNagiosJMXplugin(http://bit.ly/nagiosjmx)toalertyouaboutperformanceabnormalitiesinyourFlumeagents.

GangliaOneoftheavailablemonitoringoptionstowatchFlumeinternalmetricsisGangliaintegration.Ganglia(http://ganglia.sourceforge.net/)isanopensourcemonitoringtoolthatisusedtocollectmetricsanddisplaygraphs,anditcanbetieredtohandleverylargeinstallations.TosendyourFlumemetricstoyourGangliacluster,youneedtopasssomepropertiestoyouragentatstartuptime:

Javaproperty Value Description

flume.monitoring.type ganglia Settoganglia

flume.monitoring.hostshost1:port1,host2:port2

Acomma-separatedlistofhost:portpairsforyourgmondprocess(es)

flume.monitoring.pollFrequency 60Thenumberofsecondsbetweensendingofdata(default60seconds)

flume.monitoring.isGanglia3 falseSettotrueifusingolderGanglia3protocol.Thedefaultprocessistosenddatausingv3.1protocol.

Lookateachinstanceofgmondwithinthesamenetworkbroadcastdomain(asreachabilityisbasedonmulticastpackets),andfindtheudp_recv_channelblockingmond.conf.Let’ssayIhadtwonearbyserverswiththesetwocorrespondingconfigurationblocks:

udp_recv_channel{

mcast_join=239.2.14.22

port=8649

bind=239.2.14.22

retry_bind=true

}

udp_recv_channel{

mcast_join=239.2.11.71

port=8649

bind=239.2.11.71

retry_bind=true

}

InthiscasetheIPandportare239.2.14.22/8649forthefirstserverand239.2.11.71/8649forthesecond,leadingtothesestartupproperties:

-Dflume.monitoring.type=ganglia

-Dflume.monitoring.hosts=239.2.14.22:8649,239.2.11.71:8649

Here,Iamusingdefaultsforthepollinterval,andI’malsousingthenewerGangliawireprotocol.

NoteWhilereceivingdataviaTCPissupportedinGanglia,thecurrentFlume/GangliaintegrationonlysupportssendingdatausingmulticastUDP.Ifyouhavealarge/complicatednetworksetup,you’llwanttogeteducatedbyyournetworkengineersif

thingsdon’tworkasyouexpect.

InternalHTTPserverYoucanconfiguretheFlumeagenttostartanHTTPserverthatwilloutputJSONthatcanusequeriesbyoutsidemechanisms.UnliketheGangliaintegration,anexternalentityhastocalltheFlumeagenttopollthedata.Intheory,youcanuseNagiostopollthisJSONdataandalertoncertainconditions,butIhavepersonallynevertriedit.Ofcourse,thissetupisveryusefulindevelopmentandtesting,especiallyifyouarewritingcustomFlumecomponentstobesuretheyaregeneratingusefulmetrics.HereisasummaryoftheJavapropertiesyou’llneedtosetatthestartupoftheFlumeagent:

Javaproperty Value Description

flume.monitoring.type http Thevalueissettohttp

flume.monitoring.port PORT ThisistheportnumbertobindtheHTTPserver

TheURLformetricswillbehttp://SERVER_OR_IP_OF_AGENT:PORT/metrics.

Let’slookatthefollowingFlumeconfiguration:

agent.sources=s1

agent.channels=c1

agent.sinks=k1

agent.sources.s1.type=avro

agent.sources.s1.bind=0.0.0.0

agent.sources.s1.port=12345

agent.sources.s1.channels=c1

agent.channels.c1.type=memory

agent.sinks.k1.type=avro

agent.sinks.k1.hostname=192.168.33.33

agent.sinks.k1.port=9999

agent.sinks.k1.channel=c1

StarttheFlumeagentwiththeseproperties:

-Dflume.monitoring.type=http

-Dflume.monitoring.port=44444

Now,whenyougotohttp://SERVER_OR_IP:44444/metrics,youmightseesomethinglikethis:

{

"SOURCE.s1":{

"OpenConnectionCount":"0",

"AppendBatchAcceptedCount":"0",

"AppendBatchReceivedCount":"0",

"Type":"SOURCE",

"EventAcceptedCount":"0",

"AppendReceivedCount":"0",

"StopTime":"0",

"EventReceivedCount":"0",

"StartTime":"1365128622891",

"AppendAcceptedCount":"0"},

"CHANNEL.c1":{

"EventPutSuccessCount":"0",

"ChannelFillPercentage":"0.0",

"Type":"CHANNEL",

"StopTime":"0",

"EventPutAttemptCount":"0",

"ChannelSize":"0",

"StartTime":"1365128621890",

"EventTakeSuccessCount":"0",

"ChannelCapacity":"100",

"EventTakeAttemptCount":"0"},

"SINK.k1":{

"BatchCompleteCount":"0",

"ConnectionFailedCount":"4",

"EventDrainAttemptCount":"0",

"ConnectionCreatedCount":"0",

"BatchEmptyCount":"0",

"Type":"SINK",

"ConnectionClosedCount":"0",

"EventDrainSuccessCount":"0",

"StopTime":"0",

"StartTime":"1365128622325",

"BatchUnderflowCount":"0"}

}

Asyoucansee,eachsource,sink,andchannelarebrokenoutseparatelywiththeircorrespondingmetrics.Eachtypeofsource,channel,andsinkprovidetheirownsetofmetrickeys,althoughthereissomecommonality,sobesuretocheckwhatlooksinteresting.Forinstance,thisAvrosourcehasOpenConnectionCount,thatis,thenumberofconnectedclients(whoaremostlikelysendingdatain).Thismayhelpyoudecidewhetheryouhavetheexpectednumberofclientsrelyingonthatdataor,perhaps,toomanyclients,andyouneedtostarttieringyouragents.

Generallyspeaking,thechannel’sChannelSizeorChannelFillPercentagemetricswillgiveyouagoodideawhetherthedataiscominginfasterthanitisgoingout.Itwillalsotellyouwhetheryouhaveitsetlargeenoughformaintenance/outagesofyourdatavolume.

Lookingatthesink,EventDrainSuccessCountversusEventDrainAttemptCountwilltellyouhowoftenoutputissuccessfulwhencomparedtothetimestried.Inthisexample,IamconfiguringanAvrosinktoanonexistenttarget.Asyoucansee,theConnectionFailedCountmetricisgrowing,whichisagoodindicatorofpersistentconnectionproblems.EvenagrowingConnectionCreatedCountmetriccanindicatethatconnectionsaredroppingandreopeningtoooften.

Really,therearenohardandfastrulesbesideswatchingChannelSize/ChannelFillPercentage.Eachusecasewillhaveitsownperformanceprofile,sostartsmall,setupyourmonitoring,andlearnasyougo.

CustommonitoringhooksIfyoualreadyhaveamonitoringsystem,youmaywanttotaketheextraefforttodevelopacustommonitoringreportingmechanism.Youmaythinkthisisassimpleasimplementingtheorg.apache.flume.instrumentation.MonitorServiceinterface.Youdoneedtodothis,butlookingattheinterface,youwillonlyseeastart()andstop()method.Unlikethemoreobviousinterceptorparadigm,theagentexpectsthatyourMonitorServiceimplementationwillstart/stopathreadtosenddataontheexpectedorconfiguredintervalifitisthetypetosenddatatoareceivingservice.Ifyouaregoingtooperateaservice,suchastheHTTPservice,thenstart/stopwouldbeusedtostartandstopyourlisteningservice.ThemetricsthemselvesarepublishedinternallytoJMXbythevarioussources,sinks,channels,andinterceptorsusingobjectnamesthatstartwithorg.apache.flume.YourimplementationwillneedtoreadthesefromMBeanServer.

NoteThebestadviceIcangiveyou,shouldyoudecidetoimplementyourown,istolookatthesourceoftwoexistingimplementations(includedinthesourcedownloadreferencedinChapter2,AQuickStartGuidetoFlume)anddowhattheydo.Touseyourmonitoringhook,settheflume.monitoring.typepropertytothefullyqualifiedclassnameofyourimplementationclass.ExpecttohavetoreworkanycustomhookswithnewFlumeversionsuntiltheframeworkmaturesandstabilizes.

SummaryInthischapter,wecoveredmonitoringFlumeagentsbothfromtheprocesslevelandthemonitoringofinternalmetrics(whetheritisworking).

MonitandNagioswereintroducedasopensourceoptionsforprocesswatching.

Next,wecoveredtheFlumeagentinternalmonitoringmetricswithGangliaandJSONoverHTTPimplementationsthatshipwithApacheFlume.

Finally,wecoveredhowtointegrateacustommonitoringimplementationifyouneedtodirectlyintegratetosomeothertoolthat’snotsupportedbyFlumebydefault.

Inourfinalchapter,wewilldiscusssomegeneralconsiderationsforyourFlumedeployment.

Chapter9.ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollectionInthislastchapter,Ithoughtweshouldcoversomeofthelessconcrete,randomthoughtsIhavearounddatacollectionintoHadoop.There’snohardsciencebehindsomeofthis,andyoushouldfeelperfectlyalrighttodisagreewithme.

WhileHadoopisagreattooltoconsumevastquantitiesofdata,Ioftenthinkofapictureofthelogjamthatoccurredin1886intheSt.CroixRiverinMinnesota(http://www.nps.gov/sacn/historyculture/stories.htm).Whendealingwithtoomuchdata,youwanttomakesureyoudon’tjamyourriver.Besuretotakethepreviouschapteronmonitoringseriouslyandnotjustasnice-to-haveinformation.

TransporttimeversuslogtimeIhadasituationwheredatawasbeingplacedusingdatepatternsinthefilenameand/orthepathinHDFSdidn’tmatchthecontentsofthedirectories.Theexpectationwasthatthedatainthe2014/12/29directorypathcontainedallthedataforDecember29,2014.However,therealitywasthatthedatewasbeingpulledfromthetransport.Itturnsoutthattheversionofsyslogwewereusingwasrewritingtheheader,includingthedateportion,causingthedatatotakeonthetransporttimeandnotreflecttheoriginaltimeoftherecord.Usually,theoffsetsweretiny,justasecondortwo,sonobodyreallytooknotice.However,oneday,oneoftherelayserversdiedandwhenthedatathathadgotstuckonupstreamserverswasfinallysent,ithadthecurrenttime.Inthiscase,itwasshiftedbyacoupleofdays,causingasignificantdatacleanupeffort.

Besurethisisn’thappeningtoyouifyouareplacingdatabydate.Checkthedateedgecasestoseethattheyarewhatyouexpect,andmakesureyoutestyouroutagescenariosbeforetheyhappenforrealinproduction.

AsImentionedpreviously,theseretransmitsduetoplannedorunplannedmaintenance(orevenatinynetworkhiccup)willmostlikelycauseduplicateandout-of-ordereventstoarrive,sobesuretoaccountforthiswhenprocessingrawdata.TherearenosingledeliveryororderingguaranteesinFlume.Ifyouneedthat,useatransactionaldatabaseordistributedtransactionlogsuchasApacheKafka(http://kafka.apache.org/)instead.Ofcourse,ifyouaregoingtouseKafka,youwouldprobablyonlyuseFlumeforthefinallegofyourdatapath,withyoursourceconsumingeventsfromKafka(https://github.com/baniuyao/flume-ng-kafka-source).

NoteRememberthatyoucanalwaysworkaroundduplicatesinyourdataatquerytimeaslongasyoucanuniquelyidentifyyoureventsfromoneanother.Ifyoucannotdistinguisheventseasily,youcanaddaUniversallyUniqueIdentifier(UUID)(http://en.wikipedia.org/wiki/Universally_unique_identifier)headerusingthebundledinterceptor,UUIDInterceptor(configurationdetailsareintheFlumeUserGuide).

TimezonesareevilIncaseyoumissedmybiasagainstusinglocaltimeinChapter4,SinksandSinkProcessors,I’llrepeatitherealittlestronger:timezonesareevil—evillikeDr.Evil(http://en.wikipedia.org/wiki/Dr._Evil)—andlet’snotforgetabouthisMiniMecounterpart,(http://en.wikipedia.org/wiki/Mini-Me)—DaylightSavingsTime.

Weliveinaglobalworldnow.YouarepullingdatafromallovertheplaceintoyourHadoopcluster.Youmayevenhavemultipledatacentersindifferentpartsofthecountry(ortheworld).Thelastthingyouwanttobedoingwhiletryingtoanalyzeyourdataistodealwithaskewdata.DaylightSavingsTimechangesatleastsomewhereonEarthadozentimesinayear.Justlookatthehistory:ftp://ftp.iana.org/tz/releases/.SaveyourselftheheadacheandjustnormalizeittoUTC.Ifyouwanttoconvertitto“localtime”onitswaytohumaneyeballs,feelfree.However,whileitlivesinyourcluster,keepitnormalizedtoUTC.

NoteConsideradoptingUTCeverywhereviathisJavastartupparameter(ifyoucan’tsetitsystem-wide):-Duser.timezone=UTC

Also,usetheISO8601(http://en.wikipedia.org/wiki/ISO_8601)timestandardwherepossibleandbesuretoincludetimezoneinformation(evenifitisUTC).Everymoderntoolontheplanetsupportsthisformatandwillsaveyoupaindowntheroad.

IliveinChicago,andourcomputersatworkuseCentralTime,whichadjustsfordaylightsavings.InourHadoopcluster,weliketokeepdatainaYYYY/MM/DD/HHdirectorylayout.Twiceayear,somethingsbreakslightly.Inthefall,wehavetwiceasmuchdatainour2a.m.directory.Inthespring,thereisno2a.m.directory.Madness!

CapacityplanningRegardlessofhowmuchdatayouthinkyouhave,thingswillchangeovertime.Newprojectswillpopupanddatacreationratesforyourexistingprojectswillchange(upordown).Datavolumewillusuallyebbandflowwiththetrafficoftheday.Finally,thenumberofserversfeedingyourHadoopclusterwillchangeovertime.

TherearemanyschoolsofthoughtonhowmuchextrastoragecapacityyoushouldkeepinyourHadoopcluster(weusethetotallyunscientificvalueof20percent,whichmeansthatweusuallyplanfor80percentfullwhenorderingadditionalhardwarebutdon’tstarttopanicuntilwehitthe85-90percentutilizationnumber).Generally,youwanttokeepenoughextraspacesothatthefailureand/ormaintenanceofaserverortwowon’tcausetheHDFSblockreplicationtoconsumealltheremainingspace.

Youmayalsoneedtosetupmultipleflowsinsideasingleagent.Thesourceandsinkprocessorsarecurrentlysingle-threaded,sothereissomelimittowhattuningbatchsizescanaccomplishwhenunderheavydatavolumes.Beverycarefulinthesesituationswhereyousplityourdataflowatthesourceusingareplicatingchannelselectortomultiplechannels/sinks.Ifoneofthepath’schannelsfillsup,anexceptionisthrownbacktothesource.Ifthatfullchannelisnotmarkedasoptionalandthedataisdropped,thesourcewillstopconsumingnewdata.Thiseffectivelyjamstheagentforallotherchannelsattachedtothatsource.Youmaynotwanttodropthedata(markingthechannelasoptional)becausethedataisimportant.Unfortunately,thisistheonlyfan-outmechanismprovidedinFlumetosendtomultipledestinations,somakesureyoucatchissuesquicklysothatallyourdataflowsarenotimpairedduetoacascadebackupofevents.

ForanumberofFlumeagentsfeedingHadoop,thistooshouldbeadjustedbasedonrealnumbers.Watchthechannelsizetoseehowwellthewritesarekeepingupundernormalloads.Adjustthemaximumchannelcapacitytohandlewhateveramountofoverheadmakesyoufeelgood.Youcanalwayspurchasewaymorehardwarethanyouneed,butevenaprolongedoutagemayoverfloweventhemostconservativeestimates.Thisiswhenyouhavetopickandchoosewhichdataismoreimportanttoyouandadjustyourchannelcapacitiestoreflectthat.Thisway,ifyouexceedyourlimits,theleastimportantdatawillbethefirsttobedropped.

Chancesareyourcompanydoesn’thaveaninfiniteamountofmoneyandatsomepoint,thevalueofthedataversusthecostofcontinuingtoexpandyourclusterwillstarttobequestioned.Thisiswhysettinglimitsonthevolumeofdatacollectedisveryimportant.Thisisjustoneaspectofyourdataretentionpolicy,wherecostisthedrivingfactor.Inamoment,we’lldiscusssomeofthecomplianceaspectsofthispolicy.Sufficetosay,anyprojectsendingdataintoHadoopshouldbeabletosaywhatthevalueofthatdataisandwhatthelossisifwedeletetheolderstuff.Thisistheonlywaythepeoplewritingthecheckscanmakeaninformeddecision.

ConsiderationsformultipledatacentersIfyourunyourbusinessoutofmultipledatacentersandhavealargevolumeofdatacollected,youmaywanttoconsidersettingupaHadoopclusterineachdatacenterratherthansendingallyourcollecteddatabacktoasingledatacenter.Theremayberegulatoryimplicationsregardingdatacrossingcertaingeographicboundaries.ChancesarethereissomebodyinyourcompanywhoknowsmuchmoreaboutcompliancethanyouorI,soseekthemoutbeforeyoustartcopyingdataacrossborders.Ofcourse,notcollatingyourdatawillmakeitmoredifficulttoanalyzeit,asyoucan’tjustrunoneMapReducejobagainstallthedata.Instead,youwouldhavetorunparalleljobsandthencombinetheresultsinasecondpass.Adjustingyourdataprocessingproceduresisbetterthanpotentiallybreakingthelaw.Besuretodoyourhomework.

Pullingallyourdataintoasingleclustermayalsobemorethanyournetworkingcanhandle.Dependingonhowyourdatacentersareconnectedtoeachother,yousimplymaynotbeabletotransmitthedesiredvolumeofdata.Ifyouusepubliccloudservices,therearesurelydatatransfercostsbetweendatacenters.Finally,considerthatacompleteclusterfailureorcorruptionmaywipeouteverything,asmostclustersareusuallytoobigtobackupeverythingexcepthighvaluedata.Havingsomeoftheolddatainthiscaseissometimesbetterthanhavingnothing.WithmultipleHadoopclusters,youhavetheabilitytouseaFailoverSinkProcessortoforwarddatatoadifferentclusterifyoudon’twanttowaittosendtothelocalone.

Ifyoudochoosetosendallyourdatatoasingledestination,consideraddingalargediskcapacitymachineasarelayserverforthedatacenter.Thisway,ifthereisacommunicationissueorextendedclustermaintenance,youcanletdatapileuponamachinethat’sdifferentfromtheonestryingtoserviceyourcustomers.Thisissoundadviceeveninasingledatacentersituation.

ComplianceanddataexpiryRememberthatthedatayourcompanyiscollectingfromyourcustomersshouldbeconsideredsensitiveinformation.Youmaybeboundbyadditionalregulatorylimitationsonaccessingdatasuchas:

Personallyidentifiableinformation(PII):Howyouhandleandsafeguardcustomer’sidentitieshttp://en.wikipedia.org/wiki/Personally_identifiable_informationPaymentCardIndustryDataSecurityStandard(PCIDSS):Howyousafeguardcreditcardinformationhttp://en.wikipedia.org/wiki/PCI_DSSServiceOrganizationControl(SOC-2):Howyoucontrolaccesstoinformation/systemshttp://www.aicpa.org/InterestAreas/FRC/AssuranceAdvisoryServices/Pages/AICPASOC2Report.aspxStatementsonStandardsforAttestationEngagements(SSAE-16):Howyoumanagechangeshttp://www.aicpa.org/Research/Standards/AuditAttest/DownloadableDocuments/AT-00801.pdfSarbanesOxley(SOX):http://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act

Thisisbynomeansadefinitivelist,sobesuretoseekoutyourcompany’scomplianceexpertsforwhatdoesanddoesn’tapplytoyoursituation.Ifyouaren’tproperlyhandlingaccesstothisdatainyourcluster,thegovernmentwillleanonyou,orworse,youwon’thavecustomersanymoreiftheyfeelyouaren’tprotectingtheirpersonalinformation.Considerscrambling,trimming,orobfuscatingyourdataofpersonalinformation.Chancesarethebusinessinsightyouarelookingfallsmoreintothecategoryof“howmanypeoplewhosearchfor“hammer”actuallybuyone?”ratherthan“howmanycustomersarenamedBob?”AsyousawinChapter6,Interceptors,ETL,andRouting,itwouldbeveryeasytowriteaninterceptortoobfuscatePIIasyoumoveitaround.

YourcompanyprobablyhasadocumentretentionpolicythatincludesthedatayouareputtingintoHadoop.Makesureyouremovedatathatyourpolicysaysyouaren’tsupposedtobekeepingaroundanymore.Thelastthingyouwantisavisitfromthelawyers.

SummaryInthischapter,wecoveredseveralreal-worldconsiderationsyouneedtothinkaboutwhenplanningyourFlumeimplementation,including:

TransporttimedoesnotalwaysmatcheventtimeThemayhemintroducedwithDaylightSavingsTimetocertaintime-basedlogicCapacityplanningconsiderationsItemstoconsiderwhenyouhavemorethanonedatacenterDatacomplianceDataretentionandexpiration

Ihopeyouenjoyedthisbook.Hopefully,youwillbeabletoapplymuchofthisinformationdirectlyinyourapplication/Hadoopintegrationefforts.

Thanks,thiswasfun!

IndexA

ActiveMQURL/JMSsource

agent/Sources,channels,andsinksagentcommand/Startingupwith“Hello,World!”agentidentifier(name)/AnoverviewoftheFlumeconfigurationfileagentprocess

monitoring/Monitoringtheagentprocessagentprocess,monitoring

Monit/MonitNagios/Nagios

AOPSpringFramework/Interceptors,channelselectors,andsinkprocessorsApacheAvro

about/ApacheAvroURL/ApacheAvro

ApachebenchmarkURL/Settingupthewebserver

ApacheBigTopprojectURL/FlumeinHadoopdistributions

ApacheFlumeURL/Thememorychannel

ApacheKafkaURL/Transporttimeversuslogtime

Avrosource/sinkused,fortieringdataflows/TheAvrosource/sink,TheThriftsource/sinkcompressedAvro/CompressingAvro

avro_eventserializer/ApacheAvro

BbasenameHeaderKeyproperty/SpoolingDirectorySourcebatchDurationMillisproperty/SinkconfigurationbatchSizeproperty/SinkconfigurationbatchTimeoutproperty/TheExecsourceBesteffort(BE)/Flume0.9byteCapacity

URL/ThememorychannelbyteCapacityBufferPercentage

URL/Thememorychannel

Ccapacityplanning/CapacityplanningCDH3distribution/Flume0.9CDH5

URL/ArchivingtoHDFScentralizedmaster(masters)/Flume0.9cf-enginetool/Flume1.X(Flume-NG)ChannelProcessorfunction/Thememorychannelchannels/Sources,channels,andsinkschannelselector

about/Channelselectorsreplicating/Replicatingmultiplexing/Multiplexing

channelselectors/Interceptors,channelselectors,andsinkprocessorscheckpointIntervalproperty/ThefilechannelCheftool/Flume1.X(Flume-NG)Cloudera

URL/FlumeinHadoopdistributionsClouderadistribution

issuesolution,URL/AnoverviewoftheFlumeconfigurationfilecodecs

about/Compressioncodecscommand-lineAvro

about/Usingcommand-lineAvrocompliance/ComplianceanddataexpiryCompressedStream/CompressedStreamconsumeOrderproperty/SpoolingDirectorySourcecrondaemon

URL/Configuringlogrotationtothespooldirectorycustominterceptors

about/Custominterceptorspluginsdirectory/Thepluginsdirectory

custommonitoringreportingmechanismdeveloping/Custommonitoringhooks

D-Dflume.root.loggerproperty/Startingupwith“Hello,World!”data/logs

streaming/TheproblemwithHDFSandstreamingdata/logsdataexpiry/Complianceanddataexpirydataflows

tiering/Tieringflowsdataflows,tiering

Avrosource/sink,using/TheAvrosource/sinkSSLAvro/SSLAvroflowsThriftsource/sink,using/TheThriftsource/sinkcommand-lineAvro/Usingcommand-lineAvroLog4Jappender/TheLog4JappenderLog4Jload-balancingappender/TheLog4Jload-balancingappender

DataStream/DataStreamdestinationNameproperty/JMSsourceDiskFailover(DFO)/Flume0.9

EElasticComputeCluster(EC2)/WeblogstosearchableUIElasticSearch

about/ElasticSearchSinkURL/ElasticSearchSink,Settingupthetarget–Elasticsearch,Settingupabetteruserinterface–KibanaversusApacheSolr/DynamicSerializersettingup/Settingupthetarget–Elasticsearch

ElasticSearchDynamicSerializerserializer/DynamicSerializerElasticSearchLogStashEventSerializer/LogStashSerializerElasticSearchSink

about/ElasticSearchSinksettings/ElasticSearchSinkElasticSearchLogStashEventSerializer/LogStashSerializerElasticSearchDynamicSerializerserializer/DynamicSerializer

EmbeddedAgentabout/Theembeddedagentconfiguration/Configurationandstartupstartup/Configurationandstartupalternativeformats,URL/Configurationandstartupdata,sending/Sendingdatashutdown/Shutdown

End-to-End(E2E)/Flume0.9EventSerializer

about/EventSerializerstextoutput/Textoutputtext_with_headersserializer/TextwithheadersApacheAvro/ApacheAvrouser-providedAvroschema/User-providedAvroschemafiletype/Filetypetimeouts/Timeoutsandworkersworkers/Timeoutsandworkers

Execsourceabout/TheExecsourceproperties/TheExecsource

Ffilechannel

about/Thefilechannelconfigurationparameters/Thefilechannel

filetypeabout/FiletypeSequenceFile/SequenceFileDataStream/DataStreamCompressedStream/CompressedStream

Flumeabout/Flume0.9events/FlumeeventsURL/DownloadingFlumedownloading/DownloadingFlumeinHadoopdistributions/FlumeinHadoopdistributionsconfigurationfile/AnoverviewoftheFlumeconfigurationfilesettingup,oncollector/relay/SettingupFlumeoncollector/relaysettingup,onclient/SettingupFlumeontheclientagentprocess,monitoring/Monitoringtheagentprocess

Flume-NG(FlumetheNextGeneration)/Flume0.9flume-ng-kafka-source

URL/TransporttimeversuslogtimeFlume0.9/Flume0.9Flume1.X(Flume-NG)/Flume1.X(Flume-NG)

URL/Flume1.X(Flume-NG)Flumeconfigurationfile

overview/AnoverviewoftheFlumeconfigurationfileFlumeevents

about/Flumeeventsinterceptors/Interceptors,channelselectors,andsinkprocessorschannelselectors/Interceptors,channelselectors,andsinkprocessorssinkprocessors/Interceptors,channelselectors,andsinkprocessors

FlumeJVMURL/Monitoringperformancemetrics

FlumeUserGuideURL/Thefilechannel,HDFSsink

GGanglia

about/GangliaURL/Ganglia

grokcommandURL/TheKiteSDK

HHadoopdistributions

Flume/FlumeinHadoopdistributionsbenefits/FlumeinHadoopdistributionslimitations/FlumeinHadoopdistributions

HDFSissue/TheproblemwithHDFSandstreamingdata/logsarchivingto/ArchivingtoHDFS

hdfs.batchSizeparameter/Filerotationhdfs.maxOpenFilesproperty/Pathandfilenamehdfs.timeZoneproperty/Pathandfilenamehdfs.useLocalTimeStampbooleanproperty/PathandfilenameHDFSsink

about/HDFSsinkconfigurationparameters/HDFSsinkpath/Pathandfilenamefilename/Pathandfilenamefilerotation/Filerotationcompressioncodecs/Compressioncodecs

Hello,World!examplefileconfiguration/Startingupwith“Hello,World!”

helpcommand/Startingupwith“Hello,World!”Hortonworks

URL/FlumeinHadoopdistributionsHostinterceptor

about/Hostproperties/Host

HumanOptimizedConfigurationObjectNotation(HOCON)URL/Morphlineconfigurationfiles

IindexNameproperty/ElasticSearchSinkinterceptor/Interceptors,channelselectors,andsinkprocessors

used,forcreatingsearchfields/Creatingmoresearchfieldswithaninterceptorinterceptors

about/Interceptorsadding/InterceptorsTimestamp/TimestampHost/HostStatic/Staticregularexpressionfiltering/Regularexpressionfilteringregularexpression/RegularexpressionextractorMorphline/Morphlineinterceptorcustom/Custominterceptors

internalHTTPserverusing/InternalHTTPserver

ISO8601URL/Timezonesareevil

JJavaKeyStore(JKS)/SSLAvroflowsJavaproperties

flume.monitoring.type/Gangliaflume.monitoring.hosts/Gangliaflume.monitoring.isGanglia3/Ganglia

JMSmessageselectorsURL/JMSsource

JMSsourceabout/JMSsourceconfiguring/JMSsourcesettings/JMSsource

Kkeep-aliveparameter/ThefilechannelkeepFieldsproperty/ThesyslogUDPsource,ThesyslogTCPsourceKibana

URL/ElasticSearchSink,Settingupabetteruserinterface–Kibanasettingup/Settingupabetteruserinterface–Kibana

KiteSDKabout/TheKiteSDKURL/TheKiteSDK

LLog4Jappender

about/TheLog4JappenderURL/TheLog4Jappender

Log4Jload-balancingappenderabout/TheLog4Jload-balancingappender

logrotateutilityURL/Configuringlogrotationtothespooldirectory

LogstashURL/ElasticSearchSink

logtimeversustransporttime/Transporttimeversuslogtime

MMapR

URL/FlumeinHadoopdistributionsmemoryCapacityproperty/SpillableMemoryChannelmemorychannel

about/Thememorychannelconfigurationparameters/Thememorychannel

metricsURL/InternalHTTPserver

MiniMecounterpart/TimezonesareevilminimumRequiredSpaceproperty/ThefilechannelMonit

URL/Monitabout/Monit

Morphlineconfigurationfile/MorphlineconfigurationfilesURL/Morphlineconfigurationfiles,TypicalSolrSinkconfiguration,Morphlineinterceptor

MorphlinecommandsURL/TheKiteSDK

morphlineIdproperty/SinkconfigurationMorphlineinterceptor

about/MorphlineinterceptorMorphlineSolrSink

about/MorphlineSolrSinkMorphlineconfigurationfile/MorphlineconfigurationfilesSolrSinkconfiguration/TypicalSolrSinkconfigurationsinkconfiguration/SinkconfigurationSinkconfiguration/Sinkconfiguration

multipledatacentersconsiderations/Considerationsformultipledatacenters

MultiportSyslogTCPsourceabout/ThemultiportsyslogTCPsourceproperties/ThemultiportsyslogTCPsourceFlumeheaders/ThemultiportsyslogTCPsource

NNagios

URL/Nagiosabout/Nagios

NagiosJMXpluginURL/Monitoringperformancemetrics

nccommand/Startingupwith“Hello,World!”Nginxwebserver

URL/Settingupthewebserver

OoverflowCapacityproperty/SpillableMemoryChanneloverflowDeactivationThresholdproperty/SpillableMemoryChanneloverflowTimeoutproperty/SpillableMemoryChannel

PPaymentCardIndustryDataSecurityStandard(PCIDSS)

URL/Complianceanddataexpiryperformancemetrics

monitoring/Monitoringperformancemetricsperformancemetrics,monitoring

Ganglia/GangliainternalHTTPserver/InternalHTTPservercustommonitoringhooks/Custommonitoringhooks

Personallyidentifiableinformation(PII)URL/Complianceanddataexpiry

pluginsdirectoryabout/Thepluginsdirectory$FLUME_HOME/plugins.ddirectory/Thepluginsdirectorylibdirectory/Thepluginsdirectorynativedirectory/Thepluginsdirectory

pollTimeoutproperty/JMSsourcepom.xmlfile

URL/DynamicSerializerPOSIX-stylefilesystem/TheproblemwithHDFSandstreamingdata/logsprocessor.backoffproperty/LoadbalancingPuppettool/Flume1.X(Flume-NG)

RRedHatEnterpriseLinux(RHEL)/FlumeinHadoopdistributionsregularexpressionextractorinterceptor

about/Regularexpressionextractorproperties/Regularexpressionextractor

regularexpressionfilteringinterceptorabout/RegularexpressionfilteringURL/Regularexpressionfiltering

routing/RoutingRunners/Flume1.X(Flume-NG)

SSarbanesOxley(SOX)

URL/ComplianceanddataexpirySequenceFilefiletype/SequenceFileserializer.syncIntervalBytesproperty/ApacheAvroserializers/Regularexpressionextractorsinkgroup

about/Sinkgroupsloadbalancing/Loadbalancingfailover/Failover

SinkProcessorabout/Sinkgroups

sinkprocessors/Interceptors,channelselectors,andsinkprocessorssinks/Sources,channels,andsinksSolr/MorphlineSolrSinkSolrCloud/MorphlineSolrSinkSolrSink

configuration/TypicalSolrSinkconfigurationsources/Sources,channels,andsinksSpillableMemoryChannel

about/SpillableMemoryChannelconfigurationparameters/SpillableMemoryChannel

spooldirectorylogrotation,configuring/Configuringlogrotationtothespooldirectory

SpoolingDirectorySourceabout/SpoolingDirectorySourcecreating/SpoolingDirectorySourceproperties/SpoolingDirectorySource

SpringURL/Configurationandstartup

start()method/CustommonitoringhooksStatementsonStandardsforAttestationEngagements(SSAE-16)

URL/ComplianceanddataexpiryStaticinterceptor

about/Staticproperties/Static

syslogsourcesabout/SyslogsourcesURL/SyslogsourcesUDPsource/ThesyslogUDPsourceTCPsource/ThesyslogTCPsourceMultiportSyslogTCPsource/ThemultiportsyslogTCPsource

SyslogTCPsource

about/ThesyslogTCPsourcecreating/ThesyslogTCPsourceFlumeheaders/ThesyslogTCPsource

syslogtermsURL/SyslogsourcesSyslogUDPsource

about/ThesyslogUDPsourceproperties/ThesyslogUDPsourceFlumeheaders/ThesyslogUDPsource

Ttail

URL/Theproblemwithusingtailabout/Theproblemwithusingtailissues/Theproblemwithusingtail

tail-Fcommand/TheExecsourcetake/Thememorychanneltext_with_headersserializer/TextwithheadersThrift

URL/TheThriftsource/sinkThriftsource/sink

used,fortieringdataflows/TheThriftsource/sinktiereddatacollection/Tiereddatacollection(multipleflowsand/oragents)Timestampinterceptor

about/Timestampproperties/Timestamp

timestampkey/Regularexpressionextractortimezones/TimezonesareeviltransactionCapacityproperty/Thefilechannel,SpillableMemoryChanneltransporttime

versuslogtime/Transporttimeversuslogtime

UUniversallyUniqueIdentifier(UUID)

URL/Transporttimeversuslogtimeuser-providedAvroschema/User-providedAvroschema

VVeriSign/SSLAvroflows

Wwebapplication

simulating/WeblogstosearchableUIwebserver

settingup/Settingupthewebserverlogrotation,configuringtospooldirectory/Configuringlogrotationtothespooldirectory

workerdaemons(agents)/Flume0.9WriteAheadLog(WAL)/Thefilechannelwrk

URL/Settingupthewebserver

ZZookeeper/Flume0.9