Apache Flume: Distributed Log Collection -...

ApacheFlume:DistributedLogCollectionforHadoopSecondEdition

TableofContents


Credits

AbouttheAuthor

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.OverviewandArchitecture

Flume0.9

Flume1.X(Flume-NG)

TheproblemwithHDFSandstreamingdata/logs

Sources,channels,andsinks

Flumeevents

Interceptors,channelselectors,andsinkprocessors

Tiereddatacollection(multipleflowsand/oragents)

TheKiteSDK

Summary

2.AQuickStartGuidetoFlume

DownloadingFlume

FlumeinHadoopdistributions

AnoverviewoftheFlumeconfigurationfile

Startingupwith“Hello,World!”

Summary

3.Channels

Thememorychannel

Thefilechannel

SpillableMemoryChannel

Summary

4.SinksandSinkProcessors

HDFSsink

Pathandfilename

Filerotation

Compressioncodecs

EventSerializers

Textoutput

Textwithheaders

ApacheAvro

User-providedAvroschema

Filetype

SequenceFile

DataStream

CompressedStream

Timeoutsandworkers

Sinkgroups

Loadbalancing

Failover

MorphlineSolrSink

Morphlineconfigurationfiles

TypicalSolrSinkconfiguration

Sinkconfiguration

ElasticSearchSink

LogStashSerializer

DynamicSerializer

Summary

5.SourcesandChannelSelectors

Theproblemwithusingtail

TheExecsource

SpoolingDirectorySource

Syslogsources

ThesyslogUDPsource

ThesyslogTCPsource

ThemultiportsyslogTCPsource

JMSsource

Channelselectors

Replicating

Multiplexing

Summary

6.Interceptors,ETL,andRouting

Interceptors

Timestamp

Host

Static

Regularexpressionfiltering

Regularexpressionextractor

Morphlineinterceptor

Custominterceptors

Thepluginsdirectory

Tieringflows

TheAvrosource/sink

CompressingAvro

SSLAvroflows

TheThriftsource/sink

Usingcommand-lineAvro

TheLog4Jappender

TheLog4Jload-balancingappender

Theembeddedagent

Configurationandstartup

Sendingdata

Shutdown

Routing

Summary

7.PuttingItAllTogether

WeblogstosearchableUI

Settingupthewebserver

Configuringlogrotationtothespooldirectory

Settingupthetarget–Elasticsearch

SettingupFlumeoncollector/relay

SettingupFlumeontheclient

Creatingmoresearchfieldswithaninterceptor

Settingupabetteruserinterface–Kibana

ArchivingtoHDFS

Summary

8.MonitoringFlume

Monitoringtheagentprocess

Monit

Nagios

Monitoringperformancemetrics

Ganglia

InternalHTTPserver

Custommonitoringhooks

Summary

9.ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollection

Transporttimeversuslogtime

Timezonesareevil

Capacityplanning

Considerationsformultipledatacenters

Complianceanddataexpiry

Summary

Index

ApacheFlume:DistributedLogCollectionforHadoopSecondEditionCopyright©2015PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:July2013

Secondedition:February2015

Productionreference:1190215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78439-217-8

www.packtpub.com

http://www.packtpub.com

CreditsAuthor

SteveHoffman

Reviewers

SachinHandiekar

MichaelKeane

StefanWill

CommissioningEditor

DipikaGaonkar

AcquisitionEditor

ReshmaRaman

ContentDevelopmentEditor

NeetuAnnMathew

TechnicalEditor

MenzaMathew

CopyEditors

VikrantPhadke

StutiSrivastava

ProjectCoordinator

MaryAlex

Proofreader

SimranBhogal

SafisEditing

Indexer

RekhaNair

Graphics

SheetalAute

AbhinashSahu

ProductionCoordinator

KomalRamchandani

CoverWork

KomalRamchandani

AbouttheAuthorSteveHoffmanhas32yearsofexperienceinsoftwaredevelopment,rangingfromembeddedsoftwaredevelopmenttothedesignandimplementationoflarge-scale,service-oriented,object-orientedsystems.Forthelast5years,hehasfocusedoninfrastructureascode,includingautomatedHadoopandHBaseimplementationsanddataingestionusingApacheFlume.SteveholdsaBSincomputerengineeringfromtheUniversityofIllinoisatUrbana-ChampaignandanMSincomputersciencefromDePaulUniversity.HeiscurrentlyaseniorprincipalengineeratOrbitzWorldwide(http://orbitz.com/).

MoreinformationonStevecanbefoundathttp://bit.ly/bacoboyandonTwitterat@bacoboy.

ThisisthefirstupdatetoSteve’sfirstbook,ApacheFlume:DistributedLogCollectionforHadoop,PacktPublishing.

I’dagainliketodedicatethisupdatedbooktomylovingandsupportivewife,Tracy.Sheputsupwithalot,andthatisverymuchappreciated.Icouldn’taskforabetterfrienddailybymyside.

Myterrificchildren,RachelandNoah,areaconstantreminderthathardworkdoespayoffandthatgreatthingscancomefromchaos.

Ialsowanttogiveabigthankstomyparents,AlanandKaren,formoldingmeintothesomewhatsatisfactoryhumanI’vebecome.TheirdedicationtofamilyandeducationaboveallelseguidesmedailyasIattempttohelpmyownchildrenfindtheirhappinessintheworld.

http://orbitz.com/

http://bit.ly/bacoboy

AbouttheReviewersSachinHandiekarisaseniorsoftwaredeveloperwithover5yearsofexperienceinJavaEEdevelopment.HegraduatedincomputersciencefromtheUniversityofGreenwich,London,andcurrentlyworksforaglobalconsultingcompany,developingenterpriseapplicationsusingvariousopensourcetechnologies,suchasApacheCamel,ServiceMix,ActiveMQ,andZooKeeper.

Sachinhasalotofinterestinopensourceprojects.HehascontributedcodetoApacheCamelanddevelopedpluginsforSpringSocial,whichcanbefoundatGitHub(https://github.com/sachin-handiekar).

Healsoactivelywritesaboutenterpriseapplicationdevelopmentonhisblog(http://sachinhandiekar.com).

MichaelKeanehasaBSincomputersciencefromtheUniversityofIllinoisatUrbana-Champaign.Hehasworkedasasoftwareengineer,codingalmostexclusivelyinJavasinceJDK1.1.Hehasalsoworkedonthemission-criticalmedicaldevicesoftware,e-commerce,transportation,navigation,andadvertisingdomains.HeiscurrentlyadevelopmentleaderforConversant,wherehemaintainsFlumeflowsofnearly100billionloglinesperday.

Michaelisafatherofthree,andbesideswork,hespendsmostofhistimewithhisfamilyandcoachingyouthsoftball.

StefanWillisacomputerscientistwithadegreeinmachinelearningandpatternrecognitionfromtheUniversityofBonn,Germany.Foroveradecade,hehasworkedforseveralstart-upsinSiliconValleyandRaleigh,NorthCarolina,intheareaofsearchandanalytics.Presently,heleadsthedevelopmentofthesearchbackendandreal-timeanalyticsplatformatZendesk,aproviderofcustomerservicesoftware.

https://github.com/sachin-handiekar

http://sachinhandiekar.com

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

http://www.PacktPub.com


mailto:[email protected]


https://www2.packtpub.com/books/subscription/packtlib

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandviewnineentirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.


PrefaceHadoopisagreatopensourcetoolforshiftingtonsofunstructureddataintosomethingmanageablesothatyourbusinesscangainbetterinsightintoyourcustomers’needs.It’scheap(mostlyfree),scaleshorizontallyaslongasyouhavespaceandpowerinyourdatacenter,andcanhandleproblemsthatwouldcrushyourtraditionaldatawarehouse.Thatsaid,alittle-knownsecretisthatyourHadoopclusterrequiresyoutofeeditdata.Otherwise,youjusthaveaveryexpensiveheatgenerator!Youwillquicklyrealize(onceyougetpastthe“playingaround”phasewithHadoop)thatyouwillneedatooltoautomaticallyfeeddataintoyourcluster.Inthepast,youhadtocomeupwithasolutionforthisproblem,butnomore!FlumewasstartedasaprojectoutofCloudera,whenitsintegrationengineershadtokeepwritingtoolsoverandoveragainfortheircustomerstoautomaticallyimportdata.Today,theprojectliveswiththeApacheFoundation,isunderactivedevelopment,andboastsofuserswhohavebeenusingitintheirproductionenvironmentsforyears.

Inthisbook,IhopetogetyouupandrunningquicklywithanarchitecturaloverviewofFlumeandaquick-startguide.Afterthat,we’lldivedeepintothedetailsofmanyofthemoreusefulFlumecomponents,includingtheveryimportantfilechannelforthepersistenceofin-flightdatarecordsandtheHDFSSinkforbufferingandwritingdataintoHDFS(theHadoopFileSystem).SinceFlumecomeswithawidevarietyofmodules,chancesarethattheonlytoolyou’llneedtogetstartedisatexteditorfortheconfigurationfile.

Bythetimeyoureachtheendofthisbook,youshouldknowenoughtobuildahighlyavailable,fault-tolerant,streamingdatapipelinethatfeedsyourHadoopcluster.

WhatthisbookcoversChapter1,OverviewandArchitecture,introducesFlumeandtheproblemspacethatit’stryingtoaddress(specificallywithregardstoHadoop).Anarchitecturaloverviewofthevariouscomponentstobecoveredinlaterchaptersisgiven.

Chapter2,AQuickStartGuidetoFlume,servestogetyouupandrunningquickly.ItincludesdownloadingFlume,creatinga“Hello,World!”configuration,andrunningit.

Chapter3,Channels,coversthetwomajorchannelsmostpeoplewilluseandtheconfigurationoptionsavailableforeachofthem.

Chapter4,SinksandSinkProcessors,goesintogreatdetailonusingtheHDFSFlumeoutput,includingcompressionoptionsandoptionsforformattingthedata.Failoveroptionsarealsocoveredsothatyoucancreateamorerobustdatapipeline.

Chapter5,SourcesandChannelSelectors,introducesseveraloftheFlumeinputmechanismsandtheirconfigurationoptions.Alsocoveredisswitchingbetweendifferentchannelsbasedondatacontent,whichallowsthecreationofcomplexdataflows.

Chapter6,Interceptors,ETL,andRouting,explainshowtotransformdatain-flightaswellasextractinformationfromthepayloadtousewithChannelSelectorstomakeroutingdecisions.ThenthischaptercoverstieringFlumeagentsusingAvroserialization,aswellasusingtheFlumecommandlineasastandaloneAvroclientfortestingandimportingdatamanually.

Chapter7,PuttingItAllTogether,walksyouthroughthedetailsofanend-to-endusecasefromthewebserverlogstoasearchableUI,backedbyElasticsearchaswellasarchivalstorageinHDFS.

Chapter8,MonitoringFlume,discussesvariousoptionsavailableformonitoringFlumebothinternallyandexternally,includingMonit,Nagios,Ganglia,andcustomhooks.

Chapter9,ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollection,isacollectionofmiscellaneousthingstoconsiderthatareoutsidethescopeofjustconfiguringandusingFlume.

WhatyouneedforthisbookYou’llneedacomputerwithaJavaVirtualMachineinstalled,sinceFlumeiswritteninJava.Ifyoudon’thaveJavaonyourcomputer,youcandownloaditfromhttp://java.com/.

YouwillalsoneedanInternetconnectionsothatyoucandownloadFlumetoruntheQuickStartexample.

ThisbookcoversApacheFlume1.5.2.

http://java.com/

WhothisbookisforThisbookisforpeopleresponsibleforimplementingtheautomaticmovementofdatafromvarioussystemstoaHadoopcluster.IfitisyourjobtoloaddataintoHadooponaregularbasis,thisbookshouldhelpyoutocodeyourselfoutofmanualmonkeyworkorfromwritingacustomtoolyou’llbesupportingforaslongasyouworkatyourcompany.

OnlybasicknowledgeofHadoopandHDFSisrequired.Somecustomimplementationsarecovered,shouldyourneedsnecessitatethem.Forthislevelofimplementation,youwillneedtoknowhowtoprograminJava.

Finally,you’llneedyourfavoritetexteditor,sincemostofthisbookcovershowtoconfigurevariousFlumecomponentsviaanagent’stextconfigurationfile.

ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andexplanationsoftheirmeanings.

Codewordsintextareshownasfollows:“Ifyouwanttousethisfeature,yousettheuseDualCheckpointspropertytotrueandspecifyalocationforthatsecondcheckpointdirectorywiththebackupCheckpointDirproperty.”

Ablockofcodeissetasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/access

agent.sinks.k1.hdfs.filePrefix=access

agent.sinks.k1.hdfs.fileSuffix=.log

Whenwewishtodrawyourattentiontoaparticularpartofacodeblock,therelevantlinesoritemsaresetinbold:

agent.sources.s1.command=uptime

agent.sources.s1.restart=true

agent.sources.s1.restartThrottle=60000

Anycommand-lineinputoroutputiswrittenasfollows:

$tar-zxfapache-flume-1.5.2.tar.gz

$cdapache-flume-1.5.2

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxesforexample,appearinthetextlikethis:“FlumewasfirstintroducedinCloudera’sCDH3distributionin2011.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedormayhavedisliked.Readerfeedbackisimportantforustodeveloptitlesthatyoureallygetthemostoutof.

Tosendusgeneralfeedback,simplysendane-mailto<[email protected]>,andmentionthebooktitleviathesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideonwww.packtpub.com/authors.


http://www.packtpub.com/authors

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesforallPacktbooksyouhavepurchasedfromyouraccountathttp://www.packtpub.com.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.


http://www.packtpub.com/support

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyouwouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheerratasubmissionformlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedonourwebsite,oraddedtoanylistofexistingerrata,undertheErratasectionofthattitle.Anyexistingerratacanbeviewedbyselectingyourtitlefromhttp://www.packtpub.com/support.

http://www.packtpub.com/submit-errata


PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.


QuestionsYoucancontactusat<[email protected]>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.


Chapter1.OverviewandArchitectureIfyouarereadingthisbook,chancesareyouareswimminginoceansofdata.Creatingmountainsofdatahasbecomeveryeasy,thankstoFacebook,Twitter,Amazon,digitalcamerasandcameraphones,YouTube,Google,andjustaboutanythingelseyoucanthinkofbeingconnectedtotheInternet.Asaproviderofawebsite,10yearsago,yourapplicationlogswereonlyusedtohelpyoutroubleshootyourwebsite.Today,thissamedatacanprovideavaluableinsightintoyourbusinessandcustomersifyouknowhowtopangoldoutofyourriverofdata.

Furthermore,asyouarereadingthisbook,youarealsoawarethatHadoopwascreatedtosolve(partially)theproblemofsiftingthroughmountainsofdata.Ofcourse,thisonlyworksifyoucanreliablyloadyourHadoopclusterwithdataforyourdatascientiststopickapart.

GettingdataintoandoutofHadoop(inthiscase,theHadoopFileSystem,orHDFS)isn’thard;itisjustasimplecommand,suchas:

%hadoopfs--putdata.csv.

Thisworksgreatwhenyouhaveallyourdataneatlypackagedandreadytoupload.

However,yourwebsiteiscreatingdataallthetime.HowoftenshouldyoubatchloaddatatoHDFS?Daily?Hourly?Whateverprocessingperiodyouchoose,eventuallysomebodyalwaysasks“canyougetmethedatasooner?”Whatyoureallyneedisasolutionthatcandealwithstreaminglogs/data.

Turnsoutyouaren’taloneinthisneed.Cloudera,aproviderofprofessionalservicesforHadoopaswellastheirowndistributionofHadoop,sawthisneedoverandoverwhenworkingwiththeircustomers.Flumewascreatedtofillthisneedandcreateastandard,simple,robust,flexible,andextensibletoolfordataingestionintoHadoop.

Flume0.9FlumewasfirstintroducedinCloudera’sCDH3distributionin2011.Itconsistedofafederationofworkerdaemons(agents)configuredfromacentralizedmaster(ormasters)viaZookeeper(afederatedconfigurationandcoordinationsystem).Fromthemaster,youcouldchecktheagentstatusinawebUIaswellaspushoutconfigurationcentrallyfromtheUIorviaacommand-lineshell(bothreallycommunicatingviaZookeepertotheworkeragents).

Datacouldbesentinoneofthreemodes:Besteffort(BE),DiskFailover(DFO),andEnd-to-End(E2E).ThemasterswereusedfortheE2Emodeacknowledgementsandmultimasterconfigurationneverreallymatured,soyouusuallyonlyhadonemaster,makingitacentralpointoffailureforE2Edataflows.TheBEmodeisjustwhatitsoundslike:theagentwouldtrytosendthedata,butifitcouldn’t,thedatawouldbediscarded.Thismodeisgoodforthingssuchasmetrics,wheregapscaneasilybetolerated,asnewdataisjustasecondaway.TheDFOmodestoresundeliverabledatatothelocaldisk(orsometimes,alocaldatabase)andwouldkeepretryinguntilthedatacouldbedeliveredtothenextrecipientinyourdataflow.Thisishandyforthoseplanned(orunplanned)outages,aslongasyouhavesufficientlocaldiskspacetobuffertheload.

InJune,2011,ClouderamovedcontroloftheFlumeprojecttotheApacheFoundation.Itcameoutoftheincubatorstatusayearlaterin2012.Duringtheincubationyear,workhadalreadybeguntorefactorFlumeundertheStar-Trek-themedtag,Flume-NG(FlumetheNextGeneration).

Flume1.X(Flume-NG)ThereweremanyreasonswhyFlumewasrefactored.Ifyouareinterestedinthedetails,youcanreadaboutthemathttps://issues.apache.org/jira/browse/FLUME-728.WhatstartedasarefactoringbrancheventuallybecamethemainlineofdevelopmentasFlume1.X.

ThemostobviouschangeinFlume1.Xisthatthecentralizedconfigurationmaster(s)andZookeeperaregone.TheconfigurationinFlume0.9wasoverlyverbose,andmistakeswereeasytomake.Furthermore,centralizedconfigurationwasreallyoutsidethescopeofFlume’sgoals.Centralizedconfigurationwasreplacedwithasimpleon-diskconfigurationfile(althoughtheconfigurationproviderispluggablesothatitcanbereplaced).Theseconfigurationfilesareeasilydistributedusingtoolssuchascf-engine,Chef,andPuppet.IfyouareusingaClouderadistribution,takealookatClouderaManagertomanageyourconfigurations.Abouttwoyearsago,theycreatedafreeversionwithnonodelimit,soitmaybeanattractiveoptionforyou.Justbesureyoudon’tmanagetheseconfigurationsmanually,oryou’llbeeditingthesefilesmanuallyforever.

AnothermajordifferenceinFlume1.Xisthatthereadingofinputdataandthewritingofoutputdataarenowhandledbydifferentworkerthreads(calledRunners).InFlume0.9,theinputthreadalsodidthewritingtotheoutput(exceptforfailoverretries).Iftheoutputwriterwasslow(ratherthanjustfailingoutright),itwouldblockFlume’sabilitytoingestdata.Thisnewasynchronousdesignleavestheinputthreadblissfullyunawareofanydownstreamproblem.

ThefirsteditionofthisbookcoveredalltheversionsofFlumeuptillVersion1.3.1.ThissecondeditionwillcovertillVersion1.5.2(thecurrentversionatthetimeofwritingthis).

https://issues.apache.org/jira/browse/FLUME-728

TheproblemwithHDFSandstreamingdata/logsHDFSisn’tarealfilesystem,atleastnotinthetraditionalsense,andmanyofthethingswetakeforgrantedwithnormalfilesystemsdon’tapplyhere,suchasbeingabletomountit.ThismakesgettingyourstreamingdataintoHadoopalittlemorecomplicated.

InaregularPOSIX-stylefilesystem,ifyouopenafileandwritedata,itstillexistsonthediskbeforethefileisclosed.Thatis,ifanotherprogramopensthesamefileandstartsreading,itwillgetthedataalreadyflushedbythewritertothedisk.Furthermore,ifthiswritingprocessisinterrupted,anyportionthatmadeittodiskisusable(itmaybeincomplete,butitexists).

InHDFS,thefileexistsonlyasadirectoryentry;itshowszerolengthuntilthefileisclosed.Thismeansthatifdataiswrittentoafileforanextendedperiodwithoutclosingit,anetworkdisconnectwiththeclientwillleaveyouwithnothingbutanemptyfileforallyourefforts.Thismayleadyoutotheconclusionthatitwouldbewisetowritesmallfilessothatyoucanclosethemassoonaspossible.

TheproblemisthatHadoopdoesn’tlikelotsoftinyfiles.AstheHDFSfilesystemmetadataiskeptinmemoryontheNameNode,themorefilesyoucreate,themoreRAMyou’llneedtouse.FromaMapReduceprospective,tinyfilesleadtopoorefficiency.Usually,eachMapperisassignedasingleblockofafileastheinput(unlessyouhaveusedcertaincompressioncodecs).Ifyouhavelotsoftinyfiles,thecostofstartingtheworkerprocessescanbedisproportionallyhighcomparedtothedataitisprocessing.ThiskindofblockfragmentationalsoresultsinmoreMappertasks,increasingtheoveralljobruntimes.

ThesefactorsneedtobeweighedwhendeterminingtherotationperiodtousewhenwritingtoHDFS.Iftheplanistokeepthedataaroundforashorttime,thenyoucanleantowardthesmallerfilesize.However,ifyouplanonkeepingthedataforaverylongtime,youcaneithertargetlargerfilesordosomeperiodiccleanuptocompactsmallerfilesintofewer,largerfilestomakethemmoreMapReducefriendly.Afterall,youonlyingestthedataonce,butyoumightrunaMapReducejobonthatdatahundredsorthousandsoftimes.

Sources,channels,andsinksTheFlumeagent’sarchitecturecanbeviewedinthissimplediagram.Inputsarecalledsourcesandoutputsarecalledsinks.Channelsprovidethegluebetweensourcesandsinks.Alloftheseruninsideadaemoncalledanagent.

NoteKeepinmind:

Asourcewriteseventstooneormorechannels.Achannelistheholdingareaaseventsarepassedfromasourcetoasink.Asinkreceiveseventsfromonechannelonly.Anagentcanhavemanychannels.

FlumeeventsThebasicpayloadofdatatransportedbyFlumeiscalledanevent.Aneventiscomposedofzeroormoreheadersandabody.

Theheadersarekey/valuepairsthatcanbeusedtomakeroutingdecisionsorcarryotherstructuredinformation(suchasthetimestampoftheeventorthehostnameoftheserverfromwhichtheeventoriginated).YoucanthinkofitasservingthesamefunctionasHTTPheaders—awaytopassadditionalinformationthatisdistinctfromthebody.

Thebodyisanarrayofbytesthatcontainstheactualpayload.Ifyourinputiscomprisedoftailedlogfiles,thearrayismostlikelyaUTF-8-encodedstringcontainingalineoftext.

Flumemayaddadditionalheadersautomatically(likewhenasourceaddsthehostnamewherethedataissourcedorcreatinganevent’stimestamp),butthebodyismostlyuntouchedunlessyouedititenrouteusinginterceptors.

Interceptors,channelselectors,andsinkprocessorsAninterceptorisapointinyourdataflowwhereyoucaninspectandalterFlumeevents.Youcanchainzeroormoreinterceptorsafterasourcecreatesanevent.IfyouarefamiliarwiththeAOPSpringFramework,thinkMethodInterceptor.InJavaServlets,it’ssimilartoServletFilter.Here’sanexampleofwhatusingfourchainedinterceptorsonasourcemightlooklike:

Channelselectorsareresponsibleforhowdatamovesfromasourcetooneormorechannels.Flumecomespackagedwithtwochannelselectorsthatcovermostusecasesyoumighthave,althoughyoucanwriteyourownifneedbe.Areplicatingchannelselector(thedefault)simplyputsacopyoftheeventintoeachchannel,assumingyouhaveconfiguredmorethanone.Incontrast,amultiplexingchannelselectorcanwritetodifferentchannelsdependingonsomeheaderinformation.Combinedwithsomeinterceptorlogic,thisduoformsthefoundationforroutinginputtodifferentchannels.

Finally,asinkprocessoristhemechanismbywhichyoucancreatefailoverpathsforyoursinksorloadbalanceeventsacrossmultiplesinksfromachannel.

Tiereddatacollection(multipleflowsand/oragents)YoucanchainyourFlumeagentsdependingonyourparticularusecase.Forexample,youmaywanttoinsertanagentinatieredfashiontolimitthenumberofclientstryingtoconnectdirectlytoyourHadoopcluster.Morelikely,yoursourcemachinesdon’thavesufficientdiskspacetodealwithaprolongedoutageormaintenancewindow,soyoucreateatierwithlotsofdiskspacebetweenyoursourcesandyourHadoopcluster.

Inthefollowingdiagram,youcanseethattherearetwoplaceswheredataiscreated(ontheleft-handside)andtwofinaldestinationsforthedata(theHDFSandElasticSearchcloudbubblesontheright-handside).Tomakethingsmoreinteresting,let’ssayoneofthemachinesgeneratestwokindsofdata(let’scallthemsquareandtriangledata).Youcanseethatinthelower-leftagent,weuseamultiplexingchannelselectortosplitthetwokindsofdataintodifferentchannels.Therectanglechannelisthenroutedtotheagentintheupper-rightcorner(alongwiththedatacomingfromtheupper-leftagent).ThecombinedvolumeofeventsiswrittentogetherinHDFSinDatacenter1.Meanwhile,thetriangledataissenttotheagentthatwritestoElasticSearchinDatacenter2.Keepinmindthatdatatransformationscanoccurafteranysource.Howallofthesecomponentscanbeusedtobuildcomplicateddataworkflowswillbebecomeclearasweproceed.

TheKiteSDKOneofthenewtechnologiesincorporatedinFlume,startingwithVersion1.4,issomethingcalledaMorphline.YoucanthinkofaMorphlineasaseriesofcommandschainedtogethertoformadatatransformationpipe.

IfyouareafanofpipeliningUnixcommands,thiswillbeveryfamiliartoyou.Thecommandsthemselvesareintendedtobesmall,single-purposefunctionsthatwhenchainedtogethercreatepowerfullogic.Inmanyways,usingaMorphlinecommandchaincanbeidenticalinfunctionalitytotheinterceptorparadigmjustmentioned.ThereisaMorphlineinterceptorwewillcoverinChapter6,Interceptors,ETL,andRouting,whichyoucanuseinsteadof,orinadditionto,theincludedJava-basedinterceptors.

NoteTogetanideaofhowusefulthesecommandscanbe,takealookatthehandygrokcommandanditsincludedextensibleregularexpressionlibraryathttps://github.com/kite-sdk/kite/blob/master/kite-morphlines/kite-morphlines-core/src/test/resources/grok-dictionaries/grok-patterns

ManyofthecustomJavainterceptorsthatI’vewritteninthepastweretomodifythebody(data)andcaneasilybereplacedwithanout-of-the-boxMorphlinecommandchain.YoucangetfamiliarwiththeMorphlinecommandsbycheckingouttheirreferenceguideathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

FlumeVersion1.4alsoincludesaMorphline-backedsinkusedprimarilytofeeddataintoSolr.We’llseemoreofthisinChapter4,SinksandSinkProcessors,MorphlineSolrSearchSink.

MorphlinesarejustonecomponentoftheKiteSDKincludedinFlume.StartingwithVersion1.5,FlumehasaddedexperimentalsupportforKiteData,whichisanefforttocreateastandardlibraryfordatasetsinHadoop.Itlooksverypromising,butitisoutsidethescopeofthisbook.

NotePleaseseetheprojecthomepageformoreinformation,asitwillcertainlybecomemoreprominentintheHadoopecosystemasthetechnologymatures.YoucanreadallabouttheKiteSDKathttp://kitesdk.org.

https://github.com/kite-sdk/kite/blob/master/kite-morphlines/kite-morphlines-core/src/test/resources/grok-dictionaries/grok-patterns

http://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

http://kitesdk.org

SummaryInthischapter,wediscussedtheproblemthatFlumeisattemptingtosolve:gettingdataintoyourHadoopclusterfordataprocessinginaneasilyconfigured,reliableway.WealsodiscussedtheFlumeagentanditslogicalcomponents,includingevents,sources,channelselectors,channels,sinkprocessors,andsinks.Finally,webrieflydiscussedMorphlinesasapowerfulnewETL(Extract,Transform,Load)library,startingwithVersion1.4ofFlume.

Thenextchapterwillcovertheseinmoredetail,specifically,themostcommonlyusedimplementationsofeach.Likeallgoodopensourceprojects,almostallofthesecomponentsareextensibleifthebundledonesdon’tdowhatyouneedthemtodo.

Chapter2.AQuickStartGuidetoFlumeAswecoveredsomeofthebasicsinthepreviouschapter,thischapterwillhelpyougetstartedwithFlume.So,let’sstartwiththefirststep:downloadingandconfiguringFlume.

DownloadingFlumeLet’sdownloadFlumefromhttp://flume.apache.org/.Lookforthedownloadlinkinthesidenavigation.You’llseetwocompressed.tararchivesavailablealongwiththechecksumandGPGsignaturefilesusedtoverifythearchives.Instructionstoverifythedownloadareonthewebsite,soIwon’tcoverthemhere.Checkingthechecksumfilecontentsagainsttheactualchecksumverifiesthatthedownloadwasnotcorrupted.Checkingthesignaturefilevalidatesthatallthefilesyouaredownloading(includingthechecksumandsignature)camefromApacheandnotsomenefariouslocation.Doyoureallyneedtoverifyyourdownloads?Ingeneral,itisagoodideaanditisrecommendedbyApachethatyoudoso.Ifyouchoosenotto,Iwon’ttell.

Thebinarydistributionarchivehasbininthename,andthesourcearchiveismarkedwithsrc.ThesourcearchivecontainsjusttheFlumesourcecode.ThebinarydistributionismuchlargerbecauseitcontainsnotonlytheFlumesourceandthecompiledFlumecomponents(jars,javadocs,andsoon),butalsoallthedependentJavalibraries.ThebinarypackagecontainsthesameMavenPOMfileasthesourcearchive,soyoucanalwaysrecompilethecodeevenifyoustartwiththebinarydistribution.

Goahead,downloadandverifythebinarydistributiontosaveussometimeingettingstarted.

http://flume.apache.org/

FlumeinHadoopdistributionsFlumeisavailablewithsomeHadoopdistributions.ThedistributionssupposedlyprovidebundlesofHadoop’scorecomponentsandsatelliteprojects(suchasFlume)inawaythatensuresthingssuchasversioncompatibilityandadditionalbugfixesaretakenintoaccount.Thesedistributionsaren’tbetterorworse;they’rejustdifferent.

Therearebenefitstousingadistribution.Someoneelsehasalreadydonetheworkofpullingtogetheralltheversion-compatiblecomponents.Today,thisislessofanissuesincetheApacheBigTopprojectstarted(http://bigtop.apache.org/).Nevertheless,havingprebuiltstandardOSpackages,suchasRPMsandDEBs,easeinstallationaswellasprovidestartup/shutdownscripts.Eachdistributionhasdifferentlevelsoffreeandpaidoptions,includingpaidprofessionalservicesifyoureallygetintoasituationyoujustcan’thandle.

Therearedownsides,ofcourse.TheversionofFlumebundledinadistributionwilloftenlagquiteabitbehindtheApachereleases.Ifthereisaneworbleeding-edgefeatureyouareinterestedinusing,you’lleitherbewaitingforyourdistribution’sprovidertobackportitforyou,oryou’llbestuckpatchingityourself.Furthermore,whilethedistributionprovidersdoafairamountoftesting,suchasanygeneral-purposeplatform,youwillmostlikelyencountersomethingthattheirtestingdidn’tcover,inwhichcase,youarestillonthehooktocomeupwithaworkaroundordiveintothecode,fixit,andhopefully,submitthatpatchbacktotheopensourcecommunity(where,atafuturepoint,it’llmakeitintoanupdateofyourdistributionorthenextversion).

So,thingsmoveslowerinaHadoopdistributionworld.Youcanseethatasgoodorbad.Usually,largecompaniesdon’tliketheinstabilityofbleeding-edgetechnologyormakingchangesoften,aschangecanbethemostcommoncauseofunplannedoutages.You’dbehardpressedtofindsuchacompanyusingthebleeding-edgeLinuxkernelratherthansomethinglikeRedHatEnterpriseLinux(RHEL),CentOS,UbuntuLTS,oranyoftheotherdistributionswhosetargetisstabilityandcompatibility.IfyouareastartupbuildingthenextInternetfad,youmightneedthatbleeding-edgefeaturetogetalegupontheestablishedcompetition.

Ifyouareconsideringadistribution,dotheresearchandseewhatyouaregetting(ornotgetting)witheach.Rememberthateachoftheseofferingsishopingthatyou’lleventuallywantand/orneedtheirEnterpriseoffering,whichusuallydoesn’tcomecheap.Doyourhomework.

NoteHere’sashort,nondefinitivelistofsomeofthemoreestablishedplayers.Formoreinformation,refertothefollowinglinks:

Cloudera:http://cloudera.com/Hortonworks:http://hortonworks.com/MapR:http://mapr.com/

http://bigtop.apache.org/

http://cloudera.com/

http://hortonworks.com/

http://mapr.com/

AnoverviewoftheFlumeconfigurationfileNowthatwe’vedownloadedFlume,let’sspendsometimegoingoverhowtoconfigureanagent.

AFlumeagent’sdefaultconfigurationproviderusesasimpleJavapropertyfileofkey/valuepairsthatyoupassasanargumenttotheagentuponstartup.Asyoucanconfiguremorethanoneagentinasinglefile,youwillneedtoadditionallypassanagentidentifier(calledaname)sothatitknowswhichconfigurationstouse.InmyexampleswhereI’monlyspecifyingoneagent,I’mgoingtousethenameagent.

NoteBydefault,theconfigurationpropertyfileismonitoredforchangesevery30seconds.Ifachangeisdetected,Flumewillattempttoreconfigureitself.Inpractice,manyoftheconfigurationsettingscannotbechangedaftertheagenthasstarted.Saveyourselfsometroubleandpasstheundocumented--no-reload-confargumentwhenstartingtheagent(exceptindevelopmentsituationsperhaps).

IfyouusetheClouderadistribution,thepassingofthisflagiscurrentlynotpossible.I’veopenedatickettofixthatathttps://issues.cloudera.org/browse/DISTRO-648.Ifthisisimportanttoyou,pleasevoteitup.

Eachagentisconfigured,startingwiththreeparameters:

agent.sources=<listofsources>

agent.channels=<listofchannels>

agent.sinks=<listofsinks>

TipDownloadingtheexamplecode

Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Eachsource,channel,andsinkalsohasauniquenamewithinthecontextofthatagent.Forexample,ifI’mgoingtotransportmyApacheaccesslogs,Imightdefineachannelnamedaccess.Theconfigurationsforthischannelwouldallstartwiththeagent.channels.accessprefix.EachconfigurationitemhasatypepropertythattellsFlumewhatkindofsource,channel,orsinkitis.Inthiscase,wearegoingtouseanin-memorychannelwhosetypeismemory.Thecompleteconfigurationforthechannelnamedaccessintheagentnamedagentwouldbe:

agent.channels.access.type=memory

Anyargumentstoasource,channel,orsinkareaddedasadditionalpropertiesusingthe

https://issues.cloudera.org/browse/DISTRO-648



sameprefix.ThememorychannelhasacapacityparametertoindicatethemaximumnumberofFlumeeventsitcanhold.Let’ssaywedidn’twanttousethedefaultvalueof100;ourconfigurationwouldnowlooklikethis:

agent.channels.access.type=memory

agent.channels.access.capacity=200

Finally,weneedtoaddtheaccesschannelnametotheagent.channelspropertysothattheagentknowstoloadit:

agent.channels=access

Let’slookatacompleteexampleusingthecanonicalHello,World!example.

Startingupwith“Hello,World!”NotechnicalbookwouldbecompletewithoutaHello,World!example.Hereistheconfigurationfilewe’llbeusing:

agent.sources=s1

agent.channels=c1

agent.sinks=k1

agent.sources.s1.type=netcat

agent.sources.s1.channels=c1

agent.sources.s1.bind=0.0.0.0

agent.sources.s1.port=12345

agent.channels.c1.type=memory

agent.sinks.k1.type=logger

agent.sinks.k1.channel=c1

Here,I’vedefinedoneagent(calledagent)whohasasourcenameds1,achannelnamedc1,andasinknamedk1.

Thes1source’stypeisnetcat,whichsimplyopensasocketlisteningforevents(onelineoftextperevent).Itrequirestwoparameters:abindIPandaportnumber.Inthisexample,weareusing0.0.0.0forabindaddress(theJavaconventiontospecifylistenonanyaddress)andport12345.Thesourceconfigurationalsohasaparametercalledchannels(plural),whichisthenameofthechannel(s)thesourcewillappendeventsto,inthiscase,c1.Itisplural,becauseyoucanconfigureasourcetowritetomorethanonechannel;wejustaren’tdoingthatinthissimpleexample.

Thechannelnamedc1isamemorychannelwithadefaultconfiguration.

Thesinknamedk1isoftheloggertype.Thisisasinkthatismostlyusedfordebuggingandtesting.ItwilllogalleventsattheINFOlevelusingLog4j,whichitreceivesfromtheconfiguredchannel,inthiscase,c1.Here,thechannelkeywordissingularbecauseasinkcanonlybefeddatafromonechannel.

Usingthisconfiguration,let’sruntheagentandconnecttoitusingtheLinuxnetcatutilitytosendanevent.

First,explodethe.tararchiveofthebinarydistributionwedownloadedearlier:

$tar-zxfapache-flume-1.5.2-bin.tar.gz

$cdapache-flume-1.5.2-bin

Next,let’sbrieflylookatthehelp.Runtheflume-ngcommandwiththehelpcommand:

$./bin/flume-nghelp

Usage:./bin/flume-ng<command>[options]...

commands:

helpdisplaythishelptext

agentrunaFlumeagent

avro-clientrunanavroFlumeclient

versionshowFlumeversioninfo

globaloptions:

--conf,-c<conf>useconfigsin<conf>directory

--classpath,-C<cp>appendtotheclasspath

--dryrun,-ddonotactuallystartFlume,justprintthecommand

--plugins-path<dirs>colon-separatedlistofplugins.ddirectories.See

the

plugins.dsectionintheuserguideformore

details.

Default:$FLUME_HOME/plugins.d

-Dproperty=valuesetsaJavasystempropertyvalue

-Xproperty=valuesetsaJava-Xoption

agentoptions:

--conf-file,-f<file>specifyaconfigfile(required)

--name,-n<name>thenameofthisagent(required)

--help,-hdisplayhelptext

avro-clientoptions:

--rpcProps,-P<file>RPCclientpropertiesfilewithserverconnection

params

--host,-H<host>hostnametowhicheventswillbesent

--port,-p<port>portoftheavrosource

--dirname<dir>directorytostreamtoavrosource

--filename,-F<file>textfiletostreamtoavrosource(default:std

input)

--headerFile,-R<file>Filecontainingeventheadersaskey/valuepairs

oneachnewline


Either--rpcPropsorboth--hostand--portmustbespecified.

Notethatif<conf>directoryisspecified,thenitisalwaysincluded

firstintheclasspath.

Asyoucansee,therearetwowayswithwhichyoucaninvokethecommand(otherthanthesimplehelpandversioncommands).Wewillbeusingtheagentcommand.Theuseofavro-clientwillbecoveredlater.

Theagentcommandhastworequiredparameters:aconfigurationfiletouseandtheagentname(incaseyourconfigurationcontainsmultipleagents).

Let’stakeoursampleconfigurationandopenaneditor(viinmycase,butusewhateveryoulike):

$viconf/hw.conf

Next,placethecontentsoftheprecedingconfigurationintotheeditor,save,andexitbacktotheshell.

Nowyoucanstarttheagent:

$./bin/flume-ngagent-nagent-cconf-fconf/hw.conf-

Dflume.root.logger=INFO,console

The-Dflume.root.loggerpropertyoverridestherootloggerinconf/log4j.propertiestousetheconsoleappender.

Ifwedidn’toverridetherootlogger,everythingwouldstillwork,buttheoutputwouldgotothelog/flume.logfileinsteadofbeingbasedonthecontentsofthedefaultconfigurationfile.Ofcourse,youcanedittheconf/log4j.propertiesfileandchangetheflume.root.loggerproperty(oranythingelseyoulike).Tochangejustthepathorfilename,youcansettheflume.log.dirandflume.log.filepropertiesintheconfigurationfileorpassadditionalflagsonthecommandlineasfollows:

$./bin/flume-ngagent-nagent-cconf-fconf/hw.conf-

Dflume.root.logger=INFO,console-Dflume.log.dir=/tmp-

Dflume.log.file=flume-agent.log

Youmightaskwhyyouneedtospecifythe-cparameter,asthe-fparametercontainsthecompleterelativepathtotheconfiguration.ThereasonforthisisthattheLog4jconfigurationfileshouldbeincludedontheclasspath.

Ifyouleftthe-cparameteroffthecommand,you’llseethiserror:

Warning:Noconfigurationdirectoryset!Use--conf<dir>tooverride.

log4j:WARNNoappenderscouldbefoundforlogger

(org.apache.flume.lifecycle.LifecycleSupervisor).

log4j:WARNPleaseinitializethelog4jsystemproperly.

log4j:WARNSeehttp://logging.apache.org/log4j/1.2/faq.html#noconfigfor

moreinfo.

Butyoudidn’tdothatsoyoushouldseethesekeyloglines:

2014-10-0515:39:06,109(conf-file-poller-0)[INFO-

org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfigu

ration.java:140)]Post-validationflumeconfigurationcontains

configurationforagents:[agent]

Thislinetellsyouthatyouragentstartswiththenameagent.

Usuallyyou’dlookforthislineonlytobesureyoustartedtherightconfigurationwhenyouhavemultipleconfigurationsdefinedinyourconfigurationfile.


org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatche

rRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)]

Reloadingconfigurationfile:conf/hw.conf

Thisisanothersanitychecktomakesureyouareloadingthecorrectfile,inthiscaseourhw.conffile.


org.apache.flume.node.Application.startAllComponents(Application.java:138)]

Startingnewconfiguration:{sourceRunners:{s1=EventDrivenSourceRunner:{

source:org.apache.flume.source.NetcatSource{name:s1,state:IDLE}}}

sinkRunners:{k1=SinkRunner:{

policy:org.apache.flume.sink.DefaultSinkProcessor@442fbe47counterGroup:{

name:nullcounters:{}}}}channels:

{c1=org.apache.flume.channel.MemoryChannel{name:c1}}}

Oncealltheconfigurationshavebeenparsed,youwillseethismessage,whichshowsyoueverythingthatwasconfigured.Youcansees1,c1,andk1,andwhichJavaclassesareactuallydoingthework.Asyouprobablyguessed,netcatisaconveniencefororg.apache.flume.source.NetcatSource.Wecouldhaveusedtheclassnameifwewanted.Infact,ifIhadmyowncustomsourcewritten,Iwoulduseitsclassnameforthesource’stypeparameter.YoucannotdefineyourownshortnameswithoutpatchingtheFlumedistribution.

2014-10-0515:39:06,427(lifecycleSupervisor-1-0)[INFO-

org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)]Created

serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345]

Here,weseethatoursourceisnowlisteningonport12345fortheinput.So,let’ssendsomedatatoit.

Finally,openasecondterminal.We’llusethenccommand(youcanuseTelnetoranythingelsesimilar)tosendtheHelloWorldstringandpresstheReturn(Enter)keytomarktheendoftheevent:

%nclocalhost12345

HelloWorld

OK

TheOKmessagecamefromtheagentafterwepressedtheReturnkey,signifyingthatitacceptedthelineoftextasasingleFlumeevent.Ifyoulookattheagentlog,youwillseethefollowing:

2014-10-0515:44:11,215(SinkRunner-PollingRunner-DefaultSinkProcessor)

[INFO-org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]

Event:{headers:{}body:48656C6C6F20576F726C64

HelloWorld}

ThislogmessageshowsyouthattheFlumeeventcontainsnoheaders(NetcatSourcedoesn’taddanyitself).Thebodyisshowninhexadecimalalongwithastringrepresentation(forushumanstoread,inthiscase,ourHelloWorldmessage).

IfIsendthefollowinglineandthenpresstheEnterkey,you’llgetanOKmessage:

Thequickbrownfoxjumpedoverthelazydog.

You’llseethisintheagent’slog:


[INFO-org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]

Event:{headers:{}body:54686520717569636B2062726F776E20

Thequickbrown}

Theeventappearstohavebeentruncated.Theloggersink,bydesign,limitsthebodycontentto16bytestokeepyourscreenfrombeingfilledwithmorethanwhatyou’dneedinadebuggingcontext.Ifyouneedtoseethefullcontentsfordebugging,youshoulduseadifferentsink,perhapsthefile_rollsink,whichwouldwritetothelocalfilesystem.

SummaryInthischapter,wecoveredhowtodownloadtheFlumebinarydistribution.Wecreatedasimpleconfigurationfilethatincludedonesourcewritingtoonechannel,feedingonesink.Thesourcelistenedonasocketfornetworkclientstoconnecttoandtosenditeventdata.Theseeventswerewrittentoanin-memorychannelandthenfedtoaLog4jsinktobecometheoutput.WethenconnectedtoourlisteningagentusingtheLinuxnetcatutilityandsentsomestringeventstoourFlumeagent’ssource.Finally,weverifiedthatourLog4j-basedsinkwrotetheeventsout.

Inthenextchapter,we’lltakeadetailedlookatthetwomajorchanneltypesyou’llmostlikelyuseinyourdataprocessingworkflows:thememorychannelandthefilechannel.

Wewillalsotakealookatanewexperimentalchannel,introducedinVersion1.5ofFlume,calledtheSpillableMemoryChannel,whichattemptstobeahybridoftheothertwo.

Foreachtype,we’lldiscussalltheconfigurationknobsavailabletoyou,whenandwhyyoumightwanttodeviatefromthedefaults,andmostimportantly,whytouseoneovertheother.

Chapter3.ChannelsInFlume,achannelistheconstructusedbetweensourcesandsinks.Itprovidesabufferforyourin-flighteventsaftertheyarereadfromsourcesuntiltheycanbewrittentosinksinyourdataprocessingpipelines.

Theprimarytypeswe’llcoverhereareamemory-backed/nondurablechannelandalocal-filesystem-backed/durablechannel.StartingwithFlume1.5,anexperimentalhybridmemoryandfilechannelcalledtheSpillableMemoryChannelisintroduced.Thedurablefilechannelflushesallchangestodiskbeforeacknowledgingthereceiptoftheeventtothesender.Thisisconsiderablyslowerthanusingthenondurablememorychannel,butitprovidesrecoverabilityintheeventofsystemorFlumeagentrestarts.Conversely,thememorychannelismuchfaster,butfailureresultsindatalossandithasmuchlowerstoragecapacitywhencomparedtothemultiterabytedisksbackingthefilechannel.ThisiswhytheSpillableMemoryChannelwascreated.Intheory,yougetthebenefitsofmemoryspeeduntilthememoryfillsupduetoflowbackpressure.Atthispoint,thediskwillbeusedtostoretheevents—andwiththatcomesmuchlargercapacity.Therearetrade-offshereaswell,asperformanceisnowvariabledependingonhowtheentireflowisperforming.Ultimately,thechannelyouchoosedependsonyourspecificusecases,failurescenarios,andrisktolerance.

Thatsaid,regardlessofwhatchannelyouchoose,ifyourrateofingestfromthesourcesintothechannelisgreaterthantherateatwhichthesinkcanwritedata,youwillexceedthecapacityofthechannelandthrowaChannelException.Whatyoursourcedoesordoesn’tdowiththatChannelExceptionissource-specific,butinsomecases,datalossispossible,soyou’llwanttoavoidfillingchannelsbysizingthingsproperly.Infact,youalwayswantyoursinktobeabletowritefasterthanyoursourceinput.Otherwise,youmightgetintoasituationwhereonceyoursinkfallsbehind,youcannevercatchup.Ifyourdatavolumetrackswiththesiteusage,youcanhavehighervolumesduringthedayandlowervolumesatnight,givingyourchannelstimetodrain.Inpractice,you’llwanttotryandkeepthechanneldepth(thenumberofeventscurrentlyinthechannel)aslowaspossiblebecausetimespentinthechanneltranslatestoatimedelaybeforereachingthefinaldestination.

ThememorychannelAmemorychannel,asexpected,isachannelwherein-flighteventsarestoredinmemory.Asmemoryis(usually)ordersofmagnitudefasterthanthedisk,eventscanbeingestedmuchmorequickly,resultinginreducedhardwareneeds.Thedownsideofusingthischannelisthatanagentfailure(hardwareproblem,poweroutage,JVMcrash,Flumerestart,andsoon)resultsinthelossofdata.Dependingonyourusecase,thismightbeperfectlyfine.Systemmetricsusuallyfallintothiscategory,asafewlostdatapointsisn’ttheendoftheworld.However,ifyoureventsrepresentpurchasesonyourwebsite,thenamemorychannelwouldbeapoorchoice.

Tousethememorychannel,setthetypeparameteronyournamedchanneltomemory.


Thisdefinesamemorychannelnamedc1fortheagentnamedagent.

Hereisatableofconfigurationparametersyoucanadjustfromthedefaultvalues:

Key Required Type Default

type Yes String memory

capacity No int 100

transactionCapacity No int 100

byteCapacityBufferPercentage No int(percent) 20%

byteCapacity No long(bytes) 80%ofJVMHeap

keep-alive No int 3(seconds)

Thedefaultcapacityofthischannelis100events.Thiscanbeadjustedbysettingthecapacitypropertyasfollows:

agent.channels.c1.capacity=200

Rememberthatifyouincreasethisvalue,youwillmostlikelyhavetoincreaseyourJavaheapspaceusingthe-Xmx,andoptionally-Xms,parameters.

Anothercapacity-relatedsettingyoucansetistransactionCapacity.Thisisthemaximumnumberofeventsthatcanbewritten,alsocalledaput,byasource’sChannelProcessor,thecomponentresponsibleformovingdatafromthesourcetothechannel,inasingletransaction.Thisisalsothenumberofeventsthatcanberead,alsocalledatake,inasingletransactionbytheSinkProcessor,whichisthecomponentresponsibleformovingdatafromthechanneltothesink.Youmightwanttosetthishigherinordertodecreasetheoverheadofthetransactionwrapper,whichmightspeedthingsup.Thedownsidetoincreasingthisisthatasourcewouldhavetorollbackmoredataintheeventofafailure.

TipFlumeonlyprovidestransactionalguaranteesforeachchannelineachindividualagent.Inamultiagent,multichannelconfiguration,duplicatesandout-of-orderdeliveryarelikelybutshouldnotbeconsideredthenorm.Ifyouaregettingduplicatesinnonfailureconditions,itmeansthatyouneedtocontinuetuningyourFlumeconfigurations.

Ifyouareusingasinkthatwritessomeplacethatbenefitsfromlargerbatchesofwork(suchasHDFS),youmightwanttosetthishigher.Likemanythings,theonlywaytobesureistorunperformancetestswithdifferentvalues.ThisblogpostfromFlumecommitterMikePercyshouldgiveyousomegoodstartingpoints:http://bit.ly/flumePerfPt1.

ThebyteCapacityBufferPercentageandbyteCapacityparameterswereintroducedinhttps://issues.apache.org/jira/browse/FLUME-1535asameanstosizethememorychannelcapacityusingthenumberofbytesusedratherthanthenumberofeventsaswellastryingtoavoidOutOfMemoryErrors.Ifyoureventshavealargevarianceinsize,youmightbetemptedtousethesesettingstoadjustthecapacity,butbewarnedthatcalculationsareestimatedfromtheevent’sbodyonly.Ifyouhaveanyheaders,whichyouwill,youractualmemoryusagewillbehigherthantheconfiguredvalues.

Finally,thekeep-aliveparameteristhetimethethreadwritingdataintothechannelwillwaitwhenthechannelisfull,beforegivingup.Asdataisbeingdrainedfromthechannelatthesametime,ifspaceopensupbeforethetimeoutexpires,thedatawillbewrittentothechannelratherthanthrowinganexceptionbacktothesource.Youmightbetemptedtosetthisvalueveryhigh,butrememberthatwaitingforawritetoachannelwillblockthedataflowingintoyoursource,whichmightcausedatatobackupinanupstreamagent.Eventually,thismightresultineventsbeingdropped.Youneedtosizeforperiodicspikesintrafficaswellastemporaryplanned(andunplanned)maintenance.

http://bit.ly/flumePerfPt1


ThefilechannelAfilechannelisachannelthatstoreseventstothelocalfilesystemoftheagent.Thoughit’sslowerthanthememorychannel,itprovidesadurablestoragepaththatcansurvivemostissuesandshouldbeusedinusecaseswhereagapinyourdataflowisundesirable.

ThisdurabilityisprovidedbyacombinationofaWriteAheadLog(WAL)andoneormorefilestoragedirectories.TheWALisusedtotrackallinputandoutputfromthechannelinanatomicallysafeway.Thisway,iftheagentisrestarted,theWALcanbereplayedtomakesurealltheeventsthatcameintothechannel(puts)havebeenwrittenout(takes)beforethestoreddatacanbepurgedfromthelocalfilesystem.

Additionally,thefilechannelsupportstheencryptionofdatawrittentothefilesystemifyourdatahandlingpolicyrequiresthatalldataonthedisk(eventemporarily)beencrypted.Iwon’tcoverthishere,butshouldyouneedit,thereisanexampleintheFlumeUserGuide(http://flume.apache.org/FlumeUserGuide.html).Keepinmindthatusingencryptionwillreducethethroughputofyourfilechannel.

Tousethefilechannel,setthetypeparameteronyournamedchanneltofile.

agent.channels.c1.type=file

Thisdefinesafilechannelnamedc1fortheagentnamedagent.



type Yes String file

checkpointDir No String ~/.flume/file-channel/checkpoint

useDualCheckpoints No boolean false

backupCheckpointDir No StringNodefault,butmustbedifferentfromcheckpointDir

dataDirs No String(commaseparatedlist)

~/.flume/file-channel/data

capacity No int 1000000

keep-alive No int 3(seconds)


checkpointInterval No long 30000(milliseconds)

maxFileSize No long 2146435071(bytes)

minimumRequiredSpace No long 524288000(bytes)

http://flume.apache.org/FlumeUserGuide.html

TospecifythelocationwheretheFlumeagentshouldholddata,setthecheckpointDiranddataDirsproperties:

agent.channels.c1.checkpointDir=/flume/c1/checkpoint

agent.channels.c1.dataDirs=/flume/c1/data

Technically,thesepropertiesarenotrequiredandhavesensibledefaultvaluesfordevelopment.However,ifyouhavemorethanonefilechannelconfiguredinyouragent,onlythefirstchannelwillstart.Forproductiondeploymentsanddevelopmentworkwithmultiplefilechannels,youshouldusedistinctdirectorypathsforeachfilechannelstorageareaandconsiderplacingdifferentchannelsondifferentdiskstoavoidIOcontention.Additionally,ifyouaresizingalargemachine,considerusingsomeformofRAIDthatcontainsstriping(RAID10,50,or60)toachievehigherdiskperformanceratherthanbuyingmoreexpensive10Kor15KdrivesorSSDs.Ifyoudon’thaveRAIDstripingbuthavemultipledisks,setdataDirstoacomma-separatedlistofeachstoragelocation.UsingmultiplediskswillspreadthedisktrafficalmostaswellasstripedRAIDbutwithoutthecomputationaloverheadassociatedwithRAID50/60aswellasthe50percentspacewasteassociatedwithRAID10.You’llwanttotestyoursystemtoseewhethertheRAIDoverheadisworththespeeddifference.Asharddrivefailuresareareality,youmightprefercertainRAIDconfigurationstosingledisksinordertoprotectyourselffromthedatalossassociatedwithsingledrivefailures.RAID6acrossthemaximumnumberofdiskscanprovidethehighestperformancefortheminimalamountofdataprotectionwhencombinedwithredundantandreliablepowersources(suchasanuninterruptablepowersupplyorUPS).

UsingtheJDBCchannelisabadideaasitwouldintroduceabottleneckandsinglepointoffailureinwhatshouldbedesignedasahighlydistributedsystem.NFSstorageshouldbeavoidedforthesamereason.

TipBesuretosetHADOOP_PREFIXandJAVA_HOMEenvironmentvariableswhenusingthefilechannel.Whileweseeminglyhaven’tusedanythingHadoop-specific(suchaswritingtoHDFS),thefilechannelusesHadoopWritablesasanon-diskserializationformat.IfFlumecan’tfindtheHadooplibraries,youmightseethisinyourstartup,socheckyourenvironmentvariables:java.lang.NoClassDefFoundError:org/apache/hadoop/io/Writable

StartingwithFlume1.4,thefilechannelsupportsasecondarycheckpointdirectory.Insituationswhereafailureoccurswhilewritingthecheckpointdata,thatinformationcouldbecomeunusableandafullreplayofthelogsisnecessaryinordertorecoverthestateofthechannel.Asonlyonecheckpointisupdatedatatimebeforeflippingtotheother,oneshouldalwaysbeinaconsistentstate,thusshorteningrestarttimes.Withoutvalidcheckpointinformation,theFlumeagentcan’tknowwhathasbeensentandwhathasnotbeensentindataDirs.Asfilesinthedatadirectoriesmightcontainlargeamountsofdataalreadysentbutnotyetdeleted,alackofcheckpointinformationwouldresultinalargenumberofrecordsbeingresentasduplicates.

Incomingdatafromsourcesiswrittenandacknowledgedattheendofthemostcurrentfileinthedatadirectory.Thefilesinthecheckpointdirectorykeeptrackofthisdatawhenit’stakenbyasink.

Ifyouwanttousethisfeature,settheuseDualCheckpointspropertytotrueandspecifyalocationforthatsecondcheckpointdirectorywiththebackupCheckpointDirproperty.Forperformancereasons,itisalwayspreferredthatthisbeonadifferentdiskfromtheotherdirectoriesusedbythefilechannel:

agent.channels.c1.useDualCheckpoints=true

agent.channels.c1.backupCheckpointDir=/flume/c1/checkpoint2

Thedefaultfilechannelcapacityisonemillioneventsregardlessofthesizeoftheeventcontents.Ifthechannelcapacityisreached,asourcewillnolongerbeabletoingestthedata.Thisdefaultshouldbefineforlowvolumecases.You’llwanttosizethishigherifyouringestionissoheavythatyoucan’ttoleratenormalplannedorunplannedoutages.Forinstance,therearemanyconfigurationchangesyoucanmakeinHadoopthatrequireaclusterrestart.IfyouhaveFlumewritingimportantdataintoHadoop,thefilechannelshouldbesizedtotoleratethetimeittakestorestartHadoop(andmaybeaddacomfortbufferfortheunexpected).Ifyourclusterorothersystemsareunreliable,youcansetthishigherstilltohandleevenlargeramountsofdowntime.Atsomepoint,you’llrunintothefactthatyourdiskspaceisafiniteresource,soyouwillhavetopicksomeupperlimit(orbuybiggerdisks).

Thekeep-aliveparameterissimilartomemorychannels.Itisthemaximumtimethesourcewillwaitwhentryingtowriteintoafullchannelbeforegivingup.Ifspacebecomesavailablebeforethetimeout,thewriteissuccessful;otherwise,ChannelExceptionisthrownbacktothesource.

ThetransactionCapacitypropertyisthemaximumnumberofeventsallowedinasingletransaction.Thismightbecomeimportantforcertainsourcesthatbatchtogethereventsandpassthemtothechannelinasinglecall.Mostlikely,youwon’tneedtochangethisfromthedefault.Settingthishigherallocatesadditionalresourcesinternally,soyoushouldn’tincreaseitunlessyourunintoperformanceissues.

ThecheckpointIntervalpropertyisthenumberofmillisecondsbetweenperformingacheckpoint(whichalsorollsthelogfileswrittentologDirs).Ifyoudonotsetthis,30secondswillbeused.

CheckpointfilesalsorollbasedonthevolumeofdatawrittentothemusingthemaxFileSizeproperty.Youcanlowerthisvalueforlowtrafficchannelsifyouwanttotryandsavesomediskspace.Let’ssayyourmaximumfilesizeis50,000bytesbutyourchannelonlywrites500bytesaday;itwouldtake100daystofillasinglelog.Let’ssaythatyouwereonday100and2000bytescameinallatonce.Somedatawouldbewrittentotheoldfileandanewfilewouldbestartedwiththeoverflow.Aftertheroll,Flumetriestoremoveanylogfilesthataren’tneededanymore.Asthefullloghasunprocessedrecords,itcannotberemovedyet.Thenextchancetocleanupthatoldlogfilemightnotcomeforanother100days.Itprobablydoesn’tmatterifthatold50,000bytefilesticks

aroundlonger,butasthedefaultisaround2GB,youcouldhavetwicethat(4GB)diskspaceusedperchannel.Dependingonhowmuchdiskyouhaveavailableandthenumberofchannelsconfiguredinyouragent,thismightormightnotbeaproblem.Ifyourmachineshaveplentyofstoragespace,thedefaultshouldbefine.

Finally,theminimumRequiredSpacepropertyistheamountofspaceyoudonotwanttouseforwritinglogs.Thedefaultconfigurationwillthrowanexceptionifyouattempttousethelast500MBofthediskassociatedwiththedataDirpath.Thislimitappliesacrossallchannels,soifyouhavethreefilechannelsconfigured,theupperlimitisstill500MBandnot1.5GB.Youcansetthisvalueaslowas1MB,butgenerallyspeaking,badthingstendtohappenwhenyoupushdiskutilizationtowards100percent.

SpillableMemoryChannelIntroducedinFlume1.5,theSpillableMemoryChannelisachannelthatactslikeamemorychanneluntilitisfull.Atthatpoint,itactslikeafilechannelthatisconfiguredwithamuchlargercapacitythanitsmemorycounterpartbutrunsatthespeedofyourdisks(whichmeansordersofmagnitudeslower).

NoteTheSpillableMemoryChannelisstillconsideredexperimental.Useitatyourownrisk!

Ihavemixedfeelingsaboutthisnewchanneltype.Onthesurface,itseemslikeagoodidea,butinpractice,Icanseeproblems.Specifically,havingavariablechannelspeedthatchangesdependingonhowdownstreamentitiesinyourdatapipebehavemakesfordifficultcapacityplanning.Asamemorychannelisusedundergoodconditions,thisimpliesthatthedatacontainedinitcanbelost.SowhywouldIgothroughextratroubletosavesomeofittothedisk?Thedataiseitherveryimportantformetospoolittodiskwithafile-backedchannel,orit’slessimportantandcanbelost,soIcangetawaywithlesshardwareanduseafastermemory-backedchannel.IfIreallyneedmemoryspeedbutwiththecapacityofaharddrive,SolidStateDrive(SSD)priceshavecomedownenoughinrecentyearsforafilechannelonSSDtonowbeaviableoptionforyouratherthanusingthishybridchanneltype.Idonotusethischannelmyselfforthesereasons.

Tousethischannelconfiguration,setthetypeparameteronyournamedchanneltospillablememory:

agent.channels.c1.type=spillablememory

ThisdefinesaSpillableMemoryChannelnamedc1fortheagentnamedagent.



type Yes String spillablememory

memoryCapacity No int 10000

overflowCapacity No int 100000000

overflowTimeout No int 3(seconds)

overflowDeactivationThreshold No int 5(percent)

byteCapacityBufferPercentage No int 20(percent)

byteCapacity No long(bytes) 80%ofJVMHeap

checkpointDir No String ~/.flume/file-channel/checkpoint

dataDirsNo String(comma

~/.flume/file-channel/data

separatedlist)

useDualCheckpoints No boolean false

backupCheckpointDir No StringNodefault,butmustbedifferentthancheckpointDir


checkpointInterval No long 30000(milliseconds)

maxFileSize No long 2146435071(bytes)

minimumRequiredSpace No long 524288000(bytes)

Asyoucansee,manyofthefieldsmatchagainstthememorychannelandfilechannel’sproperties,sothereshouldbenosurpriseshere.Let’sstartwiththememoryside.

ThememoryCapacitypropertydeterminesthemaximumnumberofeventsheldinthememory(thiswasjustcalledcapacityforthememorychannelbutwasrenamedheretoavoidambiguity).Also,thedefaultvalue,ifunspecified,is10,000recordsinsteadof100.Ifyouwantedtodoublethedefaultcapacity,theconfigurationmightlooksomethinglikethis:

agent.channels.c1.memoryCapacity=20000

Asmentionedpreviously,youwillmostlikelyneedtoincreasetheJavaheapspaceallocatedusingthe-Xmxand-Xmsparameters.LikemostJavaprograms,morememoryusuallyhelpsthegarbagecollectorrunmoreefficiently,especiallyasanapplicationsuchasFlumegeneratesalotofshort-livedobjects.BesuretodoyourresearchandpickanappropriateJVMgarbagecollectorbasedonyouravailablehardware.

TipYoucansetmemoryCapacitytozero,whicheffectivelyturnsitintoafilechannel.Don’tdothis.Justusethefilechannelandbedonewithit.

TheoverflowCapacitypropertydeterminesthemaximumnumberofeventsthatcanbewrittentothediskbeforeanerroristhrownbacktothesourcefeedingit.Thedefaultvalueisagenerous100,000,000events(farlargerthanthefilechannel’sdefaultof100,000).Mostserversnowadayshavemultiterabytedisks,sospaceshouldnotbeaproblem,butdothemathagainstyouraverageeventsizetobesureyoudon’tfillyourdisksbyaccident.Ifyourchannelsarefillingthismuch,youareprobablydealingwithanotherissuedownstreamandthelastthingyouneedisyourdatabufferlayerfillingupcompletely.Forexample,ifyouhadalarge1megabyteeventpayload,100millionoftheseaddupto100terabytes,whichisprobablybiggerthanthediskspaceonanaverageserver.A1kilobytepayloadwouldonlytake100gigabytes,whichisprobablyfine.Justdothemathaheadoftimesoyouarenotsurprised.

TipYoucansetoverflowCapacitytozero,whicheffectivelyturnsitintoamemorychannel.

Don’tdothis.Justusethememorychannelandbedonewithit.

ThetransactionCapacitypropertyadjuststhebatchsizeofeventswrittentoachannelinasingletransaction.Ifthisissettoolow,itwilllowerthethroughputoftheagentonhighvolumeflowsbecauseofthetransactionoverhead.Forahighvolumechannel,youwillprobablyneedtosetthishigher,buttheonlywaytobesureistotestyourparticularworkflow.SeeChapter8,MonitoringFlume,tolearnhowtoaccomplishthis.

Finally,thebyteCapacityBufferPercentageandbyteCapacityparametersareidenticalinfunctionalityanddefaultstothememorychannel,soIwon’twasteyourtimerepeatingithere.

WhatisimportantistheoverflowTimeoutproperty.Thisisthenumberofsecondsafterwhichthememorypartofthechannelfillsbeforedatastartsgettingwrittentothedisk-backedportionofthechannel.Ifyouwantwritestostartoccurringimmediately,youcansetthistozero.Youmightwonderwhyyouneedtowaitbeforestartingtowritetothediskportionofthechannel.ThisiswheretheundocumentedoverflowDeactivationThresholdpropertycomesintoplay.Thisistheamountoftimethatspacehastobeavailableinthememorypathbeforeitcanswitchbackfromdiskwriting.Ibelievethisisanattempttopreventflappingbackandforthbetweenthetwo.Ofcourse,therereallyarenoorderingguaranteesinFlume,soIdon’tknowwhyyouwouldchoosetoappendtothediskbufferifaspotisavailableinfastermemory.Perhapstheyaretryingtoavoidsomekindofstarvationcondition,althoughthecodeappearstoattempttoremoveeventsintheorderofarrivalevenwhenusingbothmemoryanddiskqueues.Perhapsitwillbeexplainedtousshoulditevercomeoutofexperimentalstatus.

NoteTheoverflowDeactivationThresholdpropertyisstatedtobeforinternaluseonly,soadjustitatyourownperil.Ifyouareconsideringit,besuretogetfamiliarwiththesourcecodesothatyouunderstandtheimplicationsofalteringthedefault.

Therestofthepropertiesonthischannelareidenticalinnameandfunctionalitytoitsfilechannelcounterpart,sopleaserefertotheprevioussection.

SummaryInthischapter,wecoveredthetwochanneltypesyouaremostlikelytouseinyourdataprocessingpipelines.

Thememorychanneloffersspeedatthecostofdatalossintheeventoffailure.Alternatively,thefilechannelprovidesamorereliabletransportinthatitcantolerateagentfailuresandrestartsataperformancecost.

Youwillneedtodecidewhichchannelisappropriateforyourusecases.Whentryingtodecidewhetheramemorychannelisappropriate,askyourselfwhatthemonetarycostisifyoulosesomedata.Weighthatagainsttheadditionalcostsofmorehardwaretocoverthedifferenceinperformancewhendecidingifyouneedadurablechannelafterall.Anotherconsiderationiswhetherornotthedatacanberesent.NotalldatayoumightingestintoHadoopwillcomefromstreamingapplicationlogs.Ifyoureceive“dailydownloads”ofdata,youcangetawaywithusingamemorychannelbecauseifyouencounteraproblem,youcanalwaysreruntheimport.

Finally,wecoveredtheexperimentalSpillableMemoryChannel.Personally,Ithinkitscreationisabadidea,butlikemostthingsincomputerscience,everybodyhasanopiniononwhatisgoodorbad.Ifeelthattheaddedcomplexityandnondeterministicperformancemakefordifficultcapacityplanning,asyoushouldalwayssizethingsfortheworst-casescenario.Ifyourdataiscriticalenoughforyoutooverflowtothediskratherthandiscardtheevents,thenyouaren’tgoingtobeokaywithlosingeventhesmallamountheldinmemory.

Inthenextchapter,we’lllookatsinks,specifically,theHDFSsinktowriteeventstoHDFS,theElasticSearchsinktowriteeventstoElasticSearch,andtheMorphlineSolrsinktowriteeventstoSolr.WewillalsocoverEventSerializers,whichspecifyhowFlumeeventsaretranslatedintooutputthat’smoresuitableforthesink.Finally,wewillcoversinkprocessorsandhowtosetuploadbalancingandfailurepathsinatieredconfigurationformorerobustdatatransport.

Chapter4.SinksandSinkProcessorsBynow,youshouldhaveaprettygoodideawherethesinkfitsintotheFlumearchitecture.Inthischapter,wewillfirstlearnaboutthemost-usedsinkwithHadoop,theHDFSsink.WewillthencovertwoofthenewersinksthatsupportcommonNearRealTime(NRT)logprocessing:theElasticSearchSinkandtheMorphlineSolrSink.Asyou’dexpect,thefirstwritesdataintoElasticsearchandthelattertoSolr.ThegeneralarchitectureofFlumesupportsmanyothersinkswewon’thavespacetocoverinthisbook.SomecomebundledwithFlumeandcanwritetoHBase,IRC,and,aswesawinChapter2,AQuickStartGuidetoFlume,alog4jandfilesink.OthersinksareavailableontheInternetandcanbeusedtowritedatatoMongoDB,Cassandra,RabbitMQ,Redis,andjustaboutanyotherdatastoreyoucanthinkof.Ifyoucan’tfindasinkthatsuitsyourneeds,youcanwriteoneeasilybyextendingtheorg.apache.flume.sink.AbstractSinkclass.

HDFSsinkThejoboftheHDFSsinkistocontinuouslyopenafileinHDFS,streamdataintoit,andatsomepoint,closethatfileandstartanewone.AswediscussedinChapter1,OverviewandArchitecture,thetimebetweenfilesrotationsmustbebalancedwithhowquicklyfilesareclosedinHDFS,thusmakingthedatavisibleforprocessing.Aswe’vediscussed,havinglotsoftinyfilesforinputwillmakeyourMapReducejobsinefficient.

TousetheHDFSsink,setthetypeparameteronyournamedsinktohdfs.

agent.sinks.k1.type=hdfs

ThisdefinesaHDFSsinknamedk1fortheagentnamedagent.Therearesomeadditionalparametersyoumustspecify,startingwiththepathinHDFSyouwanttowritethedatato:

agent.sinks.k1.hdfs.path=/path/in/hdfs

ThisHDFSpath,likemostfilepathsinHadoop,canbespecifiedinthreedifferentways:absolute,absolutewithservername,andrelative.Theseareallequivalent(assumingyourFlumeagentisrunastheflumeuser):

absolute /Users/flume/mydata

absolutewithserver hdfs://namenode/Users/flume/mydata

relative mydata

IprefertoconfigureanyserverI’minstallingFlumeonwithaworkinghadoopcommandlinebysettingthefs.default.namepropertyinHadoop’score-site.xmlfile.Idon’tkeeppersistentdatainHDFSuserdirectoriesbutprefertouseabsolutepathswithsomemeaningfulpathname(forexample,/logs/apache/access).TheonlytimeIwouldspecifyaNameNodespecificallyisifthetargetwasadifferentHadoopclusterentirely.Thisallowsyoutomoveconfigurationsyou’vealreadytestedinoneenvironmentintoanotherwithoutunintendedconsequencessuchasyourproductionserverwritingdatatoyourstagingHadoopclusterbecausesomebodyforgottoeditthetargetintheconfiguration.Iconsiderexternalizingenvironmentspecificsagoodbestpracticetoavoidsituationssuchasthese.

OnefinalrequiredparameterfortheHDFSsink,actuallyanysink,isthechannelthatitwillbedoingtakeoperationsfrom.Forthis,setthechannelparameterwiththechannelnametoreadfrom:


Thistellsthek1sinktoreadeventsfromthec1channel.

Hereisamostlycompletetableofconfigurationparametersyoucanadjustfromthedefaultvalues:


type Yes String hdfs

channel Yes String

hdfs.path Yes String

hdfs.filePrefix No String FlumeData

hdfs.fileSuffix No String

hdfs.minBlockReplicas No intSeethedfs.replicationpropertyinyourinheritedHadoopconfiguration,usually,3.

hdfs.maxOpenFiles No long 5000

hdfs.closeTries No int 0(0=tryforever,otherwiseacount)

hdfs.retryInterval No int 180Seconds(0=don’tretry)

hdfs.round No boolean false

hdfs.roundValue No int 1

hdfs.roundUnit No String(second,

minuteorhour)second

hdfs.timeZone No String Localtime

hdfs.useLocalTimeStamp No boolean False

hdfs.inUsePrefix No String Blank

hdfs.inUseSuffix No String .tmp

hdfs.rollInterval No long(seconds) 30Seconds(0=disable)

hdfs.rollSize No long(bytes) 1024bytes(0=disable)

hdfs.rollCount No long 10(0=disable)

hdfs.batchSize No long 100

hdfs.codeC No String

RemembertoalwayschecktheFlumeUserGuidefortheversionyouareusingathttp://flume.apache.org/,asthingsmightchangebetweenthereleaseofthisbookandtheversionyouareactuallyusing.

http://flume.apache.org/

PathandfilenameEachtimeFlumestartsanewfileathdfs.pathinHDFStowritedatainto,thefilenameiscomposedofthehdfs.filePrefix,aperiodcharacter,theepochtimestampatwhichthefilewasstarted,andoptionally,afilesuffixspecifiedbythehdfs.fileSuffixproperty(ifset),forexample:


Theprecedingcommandwouldresultinafilesuchas/logs/apache/access/FlumeData.1362945258

However,inthefollowingconfiguration,yourfilenameswouldbemorelike/logs/apache/access/access.1362945258.log:


agent.sinks.k1.hdfs.filePrefix=access

agent.sinks.k1.hdfs.fileSuffix=.log

Overtime,thehdfs.pathdirectorywillgetveryfull,soyouwillwanttoaddsomekindoftimeelementintothepathtopartitionthefilesintosubdirectories.Flumesupportsvarioustime-basedescapesequences,suchas%Ytospecifyafour-digityear.Iliketousesequencesintheyear/month/day/hourform(sothattheyaresortedoldesttonewest),soIoftenusethisforapath:

agent.sinks.k1.hdfs.path=/logs/apache/access/%Y/%m/%d/%H

ThissaysIwantapathlike/logs/apache/access/2013/03/10/18/.

NoteForacompletelistoftime-basedescapesequences,seetheFlumeUserGuide.

AnotherhandyescapesequencemechanismistheabilitytouseFlumeheadervaluesinyourpath.Forinstance,iftherewasaheaderwithakeyoflogType,IcouldsplitApacheaccessanderrorlogsintodifferentdirectorieswhileusingthesamechannel,byescapingtheheader’skeyasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%{logType}/%Y/%m/%d/%H

Theprecedinglineofcodewouldresultinaccesslogsgoingto/logs/apache/access/2013/03/10/18/,anderrorlogsgoingto/logs/apache/error/2013/03/10/18/.However,ifIpreferredbothlogtypesinthesamedirectorypath,IcouldhaveusedlogTypeinmyhdfs.filePrefixinstead,asfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H

agent.sinks.k1.hdfs.filePrefix=%{logType}

Obviously,itispossibleforFlumetowritetomultiplefilesatonce.Thehdfs.maxOpenFilespropertysetstheupperlimitforhowmanycanbeopenatonce,withadefaultof5000.Ifyoushouldexceedthislimit,theoldestfilethat’sstillopenisclosed.RememberthateveryopenfileincursoverheadbothattheOSlevelandinHDFS(NameNodeandDataNodeconnections).

Anothersetofpropertiesyoumightfindusefulallowforroundingdowneventtimesatanhour,minute,orsecondgranularitywhilestillmaintainingtheseelementsinfilepaths.Let’ssayyouhadapathspecificationasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H%M

However,ifyouwantedonlyfoursubdirectoriesperday(at00,15,30,and45pastthehour,eachcontaining15minutesofdata),youcouldaccomplishthisbysettingthefollowing:

agent.sinks.k1.hdfs.round=true

agent.sinks.k1.hdfs.roundValue=15

agent.sinks.k1.hdfs.roundUnit=minute

Thiswouldresultinlogsbetween01:15:00and01:29:59onMarch10,2013beingwrittentofilescontainedin/logs/apache/2013/03/10/0115/.Logsfrom01:30:00to01:44:59wouldbewritteninfilescontainedin/logs/apache/2013/03/10/0130/.

Thehdfs.timeZonepropertyisusedtospecifythetimezonethatyouwanttimeinterpretedforyourescapesequences.Thedefaultisyourcomputer’slocaltime.Ifyourlocaltimeisaffectedbydaylightsavingstimeadjustments,youwillhavetwiceasmuchdatawhen%H==02(inthefall)andnodatawhen%H==02(inthespring).Ithinkitisabadideatointroducetimezonesintothingsthataremeantforcomputerstoread.Ibelievetimezonesareaconcernforhumansaloneandcomputersshouldonlyconverseinuniversaltime.Forthisreason,IsetthispropertyonmyFlumeagentstomakethetimezoneissuejustgoaway:

-Duser.timezone=UTC

Ifyoudon’tagree,youarefreetousethedefault(localtime)orsethdfs.timeZonetowhateveryoulike.Thevalueyoupassedisusedinacalltojava.util.Timezone.getTimeZone(…),sochecktheJavadocsforacceptablevaluestobeusedhere.

Theothertime-relatedpropertyisthehdfs.useLocalTimeStampbooleanproperty.Bydefault,itsvalueisfalse,whichtellsthesinktousetheevent’stimestampheaderwhencalculatingdate-basedescapesequencesinfilepaths,asshownpreviously.Ifyousetthepropertytotrue,thecurrentsystemtimewillbeusedinstead,effectivelytellingFlumetousethetransportarrivaltimeratherthantheoriginaleventtime.YouwouldnotsetthisincaseswhereHDFSwasthefinaltargetforthestreamedevents.Thisway,delayedeventswillstillbeplacedcorrectly(whereuserswouldnormallylookforthem)regardlessoftheirarrivaltime.However,theremaybeausecasewhereeventsaretemporarilywrittentoHadoopandprocessedinbatches,onsomeinterval(perhapsdaily).Inthiscase,thetransporttimewouldbepreferred,soyourpostprocessingjobdoesn’tneedtoscanolderfoldersfordelayeddata.

RememberthatfilesinHDFSarebrokenintofileblocksthatarereplicatedacrosstheDataNodes.Thedefaultnumberofreplicasisusuallythree(assetintheHadoopbaseconfiguration).Youcanoverridethisvalueupordownforthissinkwiththehdfs.minBlockReplicasproperty.Forexample,ifIhaveadatastreamthatIfeelonly

needstworeplicasinsteadofthree,Icanoverridethisasfollows:

agent.skinks.k1.hdfs.minBlockReplicas=2

NoteDon’tsettheminimumreplicacounthigherthanthenumberofdatanodesyouhave,otherwiseyou’llcreateadegradedstateHDFS.Youalsodon’twanttosetitsohighthatadownedboxformaintenancewouldtriggerthissituation.Personally,I’veneversetthishigherthanthedefaultofthree,butIhavesetitloweronlessimportantdatainordertosavespace.

Finally,whilefilesarebeingwrittentoHDFS,a.tmpextensionisadded.Whenthefileisclosed,theextensionisremoved.Youcanchangetheextensionusedbysettingthehdfs.inUseSuffixproperty,butI’veneverhadareasontodoso:

agent.sinks.k1.hdfs.inUseSuffix=flumeiswriting

ThisallowsyoutoseewhichfilesarebeingwrittentosimplybylookingatadirectorylistinginHDFS.AsyoutypicallyspecifyadirectoryforinputinyourMapReducejob(orbecauseyouareusingHive),thetemporaryfileswilloftenbepickedupasemptyorgarbledinputbymistake.Toavoidhavingyourtemporaryfilespickedupbeforebeingclosed,settheprefixtoeitheradotoranunderscorecharacterasfollows:

agent.sinks.k1.hdfs.inUsePrefix=_

Thatsaid,thereareoccasionswherefileswerenotclosedproperlyduetosomeHDFSglitch,soyoumightseefileswiththein-useprefix/suffixthathaven’tbeenusedinsometime.AfewnewpropertieswereaddedinVersion1.5tochangethedefaultbehaviorofclosingfiles.Thefirstisthehdfs.closeTriesproperty.Thedefaultofzeroactuallymeans“tryforever”,soitisalittleconfusing.Settingitto4meanstry4timesbeforegivingup.Youcanadjusttheintervalbetweenretriesbysettingthehdfs.retryIntervalproperty.SettingittoolowcouldswampyourNameNodewithtoomanyrequests,sobecarefulifyoulowerthisfromthedefaultof3minutes.Ofcourse,ifyouareopeningfilestooquickly,youmightneedtolowerthisjusttokeepfromgoingoverthehdfs.maxOpenFilessettingwhichwascoveredpreviously.Ifyouactuallydidn’twantanyretries,youcansethdfs.retryIntervaltozeroseconds(again,nottobeconfusedwithcloseTries=0,whichmeanstryforever).Hopefullyinafutureversion,theywillusethemorecommonlyusedconventionofanegativenumber(usually,-1)wheninfiniteisdesired.

FilerotationBydefault,Flumewillrotateactivelywritten-tofilesevery30seconds,10events,or1024bytes.Thisisdonebysettingthehdfs.rollInterval,hdfs.rollCount,andhdfs.rollSizeproperties,respectively.Oneormoreofthesecanbesettozerotodisablethisparticularrollingmechanism.Forinstance,ifyouonlywantedatime-basedrollof1minute,youwouldsetthefollowing:

agent.sinks.k1.hdfs.rollInterval=60

agent.sinks.k1.hdfs.rollCount=0

agent.sinks.k1.hdfs.rollSize=0

Ifyouroutputcontainsanyamountofheaderinformation,theHDFSsizeperfilecanbelargerthanwhatyouexpect,becausethehdfs.rollSizerotationschemeonlycountstheeventbodylength.Clearly,youmightnotwanttodisableallthreemechanismsforrotationatthesametime,oryouwillhaveonedirectoryinHDFSoverflowingwithfiles.

Finally,arelatedparameterishdfs.batchSize.Thisisthenumberofeventsthatthesinkwillreadpertransactionfromthechannel.Ifyouhavealargevolumeofdatainyourchannel,youmightseeaperformanceincreasebysettingthishigherthanthedefaultof100,whichdecreasesthetransactionoverheadperevent.

Nowthatwe’vediscussedthewayfilesaremanagedandrolledinHDFS,let’slookintohowtheeventcontentsgetwritten.

CompressioncodecsCodecs(Coder/Decoders)areusedtocompressanddecompressdatausingvariouscompressionalgorithms.Flumesupportsgzip,bzip2,lzo,andsnappy,althoughyoumighthavetoinstalllzoyourself,especiallyifyouareusingadistributionsuchasCDH,duetolicensingissues.

Ifyouwanttospecifycompressionforyourdata,setthehdfs.codeCpropertyifyouwanttheHDFSsinktowritecompressedfiles.ThepropertyisalsousedasthefilesuffixforthefileswrittentoHDFS.Forexample,ifyouspecifythefollowing,allfilesthatarewrittenwillhavea.gzipextension,soyoudon’tneedtospecifythehdfs.fileSuffixpropertyinthiscase:

agent.sinks.k1.hdfs.codeC=gzip

Thecodecyouchoosetousewillrequiresomeresearchonyourpart.Thereareargumentsforusinggziporbzip2fortheirhighercompressionratiosatthecostoflongercompressiontimes,especiallyifyourdataiswrittenoncebutwillbereadhundredsorthousandsoftimes.Ontheotherhand,usingsnappyorlzoresultsinfastercompressionperformancebutresultsinalowercompressionratio.Keepinmindthatthesplitabilityofthefile,especiallyifyouareusingplaintextfiles,willgreatlyaffecttheperformanceofyourMapReducejobs.GopickupacopyofHadoopBeginner’sGuide,GarryTurkington,PacktPublishing(http://amzn.to/14Dh6TA)orHadoop:TheDefinitiveGuide,TomWhite,O’Reilly(http://amzn.to/16OsfIf)ifyouaren’tsurewhatI’mtalkingabout.

http://amzn.to/14Dh6TA

http://amzn.to/16OsfIf

EventSerializersAnEventSerializeristhemechanismbywhichaFlumeEventisconvertedintoanotherformatforoutput.ItissimilarinfunctiontotheLayoutclassinlog4j.Bydefault,thetextserializer,whichoutputsjusttheFlumeeventbody,isused.Thereisanotherserializer,header_and_text,whichoutputsboththeheadersandthebody.Finally,thereisanavro_eventserializerthatcanbeusedtocreateanAvrorepresentationoftheevent.Ifyouwriteyourown,you’dusetheimplementation’sfullyqualifiedclassnameastheserializerpropertyvalue.

TextoutputAsmentionedpreviously,thedefaultserializeristhetextserializer.ThiswilloutputonlytheFlumeeventbody,withtheheadersdiscarded.Eacheventhasanewlinecharacterappenderunlessyouoverridethisdefaultbehaviorbysettingtheserializer.appendNewLinepropertytofalse.


Serializer No String text

serializer.appendNewLine No boolean true

TextwithheadersThetext_with_headersserializerallowsyoutosavetheFlumeeventheadersratherthandiscardthem.Theoutputformatconsistsoftheheaders,followedbyaspace,thenthebodypayload,andfinally,terminatedbyanoptionallydisablednewlinecharacter,forinstance:

{key1=value1,key2=value2}bodytexthere


serializer No String text_with_headers

serializer.appendNewLine No boolean true

ApacheAvroTheApacheAvroproject(http://avro.apache.org/)providesaserializationformatthatissimilarinfunctionalitytoGoogleProtocolBuffersbutismoreHadoopfriendlyasthecontainerisbasedonHadoop’sSequenceFileandhassomeMapReduceintegration.Theformatisalsoself-describingusingJSON,makingforagoodlong-termdatastorageformat,asyourdataformatmightevolveovertime.IfyourdatahasalotofstructureandyouwanttoavoidturningitintoStringsonlytothenparsetheminyourMapReducejob,youshouldreadmoreaboutAvrotoseewhetheryouwanttouseitasastorageformatinHDFS.

Theavro_eventserializercreatesAvrodatabasedontheFlumeeventschema.IthasnoformattingparametersasAvrodictatestheformatofthedata,andthestructureoftheFlumeeventdictatestheschemaused:


serializer No String avro_event

serializer.compressionCodec No String(gzip,bzip2,lzo,orsnappy)

serializer.syncIntervalBytes No int(bytes) 2048000(bytes)

IfyouwantyourdatacompressedbeforebeingwrittentotheAvrocontainer,youshouldsettheserializer.compressionCodecpropertytothefileextensionofaninstalledcodec.Theserializer.syncIntervalBytespropertydeterminesthesizeofthedatabufferusedbeforeflushingthedatatoHDFS,andtherefore,thissettingcanaffectyourcompressionratiowhenusingacodec.HereisanexampleusingsnappycompressiononAvrodatausinga4MBbuffer:

agent.sinks.k1.serializer=avro_event

agent.sinks.k1.serializer.compressionCodec=snappy

agent.sinks.k1.serializer.syncIntervalBytes=4194304

agent.sinks.k1.hdfs.fileSuffix=.avro

ForAvrofilestoworkinanAvroMapReducejob,theymustendin.avroortheywillbeignoredasinput.Forthisreason,youneedtoexplicitlysetthehdfs.fileSuffixproperty.Furthermore,youwouldnotsetthehdfs.codeCpropertyonanAvrofile.

http://avro.apache.org/

User-providedAvroschemaIfyouwanttouseadifferentschemafromtheFlumeeventschemausedwiththeavro_eventtype,startinginVersion1.4,thecloselynamedAvroEventSerializerwillletyoudothis.Keepinmindthatusingthisimplementationonly,theevent’sbodyisserializedandheadersarenotpassedon.

Settheserializertypetothefullyqualifiedorg.apache.flume.sink.hdfs.AvroEventSerializerclassname:

agent.sinks.k1.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer

UnliketheotherserializersthattakeadditionalparametersintheFlumeconfigurationfile,thisonerequiresthatyoupasstheschemainformationviaaFlumeheader.ThisisabyproductofoneoftheAvro-awaresourceswe’llseeinChapter6,Interceptors,ETL,andRouting,whereschemainformationissentfromthesourcetothefinaldestinationviatheeventheader.Youcanfakethisifyouareusingasourcethatdoesn’tsetthesebyusingastaticheaderinterceptor.We’lltalkmoreaboutinterceptorsinChapter6,Interceptors,ETL,andRouting,soflipbacktothispartlateron.

TospecifytheschemadirectlyintheFlumeconfigurationfile,usetheflume.avro.schema.literalheaderasshowninthisexample(usingamapofstringsschema):


agent.sinks.k1.interceptors=i1

agent.sinks.k1.interceptors.i1.type=static

agent.sinks.k1.interceptors.i1.key=flume.avro.schema.literal

agent.sinks.k1.interceptors.i1.value="

{\"type\":\"map\",\"values\":\"string\"}"

IfyouprefertoputtheschemafileinHDFS,usetheflume.avro.schema.urlheaderinstead,asshowninthisexample:


agent.sinks.k1.interceptors=i1

agent.sinks.k1.interceptors.i1.type=static

agent.sinks.k1.interceptors.i1.key=flume.avro.schema.url

agent.sinks.k1.interceptors.i1.value=hdfs://path/to/schema.avsc

Actually,inthissecondform,youcanpassanyURLincludingafile://URL,butthiswouldindicateafilelocaltowhereyouarerunningtheFlumeagent,whichmightcreateadditionalsetupworkforyouradministrators.ThisisalsotrueofconfigurationservedupbyaHTTPwebserverorfarm.Ratherthancreatingadditionalsetupdependencies,justusethedependencyyoucannotremove,whichisHDFS,usingahdfs://URL.

Besuretoonlyseteithertheflume.avro.schema.literalheaderortheflume.avro.schema.urlheaderbothnotboth.

FiletypeBydefault,theHDFSsinkwritesdatatoHDFSasHadoop’sSequenceFile.ThisisacommonHadoopwrapperthatconsistsofakeyandvaluefieldseparatedbybinaryfieldandrecorddelimiters.Usually,textfilesonacomputermakeassumptionslikeanewlinecharacterterminateseachrecord.So,whatdoyoudoifyourdatacontainsanewlinecharacter,suchassomeXML?Usingasequencefilecansolvethisproblembecauseitusesnonprintablecharactersfordelimiters.Sequencefilesarealsosplittable,whichmakesforbetterlocalityandparallelismwhenrunningMapReducejobsonyourdata,especiallyonlargefiles.

SequenceFileWhenusingaSequenceFilefiletype,youneedtospecifyhowyouwantthekeyandvaluetobewrittenontherecordintheSequenceFile.ThekeyoneachrecordwillalwaysbeaLongWritabletypeandwillcontainthecurrenttimestamp,orifthetimestampeventheaderisset,itwillbeusedinstead.Bydefault,theformatofthevalueisaorg.apache.hadoop.io.BytesWritabletype,whichcorrespondstothebyte[]Flumebody:


hdfs.fileType No String SequenceFile

hdfs.writeFormat No String writable

However,ifyouwantthepayloadinterpretedasaString,youcanoverridethehdfs.writeFormatproperty,soorg.apache.hadoop.io.Textwillbeusedasthevaluefield:


hdfs.fileType No String SequenceFile

hdfs.writeFormat No String text

DataStreamIfyoudonotwanttooutputaSequenceFilefilebecauseyourdatadoesn’thaveanaturalkey,youcanuseaDataStreamtooutputonlytheuncompressedvalue.Simplyoverridethehdfs.fileTypeproperty:

agent.sinks.k1.hdfs.fileType=DataStream

ThisisthefiletypeyouwouldusewithAvroserialization,asanycompressionshouldhavebeendoneintheEventSerializer.Toserializegzip-compressedAvrofiles,youwouldsettheseproperties:

agent.sinks.k1.serializer=avro_event

agent.sinks.k1.serializer.compressionCodec=gzip

agent.sinks.k1.hdfs.fileType=DataStream

agent.sinks.k1.hdfs.fileSuffix=.avro

CompressedStreamCompressedStreamissimilartoaDataStream,exceptthatthedataiscompressedwhenit’swritten.Youcanthinkofthisasrunningthegziputilityonanuncompressedfile,butallinonestep.ThisdiffersfromacompressedAvrofilewhosecontentsarecompressedandthenwrittenintoanuncompressedAvrowrapper:

agent.sinks.k1.hdfs.fileType=CompressedStream

RememberthatonlycertaincompressedformatsaresplittableinMapReduceshouldyoudecidetouseCompressedStream.Thecompressionalgorithmselectiondoesn’thaveaFlumeconfigurationbutisdictatedbythezlib.compress.strategyandzlib.compress.levelpropertiesincoreHadoopinstead.

TimeoutsandworkersFinally,therearetwomiscellaneouspropertiesrelatedtotimeoutsandtwoforworkerpoolsthatyoucanchange:


hdfs.callTimeout No long(milliseconds) 10000

hdfs.idleTimeout No int(seconds) 0(0=disable)

hdfs.threadsPoolSize No int 10

hdfs.rollTimerPoolSize No int 1

Thehdfs.callTimeoutpropertyistheamountoftimetheHDFSsinkwillwaitforHDFSoperationstoreturnasuccess(orfailure)beforegivingup.IfyourHadoopclusterisparticularlyslow(forinstance,adevelopmentorvirtualcluster),youmightneedtosetthisvaluehigherinordertoavoiderrors.Keepinmindthatyourchannelwilloverflowifyoucannotsustainhigherwritethroughputthantheinputrateofyourchannel.

Thehdfs.idleTimeoutproperty,ifsettoanonzerovalue,isthetimeFlumewillwaittoautomaticallycloseanidlefile.Ihaveneverusedthisashdfs.fileRollIntervalhandlestheclosingoffilesforeachrollperiod,andifthechannelisidle,itwillnotopenanewfile.Thissettingseemstohavebeencreatedasanalternativerollmechanismtothesize,time,andeventcountmechanismsthathavealreadybeendiscussed.Youmightwantasmuchdatawrittentoafileaspossibleandonlycloseitwhentherereallyisnomoredata.Inthiscase,youcanusehdfs.idleTimeouttoaccomplishthisrotationschemeifyoualsosethdfs.rollInterval,hdfs.rollSize,andhdfs.rollCounttozero.

Thefirstpropertyyoucansettoadjustthenumberofworkersishdfs.threadsPoolSizeanditdefaultsto10.Thisisthemaximumnumberoffilesthatcanbewrittentoatthesametime.Ifyouareusingeventheaderstodeterminefilepathsandnames,youmighthavemorethan10filesopenatonce,butbecarefulwhenincreasingthisvaluetoomuchsoasnottooverwhelmHDFS.

Thelastpropertyrelatedtoworkerpoolsisthehdfs.rollTimerPoolSize.Thisisthenumberofworkersprocessingtimeoutssetbythehdfs.idleTimeoutproperty.Theamountofworktoclosethefilesisprettysmall,soincreasingthisvaluefromthedefaultofoneworkerisunlikely.Ifyoudonotusearotationbasedonhdfs.idleTimeout,youcanignorethehdfs.rollTimerPoolSizeproperty,asitisnotused.

SinkgroupsInordertoremovesinglepointsoffailuresinyourdataprocessingpipeline,Flumehastheabilitytosendeventstodifferentsinksusingeitherloadbalancingorfailover.Inordertodothis,weneedtointroduceanewconceptcalledasinkgroup.Asinkgroupisusedtocreatealogicalgroupingofsinks.Thebehaviorofthisgroupingisdictatedbysomethingcalledthesinkprocessor,whichdetermineshoweventsarerouted.

Thereisadefaultsinkprocessorthatcontainsasinglesinkwhichisusedwheneveryouhaveasinkthatisn’tpartofanysinkgroup.OurHello,World!exampleinChapter2,AQuickStartGuidetoFlume,usedthedefaultsinkprocessor.Nospecialconfigurationisrequiredforsinglesinks.

InorderforFlumetoknowaboutthesinkgroups,thereisanewtop-levelagentpropertycalledsinkgroups.Similartosources,channels,andsinks,youprefixthepropertywiththeagentname:

agent.sinkgroups=sg1

Here,wehavedefinedasinkgroupcalledsg1fortheagentnamedagent.

Foreachnamedsinkgroup,youneedtospecifythesinksitcontainsusingthesinkspropertyconsistingofaspace-delimitedlistofsinknames:

agent.sinkgroups.sg1.sinks=k1k2

Thisdefinesthatthek1andk2sinksarepartofthesg1sinkgroupfortheagentnamedagent.

Often,sinkgroupsareusedinconjunctionwiththetieredmovementofdatatoroutearoundfailures.However,theycanalsobeusedtowritetodifferentHadoopclusters,asevenawell-maintainedclusterhasperiodicmaintenance.

LoadbalancingContinuingtheprecedingexample,let’ssayyouwanttoloadbalancetraffictok1andk2evenly.Therearesomeadditionalpropertiesyouneedtospecify,aslistedinthistable:

Key Type Default

processor.type String load_balance

processor.selector String(round_robin,random) round_robin

processor.backoff boolean false

Whenyousetprocessor.typetoload_balance,roundrobinselectionwillbeused,unlessotherwisespecifiedbytheprocessor.selectorproperty.Thiscanbesettoeitherround_robinorrandom.Youcanalsospecifyyourownloadbalancingselectormechanism,whichwewon’tcoverhere.ConsulttheFlumedocumentationifyouneedthiscustomcontrol.

Theprocessor.backoffpropertyspecifieswhetheranexponentialbackupshouldbeusedwhenretryingasinkthatthrewanexception.Thedefaultisfalse,whichmeansthatafterathrownexception,thesinkwillbetriedagainthenexttimeitsturnisupbasedonroundrobinorrandomselection.Ifsettotrue,thenthewaittimeforeachfailureisdoubled,startingat1seconduptoalimitofaround18hours(216seconds).

NoteInanearlierversionofFlume,thedefaultinthecodeforprocessor.backoffwasstatedasfalse,butthedocumentationstateditastrue.Thiserrorhasbeenfixed,however,itmaysaveyouaheadachebyspecifyingwhatyouwantforpropertysettingsratherthanrelyingonthedefaults.

FailoverIfyouwouldrathertryonesinkandifthatonefailstotryanother,thenyouwanttosetprocessor.typetofailover.Next,you’llneedtosetadditionalpropertiestospecifytheorderbysettingtheprocessor.priorityproperty,followedbythesinkname:

Key Type Default

processor.type String failover

processor.priority.NAME int

processor.maxpenality int(milliseconds) 30000

Let’slookatthefollowingexample:

agent.sinkgroups.sg1.sinks=k1k2k3

agent.sinkgroups.sg1.processor.type=failover

agent.sinkgroups.sg1.processor.priority.k1=10



Lowerprioritynumberscomefirst,andinthecaseofatie,orderisarbitrary.Youcanuseanynumberingsystemthatmakessensetoyou(byones,fives,tens—whatever).Inthisexample,thek1sinkwillbetriedfirst,andifanexceptionisthrown,eitherk2ork3willbetriednext.Ifk3wasselectedfirstfortrialanditfailed,k2willstillbetried.Ifallsinksinthesinkgroupfail,thetransactionwiththechannelisrolledback.

Finally,processor.maxPenalitysetsanupperlimittoanexponentialbackoffforfailedsinksinthegroup.Afterthefirstfailure,itwillbe1secondbeforeitcanbeusedagain.Eachsubsequentfailuredoublesthewaittimeuntilprocessor.maxPenalityisreached.

MorphlineSolrSinkHDFSisnottheonlyusefulplacetosendyourlogsanddata.Solrisapopularreal-timesearchplatformusedtoindexlargeamountsofdata,sofulltextsearchingcanbeperformedalmostinstantaneously.Hadoop’shorizontalscalabilitycreatesaninterestingproblemforSolr,asthereisnowmoredatathanasingleinstancecanhandle.Forthisreason,ahorizontallyscalableversionofSolrwascreated,calledSolrCloud.Cloudera’sSearchproductisalsobasedonSolrCloud,soitshouldbenosurprisethatFlumedeveloperscreatedanewsinkspecificallytowritestreamingdataintoSolr.

Likemoststreamingdataflows,younotonlytransportthedata,butyoualsooftenreformatitintoaformmoreconsumabletothetargetoftheflow.Typically,thisisdoneinaFlume-onlyworkflowbyapplyingoneormoreinterceptorsjustpriortothesinkwritingthedatatothetargetsystem.ThissinkusestheMorphlineenginetotransformthedata,insteadofinterceptors.

Internally,eachFlumeeventisconvertedintoaMorphlinerecordandpassedtothefirstcommandintheMorphlinecommandchain.Arecordcanbethoughtofasasetofkey/valuepairswithstringkeysandarbitraryobjectvalues.EachoftheFlumeheadersispassedasaRecordFieldwiththesameheaderkeys.AspecialRecordFieldkey_attachment_bodyisusedfortheFlumeeventbody.Keepinmindthatthebodyisstillabytearray(Javabyte[])atthispointandmustbespecificallyprocessedintheMorphlinecommandchain.

Eachcommandprocessestherecordinturn,passingtheoutputtotheinputofthenextcommandinlinewiththefinalcommandresponsibleforterminatingtheflow.Inmanyways,itissimilarinfunctionallytoFlume’sInterceptorfunctionality,whichwe’llseeinChapter6,Interceptors,ETL,andRouting.InthecaseofwritingtoSolr,weusetheloadSolrcommandtoconverttheMorphlinerecordintoaSolrDocumentandwritetotheSolrcluster.Hereiswhatthissimplifiedflowmightlooklikeinapictureform:

MorphlineconfigurationfilesMorphlineconfigurationfilesusetheHOCONformat,whichissimilartoJSONbuthasalessstrictsyntax,makingthemlesserror-pronewhenusedforconfigurationfilesoverJSON.

NoteHOCONisanacronymforHumanOptimizedConfigurationObjectNotation.YoucanreadmoreaboutHOCONonthisGitHubpage:https://github.com/typesafehub/config/blob/master/HOCON.md

Theconfigurationfilecontainsasinglekeywiththemorphlinesvalue.ThevalueisanarrayofMorphlineconfigurations.Eachindividualentryiscomprisedofthreekeys:

id

importCommands

commands

IfyourconfigurationcontainsmultipleMorphlines,thevalueofidmustbeprovidedtotheFlumesinkbywayofthemorphlineIdproperty.ThevalueofimportCommandsspecifiestheJavaclassestoimportwhentheMorphineisevaluated.Thedoublestarindicatesthatallpathsandclassesfromthatpointinthepackagehierarchyshouldbeincluded.Allclassesthatimplementcom.cloudera.cdk.morphline.api.CommandBuilderareinterrogatedfortheirnamesviathegetNames()method.Thesenamesarethecommandnamesyouuseinthenextsection.Don’tworry;youdon’tneedtosiftthroughthesourcecodetofindthem,astheyhaveawell-documentedreferenceguideonline.Finally,thecommandskeyreferencesalistofcommanddictionaries.EachcommanddictionaryhasasinglekeyconsistingofthenameoftheMorphlinecommandfollowedbyitsspecificproperties.

NoteForalistofMorphlinecommandsandassociatedconfigurationproperties,seethereferenceguideathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

Hereiswhataskeletonconfigurationfilemightlooklike:

morphlines:[

{

id:transform_my_data

importCommands:[

"com.cloudera.**",

"org.apache.solr.**"

]

commands:[

{

COMMAND_NAME1:{

property1:value1

property2:value2

}

https://github.com/typesafehub/config/blob/master/HOCON.md


}

{COMMAND_NAME2:{

property1:value1

}

]

}

]

TypicalSolrSinkconfigurationHereistheprecedingskeletonconfigurationappliedtoourSolrusecase.Thisisnotmeanttobecomplete,butitissufficienttodiscusstheflowintheprecedingdiagram:

morphlines:[

{

id:solr_flow

importCommands:[

"com.cloudera.**",

"org.apache.solr.**"

]

commands:[

{

readLine:{

charset:UTF-8

}

{

grok:{

GROK_PROPERTIES_HERE

}

}

{

loadSolr:{

solrLocator:{

collection:my_collection

zkHost:"solr.example.com:2181/solr"

}

}

}

]

}

]

YoucanseethesameboilerplateconfigurationwherewedefineasingleMorphlinewiththesolr_flowidentifier.ThecommandsequencestartswiththereadLinecommand.Thissimplyreadstheeventbodyfromthe_attachment_bodyfieldandconvertsbyte[]toStringusingtheconfiguredencoding(inthiscase,UTF-8).TheresultingStringvalueissettothefieldwiththekeymessage.Thenextcommandinthesequence,whichisthegrokcommand,usesregularexpressionstoextractadditionalfieldstomakeamoreinterestingSolrDocument.Icouldn’tpossiblydothiscommandjusticebytryingtoexplaineverythingyoucandowithit.Forthat,pleaseseetheKiteSDKdocumentation.

NoteSeethereferenceguideforacompletelistofMorphlinecommands,theirproperties,andusageinformationathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

Sufficetosay,grokletsmetakeawebserverloglinesuchasthis:10.4.240.176--[14/Mar/2014:12:02:17-0500]"POST

http://mysite.com/do_stuff.phpHTTP/1.1"500834


Then,itletsmeturnitintomorestructureddatalikethis:

{

ip:10.4.240.176

timestamp:1413306137

method:POST

url:http://mysite.com/do_stuff.php

protocol:HTTP/1.1

status_code:500

length:834

}

Ifyouwantedtosearchforallthetimesthispagethrewa500statuscode,havingthesefieldsbrokenoutmakesthetaskeasyforSolr.

Finally,wecalltheloadSolrcommandtoinserttherecordintoourSolrcluster.ThesolrLocatorpropertyindicatesthetargetSolrcluster(bywayofitsZookeeperserver(s))andthedatacollectiontowritethesedocumentsinto.

SinkconfigurationNowthatyouhaveabasicideaofhowtocreateaMorphlineconfigurationfile,let’sapplythistotheactualsinkconfiguration.

Thefollowingtabledetailsthesink’sparametersanddefaultvalues:


type Yes String org.apache.flume.sink.solr.morphline.MorphlineSolrSink

channel Yes String

morphlineFile Yes String

morphlineId No StringRequirediftheMorphlineconfigurationfilecontainsmorethanoneMorphline.

batchSize No int 1000

batchDurationMillis No long 1000(milliseconds)

handlerClass No String org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl

TheMorphlineSolrSinkdoesnothaveashorttypealias,sosetthetypeparameteronyournamedsinktoorg.apache.flume.sink.solr.morphline.MorphlineSolrSink:

agent.sinks.k1.type=org.apache.flume.sink.solr.morphline.MorphlineSolrSink

ThisdefinesaMorphlineSolrSinknamedk1fortheagentnamedagent.

Thenextrequiredparameteristhechannelproperty.Thisspecifieswhichchanneltoreadeventsfromforprocessing.


Thistellsthek1sinktoreadeventsfromthec1channel.

TheonlyotherrequiredparameteristherelativeorabsolutepathtotheMorphlineconfigurationfile.ThiscannotbeapathinHDFS;itmustaccessibleontheservertheFlumeagentisrunningon(localdisk,NFSdisk,andsoon).

Tospecifytheconfigurationfilepath,setthemorphlineFileproperty:

agent.sinks.k1.morphlineFile=/path/on/local/system/morphline.conf

AsaMorphlineconfigurationfilecancontainmultipleMorphlines,youmustspecifytheidentifierifmorethanoneexists,usingthemorphlineIdproperty:

agent.sinks.k1.morphlineId=transform_my_data

Thenexttwopropertiesarefairlycommonamongsinks.Theyspecifyhowmanyeventstoremoveatatimeforprocessing,alsoknownasabatch.ThebatchSizepropertydefaultsto1000events,butyoumightneedtosetthishigherifyouaren’tconsuming

eventsfromthechannelfasterthantheyarebeinginserted.Clearly,youcanonlyincreasethissomuch,asthethingyouarewritingto—inthiscase,Solr—willhavesomerecordconsumptionlimit.Onlythroughtestingwillyoubeabletostressyoursystemstoseewherethelimitsare.

TherelatedbatchDurationMillispropertyspecifiesthemaximumtimetowaitbeforethesinkproceedswiththeprocessingwhenfewerthanthebatchSizenumberofeventshavebeenread.Thedefaultvalueis1secondandisspecifiedinmillisecondsintheconfigurationproperties.Inasituationwithalightdataflow(usingthedefaults,lessthan1000recordspersecond),settingbatchDurationMillishighercanmakethingsworse.Forinstance,ifyouareusingamemorychannelwiththissink,yourFlumeagentcouldbesittingtherewithdatatowritetothesink’stargetbutiswaitingformore,onlytoshowupwhenacrashhappens,resultinginlostdata.Thatsaid,yourdownstreamentitymightperformbetteronlargerbatches,whichmightpushboththeseconfigurationvalueshigher,sothereisnouniversallycorrectanswer.Startwiththedefaultsifyouareunsure,anduseharddatathatyou’llcollectusingtechniquesinChapter8,MonitoringFlume,toadjustbasedonfactsandnotguesses.

Finally,youshouldneverneedtotouchthehandlerClasspropertyunlessyouplantowriteanalternateimplementationoftheMorphlineprocessingclass.AsthereisonlyoneMorphlineengineimplementationtodate,I’mnotreallysurewhythisisadocumentedpropertyinFlume.I’mjustmentioningitforcompleteness.

ElasticSearchSinkAnothercommontargettostreamdatatobesearchedinNRTisElasticsearch.ElasticsearchisalsoaclusteredsearchingplatformbasedonLucene,likeSolr.Itisoftenusedalongwiththelogstashproject(tocreatestructuredlogs)andtheKibanaproject(awebUIforsearches).ThistrioisoftenreferredtoastheacronymELK(Elasticsearch/Logstash/Kibana).

NoteHerearetheprojecthomepagesfortheELKstackthatcangiveyouamuchbetteroverviewthanIcaninafewshortpages:

Elasticsearch:http://elasticsearch.org/Logstash:http://logstash.net/Kibana:http://www.elasticsearch.org/overview/kibana/

InElasticsearch,dataisgroupedintoindices.YoucanthinkoftheseasbeingequivalenttodatabasesinasingleMySQLinstallation.Theindicesarecomposedoftypes(similartotablesindatabases),whicharemadeupofdocuments.Adocumentislikeasinglerowinadatabase,so,eachFlumeeventwillbecomeasingledocumentinElasticSearch.Documentshaveoneormorefields(justlikecolumnsinadatabase).

ThisisbynomeansacompleteintroductiontoElasticsearch,butitshouldbeenoughtogetyoustarted,assumingyoualreadyhaveanElasticsearchclusteratyourdisposal.Aseventsgetmappedtodocumentsbythesink’sserializer,theactualsinkconfigurationneedsonlyafewconfigurationitems:wheretheclusterislocated,whichindextowriteto,andwhattypetherecordis.

ThistablesummarizesthesettingsforElasticSearchSink:


type Yes String org.apache.flume.sink.elasticsearch.ElasticSearchSink

hostNames Yes StringAcomma-separatedlistofElasticsearchnodestoconnectto.Iftheportisspecified,useacolonafterthename.Thedefaultportis9300.

clusterName No String elasticsearch

indexName No String flume

indexType No String log

ttl No String Defaultstoneverexpire.Specifythenumberandunit(5m=5minutes).


Withthisinformationinmind,let’sstartbysettingthesink’stypeproperty:

agent.sinks.k1.type=org.apache.flume.sink.elasticsearch.ElasticSearchSink

http://elasticsearch.org/

http://logstash.net/

http://www.elasticsearch.org/overview/kibana/

Next,weneedtosetthelistofserversandportstoestablishconnectivityusingthehostNamesproperty.Thisisacomma-separatedlistofhostname:portpairs.Ifyouareusingthedefaultportof9300,youcanjustspecifytheservernameorIP,forexample:

agent.sinks.k1.hostNames=es1.example.com,es2.example.com:12345

NowthatwecancommunicatewiththeElasticsearchservers,weneedtotellthemwhichcluster,index,andtypetowriteourdocumentsto.TheclusterisspecifiedusingtheclusterNameproperty.Thiscorrespondswiththecluster.namepropertyinElasticsearch’selasticsearch.ymlconfigurationfile.Itneedstobespecified,asanElasticsearchnodecanparticipateinmorethanonecluster.HereishowIwouldspecifyanondefaultclusternamecalledproduction:

agent.sinks.k1.clusterName=production

TheindexNamepropertyisreallyaprefixusedtocreateadailyindex.Thiskeepsanysingleindexfrombecomingtoolargeovertime.Ifyouusethedefaultindexname,theindexonSeptember30,2014willbenamedflume-2014-10-30.

Lastly,theindexTypepropertyspecifiestheElasticsearchtype.Ifunspecified,thelogdefaultvaluewillbeused.

Bydefault,datawrittenintoElasticsearchwillneverexpire.Ifyouwantthedatatoautomaticallyexpire,youcanspecifyatime-to-livevalueontherecordswiththettlproperty.Valuesareanumericnumberinmillisecondsoranumberwithunits.Theunitsaregiveninthistable:

Unitstring Definition Example

ms Milliseconds 5ms=5milliseconds

notspecified Milliseconds 10000=10seconds

m Minutes 10m=10minutes

h Hours 1h=1hour

d Days 7d=7days

w Weeks 4w=4weeks

KeepinmindthatyoualsoneedtoenabletheTTLfeaturesontheElasticsearchcluster,asitdisabledbydefault.SeetheElasticsearchdocumentationforhowtodothis.

Finally,liketheHDFSsink,thebatchpropertyisthenumberofeventspertransactionthatthesinkwillreadfromthechannel.Ifyouhavealargevolumeofdatainyourchannel,youshouldseeaperformanceincreasebysettingthishigherthanthedefaultof100,duetothereducedoverheadpertransaction.

Thesink’sserializerdoestheworkoftransformingtheFlumeeventtotheElasticsearchdocument.TherearetwoElasticsearchserializersthatcomepackagedwithFlume,neither

hasadditionalconfigurationpropertiessincetheymostlyuseexistingheaderstodictatefieldmappings.

We’llseemoreofthissinkinactioninChapter7,PuttingItAllTogether.

LogStashSerializerThedefaultserializer,ifnotspecified,isElasticSearchLogStashEventSerializer:

agent.sinks.k1.serializer=org.apache.flume.sink.elasticsearch.

ElasticSearchLogStashEventSerializer

ItwritesdatainthesameformatthatLogstashusesinconjunctionwithKibana.HereisatableofthecommonlyusedfieldsandtheirassociatedmappingsfromFlumeevents:

Elasticsearchfield

TakenfromtheFlumeheader Notes

@timestamp timestamp Fromtheheader,ifpresent

@source source Fromtheheader,ifpresent

@source_host source_host Fromtheheader,ifpresent

@source_path source_path Fromtheheader,ifpresent

@type type Fromtheheader,ifpresent

@host host Fromtheheader,ifpresent

@fields allheaders AdictionaryofallFlumeheaders,includingtheonesthatmighthavebeenmappedtotheotherfields,suchasthehost

@message FlumeBody

Whileyoumightthinkthedocument’s@typefieldwillbeautomaticallysettothesink’sindexTypeconfigurationproperty,you’dbeincorrect.Ifyouhadonlyonetypeoflog,itwouldbewastefultowritethisoverandoveragainforeverydocument.However,ifyouhadmorethanonelogtypegoingthroughyourFlumechannel,youcandesignateitstypeinElasticsearchusingtheStaticinterceptorwe’llseeinChapter6,Interceptors,ETL,andRouting,tosetthetype(or@type)Flumeheaderontheevent.

DynamicSerializerAnotherserializeristheElasticSearchDynamicSerializerserializer.Ifyouusethisserializer,theevent’sbodyiswrittentoafieldcalledbody.AllotherFlumeheaderkeysareusedasfieldnames.Clearly,youwanttoavoidhavingaflumeheaderkeycalledbody,asthiswillconflictwiththeactualevent’sbodywhentransformedintotheElasticsearchdocument.Tousethisserializer,specifythefullyqualifiedclassname,asshowninthisexample:

agent.sinks.k1.serializer=org.apache.flume.sink.elasticsearch.

ElasticSearchDynamicSerializer

Forcompleteness,hereisatablethatshowsyouthebreakdownofhowFlumeheadersandbodygetmappedtoElasticsearchfields:

Flumeentity Elasticsearchfield

Allheaders sameasFlumeheaders

Body body

AstheversionofElasticsearchcanbedifferentforeachuser,Flumedoesn’tpackagetheElasticsearchclientandcorrespondingLucenelibraries.FindoutfromyouradministratorwhichversionsshouldbeincludedontheFlumeclasspath,orcheckouttheMavenpom.xmlfileonGitHubforthecorrespondingversiontagorbranchathttps://github.com/elasticsearch/elasticsearch/blob/master/pom.xml.MakesurethelibraryversionsusedbyFlumematchwithElasticsearchoryoumightseeserializationerrors.

NoteAsSolrandElasticsearchhavesimilarcapabilities,checkoutKelvinTan’sappropriately-namedside-by-sidedetailedfeaturebreakdownwebpage.Itshouldhelpgetyoustartedwithwhatismostappropriateforyourspecificusecase:

http://solr-vs-elasticsearch.com/

https://github.com/elasticsearch/elasticsearch/blob/master/pom.xml

http://solr-vs-elasticsearch.com/

SummaryInthischapter,wecoveredtheHDFSsinkindepth,whichwritesstreamingdataintoHDFS.WecoveredhowFlumecanseparatedataintodifferentHDFSpathsbasedontimeorcontentsofFlumeheaders.Severalfile-rollingtechniqueswerealsodiscussed,includingtimerotation,eventcountrotation,sizerotation,androtationonidleonly.

CompressionwasdiscussedasameanstoreducestoragerequirementsinHDFS,andshouldbeusedwhenpossible.Besidesstoragesavings,itisoftenfastertoreadacompressedfileanddecompressinmemorythanitistoreadanuncompressedfile.ThiswillresultinperformanceimprovementsinMapReducejobsrunonthisdata.Thesplitabilityofcompresseddatawasalsocoveredasafactortodecidewhenandwhichcompressionalgorithmtouse.

EventSerializerswereintroducedasthemechanismbywhichFlumeeventsareconvertedintoanexternalstorageformat,includingtext(bodyonly),textandheaders(headersandbody),andAvroserialization(withoptionalcompression).

Next,variousfileformats,includingsequencefiles(Hadoopkey/valuefiles),DataStreams(uncompresseddatafiles,likeAvrocontainers),andCompressedDataStreams,werediscussed.

Next,wecoveredsinkgroupsasameanstorouteeventstodifferentsourcesusingloadbalancingorfailoverpaths,whichcanbeusedtoeliminatesinglepointsoffailureinroutingdatatoitsdestination.

Finally,wecoveredtwonewsinksaddedinFlume1.4towritedatatoApacheSolrandElasticSearchinaNearRealTime(NRT)way.Foryears,MapReducejobshaveserveduswell,andwillcontinuetodoso,butsometimesitstillisn’tfastenoughtosearchlargedatasetsquicklyandlookatthingsfromdifferentangleswithoutreprocessingdata.KiteSDKMorphlineswerealsointroducedasawaytopreparedataforwritingtoSolr.WewillrevisitMorphlinesagaininChapter6,Interceptors,ETL,andRouting,whenwelookataMorphline-poweredinterceptor.

Inthenextchapter,wewilldiscussvariousinputmechanisms(sources)thatwillfeedyourconfiguredchannelswhichwerecoveredbackinChapter3,Channels.

Chapter5.SourcesandChannelSelectorsNowthatwehavecoveredchannelsandsinks,wewillnowcoversomeofthemorecommonwaystogetdataintoyourFlumeagents.AsdiscussedinChapter1,OverviewandArchitecture,thesourceistheinputpointfortheFlumeagent.TherearemanysourcesavailablewiththeFlumedistributionaswellasmanyopensourceoptionsavailable.Likemostopensourcesoftware,ifyoucan’tfindwhatyouneed,youcanalwayswriteyourownbyextendingtheorg.apache.flume.source.AbstractSourceclass.SincetheprimaryfocusofthisbookisingestingfilesoflogsintoHadoop,we’llcoverafewofthemoreappropriatesourcestoaccomplishthis.

TheproblemwithusingtailIfyouhaveusedanyoftheFlume0.9releases,you’llnoticethattheTailSourceisnolongerapartofFlume.TailSourceprovidedamechanismto“tail”(http://en.wikipedia.org/wiki/Tail_(Unix))anyfileonthesystemandcreateFlumeeventsforeachlineofthefile.Itcouldalsohandlefilerotations,somanyusedthefilesystemasahandoffpointbetweentheapplicationcreatingthedata(forinstance,log4j)andthemechanismresponsibleformovingthosefilessomeplaceelse(forinstance,syslog).

Asisthecasewithbothchannelsandsinks,eventsareaddedandremovedfromachannelaspartofatransaction.Whenyouaretailingafile,thereisnowaytoparticipateproperlyinatransaction.Iffailuretowritesuccessfullytoachanneloccurred,orifthechannelwassimplyfull(amorelikelyeventthanfailure),thedatacouldn’tbe“putback”asrollbacksemanticsdictate.

Furthermore,iftherateofdatawrittentoafileexceedstherateFlumecouldreadthedata,itispossibletoloseoneormorelogfilesofinputoutright.Forexample,sayyouweretailing/var/log/app.log.Whenthatfilereachesacertainsize,itisrotatedorrenamed,to/var/log/app.log.1,andanewfilecalled/var/log/app.logiscreated.Let’ssayyouhadafavorablereviewinthepressandyourapplicationlogsaremuchhigherthanusual.Flumemaystillbereadingfromtherotatedfile(/var/log/app.log.1)whenanotherrotationoccurs,moving/var/log/app.logto/var/log/app.log.1.ThefileFlumeisreadingisnowrenamedto/var/log/app.log.2.WhenFlumefinisheswiththisfile,itwillmovetowhatitthinksisthenextfile(/var/log/app.log),thusskippingthefilethatnowresidesat/var/log/app.log.1.Thiskindofdatalosswouldgocompletelyunnoticedandissomethingwewanttoavoidifpossible.

http://en.wikipedia.org/wiki/Tail_(Unix)

Forthesereasons,itwasdecidedtoremovethetailfunctionalityfromFlumewhenitwasrefactored.TherearesomeworkaroundsforTailSourceafterit’sremoved,butitshouldbenotedthatnoworkaroundcaneliminatethepossibilityofdatalossunderloadthatoccursundertheseconditions.

TheExecsourceTheExecsourceprovidesamechanismtorunacommandoutsideFlumeandthenturntheoutputintoFlumeevents.TousetheExecsource,setthetypepropertytoexec:

agent.sources.s1.type=exec

AllsourcesinFlumearerequiredtospecifythelistofchannelstowriteeventstousingthechannels(plural)property.Thisisaspace-separatedlistofoneormorechannelnames:


Theonlyotherrequiredparameteristhecommandproperty,whichtellsFlumewhatcommandtopasstotheoperatingsystem.Hereisanexampleoftheuseofthisproperty:

agent.sources=s1


agent.sources.s1.type=exec

agent.sources.s1.command=tail-F/var/log/app.log

Here,Ihaveconfiguredasinglesources1foranagentnamedagent.Thesource,anExecsource,willtailthe/var/log/app.logfileandfollowanyrotationsthatoutsideapplicationsmayperformonthatlogfile.Alleventsarewrittentothec1channel.ThisisanexampleofoneoftheworkaroundsforthelackofTailSourceinFlume1.x.Thisisnotmypreferredworkaroundtousetail,butjustasimpleexampleoftheexecsourcetype.IwillshowmypreferredmethodinChapter7,PullingItAllTogether.

NoteShouldyouusethetail-FcommandinconjunctionwiththeExecsource,itisprobablethattheforkedprocesswillnotshutdown100percentofthetimewhentheFlumeagentshutsdownorrestarts.Thiswillleaveorphanedtailprocessesthatwillneverexit.Thetail–Fcommand,bydefinition,hasnoend.Evenifyoudeletethefilebeingtailed(atleastinLinux),therunningtailprocesswillkeepthefilehandleopenindefinitely.Thiskeepsthefile’sspacefromactuallybeingreclaimeduntilthetailprocessexits,whichwon’thappen.IthinkyouarebeginningtoseewhyFlumedevelopersdon’tliketailingfiles.

Ifyougothisroute,besuretoperiodicallyscantheprocesstablesfortail-FwhoseparentPIDis1.Theseareeffectivelydeadprocessesandneedtobekilledmanually.

HereisalistofotherpropertiesyoucanusewiththeExecsource:


type Yes String exec

channels Yes String spaceseparatedlistofchannels

command Yes String

shell No String shellcommand

restart No boolean false

restartThrottle No long(milliseconds) 10000(milliseconds)

logStdErr No boolean false

batchSize No int 20

batchTimeout No long 3000(milliseconds)

Noteverycommandkeepsrunning,eitherbecauseitfails(forexample,whenthechannelitiswritingtoisfull)orbecauseitisdesignedtoexitimmediately.Inthisexample,wewanttorecordthesystemloadviatheLinuxuptimecommand,whichprintsoutsomesysteminformationtostdoutandexits:


Thiscommandwillimmediatelyexit,soyoucanusetherestartandrestartThrottlepropertiestorunitperiodically:




Thiswillproduceoneeventperminute.Inthetailexample,shouldthechannelfillcausingtheExecsourcetofail,youcanusethesepropertiestorestarttheExecsource.Inthiscase,settingtherestartpropertywillstartthetailingofthefilefromthebeginningofthecurrentfile,thusproducingduplicates.DependingonhowlongtherestartThrottlepropertyis,youmayhavemissedsomedataduetoafilerotationoutsideFlume.Furthermore,thechannelmaystillbeunabletoacceptdata,inwhichcasethesourcewillfailagain.Settingthisvaluetoolowmeansgivinglesstimetothechanneltodrain,andunlikesomeofthesinkswesaw,thereisnotanoptionforexponentialbackoff.

Ifyouneedtouseshell-specificfeaturessuchaswildcardexpansion,youcansettheshellpropertyasinthisexample:

agent.sources.s1.command=grep–iapachelib/*.jar|wc-l

agent.sources.s1.shell=/bin/bash-c



Thisexamplewillfindthenumberoftimesthecase-insensitiveapachestringisfoundinalltheJARfilesinthelibdirectory.Onceperminute,thatcountwillbesentasaFlumeeventpayload.

WhilethecommandoutputwrittentostdoutbecomestheFlumeeventbody,errorsaresometimeswrittentostderr.IfyouwanttheselinesincludedintheFlumeagent’ssystemlogs,setthelogStdErrpropertytotrue.Otherwise,theywillbesilentlyignored,whichisthedefaultbehavior.

Finally,youcanspecifythenumberofeventstowritepertransactionbychangingthebatchSizeproperty.Youmayneedtosetthisvaluehigherthanthedefaultof20ifyourinputdataislargeandyourealizethatyoucannotwritetoyourchannelfastenough.Usingahigherbatchsizereducestheoverallaveragetransactionoverheadperevent.Testingwithdifferentvaluesandmonitoringthechannel’sputrateistheonlywaytoknowthisforsure.TherelatedbatchTimeoutpropertysetsthemaximumtimetowaitwhenrecordsfewerthanthebatchsize’snumberofrecordshavebeenseenbefore,flushingapartialbatchtothechannel.Thedefaultsettingforthisis3seconds(specifiedinmilliseconds).

SpoolingDirectorySourceInanefforttoavoidalltheassumptionsinherentintailingafile,anewsourcewasdevisedtokeeptrackofwhichfileshavebeenconvertedintoFlumeeventsandwhichstillneedtobeprocessed.TheSpoolingDirectorySourceisgivenadirectorytowatchfornewfilesappearing.Itisassumedthatfilescopiedtothisdirectoryarecomplete.Otherwise,thesourcemighttryandsendapartialfile.Italsoassumesthatfilenamesneverchange.Otherwise,onrestarting,thesourcewouldforgetwhichfileshavebeensentandwhichhavenot.Thefilenameconditioncanbemetinlog4jusingDailyRollingFileAppenderratherthanRollingFileAppender.However,thecurrentlyopenfilewouldneedtobewrittentoonedirectoryandcopiedtothespooldirectoryafterbeingclosed.Noneofthelog4jappendersshippinghavethiscapability.

Thatsaid,ifyouareusingtheLinuxlogrotateprograminyourenvironment,thismightbeofinterest.Youcanmovecompletedfilestoaseparatedirectoryusingapostrotatescript.Thefinalflowmightlooksomethinglikethis:

TocreateaSpoolingDirectorySource,setthetypepropertytospooldir.YoumustspecifythedirectorytowatchbysettingthespoolDirproperty:

agent.sources=s1

agent.sources.channels=c1

agent.sources.s1.type=spooldir

agent.sources.s1.spoolDir=/path/to/files

HereisasummaryofthepropertiesfortheSpoolingDirectorySource:


type Yes String spooldir


spoolDir Yes String pathtodirectorytospool

fileSuffix No String .COMPLETED

deletePolicy No String(neverorimmediate) never

fileHeader No boolean false

fileHeaderKey No String file

basenameHeader No boolean false

basenameHeaderKey No String basename

ignorePattern No String ^$

trackerDir No String ${spoolDir}/.flumespool

consumeOrder No String(oldest,youngest,orrandom) oldest

batchSize No int 10

bufferMaxLines No int 100

maxBufferLineLength No int 5000

maxBackoff No int 4000(milliseconds)

Whenafilehasbeencompletelytransmitted,itwillberenamedwitha.COMPLETEDextension,unlessoverriddenbysettingthefileSuffixproperty,likethis:

agent.sources.s1.fileSuffix=.DONE

StartingwithFlume1.4,anewproperty,deletePolicy,wascreatedtoremovecompletedfilesfromthefilesystemratherthanjustmarkingthemasdone.Inaproductionenvironment,thisiscriticalbecauseyourspooldiskwillfillupovertime.Currently,youcanonlysetthisforimmediatedeletionortoleavethefilesforever.Ifyouwantdelayeddeletion,you’llneedtoimplementyourownperiodic(cron)job,perhapsusingthefindcommandtofindfilesinthespooldirectorywiththeCOMPLETEDfilesuffixandamodificationtimelongerthansomeregularvalue,forexample:

find/path/to/spool/dir-typef-name"*.COMPLETED"–mtime7–execrm{}\;

Thiswillfindallfilescompletedmorethansevendaysagoanddeletethem.

Ifyouwanttheabsolutefilepathattachedtoeachevent,setthefileHeaderpropertytotrue.ThiswillcreateaheaderwiththefilekeyunlesssettosomethingelseusingthefileHeaderKeyproperty,likethiswouldaddthe{sourceFile=/path/to/files/foo.1234.log}headeriftheeventwasreadfromthe/path/to/files/foo.1234.logfile:

agent.sources.s1.fileHeader=true

agent.sources.s1.fileHeaderKey=sourceFile

Therelatedproperty,basenameHeader,ifsettotrue,willaddaheaderwiththebasenamekey,whichcontainsjustthefilename.ThebasenameHeaderKeypropertyallowsyoutochangethekey’svalue,asshownhere:

agent.sources.s1.basenameHeader=true

agent.sources.s1.basenameHeaderKey=justTheName

Thisconfigurationwouldaddthe{justTheName=foo.1234.log}headeriftheeventwasreadfromthesamefilelocatedat/path/to/files/foo.1234.log.

Iftherearecertainfilepatternsthatyoudonotwantthissourcetoreadasinput,youcanpassaregularexpressionusingtheignorePatternproperty.Personally,Iwon’tcopyanyfilesIdon’twanttransferredtoFlumeinthespoolDirinthefirstplace.Ifthissituationcannotbeavoided,usetheignorePatternpropertytopassaregularexpressiontomatchfilenamesthatshouldnotbetransferredasdata.Furthermore,subdirectoriesandfilesthatstartwitha"."(period)characterareignored,soyoucanavoidcostlyregularexpressionprocessingusingthisconventioninstead.

WhileFlumeissendingdata,itkeepstrackofhowfarithasgottenineachfilebykeepingametadatafileinthedirectoryspecifiedbythetrackerDirproperty.Bydefault,thisfilewillbe.flumespoolunderspoolDir.ShouldyouwantalocationotherthaninsidespoolDir,youcanspecifyanabsolutefilepath.

Filesinthedirectoryareprocessedinthe“oldestfirst”mannerascalculatedbylookingatthemodificationtimesofthefiles.YoucanchangethisbehaviorbysettingtheconsumeOrderproperty.Ifyousetthispropertytoyoungest,thenewestfileswillbeprocessedfirst.Thismaybedesiredifdataistimesensitive.Ifyou’drathergiveequalprecedencetoallfiles,youcansetthispropertytoavalueofrandom.

ThebatchSizepropertyallowsyoutotunethenumberofeventspertransactionforwritestothechannel.Increasingthismayprovidebetterthroughputatthecostoflargertransactions(andpossiblylargerrollbacks).ThebufferMaxLinespropertyisusedtosetthesizeofthememorybufferusedinreadingfilesbymultiplyingitwithmaxBufferLineLength.Ifyourdataisveryshort,youmightconsiderincreasingbufferMaxLineswhilereducingthemaxBufferLineLengthproperty.Inthiscase,itwillresultinbetterthroughputwithoutincreasingyourmemoryoverhead.Thatsaid,ifyouhaveeventslongerthan5000characters,you’llwanttosetmaxBufferLineLengthhigher.

Ifthereisaproblemwritingdatatothechannel,aChannelExceptionisthrownbacktothesource,whereit’llretryafteraninitialwaittimeof250ms.Eachfailedattemptwilldoublethistimeuptoamaximumof4seconds.Tosetthismaximumtimehigher,setthemaxBackoffproperty.Forinstance,ifIwantedamaximumof5minutes,Icansetitinmillisecondslikethis:

agent.sources.s1.maxBackoff=300000

Finally,you’llwanttoensurethatwhatevermechanismiswritingnewfilestoyourspoolingdirectorycreatesuniquefilenames,suchasaddingatimestamp(andpossibly

more).Reusingafilenamewillconfusethesource,andyourdatamaynotbeprocessed.

Asalways,rememberthatrestartsanderrorswillcreateduplicatesduetoretransmissionoffilespartiallysent,butnotmarkedascomplete,orbecausethemetadataisincomplete.

SyslogsourcesSysloghasbeenaroundfordecadesandisoftenusedasanoperating-system-levelmechanismtocaptureandmovelogsaroundsystems.Inmanyways,thereareoverlapswithsomeofthefunctionalityFlumeprovides.ThereisevenaHadoopmoduleforrsyslog,oneofthemoremodernvariantsofsyslog(http://www.rsyslog.com/doc/rsyslog_conf_modules.html/omhdfs.html).Generally,Idon’tlikesolutionsthatcoupletechnologiesthatmayversionindependently.Ifyouusethisrsyslog/Hadoopintegration,youwouldberequiredtoupdatetheversionofHadoopyoucompiledintorsyslogatthesametimeyouupgradedyourHadoopclustertoanewmajorversion.Thismaybelogisticallydifficultifyouhavealargenumberofserversand/orenvironments.BackwardcompatibilityinHadoopwireprotocolsissomethingthatisbeingactivelyworkedonintheHadoopcommunity,butcurrently,itisn’tthenorm.We’lltalkmoreaboutthisinChapter8,MonitoringFlume,whenwediscusstieringdataflows.

SysloghasanolderUDPtransportaswellasanewerTCPprotocolthatcanhandledatalargerthanasingleUDPpacketcantransmit(about64KB)anddealwithnetwork-relatedcongestioneventsthatmightrequirethedatatoberetransmitted.

Finally,therearesomeundocumentedpropertiesofsyslogsourcesthatallowustoaddmoreregular-expressionpattern-matchingformessagesthatdonotconformtoRFCstandards.Iwon’tbediscussingtheseadditionalsettings,butyoushouldbeawareofthemifyourunintofrequentparsingerrors.Inthiscase,takealookatthesourcefororg.apache.flume.source.SyslogUtilsforimplementationdetailstofindthecause.

Moredetailsonsyslogterms(suchasafacility)andstandardformatscanbefoundinRFC3164athttp://tools.ietf.org/html/rfc3164.

http://www.rsyslog.com/doc/rsyslog_conf_modules.html/omhdfs.html

http://tools.ietf.org/html/rfc3164

ThesyslogUDPsourceTheUDPversionofsyslogisusuallysafetousewhenyouarereceivingdatafromtheserver’slocalsyslogprocess,providedthedataissmallenough(lessthanabout64KB).

NoteTheimplementationforthissourcehaschosen2,500bytesasthemaximumpayloadsizeregardlessofwhatyournetworkcanactuallyhandle.Soifyourpayloadwillbelargerthanthis,useoneoftheTCPsourcesinstead.

TocreateaSyslogUDPsource,setthetypepropertytosyslogudp.Youmustsettheporttolistenonusingtheportproperty.Theoptionalhostpropertyspecifiesthebindaddress.Ifnohostisspecified,allIPsfortheserverwillbeused,whichisthesameasspecifying0.0.0.0.Inthisexample,wewillonlylistenforlocalUDPconnectionsonport5140:

agent.sources=s1

agent.sources.channels=c1

agent.sources.s1.type=syslogudp

agent.sources.s1.host=localhost


Ifyouwantsyslogtoforwardatailedfile,youcanaddalinelikethistoyoursyslogconfigurationfile:

*.err;*.alert;*.crit;*.emerg;kern.*@localhost:5140

Thiswillsendallerrorpriority,criticalpriority,emergencypriority,andkernelmessagesofanypriorityintoyourFlumesource.Thesingle@symboldesignatesthatUDPprotocolshouldbeused.

HereisasummaryofthepropertiesoftheSyslogUDPsource:


type Yes String syslogudp


port Yes int

host No String 0.0.0.0

keepFields No boolean false

ThekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbytheSyslogUDPsourcearesummarizedhere:

HeaderKey Description

Facility Thisisthesyslogfacility.Seethesyslogdocumentation.

Priority Thisisthesyslogpriority.Seethesyslogdocumentation.

timestampThisisthetimeofthesyslogevent,translatedintoanepochtimestamp.It’somittedifit’snotparsedfromoneofthestandardRFCformats.

hostname Thisistheparsedhostnameinthesyslogmessage.Itisomittedifitisnotparsed.

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.ThisvalueissettoInvalidifthepayloaddidn’tconformtotheRFCs,andsettoIncompleteifthemessagewaslongerthantheeventSizevalue(forUDP,thisissetinternallyto2,500bytes).It’somittedifeverythingisfine.

ThesyslogTCPsourceAspreviouslymentioned,theSyslogTCPsourceprovidesanendpointformessagesoverTCP,allowingforalargerpayloadsizeandTCPretrysemanticsthatshouldbeusedforanyreliableinter-servercommunications.

TocreateaSyslogTCPsource,setthetypepropertytosyslogtcp.Youmuststillsetthebindaddressandporttolistenon:

agent.sources=s1

agent.sources.s1.type=syslogtcp

agent.sources.s1.host=0.0.0.0


IfyoursyslogimplementationsupportssyslogoverTCP,theconfigurationisusuallythesame,exceptthatadouble@symbolisusedtoindicateTCPtransport.HereisthesameexampleusingTCP,whereIamforwardingthevaluestoaFlumeagentthatisrunningonadifferentservernamedflume-1.

*.err;*.alert;*.crit;*.emerg;kern.*@@flume-1:12345

TherearesomeoptionalpropertiesfortheSyslogTCPsource,aslistedhere:


type Yes String syslogtcp


port Yes int



eventSize No int(bytes) 2500bytes

ThekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbytheSyslogTCPsourcearesummarizedhere:




timestampThisisthetimeofthesyslogeventtranslatedintoanepochtimestamp.It’somittedifnotparsedfromoneofthestandardRFCformats.

hostname Theparsedhostnameinthesyslogmessage.It’somittedifnotparsed.

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.It’ssettoInvalidifthepayloaddidn’tconformtotheRFCsandsettoIncompleteifthemessagewaslongerthantheconfiguredeventSize.It’somittedifeverythingisfine.

ThemultiportsyslogTCPsourceTheMultiportSyslogTCPsourceisnearlyidenticalinfunctionalitytotheSyslogTCPsource,exceptthatitcanlistentomultipleportsforinput.Youmayneedtousethiscapabilityifyouareunabletochangewhichportsyslogwilluseinitsforwardingrules(itmaynotbeyourserveratall).Itismorelikelythatyouwillusethistoreadmultipleformatsusingonesourcetowritetodifferentchannels.We’llcoverthatinamomentintheChannelSelectorssection.Underthehood,ahigh-performanceasynchronousTCPlibrarycalledMina(https://mina.apache.org/)isused,whichoftenprovidesbetterthroughputonmulticoreserversevenwhenconsumingonlyasingleTCPport.

Toconfigurethissource,setthetypepropertytomultiport_syslogtcp:

agent.sources.s1.type=multiport_syslogtcp

Liketheothersyslogsources,youneedtospecifytheport,butinthiscaseitisaspace-separatedlistofports.Youcanusethisonlyifyouhaveoneportspecified.Thepropertyforthisisports(plural):

agent.sources.s1.type=multiport_syslogtcp


agent.sources.s1.ports=3333344444

agent.sources.s1.host=0.0.0.0

ThiscodeconfigurestheMultiportSyslogTCPsourcenameds1tolistentoanyincomingconnectionsonports33333and44444andsendthemtochannelc1.

Inordertotellwhicheventcamefromwhichport,youcansettheoptionalportHeaderpropertytothenameofthekeywhosevaluewillbetheportnumber.Let’saddthispropertytotheconfiguration:

agent.sources.s1.portHeader=port

Then,anyeventsreceivedfromport33333wouldhaveaheaderkey/valueof{"port"="33333"}.AsyousawinChapter4,SinksandSinkProcessors,youcannowusethisvalue(oranyheader)asapartofyourHDFSSinkfilepathconvention,likethis:

agent.sinks.k1.hdfs.path=/logs/%{hostname}/%{port}/%Y/%m/%D/%H

Hereisacompletetableoftheproperties:


type Yes String syslogtcp

channels Yes String Space-separatedlistofchannels

ports Yes int Space-separatedlistofportnumbers



https://mina.apache.org/

eventSize No int 2500(bytes)

portHeader No String


readBufferSize No int(bytes) 1024

numProcessors No int automaticallydetected

charset.default No String UTF-8

charset.port.PORT# No String

ThisTCPsourcehassomeadditionaltunableoptionsoverthestandardTCPsyslogsource.ThefirstisthebatchSizeproperty.Thisisthenumberofeventsprocessedpertransactionwiththechannel.ThereisalsothereadBufferSizeproperty.ItspecifiestheinternalbuffersizeusedbyaninternalMinalibrary.Finally,thenumProcessorspropertyisusedtosizetheworkerthreadpoolinMina.Beforeyoutunetheseparameters,youmaywanttofamiliarizeyourselfwithMina(http://mina.apache.org/),andlookatthesourcecodebeforedeviatingfromthedefaults.

Finally,youcanspecifythedefaultandper-portcharacterencodingtousewhenconvertingbetweenStringsandbytes:

agent.sources.s1.charset.default=UTF-16

agent.sources.s1.charset.port.33333=UTF-8

ThissampleconfigurationshowsthatallportswillbeinterpretedusingUTF-16encoding,exceptforport33333traffic,whichwilluseUTF-8.

Asyou’vealreadyseenintheothersyslogsources,thekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbythissourcearesummarizedhere:




timestampThisisthetimeofthesyslogeventtranslatedintoanepochtimestamp.OmittedifnotparsedfromoneofthestandardRFCformats.

hostname Thisistheparsedhostnameinthesyslogmessage.Omittedifnotparsed.

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.ThisissettoInvalidifthepayloaddidn’tconformtotheRFCs,settoIncompleteifthemessagewaslongerthantheconfiguredeventSize,andomittedifeverythingisfine.

http://mina.apache.org/

JMSsourceSometimes,datacanoriginatefromasynchronousmessagequeues.Forthesecases,youcanuseFlume’sJMSsourcetocreateeventsreadfromaJMSQueueorTopic.WhileitistheoreticallypossibletouseanyJavaMessageService(JMS)implementation,FlumehasonlybeentestedwithActiveMQ,sobesuretotestthoroughlyifyouuseadifferentprovider.

LikethepreviouslycoveredElasticSearchsinkinChapter4,SinksandSinkProcessors,FlumedoesnotcomepackagedwiththeJMSimplementationyou’llbeusing,astheversionsneedtomatchup,soyou’llneedtoincludethenecessaryimplementationJARfilesontheFlumeagent’sclasspath.Thepreferredmethodistousethe--plugins-dirparametermentionedinChapter2,AQuickStartGuidetoFlume,whichwe’llcoverinmoredetailinthenextchapter.

NoteActiveMQisjustoneprovideroftheJavaMessageServiceAPI.Formoreinformation,seetheprojecthomepageathttp://activemq.apache.org.

Toconfigurethissource,setthetypepropertytojms:

agent.sources.s1.type=jms

ThefirstthreepropertiesareusedtoestablishaconnectionwiththeJMSServer.TheyareinitialContextFactory,connectionFactory,andproviderURL.TheinitialContextFactorypropertyforActiveMQwillbeorg.apache.activemq.jndi.ActiveMQInitialContextFactory.TheconnectionFactorypropertywillbetheregisteredJNDIname,whichdefaultstoConnectionFactoryifunspecified.Finally,providerURListheconnectionStringpassedtotheconnectionfactorytoactuallyestablishanetworkconnection.ItisusuallyaURL-likeStringconsistingoftheservernameandportinformation.Ifyouaren’tfamiliarwithyourJMSconfiguration,asksomebodywhodoeswhatvaluestouseinyourenvironment.

Thistablesummarizesthesesettingsandotherswe’lldiscussinamoment:


type Yes String jms


initialContextFactory Yes String

connectionFactory No String ConnectionFactory

providerURL Yes String

userName No String usernameforauthentication

http://activemq.apache.org

passwordFile No String pathtofilecontainingpasswordforauthentication

destinationName Yes String

destinationType Yes String

messageSelector No String

errorThreshold No int 10

pollTimeout No long 1000(milliseconds)


IfyourJMSServerrequiresauthentication,passtheuserNameandpasswordFileproperties:

agent.sources.s1.userName=jms_bot_user

agent.sources.s1.passwordFile=/path/to/password.txt

Puttingthepasswordinafileyoureference,ratherthandirectlyintheFlumeconfiguration,allowsyoutokeepthepermissionsontheconfigurationopenforinspection,whilestoringthemoresensitivepassworddatainaseparatefilethatisaccessibleonlytotheFlumeagent(restrictionsarecommonlyprovidedbytheoperatingsystem’spermissionssystem).

ThedestinationNamepropertydecideswhichmessagetoreadfromourconnection.SinceJMSsupportsbothqueuesandtopics,youneedtosetthedestinationTypepropertytoqueueortopic,respectively.Here’swhatthepropertiesmightlooklikeforaqueue:

agent.source.s1.destinationName=my_cool_data

agent.source.s1.destinationType=queue

Foratopic,itwilllookasfollows:

agent.source.s1.destinationName=restart_events

agent.source.s1.destinationType=topic

Thedifferencebetweenaqueueandatopicisbasicallythenumberofentitiesthatwillbereadfromthenameddestination.Ifyouarereadingfromaqueue,themessagewillberemovedfromthequeueonceitiswrittentoFlumeasanevent.Atopic,onceread,isstillavailableforotherentitiestoread.

Shouldyouneedonlyasubsetofmessagespublishedtoatopicorqueue,theoptionalmessageSelectorStringpropertyprovidesthiscapability.Forexample,tocreateFlumeeventsonmessageswithafieldcalledAgethatislargerthan10,IcanspecifyamessageSelectorfilter:

agent.source.s1.messageSelector="Age>10"

NoteDescribingtheselectorcapabilitiesandsyntaxindetailisfarbeyondthescopeofthis

book.Seehttp://docs.oracle.com/javaee/1.4/api/javax/jms/Message.htmlorpickupabookonJMStobecomemorefamiliarwithJMSmessageselectors.

ShouldtherebeaproblemincommunicatingwiththeJMSserver,theconnectionwillresetitselfafteranumberoffailuresspecifiedbytheerrorThresholdproperty.Thedefaultvalueof10isreasonableandwillmostlikelynotneedtobechanged.

Next,theJMSsourcehasapropertyusedtoadjustthevalueofbatchSize,whichdefaultsto100.Bynow,youshouldbefairlyfamiliarwithadjustingbatchsizeswithothersourcesandsinks.Forhigh-volumeflows,you’llwanttosetthishighertoconsumedatainlargerchunksfromyourJMSservertogethigherthroughput.Settingthistoohighforlowvolumeflowscoulddelayprocessing.Asalways,testingistheonlysurewaytoadjustthisproperly.TherelatedpollTimeoutpropertyspecifieshowlongtowaitfornewmessagestoappearbeforeattemptingtoreadabatch.Thedefault,specifiedinmilliseconds,is1second.Ifnomessagesarereadbeforethepolltimeout,theSourcewillgointoanexponentialbackoffmodebeforeattemptinganotherread.Thiscoulddelaytheprocessingofmessagesuntilitwakesupagain.Chancesarethatyouwon’tneedtochangethisvaluefromthedefault.

WhenthemessagegetsconvertedintoaFlumeevent,themessagepropertiesbecomeFlumeheaders.ThemessagepayloadbecomestheFlumebody,butsinceJMSusesserializedJavaobjects,weneedtotelltheJMSsourcehowtointerpretthepayload.Wedothisbysettingtheconverter.typeproperty,whichdefaultstotheonlyimplementationthatispackagedwithFlumeusingtheDEFAULTString(implementedbytheDefaultJMSMessageConvererclass).ItcandeserializeJMSBytesMessages,TextMessages,andObjectMessages(Javaobjectsthatimplementthejava.io.DataOutputinterface).ItdoesnothandleStreamMessagesorMapMessages,soifyouneedtoprocessthem,you’llneedtoimplementyourownconvertertype.Todothis,you’llimplementtheorg.apache.flume.source.jms.JMSMessageConverterinterface,anduseitsfullyqualifiedclassnameastheconverter.typepropertyvalue.

Forcompleteness,herearethesepropertiesintabularform:


converter.type No String DEFAULT

converter.charset No String UTF-8

http://docs.oracle.com/javaee/1.4/api/javax/jms/Message.html

ChannelselectorsAswediscussedinChapter1,OverviewandArchitecture,asourcecanwritetooneormorechannels.Thisiswhythepropertyisplural(channelsinsteadofchannel).Therearetwowaysmultiplechannelscanbehandled.Theeventcanbewrittentoallthechannelsortojustonechannel,basedonsomeFlumeheadervalue.TheinternalmechanismforthisinFlumeiscalledachannelselector.

Theselectorforanychannelcanbespecifiedusingtheselector.typeproperty.Allselector-specificpropertiesbeginwiththeusualSourceprefix:theagentname,keywordsources,andsourcename:

agent.sources.s1.selector.type=replicating

ReplicatingIfyoudonotspecifyaselectorforasource,replicatingisthedefault.Thereplicatingselectorwritesthesameeventtoallchannelsinthesource’schannelslist:

agent.sources.s1.channels=c1c2c3


Inthisexample,everyeventwillbewrittentoallthreechannels:c1,c2,andc3.

Thereisanoptionalpropertyonthisselector,calledoptional.Itisaspace-separatedlistofchannelsthatareoptional.Considerthismodifiedexample:

agent.sources.s1.channels=c1c2c3


agent.sources.s1.selector.optional=c2c3

Now,anyfailuretowritetochannelsc2orc3willnotcausethetransactiontofail,andanydatawrittentoc1willbecommitted.Intheearlierexamplewithnooptionalchannels,anysinglechannelfailurewouldrollbackthetransactionforallchannels.

MultiplexingIfyouwanttosenddifferenteventstodifferentchannels,youshoulduseamultiplexingchannelselectorbysettingthevalueofselector.typetomultiplexing.Youalsoneedtotellthechannelselectorwhichheadertousebysettingtheselector.headerproperty:

agent.sources.s1.selector.type=multiplexing

agent.sources.s1.selector.header=port

Let’sassumeweusedtheMultiportSyslogTCPsourcetolistenonfourports—11111,22222,33333,and44444—withaportHeadersettingofport:

agent.sources.s1.selector.default=c2

agent.sources.s1.selector.mapping.11111=c1c2

agent.sources.s1.selector.mapping.44444=c2

agent.sources.s1.selector.optional.44444=c3

Thisconfigurationwillresultinthetrafficofport22222andport33333goingtothec2channelonly.Thetrafficofport11111willgotothec1andc2channels.Afailureoneitherchannelwouldresultinnothingbeingaddedtoeitherchannel.Thetrafficofport44444willgotochannelsc2andc3.However,afailuretowritetoc3willstillcommitthetransactiontoc2,andc3willnotbeattemptedagainwiththatevent.

SummaryInthischapter,wecoveredindepththevarioussourcesthatwecanusetoinsertlogdataintoFlume,includingtheExecsource,theSpoolingDirectorySource,Syslogsources(UDP,TCP,andmultiportTCP),andtheJMSsource.

WediscussedreplicatingtheoldTailSourcefunctionalityinFlume0.9andproblemswithusingtailsemanticsingeneral.

Wealsocoveredchannelselectorsandsendingeventstooneormorechannels,specificallythereplicatingandmultiplexingchannelselectors.

Optionalchannelswerealsodiscussedasawaytoonlyfailaputtransactionforonlysomeofthechannelswhenmorethanonechannelisused.

Inthenextchapter,we’llintroduceinterceptorsthatwillallowin-flightinspectionandtransformationofevents.Usedinconjunctionwithchannelselectors,interceptorsprovidethefinalpiecetocreatecomplexdataflowswithFlume.Additionally,wewillcoverRPCmechanisms(source/sinkpairs)betweenFlumeagentsusingbothAvroandThrift,whichcanbeusedtocreatecomplexdataflows.

Chapter6.Interceptors,ETL,andRoutingThefinalpieceoffunctionalityrequiredinyourdataprocessingpipelineistheabilitytoinspectandtransformeventsinflight.Thiscanbeaccomplishedusinginterceptors.Interceptors,aswediscussedinChapter1,OverviewandArchitecture,canbeinsertedafterasourcecreatesanevent,butbeforewritingtothechanneloccurs.

InterceptorsAninterceptor’sfunctionalitycanbesummedupwiththismethod:

publicEventintercept(Eventevent);

AFlumeeventispassedtoit,anditreturnsaFlumeevent.Itmaydonothing,inwhichcase,thesameunalteredeventisreturned.Often,italterstheeventinsomeusefulway.Ifnullisreturned,theeventisdropped.

Toaddinterceptorstoasource,simplyaddtheinterceptorspropertytothenamedsource,forexample:

agent.sources.s1.interceptors=i1i2i3

Thisdefinesthreeinterceptors:i1,i2,andi3onthes1sourcefortheagentnamedagent.

NoteInterceptorsarerunintheorderinwhichtheyarelisted.Intheprecedingexample,i2willreceivetheoutputfromi1.Then,i3willreceivetheoutputfromi2.Finally,thechannelselectorreceivestheoutputfromi3.

Nowthatwehavedefinedtheinterceptorbyname,weneedtospecifyitstypeasfollows:

agent.sources.s1.interceptors.i1.type=TYPE1

agent.sources.s1.interceptors.i1.additionalProperty1=VALUE



Let’slookatsomeoftheinterceptorsthatcomebundledwithFlumetogetabetterideaofhowtoconfigurethem.

TimestampTheTimestampinterceptor,asitsnamesuggests,addsaheaderwiththetimestampkeytotheFlumeeventifonedoesn’talreadyexist.Touseit,setthetypepropertytotimestamp.

Iftheeventalreadycontainsatimestampheader,itwillbeoverwrittenwiththecurrenttimeunlessconfiguredtopreservetheoriginalvaluebysettingthepreserveExistingpropertytotrue.

HereisatablesummarizingthepropertiesoftheTimestampinterceptor:


type Yes String timestamp

preserveExisting No boolean false

Hereiswhatatotalconfigurationforasourcemightlooklikeifweonlywantittoaddatimestampheaderifnoneexists:

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.i1.type=timestamp

agent.sources.s1.interceptors.i1.preserveExisting=true

RecallthisHDFSSinkpathfromChapter4,SinksandSinkProcessors,utilizingtheeventdate:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H

Thetimestampheaderiswhatdeterminesthispath.Ifitismissing,youcanbesureFlumewillnotknowwheretocreatethefiles,andyouwillnotgettheresultyouarelookingfor.

HostSimilarinsimplicitytotheTimestampinterceptor,theHostinterceptorwilladdaheadertotheeventcontainingtheIPaddressofthecurrentFlumeagent.Touseit,setthetypepropertytohost:


agent.sources.s1.interceptors.type=host

ThekeyforthisheaderwillbehostunlessyouspecifysomethingelseusingthehostHeaderproperty.Likebefore,anexistingheaderwillbeoverwritten,unlessyousetthepreserveExistingpropertytotrue.Finally,ifyouwantareverseDNSlookupofthehostnametobeusedinsteadoftheIPasavalue,settheuseIPpropertytofalse.Rememberthatreverselookupswilladdprocessingtimetoyourdataflow.

HereisatablesummarizingthepropertiesoftheHostinterceptor:


type Yes String host

hostHeader No String host

preserveExisting No boolean false

useIP No boolean true

HereiswhatatotalconfigurationforasourcemightlooklikeifweonlywantittoaddarelayHostheadercontainingtheDNShostnameofthisagenttoeveryevent:


agent.sources.s1.interceptors.i1.type=host

agent.sources.s1.interceptors.i1.hostHeader=relayHost

agent.sources.s1.interceptors.i1.useIP=false

Thisinterceptormightbeusefulifyouwantedtorecordthepathyoureventstookthoughyourdataflow,forinstance.Chancesareyouaremoreinterestedintheoriginoftheeventratherthanthepathittook,whichiswhyIhaveyettousethis.

StaticTheStaticinterceptorisusedtoinsertasinglekey/valueheaderintoeachFlumeeventprocessed.Ifmorethanonekey/valueisdesired,yousimplyaddadditionalStaticinterceptors.Unliketheinterceptorswe’velookedatsofar,thedefaultbehavioristopreserveexistingheaderswiththesamekey.Asalways,myrecommendationistoalwaysspecifywhatyouwantandnotrelyonthedefaults.

Idonotknowwhythekeyandvaluepropertiesarenotrequired,asthedefaultsarenotterriblyuseful.

HereisatablesummarizingthepropertiesoftheStaticinterceptor:


type Yes String static

key No String key

value No String value

preserveExisting No boolean true

Finally,let’slookatanexampleconfigurationthatinsertstwonewheaders,providedtheydon’talreadyexistintheevent:

agent.sources.s1.interceptors=posenv

agent.sources.s1.interceptors.pos.type=static

agent.sources.s1.interceptors.pos.key=pointOfSale

agent.sources.s1.interceptors.pos.value=US

agent.sources.s1.interceptors.env.type=static

agent.sources.s1.interceptors.env.key=environment

agent.sources.s1.interceptors.env.value=staging

RegularexpressionfilteringIfyouwanttofiltereventsbasedonthecontentofthebody,theregularexpressionfilteringinterceptorisyourfriend.Basedonaregularexpressionyouprovide,itwilleitherfilteroutthematchingeventsorkeeponlythematchingevents.Startbysettingthetypeinterceptortoregex_filter.ThepatternyouwanttomatchisspecifiedusingaJava-styleregularexpressionsyntax.Seethesejavadocsforusagedetailsathttp://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html.Thepatternstringissetintheregexproperty.BesuretoescapebackslashesinJavaStrings.Forinstance,the\d+patternwouldneedtobewrittenwithtwobackslashes:\\d+.Ifyouwantedtomatchabackslash,thedocumentationsaystotypetwobackslashes,buteachneedstobeescaped,resultinginfour,thatis,\\\\.Youwillseetheuseofescapedbackslashesthroughoutthischapter.Finally,youneedtotelltheinterceptorifyouwanttoexcludematchingrecordsbysettingtheexcludeEventspropertytotrue.Thedefault(false)indicatesthatyouwanttoonlykeepeventsthatmatchthepattern.

Hereisatablesummarizingthepropertiesoftheregularexpressionfilteringinterceptor:


type Yes String regex_filter

regex No String .*

excludeEvents No boolean false

Inthisexample,anyeventscontainingtheNullPointerExceptionstringwillbedropped:

agent.sources.s1.interceptors=npe

agent.sources.s1.interceptors.npe.type=regex_filter

agent.sources.s1.interceptors.npe.regex=NullPointerException

agent.sources.s1.interceptors.npe.excludeEvents=true

http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

RegularexpressionextractorSometimes,you’llwanttoextractbitsofyoureventbodyintoFlumeheaderssothatyoucanperformroutingviaChannelSelectors.Youcanusetheregularexpressionextractorinterceptortoperformthisfunction.Startbysettingthetypeinterceptortoregex_extractor:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

Liketheregularexpressionfilteringinterceptor,theregularexpressionextractorinterceptortoousesaJava-styleregularexpressionsyntax.Inordertoextractoneormorefields,youstartbyspecifyingtheregexpropertywithgroupmatchingparentheses.Let’sassumewearelookingforerrornumbersinoureventsintheError:Nform,whereNisanumber:



agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+)

Asyoucansee,Iputcaptureparenthesesaroundthenumber,whichmaybeoneormoredigit.NowthatI’vematchedmydesiredpattern,IneedtotellFlumewhattodowithmymatch.Here,weneedtointroduceserializers,whichprovideapluggablemechanismforhowtointerpreteachmatch.Inthisexample,I’veonlygotonematch,somyspace-separatedlistofserializernameshasonlyoneentry:




agent.sources.s1.interceptors.e1.serializers=ser1

agent.sources.s1.interceptors.e1.serializers.ser1.type=default

agent.sources.s1.interceptors.e1.serializers.ser1.name=error_no

Thenamepropertyspecifiestheeventkeytouse,wherethevalueisthematchingtextfromtheregularexpression.Thetypeofdefaultvalue(alsothedefaultifnotspecified)isasimplepass-throughserializer.Forthiseventbody,lookatthefollowing:

NullPointerException:Aproblemoccurred.Error:123.TxnID:5X2T9E.

Thefollowingheaderwouldbeaddedtotheevent:

{"error_no":"123"}

IfIwantedtoaddtheTxnIDvalueasaheader,I’dsimplyaddanothermatchingpatterngroupandserializer:



agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+).*TxnID:\\s(\\w+)

agent.sources.s1.interceptors.e1.serializers=ser1ser2




agent.sources.s1.interceptors.e1.serializers.ser2.name=txnid

Then,Iwouldcreatetheseheadersfortheprecedinginput:

{"error_no":"123","txnid":"5x2T9E"}

However,takealookatwhatwouldhappenifthefieldswerereversedasfollows:

NullPointerException:Aproblemoccurred.TxnID:5X2T9E.Error:123.

Iwouldwindupwithonlyaheaderfortxnid.Abetterwaytohandlethiskindoforderingwouldbetousemultipleinterceptorssothattheorderdoesn’tmatter:

agent.sources.s1.interceptors=e1e2







agent.sources.s1.interceptors.e2.regex=TxnID:\\s(\\w+)



agent.sources.s1.interceptors.e2.serializers.ser1.name=txnid

TheonlyothertypeofserializerimplementationthatshipswithFlume,otherthanthepass-through,istospecifythefullyqualifiedclassnameoforg.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer.Thisserializerisusedtoconverttimesintomilliseconds.Youneedtospecifyapatternpropertybasedonorg.joda.time.format.DateTimeFormatpatterns.

Forinstance,let’ssayyouwereingestingApacheWebServeraccesslogs,forexample:

192.168.1.42--[29/Mar/2013:15:27:09-0600]"GET/index.htmlHTTP/1.1"

2001037

Thecompleteregularexpressionforthismightlooklikethis(intheformofaJavaString,withbackslashandquotesescapedwithanextrabackslash):

^([\\d.]+)\\S+\\S+\\[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})

(\\d+)

Thetimepatternmatchedcorrespondstotheorg.joda.time.format.DateTimeFormatpattern:

yyyy/MMM/dd:HH:mm:ssZ

Takealookatwhatwouldhappenifwemakeourconfigurationsomethinglikethis:



agent.sources.s1.interceptors.e1.regex=^([\\d.]+)\\S+\\S+\\

[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})(\\d+)

agent.sources.s1.interceptors.e1.serializers=ipdturlscbc

agent.sources.s1.interceptors.e1.serializers.ip.name=ip_address

agent.sources.s1.interceptors.e1.serializers.dt.type=org.apache.flume.inter

ceptor.RegexExtractorInterceptorMillisSerializer

agent.sources.s1.interceptors.e1.serializers.dt.pattern=dd/MMM/yyyy:HH:mm:s

sZ

agent.sources.s1.interceptors.e1.serializers.dt.name=timestamp

agent.sources.s1.interceptors.e1.serializers.url.name=http_request

agent.sources.s1.interceptors.e1.serializers.sc.name=status_code

agent.sources.s1.interceptors.e1.serializers.bc.name=bytes_xfered

Thiswouldcreatethefollowingheadersfortheprecedingsample:

{"ip_address":"192.168.1.42","timestamp":"1364588829",

"http_request":"GET/index.htmlHTTP/1.1","status_code":"200",

"bytes_xfered":"1037"}

Thebodycontentisunaffected.You’llalsonoticethatIdidn’tspecifydefaultfortheothertypeofserializers,asthatisthedefault.

NoteThereisnooverwritecheckinginthisinterceptortype.Forinstance,usingthetimestampkeywilloverwritetheevent’sprevioustimevalueiftherewasone.

Youcanimplementyourownserializersforthisinterceptorbyimplementingtheorg.apache.flume.interceptor.RegexExtractorInterceptorSerializerinterface.However,ifyourgoalistomovedatafromthebodyofaneventtotheheader,you’llprobablywanttoimplementacustominterceptorsothatyoucanalterthebodycontentsinadditiontosettingtheheadervalue,otherwisethedatawillbeeffectivelyduplicated.

Tosummarize,let’sreviewthepropertiesforthisinterceptor:


type Yes String regex_extractor

regex Yes String

serializers Yes Space-separatedlistofserializernames

serializers.NAME.name Yes String

serializers.NAME.type No DefaultorFQCNofimplementation default

serializers.NAME.PROP No Serializer-specificproperties

MorphlineinterceptorAswesawinChapter4,SinksandSinkProcessors,apowerfullibraryoftransformationsbackstheMorphlineSolrSinkfromtheKiteSDKproject.Itshouldcomeasnosurprisethatyoucanalsousetheselibrariesinmanyplaceswhereyou’dbeforcedtowriteacustominterceptor.Similarinconfigurationtoitssinkcounterpart,youonlyneedtospecifytheMorphlineconfigurationfile,andoptionally,theMorphlineuniqueidentifier(iftheconfigurationspecifiesmorethanoneMorphline).HereisasummarytableoftheFlumeinterceptorconfiguration:


type Yes String org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder

morphlineFile Yes String

morphlineId No String Picksthefirstifnotspecifiedandanerrorifnotspecified;morethanoneexists.

YourFlumeconfigurationmightlooksomethinglikethis:

agent.sources.s1.interceptors=i1m1

agent.sources.s1.interceptors.i1.type=timestamp

agent.sources.s1.interceptors.m1.type=org.apache.flume.sink.solr.morphline.

MorphlineInterceptor$Builder

agent.sources.s1.interceptors.m1.morphlineFile=/path/to/morph.conf

agent.sources.s1.interceptors.m1.morphlineId=goMorphy

Inthisexample,wehavespecifiedthes1sourceontheagentnamedagent,whichcontainstwointerceptors,i1andm1,processedinthatorder.Thefirstinterceptorisastandardinterceptorthatinsertsatimestampheaderifnoneexists.ThesecondwillsendtheeventthroughtheMorphlineprocessorspecifiedbytheMorphlineconfigurationfilecorrespondingtothegoMorphyID.

Eventsprocessedasaninterceptormustonlyoutputoneeventforeveryinputevent.IfyouneedtooutputmultiplerecordsfromasingleFlumeevent,youmustdothisintheMorphlineSolrSink.Withinterceptors,itisstrictlyoneeventinandoneeventout.

ThefirstcommandwillmostlikelybethereadLinecommand(orreadBlob,readCSV,andsoon)toconverttheeventintoaMorphlineRecord.Fromthere,yourunanyotherMorphlinecommandsyoulikewiththefinalRecordattheendoftheMorphlinechainbeingconvertedbackintoaFlumeevent.WeknowfromChapter4,SinksandSinkProcessors,thatthe_attachment_bodyspecialRecordkeyshouldbebyte[](thesameastheevent’sbody).Thismeansthatthelastcommandneedstodotheconversion(suchastoByteArrayorwriteAvroToByteArray).AllotherRecordkeysareconvertedtoStringvaluesandsetasheaderswiththesamekeys.

NoteTakealookatthereferenceguideforacompletelistofMorphlinecommands,theirpropertiesandusageinformationathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html


TheMorphlineconfigurationfilesyntaxhasalreadybeendiscussedindetailintheMorphlineconfigurationfilesectioninChapter4,SinksandSinkProcessors.

CustominterceptorsIfthereisonepieceofcustomcodeyouwilladdtoyourFlumeimplementation,itwillmostlikelybeacustominterceptor.Asmentionedearlier,youimplementtheorg.apache.flume.interceptor.Interceptorinterfaceandtheassociatedorg.apache.flume.interceptor.Interceptor.Builderinterface.

Let’ssayIneededtoURLCodemyeventbody.Thecodewouldlooksomethinglikethis:

publicclassURLDecodeimplementsInterceptor{

publicvoidinitialize(){}

publicEventintercept(Eventevent){

try{

byte[]decoded=URLDecoder.decode(newString(event.getBody()),"UTF-

8").getBytes("UTF-8");

event.setBody(decoded);

}catchUnsupportedEncodingExceptione){

//Shouldn'thappen.Fallthroughtounalteredevent.

}

returnevent;

}

publicList<Event>intercept(List<Event>events){

for(Eventevent:events){

intercept(event);

}

returnevents;

}

publicvoidclose(){}

publicstaticclassBuilderimplementsInterceptor.Builder{

publicInterceptorbuild(){

returnnewURLDecode();

}

publicvoidconfigure(Contextcontext){}

}

}

Then,toconfiguremynewinterceptor,usethefullyqualifiedclassnamefortheBuilderclassasthetype:


agent.sources.s1.interceptors.i1.type=com.example.URLDecoder$Builder

Formoreexamplesofhowtopassandvalidateproperties,lookatanyoftheexistinginterceptorimplementationsintheFlumesourcecode.

Keepinmindthatanyheavyprocessinginyourcustominterceptorcanaffecttheoverallthroughput,sobemindfulofobjectchurnorcomputationallyintensiveprocessinginyourimplementations.

ThepluginsdirectoryCustomcode(sources,interceptors,andsoon)canalwaysbeinstalledalongsidetheFlumecoreclassesinthe$FLUME_HOME/libdirectory.YoucouldalsospecifyadditionalpathsforCLASSPATHonstartupbywayoftheflume-env.shshellscript(whichisoftensourcedatstartuptimewhenusingapackageddistributionofFlume).StartingwithFlume1.4,thereisnowacommand-lineoptiontospecifyadirectorythatcontainsthecustomcode.Bydefault,ifnotspecified,the$FLUME_HOME/plugins.ddirectoryisused,butyoucanoverridethisusingthe--plugins-pathcommand-lineparameter.

Withinthisdirectory,eachpieceofcustomcodeisseparatedintoasubdirectorywhosenameisofyourchoosing—picksomethingeasyforyoutokeeptrackofthings.Inthisdirectory,youcanincludeuptothreesubdirectories:lib,libext,andnative.

ThelibdirectoryshouldcontaintheJARfileforyourcustomcomponent.Anydependenciesitusesshouldbeaddedtothelibextsubdirectory.BothofthesepathsareaddedtotheJavaCLASSPATHvariableatstartup.

TipTheFlumedocumentationimpliesthatthisdirectoryseparationbycomponentallowsforconflictingJavalibrariestocoexist.Intruth,thisisnotpossibleunlesstheunderlyingimplementationmakesuseofdifferentclassloaderswithintheJVM.Inthiscase,theFlumestartupcodesimplyappendsallofthesepathstothestartupCLASSPATHvariable,sotheorderinwhichthesubdirectoriesareprocessedwilldeterminetheprecedence.Youcannotevenbeguaranteedthatthesubdirectorieswillbeprocessedinalexicographicorder,astheunderlyingbashshellforloopcangivenosuchguarantees.Inpractice,youshouldalwaystryandavoidconflictingdependencies.Codethatdependstoomuchonorderingtendstobebuggy,especiallyifyourclasspathreordersitselffromserverinstallationtoserverinstallation.

Thethirddirectory,native,iswhereyouputanynativelibrariesassociatedwithyourcustomcomponent.Thispath,ifitexists,getsaddedtoLD_LIBRARY_PATHatstartupsothattheJVMcanlocatethesenativecomponents.

So,ifIhadthreecustomcomponents,mydirectorystructuremightlooksomethinglikethis:

$FLUME_HOME/plugins.d/base64-enc/lib/base64interceptor.jar

$FLUME_HOME/plugins.d/base64-enc/libext/base64-2.0.0.jar

$FLUME_HOME/plugins.d/base64-enc/native/libFoo.so

$FLUME_HOME/plugins.d/uuencode/lib/uuEncodingInterceptor.jar

$FLUME_HOME/plugins.d/my-avro-serializer/myAvroSerializer.jar

$FLUME_HOME/plugins.d/my-avro-serializer/native/libAvro.so

Keepinmindthatallthesepathsgetmashedtogetheratstartup,soconflictingversionsoflibrariesshouldbeavoided.Thisstructureisanorganizationmechanismforeaseofdeploymentforhumans.Personally,Ihavenotusedityet,asIuseChef(orPuppet)toinstallcustomcomponentsonmyFlumeagentsdirectlyinto$FLUME_HOME/lib,butifyouprefertousethismechanism,itisavailable.

TieringflowsInChapter1,OverviewandArchitecture,wetalkedabouttieringyourdataflows.Thereareseveralreasonsforyoutowanttodothis.YoumaywanttolimitthenumberofFlumeagentsthatdirectlyconnecttoyourHadoopcluster,tolimitthenumberofparallelrequests.YoumayalsolacksufficientdiskspaceonyourapplicationserverstostoreasignificantamountofdatawhileyouareperformingmaintenanceonyourHadoopcluster.Whateveryourreasonorusecase,themostcommonmechanismtochainFlumeagentsistousetheAvrosource/sinkpair.

TheAvrosource/sinkWecoveredAvroabitinChapter4,SinksandSinkProcessors,whenwediscussedhowtouseitasanon-diskserializationformatforfilesstoredinHDFS.Here,we’llputittouseincommunicationbetweenFlumeagents.Atypicalconfigurationmightlooksomethinglikethis:

TousetheAvrosource,youspecifythetypepropertywithavalueofavro.Youneedtoprovideabindaddressandportnumbertolistenon:

collector.sources=av1

collector.sources.av1.type=avro

collector.sources.av1.bind=0.0.0.0

collector.sources.av1.port=42424

collector.sources.av1.channels=ch1

collector.channels=ch1

collector.channels.ch1.type=memory

collector.sinks=k1

collector.sinks.k1.type=hdfs

collector.sinks.k1.channel=ch1

collector.sinks.k1.hdfs.path=/path/in/hdfs

Here,wehaveconfiguredtheagentinthemiddlethatlistensonport42424,usesamemorychannel,andwritestoHDFS.I’veusedthememorychannelforbrevityinthisexampleconfiguration.AlsonotethatI’vegiventhisagentadifferentname,collector,justtoavoidconfusion.

Theagentsonthetopandbottomsidesfeedingthecollectortiermighthaveaconfigurationsimilartothis.Ihaveleftthesourcesoffthisconfigurationforbrevity:

client.channels=ch1

client.channels.ch1.type=memory

client.sinks=k1

client.sinks.k1.type=avro

client.sinks.k1.channel=ch1

client.sinks.k1.hostname=collector.example.com

client.sinks.k1.port=42424

Thehostname,collector.example.com,hasnothingtodowiththeagentnameonthismachine;itisthehostname(oryoucanuseanIP)ofthetargetmachinewiththereceivingAvrosource.Thisconfiguration,namedclient,wouldbeappliedtobothagentsonthetopandbottomsides,assumingbothhadsimilarsourceconfigurations.

AsIdon’tlikesinglepointsoffailure,Iwouldconfiguretwocollectoragentswiththeprecedingconfigurationandinstead,seteachclientagenttoroundrobinbetweenthetwo,usingasinkgroup.Again,I’veleftoffthesourcesforbrevity:

client.channels=ch1


client.sinks=k1k2



client.sinks.k1.hostname=collectorA.example.com




client.sinks.k2.hostname=collectorB.example.com


client.sinkgroups=g1

client.sinkgroups.g1=k1k2

client.sinkgroups.g1.processor.type=load_balance

client.sinkgroups.g1.processor.selector=round_robin

client.sinkgroups.g1.processor.backoff=true

TherearefouradditionalpropertiesassociatedwiththeAvrosinkthatyoumayneedtoadjustfromtheirsensibledefaults.

Thefirstisthebatch-sizeproperty,whichdefaultsto100.Inheavyloads,youmayseebetterthroughputbysettingthishigher,forexample:

client.sinks.k1.batch-size=1024

Thenexttwopropertiescontrolnetworkconnectiontimeouts.Theconnect-timeoutproperty,whichdefaultsto20seconds(specifiedinmilliseconds),istheamountoftimerequiredtoestablishaconnectionwithanAvrosource(receiver).Therelatedrequest-timeoutproperty,whichalsodefaultsto20seconds(specifiedinmilliseconds),isthe

amountoftimeforasentmessagetobeacknowledged.IfIwantedtoincreasethesevaluesto1minute,Icouldaddtheseadditionalproperties:

client.sinks.k1.connect-timeout=60000

client.sinks.k1.request-timeout=60000

Finally,thereset-connection-intervalpropertycanbesettoforceconnectionstoreestablishthemselvesaftersometimeperiod.ThiscanbeusefulwhenyoursinkisconnectingthroughaVIP(VirtualIPAddress)orhardwareloadbalancertokeepthingsbalanced,asservicesofferedbehindtheVIPmaychangeovertimeduetofailuresorchangesincapacity.Bydefault,theconnectionswillnotberesetexceptincasesoffailure.Ifyouwantedtochangethissothattheconnectionsresetthemselveseveryhour,forexample,youcanspecifythisbysettingthispropertywiththenumberofseconds,asfollows:

client.sinks.k1.reset-connection-interval=3600

CompressingAvroCommunicationbetweentheAvrosourceandsinkcanbecompressedbysettingthecompression-typepropertytodeflate.OntheAvrosink,youcanadditionally,setthecompression-levelpropertytoanumberbetween1and9,withthedefaultbeing6.Typically,therearediminishingreturnsathighercompressionlevels,buttestingmayproveanondefaultvaluethatworksbetterforyou.Clearly,youneedtoweightheadditionalCPUcostsagainsttheoverheadtoperformahigherlevelofcompression.Typically,thedefaultisfine.

Compressedcommunicationsareespeciallyimportantwhenyouaredealingwithhighlatencynetworks,suchassendingdatabetweentwodatacenters.Anothercommonusecaseforcompressioniswhereyouarechargedforthebandwidthconsumed,suchasmostpubliccloudservices.Inthesecases,youwillprobablychoosetospendCPUcycles,compressinganddecompressingyourflowratherthansendinghighlycompressibledatauncompressed.

TipItisveryimportantthatifyousetthecompression-typepropertyinasource/sinkpair,yousetthispropertyatbothends.Otherwise,asinkcouldbesendingdatathesourcecan’tconsume.

Continuingtheprecedingexample,toaddcompression,youwouldaddtheseadditionalpropertyfieldsonbothagents:

collector.sources.av1.compression-type=deflate

client.sinks.k1.compression-type=deflate

client.sinks.k1.compression-level=7

client.sinks.k2.compression-type=deflate

client.sinks.k2.compression-level=7

SSLAvroflowsSometimes,communicationisofasensitivenatureormaytraverseuntrustednetwork

paths.YoucanencryptyourcommunicationsbetweenanAvrosinkandanAvrosourcebysettingthesslpropertytotrue.Additionally,youwillneedtopassSSLcertificateinformationtothesource(thereceiver)andoptionally,thesink(thesender).

Iwon’tclaimtobeanexpertinSSLandSSLcertificates,butthegeneralideaisthatacertificatecaneitherbeself-signedorsignedbyatrustedthirdpartysuchasVeriSign(calledacertificateauthorityorCA).Generally,becauseofthecostassociatedwithgettingaverifiedcertificate,peopleonlydothisforwebbrowsercertificatesacustomermightseeinawebbrowser.SomeorganizationswillhaveaninternalCAwhocansigncertificates.Forthepurposeofthisexample,we’llbeusingself-signedcertificates.Thisbasicallymeansthatwe’llgenerateacertificatebutnothaveitsignedbyanyauthority.Forthis,we’llusethekeytoolutilitythatcomeswithallJavainstallations(thereferencecanbefoundathttp://docs.oracle.com/javase/7/docs/technotes/tools/solaris/keytool.html).Here,IcreateaJavaKeyStore(JKS)filethatcontainsa2048bitkey.Whenpromptedforapassword,I’llusethepasswordstring.Makesureyouusearealpasswordinyournontestenvironments:

%keytool-genkey-aliasflumey-keyalgRSA-keystorekeystore.jks-keysize

2048

Enterkeystorepassword:password

Re-enternewpassword:password

Whatisyourfirstandlastname?

[Unknown]:SteveHoffman

Whatisthenameofyourorganizationalunit?

[Unknown]:Operations

Whatisthenameofyourorganization?

[Unknown]:Me,MyselfandI

WhatisthenameofyourCityorLocality?

[Unknown]:Chicago

WhatisthenameofyourStateorProvince?

[Unknown]:Illinois

Whatisthetwo-lettercountrycodeforthisunit?

[Unknown]:US

IsCN=SteveHoffman,OU=Operations,O="Me,MyselfandI",L=Chicago,

ST=Illinois,C=UScorrect?

[no]:yes

Enterkeypasswordfor<flumey>

(RETURNifsameaskeystorepassword):

IcanverifythecontentsoftheJKSbyrunningthelistsubcommand:

%keytool-list-keystorekeystore.jks

Enterkeystorepassword:password

Keystoretype:JKS

Keystoreprovider:SUN

Yourkeystorecontains1entry

flumey,Nov11,2014,PrivateKeyEntry,Certificatefingerprint(SHA1):

5C:BC:3C:7F:7A:E7:77:EB:B5:54:FA:E2:8B:DD:D3:66:36:86:DE:E4

http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/keytool.html

NowthatIhaveakeyinmykeystorefile,Icansettheadditionalpropertiesonthereceivingsource:

collector.sources.av1.ssl=true

collector.sources.av1.keystore=/path/to/keystore.jks

collector.sources.av1.keystore-password=password

AsIamusingaself-signedcertificate,thesinkwon’tbesurethatcommunicationscanbetrusted,soIhavetwooptions.Thefirstistotellittojusttrustallcertificates.Clearly,youwouldonlydothisonaprivatenetworkthathassomereasonableassurancesaboutitssecurity.Toignorethedubiousoriginofmycertificate,Icansetthetrust-all-certspropertytotrueasfollows:

client.sinks.k1.ssl=true

client.sinks.k1.trust-all-certs=true

Ifyoudidn’tsetthis,you’dseesomethinglikethisinthelogs:

org.apache.flume.EventDeliveryException:Failedtosendevents…

Causedby:javax.net.ssl.SSLHandshakeException:GeneralSSLEngineproblem…

Causedby:sun.security.validator.ValidatorException:Notrusted

certificatefound…

ForcommunicationsoverthepublicInternetorinmultitenantcloudenvironments,morecareshouldbetaken.Inthiscase,youcanprovideatruststoretothesink.Atruststoreissimilartoakeystore,exceptthatitcontainscertificateauthoritiesyouaretellingFlumecanbetrusted.Forwell-knowncertificateauthorities,suchasVeriSignandothers,youdon’tneedtospecifyatruststore,astheiridentitiesarealreadyincludedinyourJavadistribution(atleastforthemajorones).Youwillneedtohaveyourcertificatesignedbyacertificateauthorityandthenaddthesignedcertificatefiletothesource’skeystorefile.YouwillalsoneedtoincludethesigningCA’scertificateinthekeystoresothatthecertificatechaincanbefullyresolved.

Onceyouhaveaddedthekey,signedcertificate,andsigningCA’scertificatetothesource,youneedtoconfigurethesink(thesender)totrusttheseauthoritiessothatitwillpassthevalidationstepduringtheSSLhandshake.Tospecifythetruststore,setthetruststoreandtruststore-passwordpropertiesasfollows:

client.sinks.k1.ssl=true

client.sinks.k1.truststore=/path/to/truststore.jks

client.sinks.k1.truststore-password=password

Chancesarethereissomebodyinyourorganizationresponsibleforobtainingthethird-partycertificatesorsomeonewhocanissueanorganizationalCA-signed-certificatetoyou,soI’mnotgoingtogointodetailsabouthowtocreateyourowncertificateauthority.ThereisplentyofinformationontheInternetifyouchoosetotakethisroute.Rememberthatcertificateshaveanexpirationdate(usually,ayear),soyou’llneedtorepeatthisprocesseveryyearorcommunicationswillabruptlystopwhencertificatesorcertificateauthoritiesexpire.

TheThriftsource/sinkAnothersource/sinkpairyoucanusetotieryourdataflowsisbasedonThrift(http://thrift.apache.org/).UnlikeAvrodata,whichisself-documenting,Thriftusesanexternalschema,whichcanbecompiledintojustabouteveryprogramminglanguageontheplanet.TheconfigurationisalmostidenticaltotheAvroexamplealreadycovered.TheprecedingcollectorconfigurationthatusesThriftwouldnowlooksomethinglikethis:

collector.sources=th1

collector.sources.th1.type=thrift

collector.sources.th1.bind=0.0.0.0

collector.sources.th1.port=42324

collector.sources.th1.channels=ch1

collector.channels=ch1

collector.channels.ch1.type=memory

collector.sinks=k1

collector.sinks.k1.type=hdfs

collector.sinks.k1.channel=ch1

collector.sinks.k1.hdfs.path=/path/in/hdfs

Thereisoneadditionalpropertythatsetsthemaximumworkerthreadsintheunderlyingthreadpool.Ifunset,avalueofzeroisassumed,whichmakesthethreadpoolunbounded,soyoushouldprobablysetthisinyourproductionconfigurations.Testingshouldprovideyouwithareasonableupperlimitforyourenvironment.Tosetthethreadpoolsizeto10,forexample,youwouldaddthethreadsproperty:

collector.source.th1.threads=10

The“client”agentswouldusethecorrespondingThriftsinkasfollows:

client.channels=ch1


client.sinks=k1

client.sinks.k1.type=thrift


client.sinks.k1.hostname=collector.example.com


LikeitsAvrocounterpart,thereareadditionalsettingsforthebatchsize,connectiontimeouts,andconnectionresetintervalsthatyoumaywanttoadjustbasedonyourtestingresults.RefertotheTheAvrosource/sinksectionearlierinthischapterfordetails.

http://thrift.apache.org/

Usingcommand-lineAvroTheAvrosourcecanalsobeusedinconjunctionwithoneofthecommand-lineoptionsyoumayhavenoticedbackinChapter2,AQuickStartGuidetoFlume.Ratherthanrunningflume-ngwiththeagentparameter,youcanpasstheavro-clientparametertosendoneormorefilestoanAvrosource.Thesearetheoptionsspecifictoavro-clientfromthehelptext:

avro-clientoptions:

--dirname<dir>directorytostreamtoavrosource

--host,-H<host>hostnametowhicheventswillbesent(required)

--port,-p<port>portoftheavrosource(required)

--filename,-F<file>textfiletostreamtoavrosource[default:std

input]

--headerFile,-R<file>headerFilecontainingheadersaskey/valuepairson

eachnewline


Thisvariationisveryusefulfortesting,resendingdatamanuallyduetoerrors,orimportingolderdatastoredelsewhere.

JustlikeanAvrosink,youhavetospecifythehostnameandportyouwillbesendingdatato.Youcansendasinglefilewiththe--filenameoptionorallthefilesinadirectorywiththe--dirnameoption.Ifyouspecifyneitherofthese,stdinwillbeused.Hereishowyoumightsendafilenamedfoo.logtotheFlumeagentwepreviouslyconfigured:

$./flume-ngavro-client--filenamefoo.log--hostcollector.example.com--

port42424

EachlineoftheinputwillbeconvertedintoasingleFlumeevent.

Optionally,youcanspecifyafilecontainingkey/valuepairstosetFlumeheadervalues.ThefileusesJavapropertyfilesyntax.SupposeIhadafilenamedheaders.propertiescontaining:

pointOfSale=US

environment=staging

Then,includingthe--headerFileoptionwouldsetthesetwoheadersoneveryeventcreated:

$./flume-ngavro-client--filenamefoo.log--headerFileheaders.properties

--hostcollector.example.com--port42424

TheLog4JappenderAswediscussedinChapter5,SourcesandChannelSelectors,thereareissuesthatmayarisefromusingafilesystemfileasasource.OnewaytoavoidthisproblemistousetheFlumeLog4JAppenderinyourJavaapplication(s).Underthehood,itusesthesameAvrocommunicationthattheAvrosinkuses,soyouneedonlyconfigureittosenddatatoanAvrosource.

TheAppenderhastwoproperties,whichareshownhereinXML:

<appendername="FLUME"

class="org.apache.flume.clients.log4jappender.Log4jAppender">

<paramname="Hostname"value="collector.example.com"/>

<paramname="Port"value="42424"/>

</appender>

TheformatofthebodywillbedictatedbytheAppender’sconfiguredlayout(notshown).Thelog4jfieldsthatgetmappedtoFlumeheadersaresummarizedinthistable:

Flumeheaderkey Log4Jloggingeventfield

flume.client.log4j.logger.name event.getLoggerName()

flume.client.log4j.log.levelevent.getLevel()asanumber.Seeorg.apache.log4j.Levelformappings.

flume.client.log4j.timestamp event.getTimeStamp()

flume.client.log4j.message.encoding N/A—alwaysUTF8

flume.client.log4j.logger.otherWillonlyseethisiftherewasaproblemmappingoneoftheabovefields,sonormally,thiswon’tbepresent.

Refertohttp://logging.apache.org/log4j/1.2/formoredetailsonusingLog4J.

Youwillneedtoincludetheflume-ng-sdkJARintheclasspathofyourJavaapplicationatruntimetouseFlume’sLog4JAppender.

KeepinmindthatifthereisaproblemsendingdatatotheAvrosource,theappenderwillthrowanexceptionandthelogmessagewillbedropped,asthereisnoplacetoputit.KeepingitinmemorycouldquicklyoverloadyourJVMheap,whichisusuallyconsideredworsethandroppingthedatarecord.

http://logging.apache.org/log4j/1.2/

TheLog4Jload-balancingappenderI’msureyounoticedthattheprecedingLog4jAppenderonlyhasasinglehostname/portinitsconfiguration.Ifyouwantedtospreadtheloadacrossmultiplecollectoragents,eitherforadditionalcapacityorforfaulttolerance,youcanusetheLoadBalancingLog4jAppender.ThisappenderhasasinglerequiredpropertynamedHosts,whichisaspace-separatedlistofhostnamesandportnumbersseparatedbyacolon,asfollows:


class="org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender">

<paramname="Hosts"value="server1:42424server2:42424"/>

</appender>

Thereisanoptionalproperty,Selector,whichspecifiesthemethodthatyouwanttoloadbalance.ValidvaluesareRANDOMandROUND_ROBIN.Ifnotspecified,thedefaultisROUND_ROBIN.Youcanimplementyourownselector,butthatisoutsidethescopeofthisbook.Ifyouareinterested,gohavealookatthewell-documentedsourcecodefortheLoadBalancingLog4jAppenderclass.

Finally,thereisanotheroptionalpropertytooverridethemaximumtimeforexponentialbackoffwhenaservercannotbecontacted.Initially,ifaservercannotbecontacted,1secondwillneedtopassbeforethatserveristriedagain.Eachtimetheserverisunavailable,theretrytimedoubles,uptoadefaultmaximumof30seconds.Ifwewantedtoincreasethismaximumto2minutes,wecanspecifyaMaxBackoffpropertyinmillisecondsasfollows:


class="org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender">

<paramname="Hosts"value="server1:42424server2:42424"/>

<paramname="Selector"value="RANDOM"/>

<paramname="MaxBackoff"value="120000"/>

</appender>

Inthisexample,wehavealsooverriddenthedefaultround_robinselectortousearandomselection.

TheembeddedagentIfyouarewritingaJavaprogramthatcreatesdata,youmaychoosetosendthedatadirectlyasstructureddatausingaspecialmodeofFlumecalledtheEmbeddedAgent.Itisbasicallyasimplesinglesource/singlechannelFlumeagentthatyouruninsideyourJVM.

Therearebenefitsanddrawbackstothisapproach.Onthepositiveside,youdon’tneedtomonitoranadditionalprocessonyourserverstorelaydata.Theembeddedchannelalsoallowsforthedataproducertocontinueexecutingitscodeimmediatelyafterqueuingtheeventtothechannel.TheSinkRunnerthreadhandlestakingeventsfromthechannelandsendingthemtotheconfiguredsinks.Evenifyoudidn’tuseembeddedFlumetoperformthishandofffromthecallingthread,youwouldmostlikelyusesomekindofsynchronizedqueue(suchasBlockingQueue)toisolatethesendingofthedatafromthemainexecutionthread.UsingEmbeddedFlumeprovidesthesamefunctionalitywithouthavingtoworrywhetheryou’vewrittenyourmultithreadedcodecorrectly.

ThemajordrawbacktoembeddingFlumeinyourapplicationisaddedmemorypressureonyourJVM’sgarbagecollector.Ifyouareusinganin-memorychannel,anyunsenteventsareheldintheheapandgetinthewayofcheapgarbagecollectionbywayofshort-livedobjects.However,thisiswhyyousetachannelsize:tokeepthemaximummemoryfootprinttoaknownquantity.Furthermore,anyconfigurationchangeswillrequireanapplicationrestart,whichcanbeproblematicifyouroverallsystemdoesn’thavesufficientredundancybuiltintotoleraterestarts.

ConfigurationandstartupAssumingyouarenotdissuaded(andyoushouldn’tbe),youwillfirstneedtoincludetheflume-ng-embedded-agentlibrary(anddependencies)intoyourJavaproject.Dependingonwhatbuildsystemyouareusing(Maven,Ivy,Gradle,andsoon),theexactformatwilldiffer,soI’lljustshowyoutheMavenconfigurationhere.Youcanlookupthealternativeformatsathttp://mvnrepository.com/:

<dependency>

<groupId>org.apache.flume</groupId>

<artifactId>flume-ng-embedded-agent</artifactId>

<version>1.5.2</version>

</dependency>

StartbycreatinganEmbeddedAgentobjectinyourJavacodebycallingtheconstructor(andpassingastringname—usedonlyinerrormessages):

EmbeddedAgenttoHadoop=newEmbeddedAgent("myData");

Next,youhavetosetpropertiesforthechannel,sinks,andsinkprocessorviatheconfigure()method,asshowninthisexample.Mostofthisconfigurationshouldlookveryfamiliartoyouatthispoint:

Map<String,String>config=newHashMap<String,String>;

config.put("channel.type","memory");

config.put("channel.capacity","75");

config.put("sinks","s1s2");

config.put("sink.s1.type","avro");

config.put("sink.s1.hostname","foo.example.com");

config.put("sink.s1.port","12345");

config.put("sink.s1.compression-type","deflate");

config.put("sink.s2.type","avro");

config.put("sink.s2.hostname","bar.example.com");

config.put("sink.s2.port","12345");

config.put("sink.s2.compression-type","deflate");

config.put("processor.type","failover");

config.put("processor.priority.s1","10");

config.put("processor.priority.s2","20");

toHadoop.configure(config);

Here,wedefineamemorychannelwithacapacityof75eventsalongwithtwoAvrosinksinanactive/standbyconfiguration(first,tofoo.example.com,andifthatfails,tobar.example.com)usingAvroserializationwithcompression.RefertoChapter3,Channels,forspecificsettingsformemory-orfile-backedchannelproperties.

Thesinkprocessoronlycomesintoplayifyouhavemorethanonesinkdefinedinyoursinksproperty.UnliketheoptionalsinkgroupsoptionscoveredinChapter4,SinksandSinkProcessors,youneedtospecifyasinklist,evenifitisjustone.Clearly,thereisnobehaviordifferencebetweenfailoverandloadbalancewhenthereisonlyoncesink.Refertoeachspecificsink’spropertiesinChapter4,SinksandSinkProcessors,forspecificconfigurationparameters.

Finally,beforeyoustartusingthisagent,youneedtocallthestart()methodto

http://mvnrepository.com/

instantiateeverythingbasedonyourpropertiesandstartallthebackgroundprocessingthreadsasfollows:

toHadoop.start();

Theclassisnowreadytostartreceivingandforwardingdata.

You’llwanttokeepareferencetothisobjectaroundforcleanuplater,aswellastopassittootherobjectsthatwillbesendingdataastheunderlyingchannelprovidesthreadsafety.Personally,IuseSpringFramework’sdependencyinjectiontoconfigureandpassreferencesatstartupratherthandoingitprogrammatically,asI’veshowninthisexample.RefertotheSpringwebsiteformoreinformation(http://spring.io/),asproperuseofSpringisawholebookuntoitself,anditisnottheonlydependencyinjectionframeworkavailabletoyou.

http://spring.io/

SendingdataDatacanbesenteitherassingleeventsorinbatches.Batchescansometimesbemoreefficientinhighvolumedatastreams.

Tosendasingleevent,justcalltheput()method,asshowninthisexample:

Evente=EventBuilder.withBody("HelloHadoop",Charset.forName("UTF8");

toHadoop.put(e);

Here,I’musingoneofmanymethodsavailableontheorg.apache.flume.event.EventBuilderclass.Hereisacompletelistofmethodsyoucanusetoconstructeventswiththishelperclass:

EventBuilder.withBody(byte[]body,Map<String,String>headers);

EventBuilder.withBody(byte[]body);

EventBuilder.withBody(Stringbody,Charsetcharset,Map<String,String>

headers);

EventBuilder.withBody(Stringbody,Charsetcharset);

Inordertosendseveraleventsinonebatch,youcancalltheputAll()methodwithalistofevents,asshowninthisexample,whichalsoincludesatimeheader:

Map<String,String>headers=newHashMap<String,String>;

headers.put("timestamp",Long.toString(System.currentTimeMillis());

List<Event>events=newArrayList<Event>;

events.add(EventBuilder.withBody("First".getBytes("UTF-8"),headers);

events.add(EventBuilder.withBody("Second".getBytes("UTF-8"),headers);

toHadoop.putAll(events);

Asinterceptorsarenotsupported,youwillneedtoaddanyheadersyouwanttoaddtotheeventsprogrammaticallyinJavabeforeyoucallput()orputAll().

ShutdownFinally,astherearebackgroundthreadswaitingfordatatoarriveonyourembeddedchannelwhicharemostlikelyholdingpersistentconnectionstoconfigureddestinations,whenitistimetoshutdownyourapplication,you’llwanttohaveFlumedoitscleanupaswell.Yousimplycallthestop()methodontheconfiguredEmbeddedAgentasfollows:

toHadoop.stop();

RoutingTheroutingofdatatodifferentdestinationsbasedoncontentshouldbefairlystraightforwardnowthatyou’vebeenintroducedtoallthevariousmechanismsinFlume.

ThefirststepistogetthedatayouwanttoswitchonintoaFlumeheaderbymeansofasource-sideinterceptoriftheheaderisn’talreadyavailable.ThesecondstepistouseaMultiplexingChannelSelectoronthatheadervaluetoswitchthedatatoanalternatechannel.

Forinstance,let’ssayyouwantedtocaptureallexceptionstoHDFS.Inthisconfiguration,youcanseeeventscominginonthes1sourceviaavroonport42424.TheeventistestedtoseewhetherthebodycontainsthetextException.Ifitdoes,itcreatesanexceptionheaderkey(withthevalueofException).Thisheaderisusedtoswitchtheseeventstochannelc1,andultimately,HDFS.Iftheeventdidn’tmatchthepattern,itwouldnothavetheexceptionheaderandwouldgetpassedtothec2channelviathedefaultselectorwhereitwouldbeforwardedviaAvroserializationtoport12345onserverfoo.example.com:

agent.sources=s1

agent.sources.s1.type=avro




agent.sources.s1.interceptors.i1.type=regex_extractor

agent.sources.s1.interceptors.i1.regex=(Exception)

agent.sources.s1.interceptors.i1.serializers=ex

agent.sources.s1.intercetpros.i1.serializers.ex.name=exception

agent.sources.s1.selector.type=multiplexing

agent.sources.s1.selector.header=exception

agent.sources.s1.selector.mapping.Exception=c1

agent.sources.s1.selector.default=c2

agent.channels=c1c2



agent.sinks=k1k2

agent.sinks.k1.type=hdfs


agent.sinks.k1.hdfs.path=/logs/exceptions/%y/%M/%d/%H

agent.sinks.k2.type=avro


agent.sinks.k2.hostname=foo.example.com

agent.sinks.k2.port=12345

SummaryInthischapter,wecoveredvariousinterceptorsshippedwithFlume,including:

Timestamp:Theseareusedtoaddatimestampheader,possiblyoverwritinganexistingone.Host:ThisisusedtoaddtheFlumeagenthostnameorIPasaheaderintheevent.Static:ThisisusedtoaddstaticStringheaders.Regularexpressionfiltering:Thisisusedtoincludeorexcludeeventsbasedonamatchedregularexpression.Regularexpressionextractor:Thisisusedtocreateheadersfrommatchedregularexpressions.It’susefulforroutingwithChannelSelectors.Morphline:ThisisusedtodelegatetransformationtoaMorphlinecommandchain.Custom:Thisisusedtocreateanycustomtransformationsyouneedthatyoucan’tfindelsewhere.

WealsocoveredtieringdataflowsusingtheAvrosourceandsink.OptionalcompressionandSSLwithAvroflowswerecoveredaswell.Finally,Thriftsourcesandsinkswerebrieflycovered,assomeenvironmentsmayalreadyhaveThriftdataflowstointegratewith.

Next,weintroducedtwoLog4JAppenders,asinglepathandaload-balancingversion,fordirectintegrationwithJavaapplications.

TheEmbeddedFlumeAgentwascoveredforthosewishingtodirectlyintegratebasicFlumefunctionalityintotheirJavaapplications.

Finally,wegaveyouanexampleofusinginterceptorsinconjunctionwithaChannelSelectortoprovideroutingdecisionlogic.

Inthenextchapter,wewilldiveintoanexampletostitchtogethereverythingwehavecoveredsofar.

Chapter7.PuttingItAllTogetherNowthatwe’vewalkedthroughallthecomponentsandconfigurations,let’sputtogetheraworkingend-to-endconfiguration.Thisexampleisbynomeansexhaustive,nordoesitcovereverypossiblescenarioyoumightneed,butIthinkitshouldcoveracoupleofcommonusecasesI’veseenoverandover:

FindingerrorsbysearchinglogsacrossmultipleserversinnearrealtimeStreamingdatatoHDFSforlong-termbatchprocessing

Inthefirstsituation,yoursystemsmaybeimpaired,andyouhavemultipleplaceswhereyouneedtosearchforproblems.Bringingallofthoselogstoasingleplacethatyoucansearchmeansgettingyoursystemsrestoredquickly.Inthesecondscenario,youareinterestedincapturingdatainthelongtermforanalyticsandmachinelearning.

WeblogstosearchableUILet’ssimulateawebapplicationbysettingupasimplewebserverwhoselogswewanttostreamintosomesearchableapplication.Inthiscase,we’llbeusingaKibanaUItoperformadhocqueriesagainstElasticsearch.

Forthisexample,I’llstartthreeserversinAmazon’sElasticComputeCluster(EC2),asshowninthisdiagram:

EachserverhasapublicIP(startingwith54)andaprivateIP(startingwith172).Forinterservercommunication,I’llbeusingtheprivateIPsinmyconfigurations.Mypersonalinteractionwiththewebserver(tosimulatetraffic)andwithKibanaandElasticsearchwillrequirethepublicIPs,sinceI’msittinginmyhouseandnotinAmazon’sdatacenters.

NotePaycarefulattentiontotheIPaddressesintheshellprompts,aswewillbejumpingfrommachinetomachine,andIdon’twantyoutogetlost.Forinstance,ontheCollectorbox,thepromptwillcontainitsprivateIP:[ec2-user@ip-172-31-26-205~]$

IfyoutrythisoutyourselfinEC2,youwillgetdifferentIPassignments,soadjusttheconfigurationfilesandURLsreferencedasneeded.Forthisexample,I’llbeusingAmazon’sLinuxAMIforanoperatingsystemandthet1.microsizeserver.Also,besuretoadjustthesecuritygroupstoallownetworktrafficbetweentheseservers(andyourself).Whenindoubtaboutconnectivity,useutilitiessuchastelnetornctotestconnectionsbetweenserversandports.

TipWhilegoingthroughthisexercise,Imademistakesinmysecuritygroupsmorethanonce,soperformthesetestsonconnectivityexceptions.

SettingupthewebserverLet’sstartbysettingupthewebserver.Forthis,we’llinstallthepopularNginxwebserver(http://nginx.org).Sincewebserverlogsareprettymuchstandardizednowadays,Icouldhavechosenanywebserver,butthepointhereisthatthisapplicationwritesitlogs(bydefault)toafileonthedisk.SinceI’vewarnedyoufromtryingtousethetailprogramtostreamthem,we’llbeusingacombinationoflogrotateandtheSpoolingDirectorySourcetoingestthedata.First,let’slogintotheserver.ThedefaultaccountwithsudoforanAmazonLinuxserverisec2-user.You’llneedtopassthisinyoursshcommand:

%ssh54.69.112.61-lec2-user

Theauthenticityofhost'54.69.112.61(54.69.112.61)'can'tbe

established.

RSAkeyfingerprintisdc:ee:7a:1a:05:e1:36:cd:a4:81:03:97:48:8d:b3:cc.

Areyousureyouwanttocontinueconnecting(yes/no)?yes

Warning:Permanentlyadded'54.69.112.61'(RSA)tothelistofknownhosts.

__|__|_)

_|(/AmazonLinuxAMI

___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2014.09-release-notes/

18package(s)neededforsecurity,outof41available

Run"sudoyumupdate"toapplyallupdates.

Nowthatyouareloggedintotheserver,let’sinstallNginx(I’mnotgoingtoshowalloftheoutputtosavepaper):

[ec2-user@ip-172-31-18-146~]$sudoyum-yinstallnginx

Loadedplugins:priorities,update-motd,upgrade-helper

ResolvingDependencies

-->Runningtransactioncheck

--->Packagenginx.x86_641:1.6.2-1.22.amzn1willbeinstalled

[SNIP]

Installed:

nginx.x86_641:1.6.2-1.22.amzn1

DependencyInstalled:

GeoIP.x86_640:1.4.8-1.5.amzn1gd.x86_640:2.0.35-11.10.amzn1

gperftools-libs.x86_640:2.0-11.5.amzn1libXpm.x86_640:3.5.10-2.9.amzn1

libunwind.x86_640:1.1-2.1.amzn1

Complete!

Next,let’sstarttheNginxwebserver:

[ec2-user@ip-172-31-18-146~]$sudo/etc/init.d/nginxstart

Startingnginx:[OK]

Atthispoint,thewebservershouldberunning,solet’suseourcomputer’swebbrowsertogotothepublicIP,http://54.69.112.61/inthiscase.Ifitisworking,youshouldseethedefaultwelcomepage,likethis:

http://nginx.org

Tohelpgenerateasampleinput,youcanusealoadgeneratorprogramsuchasApachebenchmark(http://httpd.apache.org/docs/2.4/programs/ab.html)orwrk(https://github.com/wg/wrk).HereisthecommandIusedtogeneratedatafor20minutesatatime:

%wrk-c2-d20mhttp://54.69.112.61

Running20mtest@http://54.69.112.61

2threadsand2connections

ConfiguringlogrotationtothespooldirectoryBydefault,theserver’saccesslogsarewrittento/var/log/nginx/access.log.Theproblemweneedtoovercomehereistomovethefilestoaseparatedirectorywhileitisstillopenbythenginxprocess.Thisiswherelogrotatecomesintothepicture.Let’sfirstcreateaspooldirectorythatwillbecometheinputpathfortheSpoolingDirectorySourcelateron.Forthepurposeofthisexample,I’lljustcreatethespooldirectoryinthehomedirectoryofec2-user.Inaproductionenvironment,youwouldplacethespooldirectorysomewhereelse:

[ec2-user@ip-172-31-18-146~]$pwd

/home/ec2-user

[ec2-user@ip-172-31-18-146~]$mkdirspool

Let’salsocreatealogrotatescriptinthehomedirectoryofec2-user,calledrotateAccess.conf.Openaneditorandpastethesecontents(aftertheprompt)intoafile

http://httpd.apache.org/docs/2.4/programs/ab.html

https://github.com/wg/wrk

calledaccessRotate.conf:

[ec2-user@ip-172-31-18-146~]$cataccessRotate.conf

/var/log/nginx/access.log{

missingok

notifempty

rotate0

copytruncate

sharedscripts

olddir/home/ec2-user/spool

postrotate

chownec2-user:ec2-user/home/ec2-user/spool/access.log.1

ts=$(date+%s)

mv/home/ec2-user/spool/access.log.1/home/ec2-

user/spool/access.log.$ts

endscript

}

Let’sgooverthissothatyouunderstandwhatitisdoing.Thisisbynomeansacompleteintroductiontothelogrotateutility.Readtheonlinedocumentationathttp://linuxconfig.org/logrotateformoredetails.

Whenlogrotateruns,itwillcopy/var/log/nginx/access.logtothe/home/ec2-user/spooldirectory,butonlyifithasanonzerolength(meaning,thereisdatatosend).Therotate0commandtellslogrotatenottoremoveanyfilesfromthetargetdirectory(sincetheFlumesourcewilldothatafterithassuccessfullytransmittedeachfile).We’llmakeuseofthecopytruncatefeaturebecauseitkeepsthefileopenasfarasthenginxprocessisconcerned,whileresettingittozerolength.Thisavoidstheneedtosignalnginxthatarotationhasjustoccurred.Thedestinationoftherotatedfileinthespooldirectoryis/home/ec2-user/spool/access.log.1.Next,inthepostrotatescriptsection,wewillchangetheownershipofthefilefromtherootusertoec2-usersothattheFlumeagentcanreadit,andlater,deleteit.Finally,werenamethefilewithauniquetimestampsothatwhenthenextrotationoccurs,theaccess.log.1filewillnotexist.Thisalsomeansthatwe’llneedtoaddanexclusionruletotheFlumesourcelatertokeepitfromreadingaccess.log.1fileuntilthecopyingprocesshascompleted.Oncerenamed,itisreadytobeconsumedbyFlume.

Nowlet’stryrunningitindebugmode,whichwillbejustadryrunsothatwecanseewhatitcando.Normally,logrotatehasadailyrotationsettingthatpreventsitfromdoinganythingunlessenoughtimehaspassed:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-d/home/ec2-

user/accessRotate.conf

readingconfigfile/home/ec2-user/accessRotate.conf

readingconfiginfofor/var/log/nginx/access.log

olddirisnow/home/ec2-user/spool

Handling1logs

rotatingpattern:/var/log/nginx/access.log1048576bytes(nooldlogs

willbekept)

olddiris/home/ec2-user/spool,emptylogfilesarenotrotated,oldlogs

http://linuxconfig.org/logrotate

areremoved

consideringlog/var/log/nginx/access.log

logdoesnotneedrotating

notrunningpostrotatescript,sincenologswererotated

Therefore,wewillalsoincludethe-f(force)flagtooverridetheprecedingfunctionality:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-df/home/ec2-


readingconfigfile/home/ec2-user/accessRotate.conf



Handling1logs

rotatingpattern:/var/log/nginx/access.logforcedfromcommandline(no

oldlogswillbekept)


areremoved


logneedsrotating

rotatinglog/var/log/nginx/access.log,log->rotateCountis0

dateextsuffix'-20141207'

globpattern'-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'

renaming/home/ec2-user/spool/access.log.1to/home/ec2-

user/spool/access.log.2(rotatecount1,logstart1,i1),

renaming/home/ec2-user/spool/access.log.0to/home/ec2-

user/spool/access.log.1(rotatecount1,logstart1,i0),

copying/var/log/nginx/access.logto/home/ec2-user/spool/access.log.1

truncating/var/log/nginx/access.log

runningpostrotatescript

runningscriptwitharg/var/log/nginx/access.log:"

chownec2-user:ec2-user/home/ec2-user/spool/access.log.1

ts=$(date+%s)

mv/home/ec2-user/spool/access.log.1/home/ec2-

user/spool/access.log.$ts

"

removingoldlog/home/ec2-user/spool/access.log.2

Asyoucansee,logrotatewillcopythelogtothespooldirectorywiththe.1extensionasexpected,followedbyourscriptblockattheendtochangepermissionsandrenameitwithauniquetimestamp.

Nowlet’srunitagain,butthistimeforreal,withoutthedebugflag,andthenwewilllistthesourceandtargetdirectoriestoseewhathappened:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-f/home/ec2-


[ec2-user@ip-172-31-18-146~]$ls-lspool/

total188

-rw-r--r--1ec2-userec2-user189344Dec717:27access.log.1417973241

[ec2-user@ip-172-31-18-146~]$ls-l/var/log/nginx/

total4

-rw-r--r--1rootroot0Dec717:27access.log

-rw-r--r--1rootroot520Dec716:44error.log

[ec2-user@ip-172-31-18-146~]$

Youcanseetheaccesslogisemptyandtheoldcontentshavebeencopiedtothespooldirectory,withcorrectpermissionsandthefilenameendinginauniquetimestamp.

Let’salsoverifythattheemptyaccesslogwillresultinnoactionifrunwithoutanydata.You’llneedtoincludethedebugflagtoseewhattheprocessisthinking:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-df/home/ec2-


readingconfigfileaccessRotate.conf



Handling1logs

rotatingpattern:/var/log/nginx/access.logforcedfromcommandline(1

rotations)


areremoved


logdoesnotneedrotating

Nowthatwehaveaworkingrotationscript,weneedsomethingtorunitperiodically(notindebugmode)insomechoseninterval.Forthis,wewillusethecrondaemon(http://en.wikipedia.org/wiki/Cron)bycreatingafileinthe/etc/cron.ddirectory:

[ec2-user@ip-172-31-18-146~]$cat/etc/cron.d/rotateLogsToSpool

#Movefilestospooldirectoryevery5minutes

*/5****root/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf

Here,I’veindicatedafive-minuteintervalandtorunastherootuser.Sincetherootuserownstheoriginallogfile,weneedelevatedaccesstoreassignownershiptoec2-usersothatitcanberemovedlaterbyFlume.

Onceyou’vesavedthiscronconfigurationfile,youshouldseeourscriptevery5minutes(aswellasotherprocesses),byinspectingthecrondaemon’slogfile:

[ec2-user@ip-172-31-18-146~]$sudotail/var/log/cron

Dec717:45:01ip-172-31-18-146CROND[22904]:(root)CMD

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)



Dec717:54:01ip-172-31-18-146anacron[2426]:Job'cron.weekly'started

Dec717:54:01ip-172-31-18-146anacron[2426]:Job'cron.weekly'

terminated





Dec718:01:01ip-172-31-18-146CROND[23060]:(root)CMD(run-parts

/etc/cron.hourly)

Dec718:01:01ip-172-31-18-146run-parts(/etc/cron.hourly)[23060]:

starting0anacron

Dec718:01:01ip-172-31-18-146run-parts(/etc/cron.hourly)[23069]:

finished0anacron


http://en.wikipedia.org/wiki/Cron


Aninspectionofthespooldirectoryalsoshowsnewfilescreatedonourarbitrarilychosen5-minuteinterval:

[ec2-user@ip-172-31-18-146~]$ls-lspool/

total1072





KeepinmindthattheNginxRPMpackagealsoinstalledarotationconfigurationlocatedat/etc/logrotate.d/nginxtoperformdailyrotations.Ifyouaregoingtousethisexampleinproduction,you’llwanttoremovetheconfiguration,sincewedon’twantitclashingwithourmorefrequentlyrunningcronscript.I’llleavehandlingerror.logasanexerciseforyou.You’lleitherwanttosenditsomeplace(andhaveFlumeremoveit)orrotateitperiodicallysothatyourdiskdoesn’tfillupovertime.

Settingupthetarget–ElasticsearchNowlet’smovetotheotherserverinourdiagramandsetupElasticsearch,ourdestinationforsearchabledata.I’mgoingtobeusingtheinstructionsfoundathttp://www.elasticsearch.org/overview/elkdownloads/.First,let’sdownloadandinstalltheRPMpackage:

[ec2-user@ip-172-31-26-120~]$wget

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc

h-1.4.1.noarch.rpm

--2014-12-0718:25:35--


h-1.4.1.noarch.rpm

Resolvingdownload.elasticsearch.org(download.elasticsearch.org)...

54.225.133.195,54.243.77.158,107.22.222.16,...

Connectingtodownload.elasticsearch.org

(download.elasticsearch.org)|54.225.133.195|:443…connected.

HTTPrequestsent,awaitingresponse…200OK

Length:26326154(25M)[application/x-redhat-package-manager]

Savingto:'elasticsearch-1.4.1.noarch.rpm'

100%

[==========================================================================

==================================>]26,326,1549.92MB/sin2.5s

2014-12-0718:25:38(9.92MB/s)-'elasticsearch-1.4.1.noarch.rpm'saved

[26326154/26326154]

[ec2-user@ip-172-31-26-120~]$sudorpm-ivhelasticsearch-1.4.1.noarch.rpm

Preparing…#################################[100%]

Updating/installing…

1:elasticsearch-1.4.1-1#################################[100%]

###NOTstartingoninstallation,pleaseexecutethefollowingstatements

toconfigureelasticsearchtostartautomaticallyusingchkconfig

sudo/sbin/chkconfig--addelasticsearch

###Youcanstartelasticsearchbyexecuting

sudoserviceelasticsearchstart

Theinstallationiskindenoughtotellmehowtoconfiguretheservicetoautomaticallystartonsystemboot-up,anditalsotellsmetolaunchtheservicenow,solet’sdothat:

[ec2-user@ip-172-31-26-120~]$sudo/sbin/chkconfig--addelasticsearch

[ec2-user@ip-172-31-26-120~]$sudoserviceelasticsearchstart

Startingelasticsearch:[OK]

Let’sperformaquicktestwithcurltoverifythatitisrunningonthedefaultport,whichis9200:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/

{

"status":200,

"name":"Siege",

"cluster_name":"elasticsearch",

"version":{

"number":"1.4.1",

http://www.elasticsearch.org/overview/elkdownloads/

"build_hash":"89d3241d670db65f994242c8e8383b169779e2d4",

"build_timestamp":"2014-11-26T15:49:29Z",

"build_snapshot":false,

"lucene_version":"4.10.2"

},

"tagline":"YouKnow,forSearch"

}

Sincetheindexesarecreatedasdatacomesin,andwehaven’tsentanydata,weexpecttoseenoindexescreatedyet:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/_cat/indices?v

healthstatusindexprirepdocs.countdocs.deletedstore.size

pri.store.size

Allweseeistheheaderwithnoindexeslisted,solet’sheadovertothecollectorserverandsetuptheFlumerelaywhichwillwritedatatoElasticsearch.

SettingupFlumeoncollector/relayTheFlumeconfigurationontheFlumecollectorwillbecompressedAvrocominginandElasticsearchgoingout.Goaheadandlogintothecollectorserver(172.31.26.205).

Let’sstartbydownloadingtheFlumebinaryfromtheApachewebsite.Followthedownloadlinkandselectamirror:

[ec2-user@ip-172-31-26-205~]$wget

http://apache.arvixe.com/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz

--2014-12-0719:50:30--http://apache.arvixe.com/flume/1.5.2/apache-flume-

1.5.2-bin.tar.gz

Resolvingapache.arvixe.com(apache.arvixe.com)...198.58.87.82

Connectingtoapache.arvixe.com(apache.arvixe.com)|198.58.87.82|:80…

connected.


Length:25323459(24M)[application/x-gzip]

Savingto:'apache-flume-1.5.2-bin.tar.gz'

100%[==========================>]25,323,4598.38MB/sin2.9s

2014-12-0719:50:33(8.38MB/s)-'apache-flume-1.5.2-bin.tar.gz'saved

[25323459/25323459]

Next,expandandchangedirectories:

[ec2-user@ip-172-31-26-205~]$tar-zxfapache-flume-1.5.2-bin.tar.gz

[ec2-user@ip-172-31-26-205~]$cdapache-flume-1.5.2-bin

[[email protected]]$

Createtheconfigurationfile,calledcollector.conf,inFlume’sconfigurationdirectoryusingyourfavoriteeditor:

[[email protected]]$catconf/collector.conf

collector.sources=av

collector.channels=m1

collector.sinks=es

collector.sources.av.type=avro

collector.sources.av.bind=0.0.0.0

collector.sources.av.port=12345

collector.sources.av.compression-type=deflate

collector.sources.av.channels=m1

collector.channels.m1.type=memory

collector.channels.m1.capacity=10000

collector.sinks.es.type=org.apache.flume.sink.elasticsearch.ElasticSearchSi

nk

collector.sinks.es.channel=m1

collector.sinks.es.hostNames=172.31.26.120

Here,youcanseethatweareusingasimplememorychannelconfiguredwithacapacityof10,000events.

ThesourceisconfiguredtoacceptcompressedAvroonport12345andpassittoour

memorychannel.

Finally,thesinkisconfiguredtowritetoElasticsearchontheserverwejustsetupatthe172.31.26.120privateIP.Weareusingthedefaultsettings,whichmeansit’llwritetotheindexnamedflume-YYYY-MM-DDwiththelogtype.

Let’stryrunningtheFlumeagent:

[[email protected]]$./bin/flume-ngagent-n

collector-cconf-fconf/collector.conf-Dflume.root.logger=INFO,console

You’llseeanexceptioninthelog,includingsomethinglikethis:

2014-12-0720:00:13,184(conf-file-poller-0)[ERROR-

org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatche

rRunnable.run(PollingPropertiesFileConfigurationProvider.java:145)]Failed

tostartagentbecausedependencieswerenotfoundinclasspath.Error

follows.

java.lang.NoClassDefFoundError:org/elasticsearch/common/io/BytesStream

TheFlumeagentcan’tfindtheElasticsearchclasses.RememberthatthesearenotpackagedwithFlume,asthelibrariesneedtobecompatiblewiththeversionofElasticsearchyouarerunning.LookingbackattheElasticsearchserver,wecangetanideaofwhatweneed.Rememberthatthislistincludesmanyruntimeserverdependencies,soitisprobablymorethanwhatyou’llneedforafunctionalElasticsearchclient:

[ec2-user@ip-172-31-26-120~]$rpm-qilelasticsearch|grepjar

/usr/share/elasticsearch/lib/elasticsearch-1.4.1.jar

/usr/share/elasticsearch/lib/groovy-all-2.3.2.jar

/usr/share/elasticsearch/lib/jna-4.1.0.jar

/usr/share/elasticsearch/lib/jts-1.13.jar

/usr/share/elasticsearch/lib/log4j-1.2.17.jar

/usr/share/elasticsearch/lib/lucene-analyzers-common-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-core-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-expressions-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-grouping-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-highlighter-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-join-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-memory-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-misc-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-queries-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-queryparser-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-sandbox-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-spatial-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-suggest-4.10.2.jar

/usr/share/elasticsearch/lib/sigar/sigar-1.6.4.jar

/usr/share/elasticsearch/lib/spatial4j-0.4.1.jar

Really,youonlyneedtheelasticsearch.jarfileanditsdependencies,butwearegoingtobelazyandjustdownloadtheRPMagaininthecollectormachine,andcopytheJARfilestoFlume:

[ec2-user@ip-172-31-26-205~]$wget


h-1.4.1.noarch.rpm

--2014-12-0720:03:38--


h-1.4.1.noarch.rpm


54.225.133.195,54.243.77.158,107.22.222.16,...




Length:26326154(25M)[application/x-redhat-package-manager]

Savingto:'elasticsearch-1.4.1.noarch.rpm'

100%

[==========================================================================

==================================>]26,326,1549.70MB/sin2.6s

2014-12-0720:03:41(9.70MB/s)-'elasticsearch-1.4.1.noarch.rpm'saved

[26326154/26326154]

[ec2-user@ip-172-31-26-205~]$sudorpm-ivhelasticsearch-1.4.1.noarch.rpm

Preparing…#################################

[100%]


1:elasticsearch-1.4.1-1#################################

[100%]

###NOTstartingoninstallation,pleaseexecutethefollowingstatements

toconfigureelasticsearchtostartautomaticallyusingchkconfig

sudo/sbin/chkconfig--addelasticsearch

###Youcanstartelasticsearchbyexecuting

sudoserviceelasticsearchstart

Thistimewewillnotconfiguretheservicetostart.Instead,we’llcopytheJARfilesweneedtoFlume’spluginsdirectoryarchitecture,whichwelearnedaboutinthepreviouschapter:


[[email protected]]$mkdir-p

plugins.d/elasticsearch/libext

[[email protected]]$cp

/usr/share/elasticsearch/lib/*.jarplugins.d/elasticsearch/libext/

NowtryrunningtheFlumeagentagain:


collector-cconf-fconf/collector.conf-Dflume.root.logger=INFO,console

Noexceptionsthistimearound,butstillnodata.Let’sgobacktothewebservermachineandsetupthefinalFlumeagent.

SettingupFlumeontheclientOnthefirstserver,thewebserver,let’sdownloadtheFlumebinariesagainandexpandthepackage:

[ec2-user@ip-172-31-18-146~]$wget

http://apache.arvixe.com/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz

--2014-12-0720:12:53--http://apache.arvixe.com/flume/1.5.2/apache-flume-

1.5.2-bin.tar.gz

Resolvingapache.arvixe.com(apache.arvixe.com)...198.58.87.82

Connectingtoapache.arvixe.com(apache.arvixe.com)|198.58.87.82|:80…

connected.


Length:25323459(24M)[application/x-gzip]

Savingto:'apache-flume-1.5.2-bin.tar.gz'

100%[=========================>]25,323,4598.13MB/sin3.0s

2014-12-0720:12:56(8.13MB/s)-'apache-flume-1.5.2-bin.tar.gz'saved

[25323459/25323459]

[ec2-user@ip-172-31-18-146~]$tar-zxfapache-flume-1.5.2-bin.tar.gz


Thistime,ourFlumeconfigurationtakesthespooldirectorywesetupbefore,withlogrotateasinput,anditneedstowritecompressedAvroasthecollectorserver.Openaneditorandcreatetheclient.conffileusingyourfavoriteeditor:

[[email protected]]$catconf/client.conf

client.sources=sd

client.channels=m1

client.sinks=av

client.sources.sd.type=spooldir

client.sources.sd.spoolDir=/home/ec2-user/spool

client.sources.sd.deletePolicy=immediate

client.sources.sd.ignorePattern=access.log.1$

client.sources.sd.channels=m1

client.channels.m1.type=memory

client.channels.m1.capacity=10000

client.sinks.av.type=avro

client.sinks.av.hostname=172.31.26.205

client.sinks.av.port=12345

client.sinks.av.compression-type=deflate

client.sinks.av.channel=m1

Again,forsimplicity,weareusingamemorychannelwith10,000-recordcapacity.

Forthesource,weconfiguretheSpoolingDirectorySourcewith/home/ec2-user/spoolastheinput.Additionally,weconfigurethedeletionpolicytoremovethefilesaftersendingiscomplete.Thisisalsowherewesetuptheexclusionrulefortheaccess.log.1filenamepatternmentionedearlier.Notethedollarsignattheendofthefilename,

denotingtheendoftheline.Withoutthis,theexclusionpatternwouldalsoexcludevalidfiles,suchasaccess.log.1417975373.

Finally,anAvrosinkisconfiguredtopointatthecollector’sprivateIPandport12345.Additionally,wesetthecompressionsothatitmatchesthereceivingAvrosource’ssettings.

Nowlet’stryrunningtheagent:


client-cconf-fconf/client.conf-Dflume.root.logger=INFO,console

Noexceptions!Butmoreimportantly,Iseethelogfilesinthespooldirectorybeingprocessedanddeleted:

2014-12-0720:59:04,041(pool-4-thread-1)[INFO-

org.apache.flume.client.avro.ReliableSpoolingFileEventReader.deleteCurrentF

ile(ReliableSpoolingFileEventReader.java:390)]Preparingtodeletefile

/home/ec2-user/spool/access.log.1417976401

2014-12-0720:59:05,319(pool-4-thread-1)[INFO-




2014-12-0720:59:06,245(pool-4-thread-1)[INFO-




2014-12-0720:59:06,245(pool-4-thread-1)[INFO-

org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(Spo

olDirectorySource.java:254)]SpoolingDirectorySourcerunnerhasshutdown.

2014-12-0720:59:06,746(pool-4-thread-1)[INFO-

org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(Spo

olDirectorySource.java:254)]SpoolingDirectorySourcerunnerhasshutdown.

Onceallthefileshavebeenprocessed,thelastlinesarerepeatedevery500milliseconds.ThisisaknownbuginFlume(https://issues.apache.org/jira/browse/FLUME-2385).Ithasalreadybeenfixedandisslatedforthe1.6.0release,sobesuretosetuplogrotationonyourFlumeagentandcleanupthismessbeforeyourunoutofdiskspace.

Onthecollectorbox,weseewritesoccurringtoElasticsearch:


[INFO-

org.apache.flume.sink.elasticsearch.client.ElasticSearchTransportClient.exe

cute(ElasticSearchTransportClient.java:181)]Sendingbulktoelasticsearch

cluster

QueryingtheElasticsearchRESTAPI,wecanseeanindexwithrecords:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/_cat/indices?v

healthstatusindexprirepdocs.countdocs.deletedstore.size

pri.store.size

yellowopenflume-2014-12-07513210201.3mb

1.3mb

Let’sreadthefirst5records:


[ec2-user@ip-172-31-26-120elasticsearch]$curl-XGET

'http://localhost:9200/flume-2014-12-07/_search?pretty=true&q=*.*&size=5'

{

"took":26,

"timed_out":false,

"_shards":{

"total":5,

"successful":5,

"failed":0

},

"hits":{

"total":32102,

"max_score":1.0,

"hits":[{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVVbObD75ecNqJg",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47

+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJj",

"_score":1.0,


+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJo",

"_score":1.0,


+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJt",

"_score":1.0,


+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVWbObD75ecNqJy",

"_score":1.0,


+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}

}]

}

}

Asyoucansee,theloglinesareinthereunderthe@messagefield,andtheycannowbesearched.However,wecandobetter.Let’sbreakthatmessagedownintosearchablefields.

CreatingmoresearchfieldswithaninterceptorLet’sborrowsomecodefromwhatwecoveredearlierinthisbooktoextractsomeFlumeheadersfromthiscommonlogformat,knowingthatallFlumeheaderswillbecomefieldsinElasticsearch.SincewearecreatingfieldstobesearchedbyinElasticsearch,I’mgoingtoaddthemtothecollector’sconfigurationratherthanthewebserver’sFlumeagent.

ChangetheagentconfigurationonthecollectortoincludeaRegularExpressionExtractorinterceptor:

[[email protected]]$catconf/collector.conf

collector.sources=av

collector.channels=m1

collector.sinks=es

collector.sources.av.type=avro

collector.sources.av.bind=0.0.0.0

collector.sources.av.port=12345

collector.sources.av.compression-type=deflate

collector.sources.av.channels=m1

collector.sources.av.interceptors=e1

collector.sources.av.interceptors.e1.type=regex_extractor

collector.sources.av.interceptors.e1.regex=^([\\d.]+)\\S+\\S+\\

[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})(\\d+)

collector.sources.av.interceptors.e1.serializers=ipdturlscbc

collector.sources.av.interceptors.e1.serializers.ip.name=source

collector.sources.av.interceptors.e1.serializers.dt.type=org.apache.flume.i

nterceptor.RegexExtractorInterceptorMillisSerializer

collector.sources.av.interceptors.e1.serializers.dt.pattern=dd/MMM/yyyy:dd:

HH:mm:ssZ

collector.sources.av.interceptors.e1.serializers.dt.name=timestamp

collector.sources.av.interceptors.e1.serializers.url.name=http_request

collector.sources.av.interceptors.e1.serializers.sc.name=status_code

collector.sources.av.interceptors.e1.serializers.bc.name=bytes_xfered

collector.channels.m1.type=memory

collector.channels.m1.capacity=10000

collector.sinks.es.type=org.apache.flume.sink.elasticsearch.ElasticSearchSi

nk

collector.sinks.es.channel=m1

collector.sinks.es.hostNames=172.31.26.120

TofollowtheLogstashconvention,I’verenamedthehostnameheadertosource.Now,tomakethenewformateasytofind,I’mgoingtodeletetheexistingindex:

[ec2-user@ip-172-31-26-120~]$curl-XDELETE'http://localhost:9200/flume-

2014-12-07/'

{"acknowledged":true}

[ec2-user@ip-172-31-26-120~]$curl-XGET'http://localhost:9200/flume-

2014-12-07/_search?pretty=true&q=*.*&size=5'

{

"error":"IndexMissingException[[flume-2014-12-07]missing]",

"status":404

}

Next,Icreatemoretrafficonmywebserverandwaitforittoappearattheotherend.ThenIquerysomerecordstoseewhatitlookslike:

[ec2-user@ip-172-31-26-120~]$curl-XGET'http://localhost:9200/flume-

2014-12-07/_search?pretty=true&q=*.*&size=5'

{

"took":95,

"timed_out":false,

"_shards":{

"total":5,

"successful":5,

"failed":0

},

"hits":{

"total":12083,

"max_score":1.0,

"hits":[{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_H",

"_score":1.0,


+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-

12-07T18:13:15.000Z","@source":"207.222.127.224","@fields":

{"timestamp":"1417975995000","status_code":"200","source":"207.222.127.224"

,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}

},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_M",

"_score":1.0,



12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":



},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_R",

"_score":1.0,



12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":



},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_W",

"_score":1.0,



12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":



},{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomq4FnbObD75ecNx_a",

"_score":1.0,



12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":



}]

}

}

Asyoucansee,wenowhaveadditionalfieldsthatwecansearchby.Additionally,youcanseethatElasticsearchhastakenourmillisecond-basedtimestampfieldandcreateditsown@timestampfieldinISO-8601format.

Nowwecandosomemoreinterestingqueriessuchasfindingouthowmanysuccessful(status200)pageswesaw:

[ec2-user@ip-172-31-26-120elasticsearch]$curl-XGET

'http://localhost:9200/flume-2014-12-07/_count'-d'{"query":{"term":

{"status_code":"200"}}}'

{"count":46018,"_shards":{"total":5,"successful":5,"failed":0}}

Settingupabetteruserinterface–KibanaWhileitappearsasifwearedone,thereisonemorethingweshoulddo.Thedataisallthere,butyouneedtobeanexpertinqueryingElasticsearchtomakegooduseoftheinformation.Afterall,ifthedataisdifficulttoconsumeandgetsignored,thenwhybothercollectingitatall?Whatyoureallyneedisanice,searchablewebinterfacethathumanswithnon-technicalbackgroundscanuse.Forthis,wearegoingtosetupKibana.Inanutshell,KibanaisawebapplicationthatrunsasadynamicHTMLpageonyourbrowser,makingcallsfordatatoElasticsearchwhennecessary.Theresultisaninteractivewebinterfacethatdoesn’trequireyoutolearnthedetailsoftheElasticsearchqueryAPI.Thisisnottheonlyoptionavailabletoyou;itisjustwhatI’musinginthisexample.Let’sdownloadKibana3fromtheElasticsearchwebsite,andinstallthisonthesameserver(althoughyoucouldeasilyservethisfromanotherHTTPserver):

[ec2-user@ip-172-31-26-120~]$wget

https://download.elasticsearch.org/kibana/kibana/kibana-3.1.2.tar.gz

--2014-12-0718:31:16--

https://download.elasticsearch.org/kibana/kibana/kibana-3.1.2.tar.gz


54.225.133.195,54.243.77.158,107.22.222.16,...




Length:1074306(1.0M)[application/octet-stream]

Savingto:'kibana-3.1.2.tar.gz'

100%

[==========================================================================

==================================>]1,074,3061.33MB/sin0.8s

2014-12-0718:31:17(1.33MB/s)-'kibana-3.1.2.tar.gz'saved

[1074306/1074306]

[ec2-user@ip-172-31-26-120~]$tar-zxfkibana-3.1.2.tar.gz

[ec2-user@ip-172-31-26-120~]$cdkibana-3.1.2

NoteAtthetimeofwritingthisbook,anewerversionofKibana(Kibana4)isinbeta.Theseinstructionsmaybeoutdatedbythetimethisbookisreleased,butrestassuredthatsomewhereintheKibanasetup,youcaneditaconfigurationfiletopointtoyourElasticsearchserver.Ihavenottriedoutthenewerversionyet,butthereisagoodoverviewofitontheElasticsearchblogathttp://www.elasticsearch.org/blog/kibana-4-beta-3-now-more-filtery.

Opentheconfig.jsfiletoeditthelinewiththeelasticsearchkey.ThisneedstobetheURLofthepublicnameorIPoftheElasticsearchAPI.Forourexample,thislineshouldlookasshownhere:

elasticsearch:'http://ec2-54-148-230-252.us-west-

2.compute.amazonaws.com:9200',

http://www.elasticsearch.org/blog/kibana-4-beta-3-now-more-filtery

Nowweneedtoprovidethisdirectorywithawebbrowser.Let’sdownloadandinstallNginxagain:

[ec2-user@ip-172-31-26-120~]$sudoyuminstallnginx

Bydefault,therootdirectoryis/usr/share/nginx/html.WecanchangethisconfigurationinNginx,buttomakethingseasy,let’sjustcreateasymboliclinktopointtotherightlocation.First,movetheoriginalpathoutofthewaybyrenamingit:

[ec2-user@ip-172-31-26-120~]$sudomv/usr/share/nginx/html

/usr/share/nginx/html.dist

Next,linktheconfiguredKibanadirectoryasthenewwebroot:

[ec2-user@ip-172-31-26-120~]$sudoln-s~/kibana-3.1.2

/usr/share/nginx/html

Finally,startthewebserver:

[ec2-user@ip-172-31-26-120~]$sudo/etc/init.d/nginxstart

Startingnginx:[OK]

Fromyourcomputer,gotohttp://54.148.230.252/.Ifyouseethispage,itmeansyoumayhavemadeamistakeinyourKibanaconfiguration:

Thiserrorpagemeansyourwebbrowsercan’tconnecttoElasticsearch.Youmayneedtoclearyourbrowsercacheifyoufixedtheconfiguration,aswebpagesaretypicallycachedlocallyforsomeperiodoftime.Ifyougotitright,thescreenshouldlooklikethis:

GoaheadandselectSampleDashboard,thefirstoption.Youshouldseesomethinglikewhatisshowninthenextscreenshot.Thisincludessomeofthedataweingested,recordcountsinthecenter,andthefilteringfieldsintheleft-handmargin.

I’mnotgoingtoclaimtobeaKibanaexpert(I’mfarfromthat),soI’llleavefurthercustomizationofthistoyou.Usethisasabasetogobackandmakeadditionalmodificationstothedatatomakeiteasiertoconsume,search,orfilter.Todothat,you’llprobablyneedtogetmorefamiliarwithhowElasticsearchworks,butthat’sokaybecauseknowledgeisn’tabadthing.Acopiousamountofdocumentationiswaitingforyouathttp://www.elasticsearch.org/guide/en/kibana/current/index.html.

Atthispoint,wehavecompletedanend-to-endimplementationofdatafromawebserverstreamedinanearreal-timefashiontoaweb-basedtoolforsearching.Sinceboththesourceandtargetformatsweredictatedbyothers,weusedaninterceptortotransformthe

http://www.elasticsearch.org/guide/en/kibana/current/index.html

dataenroute.Thisusecaseisverygoodforshort-termtroubleshooting,butit’sclearlynotvery“Hadoopy”sincewehaveyettousecoreHadoop.

ArchivingtoHDFSWhenpeoplespeakofHadoop,theyusuallyrefertostoringlotsofdataforalongtime,usuallyinHDFS,somoreinterestingdatascienceormachinelearningcanbedonelater.Let’sextendourusecasebysplittingthedataflowatthecollectortostoreanextracopyinHDFSforlateruse.

So,backinAmazonAWS,IstartafourthservertorunHadoop.IfyouplanondoingallyourworkinHadoop,you’llprobablywanttowritethisdatatoS3,butforthisexample,let’sstickwithHDFS.Nowourserverdiagramlookslikethis:

IusedCloudera’sone-lineinstallationinstructionstospeedupthesetup.It’sinstructionscanbefoundathttp://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_qs_mrv1_pseudo.html

SincetheAmazonAMIiscompatiblewithEnterpriseLinux6,IselectedtheEL6RPMRepositoryandimportedthecorrespondingGPGkey:

[ec2-user@ip-172-31-7-175~]$sudorpm-ivh

http://archive.cloudera.com/cdh5/one-click-

install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm

Retrievinghttp://archive.cloudera.com/cdh5/one-click-


Preparing…#################################

[100%]


1:cloudera-cdh-5-0#################################

[100%]

[ec2-user@ip-172-31-7-175~]$sudorpm--import

http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

Next,Iinstalledthepseudo-distributedconfigurationtoruninasingle-nodeHadoop

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_qs_mrv1_pseudo.html

cluster:

[ec2-user@ip-172-31-7-175~]$sudoyuminstall-yhadoop-0.20-conf-pseudo

ThismighttakeawhileasitdownloadsalltheClouderaHadoopdistributiondependencies.

Sincethisconfigurationisforsingle-nodeuse,weneedtoadjustthefs.defaultFSpropertyin/etc/hadoop/conf/core-site.xmltoadvertiseourprivateIPinsteadoflocalhost.Ifwedon’tdothis,thenamenodeprocesswillbindto127.0.0.1,andotherservers,suchasourcollector’sFlumeagent,willnotbeabletocontactit:

<property>

<name>fs.defaultFS</name>

<value>hdfs://172.31.7.175:8020</value>

</property>

Next,weformatthenewHDFSvolumeandstarttheHDFSdaemon(sincethatisallweneedforthisexample):

[ec2-user@ip-172-31-7-175~]$sudo-uhdfshdfsnamenode-format

[ec2-user@ip-172-31-7-175~]$forxin'cd/etc/init.d;lshadoop-hdfs-*'

;dosudoservice$xstart;done

startingdatanode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-

172-31-7-175.out

StartedHadoopdatanode(hadoop-hdfs-datanode):[OK]

startingnamenode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-namenode-ip-

172-31-7-175.out

StartedHadoopnamenode:[OK]

startingsecondarynamenode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-

secondarynamenode-ip-172-31-7-175.out

StartedHadoopsecondarynamenode:[OK]

[ec2-user@ip-172-31-7-175~]$hadoopfs-df

FilesystemSizeUsedAvailableUse%

hdfs://172.31.7.175:802083187834882457666618245120%

NowthatHDFSisrunning,let’sgobacktothecollectorboxconfigurationtocreateasecondchannelandanHDFSSinkbyaddingtheselines:

collector.channels.h1.type=memory

collector.channels.h1.capacity=10000

collector.sinks.hadoop.type=hdfs

collector.sinks.hadoop.channel=h1

collector.sinks.hadoop.hdfs.path=hdfs://172.31.7.175/access_logs/%Y/%m/%d/%H

collector.sinks.hadoop.hdfs.filePrefix=access

collector.sinks.hadoop.hdfs.rollInterval=60

collector.sinks.hadoop.hdfs.rollSize=0

collector.sinks.hadoop.hdfs.rollCount=0

Thenwemodifythetop-levelchannelsandsinkskeys:

collector.channels=m1h1

collector.sinks=eshadoop

Asyoucansee,I’vegonewithasimplememorychannelagain,butfeelfreetousea

durablefilechannelifyouneedit.FortheHDFSconfiguration,I’llbeusingadatedfilepathfromthe/access_logsrootdirectorywitha60-secondrotationregardlessofsize.Wearenotalteringthesourcejustyet,sodon’tworry.

Ifweattempttostartthecollectornow,weseethisexception:

java.lang.NoClassDefFoundError:

org/apache/hadoop/io/SequenceFile$CompressionType

RememberthatforApacheFlumetospeaktoHDFS,weneedcompatibleHDFSclassesanddependenciesfortheversionofHadoopwearespeakingto.Let’sgetsomehelpfromourfriendsatClouderaandinstalltheHadoopclientRPM(outputremovedtosavepaper):

[ec2-user@ip-172-31-26-205~]$sudorpm-ivh

http://archive.cloudera.com/cdh5/one-click-


[ec2-user@ip-172-31-26-205~]$sudorpm--import

http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

[ec2-user@ip-172-31-26-205~]$sudoyuminstallhadoop-client

TestwhetherRPMworksbycreatingthedestinationdirectoryforourdata.We’llalsosetpermissionstomatchtheaccountwe’llberunningtheFlumeagentunder(ec2-userinthiscase):

[ec2-user@ip-172-31-26-205~]$sudo-uhdfshadoopfs-mkdir

hdfs://172.31.7.175/access_logs

[ec2-user@ip-172-31-26-205~]$sudo-uhdfshadoopfs-chownec2-user:ec2-

userhdfs://172.31.7.175/access_logs

[ec2-user@ip-172-31-26-205~]$hadoopfs-lshdfs://172.31.7.175/

Found1items

drwxr-xr-x-ec2-userec2-user02014-12-1104:35

hdfs://172.31.7.175/access_logs

Now,ifwerunthecollectorFlumeagent,weseenoexceptionsduetomissingHDFSclasses.TheFlumestartupscriptdetectsthatwehavealocalHadoopinstallation,andappendsitsclasspathtoitsown.

Finally,wecangobacktotheFlumeconfigurationfile,andsplittheincomingdataatthesourcebylistingbothchannelsonthesourceasdestinationchannels:

collector.sources.av.channels=m1h1

Bydefault,areplicatingchannelselectorisused.Thisiswhatwewant,sothatnofurtherconfigurationisneeded.SavetheconfigurationfileandrestarttheFlumeagent.

Whendataisflowing,youshouldseetheexpectedHDFSactivityintheFlumecollectoragent’slogs:


[INFO-

org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:261)]

Creating

hdfs://172.31.7.175/access_logs/2014/12/11/05/access.1418274302083.tmp

YoushouldalsoseedataappearinginHDFS:

[ec2-user@ip-172-31-7-175~]$ls-R/access_logs


/access_logs/2014


/access_logs/2014/12


/access_logs/2014/12/11


/access_logs/2014/12/11/05

-rw-r--r--3ec2-userec2-user104862014-12-1105:05

/access_logs/2014/12/11/05/access.1418274302082





IfyourunthiscommandfromaserverotherthantheHadoopnode,you’llneedtospecifythefullHDFSURI(hdfs://172.31.7.175/access_logs)orsetthedefault.FSpropertyinthecore-site.xmlconfigurationfileinHadoop.

NoteInretrospect,Iprobablyshouldhaveconfiguredthedefault.FSpropertyonanyinstalledHadoopclientthatwilluseHDFStosavemyselfalotoftyping.Liveandlearn!

LiketheprecedingElasticsearchimplementation,youcannowusethisexampleasabaseend-to-endconfigurationforHDFS.Thenextstepwillbetogobackandmodifytheformat,compression,andsoon,ontheHDFSSinktobettermatchwhatyou’lldowiththedatalater.

SummaryInthischapter,weiterativelyassembledanend-to-enddataflow.WestartedbysettingupanNginxwebservertocreateaccesslogs.Wealsoconfiguredcrontoexecutealogrotateconfigurationperiodicallytosafelyrotateoldlogstoaspoolingdirectory.

Next,weinstalledandconfiguredasingle-nodeElasticsearchserverandtestedsomeinsertionsanddeletions.ThenweconfiguredaFlumeclienttoreadinputfromourspoolingdirectoryfilledwithweblogs,andrelaythemtoaFlumecollectorusingcompressedAvroserialization.ThecollectorthenrelayedtheincomingdatatoourElasticsearchserver.

Oncewesawdataflowingfromoneendtoanother,wesetupasingle-nodeHDFSserverandmodifiedourcollectorconfigurationtosplittheinputdatafeedandrelayacopyofthemessagetoHDFS,simulatingarchivalstorage.Finally,wesetupaKibanaUIinfrontofourElasticsearchinstancetoprovideaneasy-searchfunctionfornontechnicalconsumers.

Inthenextchapter,wewillcovermonitoringFlumedataflowsusingGanglia.

Chapter8.MonitoringFlumeTheuserguideforFlumestates:

MonitoringinFlumeisstillaworkinprogress.Changescanhappenveryoften.SeveralFlumecomponentsreportmetricstotheJMXplatformMBeanserver.ThesemetricscanbequeriedusingJconsole.

WhileJMXisfineforcasualbrowsingofmetricvalues,thenumberofeyeballslookingatJconsoledoesn’tscalewhenyouhavehundredsoreventhousandsofserverssendingdataallovertheplace.Whatyouneedisawaytowatcheverythingatonce.However,whataretheimportantthingstolookfor?Thatisaverydifficultquestion,butI’lltryandcoverseveraloftheitemsthatareimportant,aswecovermonitoringoptionsinthischapter.

MonitoringtheagentprocessThemostobvioustypeofmonitoringyou’llwanttoperformisFlumeagentprocessmonitoring,thatis,makingsuretheagentisstillrunning.Therearemanyproductsthatdothiskindofprocessmonitoring,sothereisnowaywecancoverthemall.Ifyouworkatacompanyofanyreasonablesize,chancesarethereisalreadyasysteminplaceforthis.Ifthisisthecase,donotgooffandbuildyourown.Thelastthingoperationswantsisyetanotherscreentowatch24/7.

MonitIfyoudonotalreadyhavesomethinginplace,onefreemiumoptionisMonit(http://mmonit.com/monit/).ThedevelopersofMonithaveapaidversionthatprovidesmorebellsandwhistlesyoumaywanttoconsider.Eveninthefreeform,itcanprovideyouwithawaytocheckwhethertheFlumeagentisrunning,restartitifitisn’t,andsendyouane-mailwhenthishappenssothatyoucanlookintowhyitdied.

Monitdoesmuchmore,butthisfunctionalityiswhatwewillcoverhere.Ifyouaresmart,andIknowyouare,youwilladdcheckstothedisk,CPU,andmemoryusageasaminimum,inadditiontowhatwecoverinthischapter.

http://mmonit.com/monit/

NagiosAnotheroptionforFlumeagentprocessmonitoringisNagios(http://www.nagios.org/).LikeMonit,youcanconfigureittowatchyourFlumeagentsandalertyouviaawebUI,e-mail,oranSNMPtrap.Thatsaid,itdoesn’thaverestartcapabilities.Thecommunityisquitestrong,andtherearemanypluginsforotheravailableapplications.

MycompanyusesthistochecktheavailabilityofHadoopwebUIs.Whilenotacompletepictureofhealth,itdoesprovidemoreinformationtotheoverallmonitoringofourHadoopecosystem.

Again,ifyoualreadyhavetoolsinplaceatyourcompany,seewhetheryoucanreusethembeforebringinginanothertool.

http://www.nagios.org/

MonitoringperformancemetricsNowthatwehavecoveredsomeoptionsforprocessmonitoring,howdoyouknowwhetheryourapplicationisactuallydoingtheworkyouthinkitis?Onmanyoccasions,I’veseenastucksyslog-ngprocessthatappearstoberunning,butitjustwasn’tsendinganydata.I’mnotpickingonsyslog-ngspecifically;allsoftwaredoesthiswhenconditionsthatarenotdesignedforoccur.

WhentalkingaboutFlumedataflows,youneedtomonitorthefollowing:

DataenteringsourcesiswithinexpectedratesDataisn’toverflowingyourchannelsDataisexitingsinksatexpectedrates

Flumehasapluggablemonitoringframework,butasmentionedatthebeginningofthechapter,itisstillverymuchaworkinprogress.Thisdoesnotmeanyoushouldn’tuseit,asthatwouldbefoolish.Itmeansyou’llwanttoprepareextratestingandintegrationtimeanytimeyouupgrade.

NoteWhilenotcoveredintheFlumedocumentation,itiscommontoenableJMXinyourFlumeJVM(http://bit.ly/javajmx)andusetheNagiosJMXplugin(http://bit.ly/nagiosjmx)toalertyouaboutperformanceabnormalitiesinyourFlumeagents.

http://bit.ly/javajmx

http://bit.ly/nagiosjmx

GangliaOneoftheavailablemonitoringoptionstowatchFlumeinternalmetricsisGangliaintegration.Ganglia(http://ganglia.sourceforge.net/)isanopensourcemonitoringtoolthatisusedtocollectmetricsanddisplaygraphs,anditcanbetieredtohandleverylargeinstallations.TosendyourFlumemetricstoyourGangliacluster,youneedtopasssomepropertiestoyouragentatstartuptime:

Javaproperty Value Description

flume.monitoring.type ganglia Settoganglia

flume.monitoring.hostshost1:port1,host2:port2

Acomma-separatedlistofhost:portpairsforyourgmondprocess(es)

flume.monitoring.pollFrequency 60Thenumberofsecondsbetweensendingofdata(default60seconds)

flume.monitoring.isGanglia3 falseSettotrueifusingolderGanglia3protocol.Thedefaultprocessistosenddatausingv3.1protocol.

Lookateachinstanceofgmondwithinthesamenetworkbroadcastdomain(asreachabilityisbasedonmulticastpackets),andfindtheudp_recv_channelblockingmond.conf.Let’ssayIhadtwonearbyserverswiththesetwocorrespondingconfigurationblocks:

udp_recv_channel{

mcast_join=239.2.14.22

port=8649

bind=239.2.14.22

retry_bind=true

}

udp_recv_channel{

mcast_join=239.2.11.71

port=8649

bind=239.2.11.71

retry_bind=true

}

InthiscasetheIPandportare239.2.14.22/8649forthefirstserverand239.2.11.71/8649forthesecond,leadingtothesestartupproperties:

-Dflume.monitoring.type=ganglia

-Dflume.monitoring.hosts=239.2.14.22:8649,239.2.11.71:8649

Here,Iamusingdefaultsforthepollinterval,andI’malsousingthenewerGangliawireprotocol.

NoteWhilereceivingdataviaTCPissupportedinGanglia,thecurrentFlume/GangliaintegrationonlysupportssendingdatausingmulticastUDP.Ifyouhavealarge/complicatednetworksetup,you’llwanttogeteducatedbyyournetworkengineersif

http://ganglia.sourceforge.net/

thingsdon’tworkasyouexpect.

InternalHTTPserverYoucanconfiguretheFlumeagenttostartanHTTPserverthatwilloutputJSONthatcanusequeriesbyoutsidemechanisms.UnliketheGangliaintegration,anexternalentityhastocalltheFlumeagenttopollthedata.Intheory,youcanuseNagiostopollthisJSONdataandalertoncertainconditions,butIhavepersonallynevertriedit.Ofcourse,thissetupisveryusefulindevelopmentandtesting,especiallyifyouarewritingcustomFlumecomponentstobesuretheyaregeneratingusefulmetrics.HereisasummaryoftheJavapropertiesyou’llneedtosetatthestartupoftheFlumeagent:

Javaproperty Value Description

flume.monitoring.type http Thevalueissettohttp

flume.monitoring.port PORT ThisistheportnumbertobindtheHTTPserver

TheURLformetricswillbehttp://SERVER_OR_IP_OF_AGENT:PORT/metrics.

Let’slookatthefollowingFlumeconfiguration:

agent.sources=s1

agent.channels=c1

agent.sinks=k1

agent.sources.s1.type=avro





agent.sinks.k1.type=avro

agent.sinks.k1.hostname=192.168.33.33

agent.sinks.k1.port=9999


StarttheFlumeagentwiththeseproperties:

-Dflume.monitoring.type=http

-Dflume.monitoring.port=44444

Now,whenyougotohttp://SERVER_OR_IP:44444/metrics,youmightseesomethinglikethis:

{

"SOURCE.s1":{

"OpenConnectionCount":"0",

"AppendBatchAcceptedCount":"0",

"AppendBatchReceivedCount":"0",

"Type":"SOURCE",

"EventAcceptedCount":"0",

"AppendReceivedCount":"0",

"StopTime":"0",

"EventReceivedCount":"0",

"StartTime":"1365128622891",

"AppendAcceptedCount":"0"},

"CHANNEL.c1":{

"EventPutSuccessCount":"0",

"ChannelFillPercentage":"0.0",

"Type":"CHANNEL",

"StopTime":"0",

"EventPutAttemptCount":"0",

"ChannelSize":"0",

"StartTime":"1365128621890",

"EventTakeSuccessCount":"0",

"ChannelCapacity":"100",

"EventTakeAttemptCount":"0"},

"SINK.k1":{

"BatchCompleteCount":"0",

"ConnectionFailedCount":"4",

"EventDrainAttemptCount":"0",

"ConnectionCreatedCount":"0",

"BatchEmptyCount":"0",

"Type":"SINK",

"ConnectionClosedCount":"0",

"EventDrainSuccessCount":"0",

"StopTime":"0",

"StartTime":"1365128622325",

"BatchUnderflowCount":"0"}

}

Asyoucansee,eachsource,sink,andchannelarebrokenoutseparatelywiththeircorrespondingmetrics.Eachtypeofsource,channel,andsinkprovidetheirownsetofmetrickeys,althoughthereissomecommonality,sobesuretocheckwhatlooksinteresting.Forinstance,thisAvrosourcehasOpenConnectionCount,thatis,thenumberofconnectedclients(whoaremostlikelysendingdatain).Thismayhelpyoudecidewhetheryouhavetheexpectednumberofclientsrelyingonthatdataor,perhaps,toomanyclients,andyouneedtostarttieringyouragents.

Generallyspeaking,thechannel’sChannelSizeorChannelFillPercentagemetricswillgiveyouagoodideawhetherthedataiscominginfasterthanitisgoingout.Itwillalsotellyouwhetheryouhaveitsetlargeenoughformaintenance/outagesofyourdatavolume.

Lookingatthesink,EventDrainSuccessCountversusEventDrainAttemptCountwilltellyouhowoftenoutputissuccessfulwhencomparedtothetimestried.Inthisexample,IamconfiguringanAvrosinktoanonexistenttarget.Asyoucansee,theConnectionFailedCountmetricisgrowing,whichisagoodindicatorofpersistentconnectionproblems.EvenagrowingConnectionCreatedCountmetriccanindicatethatconnectionsaredroppingandreopeningtoooften.

Really,therearenohardandfastrulesbesideswatchingChannelSize/ChannelFillPercentage.Eachusecasewillhaveitsownperformanceprofile,sostartsmall,setupyourmonitoring,andlearnasyougo.

CustommonitoringhooksIfyoualreadyhaveamonitoringsystem,youmaywanttotaketheextraefforttodevelopacustommonitoringreportingmechanism.Youmaythinkthisisassimpleasimplementingtheorg.apache.flume.instrumentation.MonitorServiceinterface.Youdoneedtodothis,butlookingattheinterface,youwillonlyseeastart()andstop()method.Unlikethemoreobviousinterceptorparadigm,theagentexpectsthatyourMonitorServiceimplementationwillstart/stopathreadtosenddataontheexpectedorconfiguredintervalifitisthetypetosenddatatoareceivingservice.Ifyouaregoingtooperateaservice,suchastheHTTPservice,thenstart/stopwouldbeusedtostartandstopyourlisteningservice.ThemetricsthemselvesarepublishedinternallytoJMXbythevarioussources,sinks,channels,andinterceptorsusingobjectnamesthatstartwithorg.apache.flume.YourimplementationwillneedtoreadthesefromMBeanServer.

NoteThebestadviceIcangiveyou,shouldyoudecidetoimplementyourown,istolookatthesourceoftwoexistingimplementations(includedinthesourcedownloadreferencedinChapter2,AQuickStartGuidetoFlume)anddowhattheydo.Touseyourmonitoringhook,settheflume.monitoring.typepropertytothefullyqualifiedclassnameofyourimplementationclass.ExpecttohavetoreworkanycustomhookswithnewFlumeversionsuntiltheframeworkmaturesandstabilizes.

SummaryInthischapter,wecoveredmonitoringFlumeagentsbothfromtheprocesslevelandthemonitoringofinternalmetrics(whetheritisworking).

MonitandNagioswereintroducedasopensourceoptionsforprocesswatching.

Next,wecoveredtheFlumeagentinternalmonitoringmetricswithGangliaandJSONoverHTTPimplementationsthatshipwithApacheFlume.

Finally,wecoveredhowtointegrateacustommonitoringimplementationifyouneedtodirectlyintegratetosomeothertoolthat’snotsupportedbyFlumebydefault.

Inourfinalchapter,wewilldiscusssomegeneralconsiderationsforyourFlumedeployment.

Chapter9.ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollectionInthislastchapter,Ithoughtweshouldcoversomeofthelessconcrete,randomthoughtsIhavearounddatacollectionintoHadoop.There’snohardsciencebehindsomeofthis,andyoushouldfeelperfectlyalrighttodisagreewithme.

WhileHadoopisagreattooltoconsumevastquantitiesofdata,Ioftenthinkofapictureofthelogjamthatoccurredin1886intheSt.CroixRiverinMinnesota(http://www.nps.gov/sacn/historyculture/stories.htm).Whendealingwithtoomuchdata,youwanttomakesureyoudon’tjamyourriver.Besuretotakethepreviouschapteronmonitoringseriouslyandnotjustasnice-to-haveinformation.

http://www.nps.gov/sacn/historyculture/stories.htm

TransporttimeversuslogtimeIhadasituationwheredatawasbeingplacedusingdatepatternsinthefilenameand/orthepathinHDFSdidn’tmatchthecontentsofthedirectories.Theexpectationwasthatthedatainthe2014/12/29directorypathcontainedallthedataforDecember29,2014.However,therealitywasthatthedatewasbeingpulledfromthetransport.Itturnsoutthattheversionofsyslogwewereusingwasrewritingtheheader,includingthedateportion,causingthedatatotakeonthetransporttimeandnotreflecttheoriginaltimeoftherecord.Usually,theoffsetsweretiny,justasecondortwo,sonobodyreallytooknotice.However,oneday,oneoftherelayserversdiedandwhenthedatathathadgotstuckonupstreamserverswasfinallysent,ithadthecurrenttime.Inthiscase,itwasshiftedbyacoupleofdays,causingasignificantdatacleanupeffort.

Besurethisisn’thappeningtoyouifyouareplacingdatabydate.Checkthedateedgecasestoseethattheyarewhatyouexpect,andmakesureyoutestyouroutagescenariosbeforetheyhappenforrealinproduction.

AsImentionedpreviously,theseretransmitsduetoplannedorunplannedmaintenance(orevenatinynetworkhiccup)willmostlikelycauseduplicateandout-of-ordereventstoarrive,sobesuretoaccountforthiswhenprocessingrawdata.TherearenosingledeliveryororderingguaranteesinFlume.Ifyouneedthat,useatransactionaldatabaseordistributedtransactionlogsuchasApacheKafka(http://kafka.apache.org/)instead.Ofcourse,ifyouaregoingtouseKafka,youwouldprobablyonlyuseFlumeforthefinallegofyourdatapath,withyoursourceconsumingeventsfromKafka(https://github.com/baniuyao/flume-ng-kafka-source).

NoteRememberthatyoucanalwaysworkaroundduplicatesinyourdataatquerytimeaslongasyoucanuniquelyidentifyyoureventsfromoneanother.Ifyoucannotdistinguisheventseasily,youcanaddaUniversallyUniqueIdentifier(UUID)(http://en.wikipedia.org/wiki/Universally_unique_identifier)headerusingthebundledinterceptor,UUIDInterceptor(configurationdetailsareintheFlumeUserGuide).

http://kafka.apache.org/

https://github.com/baniuyao/flume-ng-kafka-source

http://en.wikipedia.org/wiki/Universally_unique_identifier

TimezonesareevilIncaseyoumissedmybiasagainstusinglocaltimeinChapter4,SinksandSinkProcessors,I’llrepeatitherealittlestronger:timezonesareevil—evillikeDr.Evil(http://en.wikipedia.org/wiki/Dr._Evil)—andlet’snotforgetabouthisMiniMecounterpart,(http://en.wikipedia.org/wiki/Mini-Me)—DaylightSavingsTime.

Weliveinaglobalworldnow.YouarepullingdatafromallovertheplaceintoyourHadoopcluster.Youmayevenhavemultipledatacentersindifferentpartsofthecountry(ortheworld).Thelastthingyouwanttobedoingwhiletryingtoanalyzeyourdataistodealwithaskewdata.DaylightSavingsTimechangesatleastsomewhereonEarthadozentimesinayear.Justlookatthehistory:ftp://ftp.iana.org/tz/releases/.SaveyourselftheheadacheandjustnormalizeittoUTC.Ifyouwanttoconvertitto“localtime”onitswaytohumaneyeballs,feelfree.However,whileitlivesinyourcluster,keepitnormalizedtoUTC.

NoteConsideradoptingUTCeverywhereviathisJavastartupparameter(ifyoucan’tsetitsystem-wide):-Duser.timezone=UTC

Also,usetheISO8601(http://en.wikipedia.org/wiki/ISO_8601)timestandardwherepossibleandbesuretoincludetimezoneinformation(evenifitisUTC).Everymoderntoolontheplanetsupportsthisformatandwillsaveyoupaindowntheroad.

IliveinChicago,andourcomputersatworkuseCentralTime,whichadjustsfordaylightsavings.InourHadoopcluster,weliketokeepdatainaYYYY/MM/DD/HHdirectorylayout.Twiceayear,somethingsbreakslightly.Inthefall,wehavetwiceasmuchdatainour2a.m.directory.Inthespring,thereisno2a.m.directory.Madness!

http://en.wikipedia.org/wiki/Dr._Evil

http://en.wikipedia.org/wiki/Mini-Me

ftp://ftp.iana.org/tz/releases/

http://en.wikipedia.org/wiki/ISO_8601

CapacityplanningRegardlessofhowmuchdatayouthinkyouhave,thingswillchangeovertime.Newprojectswillpopupanddatacreationratesforyourexistingprojectswillchange(upordown).Datavolumewillusuallyebbandflowwiththetrafficoftheday.Finally,thenumberofserversfeedingyourHadoopclusterwillchangeovertime.

TherearemanyschoolsofthoughtonhowmuchextrastoragecapacityyoushouldkeepinyourHadoopcluster(weusethetotallyunscientificvalueof20percent,whichmeansthatweusuallyplanfor80percentfullwhenorderingadditionalhardwarebutdon’tstarttopanicuntilwehitthe85-90percentutilizationnumber).Generally,youwanttokeepenoughextraspacesothatthefailureand/ormaintenanceofaserverortwowon’tcausetheHDFSblockreplicationtoconsumealltheremainingspace.

Youmayalsoneedtosetupmultipleflowsinsideasingleagent.Thesourceandsinkprocessorsarecurrentlysingle-threaded,sothereissomelimittowhattuningbatchsizescanaccomplishwhenunderheavydatavolumes.Beverycarefulinthesesituationswhereyousplityourdataflowatthesourceusingareplicatingchannelselectortomultiplechannels/sinks.Ifoneofthepath’schannelsfillsup,anexceptionisthrownbacktothesource.Ifthatfullchannelisnotmarkedasoptionalandthedataisdropped,thesourcewillstopconsumingnewdata.Thiseffectivelyjamstheagentforallotherchannelsattachedtothatsource.Youmaynotwanttodropthedata(markingthechannelasoptional)becausethedataisimportant.Unfortunately,thisistheonlyfan-outmechanismprovidedinFlumetosendtomultipledestinations,somakesureyoucatchissuesquicklysothatallyourdataflowsarenotimpairedduetoacascadebackupofevents.

ForanumberofFlumeagentsfeedingHadoop,thistooshouldbeadjustedbasedonrealnumbers.Watchthechannelsizetoseehowwellthewritesarekeepingupundernormalloads.Adjustthemaximumchannelcapacitytohandlewhateveramountofoverheadmakesyoufeelgood.Youcanalwayspurchasewaymorehardwarethanyouneed,butevenaprolongedoutagemayoverfloweventhemostconservativeestimates.Thisiswhenyouhavetopickandchoosewhichdataismoreimportanttoyouandadjustyourchannelcapacitiestoreflectthat.Thisway,ifyouexceedyourlimits,theleastimportantdatawillbethefirsttobedropped.

Chancesareyourcompanydoesn’thaveaninfiniteamountofmoneyandatsomepoint,thevalueofthedataversusthecostofcontinuingtoexpandyourclusterwillstarttobequestioned.Thisiswhysettinglimitsonthevolumeofdatacollectedisveryimportant.Thisisjustoneaspectofyourdataretentionpolicy,wherecostisthedrivingfactor.Inamoment,we’lldiscusssomeofthecomplianceaspectsofthispolicy.Sufficetosay,anyprojectsendingdataintoHadoopshouldbeabletosaywhatthevalueofthatdataisandwhatthelossisifwedeletetheolderstuff.Thisistheonlywaythepeoplewritingthecheckscanmakeaninformeddecision.

ConsiderationsformultipledatacentersIfyourunyourbusinessoutofmultipledatacentersandhavealargevolumeofdatacollected,youmaywanttoconsidersettingupaHadoopclusterineachdatacenterratherthansendingallyourcollecteddatabacktoasingledatacenter.Theremayberegulatoryimplicationsregardingdatacrossingcertaingeographicboundaries.ChancesarethereissomebodyinyourcompanywhoknowsmuchmoreaboutcompliancethanyouorI,soseekthemoutbeforeyoustartcopyingdataacrossborders.Ofcourse,notcollatingyourdatawillmakeitmoredifficulttoanalyzeit,asyoucan’tjustrunoneMapReducejobagainstallthedata.Instead,youwouldhavetorunparalleljobsandthencombinetheresultsinasecondpass.Adjustingyourdataprocessingproceduresisbetterthanpotentiallybreakingthelaw.Besuretodoyourhomework.

Pullingallyourdataintoasingleclustermayalsobemorethanyournetworkingcanhandle.Dependingonhowyourdatacentersareconnectedtoeachother,yousimplymaynotbeabletotransmitthedesiredvolumeofdata.Ifyouusepubliccloudservices,therearesurelydatatransfercostsbetweendatacenters.Finally,considerthatacompleteclusterfailureorcorruptionmaywipeouteverything,asmostclustersareusuallytoobigtobackupeverythingexcepthighvaluedata.Havingsomeoftheolddatainthiscaseissometimesbetterthanhavingnothing.WithmultipleHadoopclusters,youhavetheabilitytouseaFailoverSinkProcessortoforwarddatatoadifferentclusterifyoudon’twanttowaittosendtothelocalone.

Ifyoudochoosetosendallyourdatatoasingledestination,consideraddingalargediskcapacitymachineasarelayserverforthedatacenter.Thisway,ifthereisacommunicationissueorextendedclustermaintenance,youcanletdatapileuponamachinethat’sdifferentfromtheonestryingtoserviceyourcustomers.Thisissoundadviceeveninasingledatacentersituation.

ComplianceanddataexpiryRememberthatthedatayourcompanyiscollectingfromyourcustomersshouldbeconsideredsensitiveinformation.Youmaybeboundbyadditionalregulatorylimitationsonaccessingdatasuchas:

Personallyidentifiableinformation(PII):Howyouhandleandsafeguardcustomer’sidentitieshttp://en.wikipedia.org/wiki/Personally_identifiable_informationPaymentCardIndustryDataSecurityStandard(PCIDSS):Howyousafeguardcreditcardinformationhttp://en.wikipedia.org/wiki/PCI_DSSServiceOrganizationControl(SOC-2):Howyoucontrolaccesstoinformation/systemshttp://www.aicpa.org/InterestAreas/FRC/AssuranceAdvisoryServices/Pages/AICPASOC2Report.aspxStatementsonStandardsforAttestationEngagements(SSAE-16):Howyoumanagechangeshttp://www.aicpa.org/Research/Standards/AuditAttest/DownloadableDocuments/AT-00801.pdfSarbanesOxley(SOX):http://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act

Thisisbynomeansadefinitivelist,sobesuretoseekoutyourcompany’scomplianceexpertsforwhatdoesanddoesn’tapplytoyoursituation.Ifyouaren’tproperlyhandlingaccesstothisdatainyourcluster,thegovernmentwillleanonyou,orworse,youwon’thavecustomersanymoreiftheyfeelyouaren’tprotectingtheirpersonalinformation.Considerscrambling,trimming,orobfuscatingyourdataofpersonalinformation.Chancesarethebusinessinsightyouarelookingfallsmoreintothecategoryof“howmanypeoplewhosearchfor“hammer”actuallybuyone?”ratherthan“howmanycustomersarenamedBob?”AsyousawinChapter6,Interceptors,ETL,andRouting,itwouldbeveryeasytowriteaninterceptortoobfuscatePIIasyoumoveitaround.

YourcompanyprobablyhasadocumentretentionpolicythatincludesthedatayouareputtingintoHadoop.Makesureyouremovedatathatyourpolicysaysyouaren’tsupposedtobekeepingaroundanymore.Thelastthingyouwantisavisitfromthelawyers.

http://en.wikipedia.org/wiki/Personally_identifiable_information

http://en.wikipedia.org/wiki/PCI_DSS

http://www.aicpa.org/InterestAreas/FRC/AssuranceAdvisoryServices/Pages/AICPASOC2Report.aspx

http://www.aicpa.org/Research/Standards/AuditAttest/DownloadableDocuments/AT-00801.pdf

http://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act

SummaryInthischapter,wecoveredseveralreal-worldconsiderationsyouneedtothinkaboutwhenplanningyourFlumeimplementation,including:

TransporttimedoesnotalwaysmatcheventtimeThemayhemintroducedwithDaylightSavingsTimetocertaintime-basedlogicCapacityplanningconsiderationsItemstoconsiderwhenyouhavemorethanonedatacenterDatacomplianceDataretentionandexpiration

Ihopeyouenjoyedthisbook.Hopefully,youwillbeabletoapplymuchofthisinformationdirectlyinyourapplication/Hadoopintegrationefforts.

Thanks,thiswasfun!

IndexA

ActiveMQURL/JMSsource

agent/Sources,channels,andsinksagentcommand/Startingupwith“Hello,World!”agentidentifier(name)/AnoverviewoftheFlumeconfigurationfileagentprocess

monitoring/Monitoringtheagentprocessagentprocess,monitoring

Monit/MonitNagios/Nagios

AOPSpringFramework/Interceptors,channelselectors,andsinkprocessorsApacheAvro

about/ApacheAvroURL/ApacheAvro

ApachebenchmarkURL/Settingupthewebserver

ApacheBigTopprojectURL/FlumeinHadoopdistributions

ApacheFlumeURL/Thememorychannel

ApacheKafkaURL/Transporttimeversuslogtime

Avrosource/sinkused,fortieringdataflows/TheAvrosource/sink,TheThriftsource/sinkcompressedAvro/CompressingAvro

avro_eventserializer/ApacheAvro

BbasenameHeaderKeyproperty/SpoolingDirectorySourcebatchDurationMillisproperty/SinkconfigurationbatchSizeproperty/SinkconfigurationbatchTimeoutproperty/TheExecsourceBesteffort(BE)/Flume0.9byteCapacity

URL/ThememorychannelbyteCapacityBufferPercentage

URL/Thememorychannel

Ccapacityplanning/CapacityplanningCDH3distribution/Flume0.9CDH5

URL/ArchivingtoHDFScentralizedmaster(masters)/Flume0.9cf-enginetool/Flume1.X(Flume-NG)ChannelProcessorfunction/Thememorychannelchannels/Sources,channels,andsinkschannelselector

about/Channelselectorsreplicating/Replicatingmultiplexing/Multiplexing

channelselectors/Interceptors,channelselectors,andsinkprocessorscheckpointIntervalproperty/ThefilechannelCheftool/Flume1.X(Flume-NG)Cloudera

URL/FlumeinHadoopdistributionsClouderadistribution

issuesolution,URL/AnoverviewoftheFlumeconfigurationfilecodecs

about/Compressioncodecscommand-lineAvro

about/Usingcommand-lineAvrocompliance/ComplianceanddataexpiryCompressedStream/CompressedStreamconsumeOrderproperty/SpoolingDirectorySourcecrondaemon

URL/Configuringlogrotationtothespooldirectorycustominterceptors

about/Custominterceptorspluginsdirectory/Thepluginsdirectory

custommonitoringreportingmechanismdeveloping/Custommonitoringhooks

D-Dflume.root.loggerproperty/Startingupwith“Hello,World!”data/logs

streaming/TheproblemwithHDFSandstreamingdata/logsdataexpiry/Complianceanddataexpirydataflows

tiering/Tieringflowsdataflows,tiering

Avrosource/sink,using/TheAvrosource/sinkSSLAvro/SSLAvroflowsThriftsource/sink,using/TheThriftsource/sinkcommand-lineAvro/Usingcommand-lineAvroLog4Jappender/TheLog4JappenderLog4Jload-balancingappender/TheLog4Jload-balancingappender

DataStream/DataStreamdestinationNameproperty/JMSsourceDiskFailover(DFO)/Flume0.9

EElasticComputeCluster(EC2)/WeblogstosearchableUIElasticSearch

about/ElasticSearchSinkURL/ElasticSearchSink,Settingupthetarget–Elasticsearch,Settingupabetteruserinterface–KibanaversusApacheSolr/DynamicSerializersettingup/Settingupthetarget–Elasticsearch

ElasticSearchDynamicSerializerserializer/DynamicSerializerElasticSearchLogStashEventSerializer/LogStashSerializerElasticSearchSink

about/ElasticSearchSinksettings/ElasticSearchSinkElasticSearchLogStashEventSerializer/LogStashSerializerElasticSearchDynamicSerializerserializer/DynamicSerializer

EmbeddedAgentabout/Theembeddedagentconfiguration/Configurationandstartupstartup/Configurationandstartupalternativeformats,URL/Configurationandstartupdata,sending/Sendingdatashutdown/Shutdown

End-to-End(E2E)/Flume0.9EventSerializer

about/EventSerializerstextoutput/Textoutputtext_with_headersserializer/TextwithheadersApacheAvro/ApacheAvrouser-providedAvroschema/User-providedAvroschemafiletype/Filetypetimeouts/Timeoutsandworkersworkers/Timeoutsandworkers

Execsourceabout/TheExecsourceproperties/TheExecsource

Ffilechannel

about/Thefilechannelconfigurationparameters/Thefilechannel

filetypeabout/FiletypeSequenceFile/SequenceFileDataStream/DataStreamCompressedStream/CompressedStream

Flumeabout/Flume0.9events/FlumeeventsURL/DownloadingFlumedownloading/DownloadingFlumeinHadoopdistributions/FlumeinHadoopdistributionsconfigurationfile/AnoverviewoftheFlumeconfigurationfilesettingup,oncollector/relay/SettingupFlumeoncollector/relaysettingup,onclient/SettingupFlumeontheclientagentprocess,monitoring/Monitoringtheagentprocess

Flume-NG(FlumetheNextGeneration)/Flume0.9flume-ng-kafka-source

URL/TransporttimeversuslogtimeFlume0.9/Flume0.9Flume1.X(Flume-NG)/Flume1.X(Flume-NG)

URL/Flume1.X(Flume-NG)Flumeconfigurationfile

overview/AnoverviewoftheFlumeconfigurationfileFlumeevents

about/Flumeeventsinterceptors/Interceptors,channelselectors,andsinkprocessorschannelselectors/Interceptors,channelselectors,andsinkprocessorssinkprocessors/Interceptors,channelselectors,andsinkprocessors

FlumeJVMURL/Monitoringperformancemetrics

FlumeUserGuideURL/Thefilechannel,HDFSsink

GGanglia

about/GangliaURL/Ganglia

grokcommandURL/TheKiteSDK

HHadoopdistributions

Flume/FlumeinHadoopdistributionsbenefits/FlumeinHadoopdistributionslimitations/FlumeinHadoopdistributions

HDFSissue/TheproblemwithHDFSandstreamingdata/logsarchivingto/ArchivingtoHDFS

hdfs.batchSizeparameter/Filerotationhdfs.maxOpenFilesproperty/Pathandfilenamehdfs.timeZoneproperty/Pathandfilenamehdfs.useLocalTimeStampbooleanproperty/PathandfilenameHDFSsink

about/HDFSsinkconfigurationparameters/HDFSsinkpath/Pathandfilenamefilename/Pathandfilenamefilerotation/Filerotationcompressioncodecs/Compressioncodecs

Hello,World!examplefileconfiguration/Startingupwith“Hello,World!”

helpcommand/Startingupwith“Hello,World!”Hortonworks

URL/FlumeinHadoopdistributionsHostinterceptor

about/Hostproperties/Host

HumanOptimizedConfigurationObjectNotation(HOCON)URL/Morphlineconfigurationfiles

IindexNameproperty/ElasticSearchSinkinterceptor/Interceptors,channelselectors,andsinkprocessors

used,forcreatingsearchfields/Creatingmoresearchfieldswithaninterceptorinterceptors

about/Interceptorsadding/InterceptorsTimestamp/TimestampHost/HostStatic/Staticregularexpressionfiltering/Regularexpressionfilteringregularexpression/RegularexpressionextractorMorphline/Morphlineinterceptorcustom/Custominterceptors

internalHTTPserverusing/InternalHTTPserver

ISO8601URL/Timezonesareevil

JJavaKeyStore(JKS)/SSLAvroflowsJavaproperties

flume.monitoring.type/Gangliaflume.monitoring.hosts/Gangliaflume.monitoring.isGanglia3/Ganglia

JMSmessageselectorsURL/JMSsource

JMSsourceabout/JMSsourceconfiguring/JMSsourcesettings/JMSsource

Kkeep-aliveparameter/ThefilechannelkeepFieldsproperty/ThesyslogUDPsource,ThesyslogTCPsourceKibana

URL/ElasticSearchSink,Settingupabetteruserinterface–Kibanasettingup/Settingupabetteruserinterface–Kibana

KiteSDKabout/TheKiteSDKURL/TheKiteSDK

LLog4Jappender

about/TheLog4JappenderURL/TheLog4Jappender

Log4Jload-balancingappenderabout/TheLog4Jload-balancingappender

logrotateutilityURL/Configuringlogrotationtothespooldirectory

LogstashURL/ElasticSearchSink

logtimeversustransporttime/Transporttimeversuslogtime

MMapR

URL/FlumeinHadoopdistributionsmemoryCapacityproperty/SpillableMemoryChannelmemorychannel

about/Thememorychannelconfigurationparameters/Thememorychannel

metricsURL/InternalHTTPserver

MiniMecounterpart/TimezonesareevilminimumRequiredSpaceproperty/ThefilechannelMonit

URL/Monitabout/Monit

Morphlineconfigurationfile/MorphlineconfigurationfilesURL/Morphlineconfigurationfiles,TypicalSolrSinkconfiguration,Morphlineinterceptor

MorphlinecommandsURL/TheKiteSDK

morphlineIdproperty/SinkconfigurationMorphlineinterceptor

about/MorphlineinterceptorMorphlineSolrSink

about/MorphlineSolrSinkMorphlineconfigurationfile/MorphlineconfigurationfilesSolrSinkconfiguration/TypicalSolrSinkconfigurationsinkconfiguration/SinkconfigurationSinkconfiguration/Sinkconfiguration

multipledatacentersconsiderations/Considerationsformultipledatacenters

MultiportSyslogTCPsourceabout/ThemultiportsyslogTCPsourceproperties/ThemultiportsyslogTCPsourceFlumeheaders/ThemultiportsyslogTCPsource

NNagios

URL/Nagiosabout/Nagios

NagiosJMXpluginURL/Monitoringperformancemetrics

nccommand/Startingupwith“Hello,World!”Nginxwebserver

URL/Settingupthewebserver

OoverflowCapacityproperty/SpillableMemoryChanneloverflowDeactivationThresholdproperty/SpillableMemoryChanneloverflowTimeoutproperty/SpillableMemoryChannel

PPaymentCardIndustryDataSecurityStandard(PCIDSS)

URL/Complianceanddataexpiryperformancemetrics

monitoring/Monitoringperformancemetricsperformancemetrics,monitoring

Ganglia/GangliainternalHTTPserver/InternalHTTPservercustommonitoringhooks/Custommonitoringhooks

Personallyidentifiableinformation(PII)URL/Complianceanddataexpiry

pluginsdirectoryabout/Thepluginsdirectory$FLUME_HOME/plugins.ddirectory/Thepluginsdirectorylibdirectory/Thepluginsdirectorynativedirectory/Thepluginsdirectory

pollTimeoutproperty/JMSsourcepom.xmlfile

URL/DynamicSerializerPOSIX-stylefilesystem/TheproblemwithHDFSandstreamingdata/logsprocessor.backoffproperty/LoadbalancingPuppettool/Flume1.X(Flume-NG)

RRedHatEnterpriseLinux(RHEL)/FlumeinHadoopdistributionsregularexpressionextractorinterceptor

about/Regularexpressionextractorproperties/Regularexpressionextractor

regularexpressionfilteringinterceptorabout/RegularexpressionfilteringURL/Regularexpressionfiltering

routing/RoutingRunners/Flume1.X(Flume-NG)

SSarbanesOxley(SOX)

URL/ComplianceanddataexpirySequenceFilefiletype/SequenceFileserializer.syncIntervalBytesproperty/ApacheAvroserializers/Regularexpressionextractorsinkgroup

about/Sinkgroupsloadbalancing/Loadbalancingfailover/Failover

SinkProcessorabout/Sinkgroups

sinkprocessors/Interceptors,channelselectors,andsinkprocessorssinks/Sources,channels,andsinksSolr/MorphlineSolrSinkSolrCloud/MorphlineSolrSinkSolrSink

configuration/TypicalSolrSinkconfigurationsources/Sources,channels,andsinksSpillableMemoryChannel

about/SpillableMemoryChannelconfigurationparameters/SpillableMemoryChannel

spooldirectorylogrotation,configuring/Configuringlogrotationtothespooldirectory

SpoolingDirectorySourceabout/SpoolingDirectorySourcecreating/SpoolingDirectorySourceproperties/SpoolingDirectorySource

SpringURL/Configurationandstartup

start()method/CustommonitoringhooksStatementsonStandardsforAttestationEngagements(SSAE-16)

URL/ComplianceanddataexpiryStaticinterceptor

about/Staticproperties/Static

syslogsourcesabout/SyslogsourcesURL/SyslogsourcesUDPsource/ThesyslogUDPsourceTCPsource/ThesyslogTCPsourceMultiportSyslogTCPsource/ThemultiportsyslogTCPsource

SyslogTCPsource

about/ThesyslogTCPsourcecreating/ThesyslogTCPsourceFlumeheaders/ThesyslogTCPsource

syslogtermsURL/SyslogsourcesSyslogUDPsource

about/ThesyslogUDPsourceproperties/ThesyslogUDPsourceFlumeheaders/ThesyslogUDPsource

Ttail

URL/Theproblemwithusingtailabout/Theproblemwithusingtailissues/Theproblemwithusingtail

tail-Fcommand/TheExecsourcetake/Thememorychanneltext_with_headersserializer/TextwithheadersThrift

URL/TheThriftsource/sinkThriftsource/sink

used,fortieringdataflows/TheThriftsource/sinktiereddatacollection/Tiereddatacollection(multipleflowsand/oragents)Timestampinterceptor

about/Timestampproperties/Timestamp

timestampkey/Regularexpressionextractortimezones/TimezonesareeviltransactionCapacityproperty/Thefilechannel,SpillableMemoryChanneltransporttime

versuslogtime/Transporttimeversuslogtime

UUniversallyUniqueIdentifier(UUID)

URL/Transporttimeversuslogtimeuser-providedAvroschema/User-providedAvroschema

VVeriSign/SSLAvroflows

Wwebapplication

simulating/WeblogstosearchableUIwebserver

settingup/Settingupthewebserverlogrotation,configuringtospooldirectory/Configuringlogrotationtothespooldirectory

workerdaemons(agents)/Flume0.9WriteAheadLog(WAL)/Thefilechannelwrk

URL/Settingupthewebserver

ZZookeeper/Flume0.9

Apache Flume: Distributed Log Collection -...

Documents

Transcript of Apache Flume: Distributed Log Collection -...