Apache Flume: Distributed Log Collection -...

ApacheFlume:DistributedLogCollectionforHadoopSecondEdition

TableofContents

Credits

AbouttheAuthor

AbouttheReviewers

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmore

Whysubscribe?

FreeaccessforPacktaccountholders

Preface

Whatthisbookcovers

Whatyouneedforthisbook

Whothisbookisfor

Conventions

Readerfeedback

Customersupport

Downloadingtheexamplecode

Errata

Piracy

Questions

1.OverviewandArchitecture

Flume0.9

Flume1.X(Flume-NG)

TheproblemwithHDFSandstreamingdata/logs

Sources,channels,andsinks

Flumeevents

Interceptors,channelselectors,andsinkprocessors

Tiereddatacollection(multipleflowsand/oragents)

TheKiteSDK

Summary

2.AQuickStartGuidetoFlume

DownloadingFlume

FlumeinHadoopdistributions

AnoverviewoftheFlumeconfigurationfile

Startingupwith“Hello,World!”

Summary

3.Channels

Thememorychannel

Thefilechannel

SpillableMemoryChannel

Summary

4.SinksandSinkProcessors

HDFSsink

Pathandfilename

Filerotation

Compressioncodecs

EventSerializers

Textoutput

Textwithheaders

ApacheAvro

User-providedAvroschema

Filetype

SequenceFile

DataStream

CompressedStream

Timeoutsandworkers

Sinkgroups

Loadbalancing

Failover

MorphlineSolrSink

Morphlineconfigurationfiles

TypicalSolrSinkconfiguration

Sinkconfiguration

ElasticSearchSink

LogStashSerializer

DynamicSerializer

Summary

5.SourcesandChannelSelectors

Theproblemwithusingtail

TheExecsource

SpoolingDirectorySource

Syslogsources

ThesyslogUDPsource

ThesyslogTCPsource

ThemultiportsyslogTCPsource

JMSsource

Channelselectors

Replicating

Multiplexing

Summary

6.Interceptors,ETL,andRouting

Interceptors

Timestamp

Static

Regularexpressionfiltering

Regularexpressionextractor

Morphlineinterceptor

Custominterceptors

Thepluginsdirectory

Tieringflows

TheAvrosource/sink

CompressingAvro

SSLAvroflows

TheThriftsource/sink

Usingcommand-lineAvro

TheLog4Jappender

TheLog4Jload-balancingappender

Theembeddedagent

Configurationandstartup

Sendingdata

Shutdown

Routing

Summary

7.PuttingItAllTogether

WeblogstosearchableUI

Settingupthewebserver

Configuringlogrotationtothespooldirectory

Settingupthetarget–Elasticsearch

SettingupFlumeoncollector/relay

SettingupFlumeontheclient

Creatingmoresearchfieldswithaninterceptor

Settingupabetteruserinterface–Kibana

ArchivingtoHDFS

Summary

8.MonitoringFlume

Monitoringtheagentprocess

Nagios

Monitoringperformancemetrics

Ganglia

InternalHTTPserver

Custommonitoringhooks

Summary

9.ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollection

Transporttimeversuslogtime

Timezonesareevil

Capacityplanning

Considerationsformultipledatacenters

Complianceanddataexpiry

Summary

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

Firstpublished:July2013

Secondedition:February2015

Productionreference:1190215

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

BirminghamB32PB,UK.

ISBN978-1-78439-217-8

www.packtpub.com

CreditsAuthor

SteveHoffman

Reviewers

SachinHandiekar

MichaelKeane

StefanWill

CommissioningEditor

DipikaGaonkar

AcquisitionEditor

ReshmaRaman

ContentDevelopmentEditor

NeetuAnnMathew

TechnicalEditor

MenzaMathew

CopyEditors

VikrantPhadke

StutiSrivastava

ProjectCoordinator

MaryAlex

Proofreader

SimranBhogal

SafisEditing

Indexer

RekhaNair

Graphics

SheetalAute

AbhinashSahu

ProductionCoordinator

KomalRamchandani

CoverWork

KomalRamchandani

AbouttheAuthorSteveHoffmanhas32yearsofexperienceinsoftwaredevelopment,rangingfromembeddedsoftwaredevelopmenttothedesignandimplementationoflarge-scale,service-oriented,object-orientedsystems.Forthelast5years,hehasfocusedoninfrastructureascode,includingautomatedHadoopandHBaseimplementationsanddataingestionusingApacheFlume.SteveholdsaBSincomputerengineeringfromtheUniversityofIllinoisatUrbana-ChampaignandanMSincomputersciencefromDePaulUniversity.HeiscurrentlyaseniorprincipalengineeratOrbitzWorldwide(http://orbitz.com/).

MoreinformationonStevecanbefoundathttp://bit.ly/bacoboyandonTwitterat@bacoboy.

ThisisthefirstupdatetoSteve’sfirstbook,ApacheFlume:DistributedLogCollectionforHadoop,PacktPublishing.

I’dagainliketodedicatethisupdatedbooktomylovingandsupportivewife,Tracy.Sheputsupwithalot,andthatisverymuchappreciated.Icouldn’taskforabetterfrienddailybymyside.

Myterrificchildren,RachelandNoah,areaconstantreminderthathardworkdoespayoffandthatgreatthingscancomefromchaos.

Ialsowanttogiveabigthankstomyparents,AlanandKaren,formoldingmeintothesomewhatsatisfactoryhumanI’vebecome.TheirdedicationtofamilyandeducationaboveallelseguidesmedailyasIattempttohelpmyownchildrenfindtheirhappinessintheworld.

AbouttheReviewersSachinHandiekarisaseniorsoftwaredeveloperwithover5yearsofexperienceinJavaEEdevelopment.HegraduatedincomputersciencefromtheUniversityofGreenwich,London,andcurrentlyworksforaglobalconsultingcompany,developingenterpriseapplicationsusingvariousopensourcetechnologies,suchasApacheCamel,ServiceMix,ActiveMQ,andZooKeeper.

Sachinhasalotofinterestinopensourceprojects.HehascontributedcodetoApacheCamelanddevelopedpluginsforSpringSocial,whichcanbefoundatGitHub(https://github.com/sachin-handiekar).

Healsoactivelywritesaboutenterpriseapplicationdevelopmentonhisblog(http://sachinhandiekar.com).

MichaelKeanehasaBSincomputersciencefromtheUniversityofIllinoisatUrbana-Champaign.Hehasworkedasasoftwareengineer,codingalmostexclusivelyinJavasinceJDK1.1.Hehasalsoworkedonthemission-criticalmedicaldevicesoftware,e-commerce,transportation,navigation,andadvertisingdomains.HeiscurrentlyadevelopmentleaderforConversant,wherehemaintainsFlumeflowsofnearly100billionloglinesperday.

Michaelisafatherofthree,andbesideswork,hespendsmostofhistimewithhisfamilyandcoachingyouthsoftball.

StefanWillisacomputerscientistwithadegreeinmachinelearningandpatternrecognitionfromtheUniversityofBonn,Germany.Foroveradecade,hehasworkedforseveralstart-upsinSiliconValleyandRaleigh,NorthCarolina,intheareaofsearchandanalytics.Presently,heleadsthedevelopmentofthesearchbackendandreal-timeanalyticsplatformatZendesk,aproviderofcustomerservicesoftware.

www.PacktPub.com

Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.

DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<service@packtpub.com>formoredetails.

Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.

https://www2.packtpub.com/books/subscription/packtlib

DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.

Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser

FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandviewnineentirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.

PrefaceHadoopisagreatopensourcetoolforshiftingtonsofunstructureddataintosomethingmanageablesothatyourbusinesscangainbetterinsightintoyourcustomers’needs.It’scheap(mostlyfree),scaleshorizontallyaslongasyouhavespaceandpowerinyourdatacenter,andcanhandleproblemsthatwouldcrushyourtraditionaldatawarehouse.Thatsaid,alittle-knownsecretisthatyourHadoopclusterrequiresyoutofeeditdata.Otherwise,youjusthaveaveryexpensiveheatgenerator!Youwillquicklyrealize(onceyougetpastthe“playingaround”phasewithHadoop)thatyouwillneedatooltoautomaticallyfeeddataintoyourcluster.Inthepast,youhadtocomeupwithasolutionforthisproblem,butnomore!FlumewasstartedasaprojectoutofCloudera,whenitsintegrationengineershadtokeepwritingtoolsoverandoveragainfortheircustomerstoautomaticallyimportdata.Today,theprojectliveswiththeApacheFoundation,isunderactivedevelopment,andboastsofuserswhohavebeenusingitintheirproductionenvironmentsforyears.

Inthisbook,IhopetogetyouupandrunningquicklywithanarchitecturaloverviewofFlumeandaquick-startguide.Afterthat,we’lldivedeepintothedetailsofmanyofthemoreusefulFlumecomponents,includingtheveryimportantfilechannelforthepersistenceofin-flightdatarecordsandtheHDFSSinkforbufferingandwritingdataintoHDFS(theHadoopFileSystem).SinceFlumecomeswithawidevarietyofmodules,chancesarethattheonlytoolyou’llneedtogetstartedisatexteditorfortheconfigurationfile.

Bythetimeyoureachtheendofthisbook,youshouldknowenoughtobuildahighlyavailable,fault-tolerant,streamingdatapipelinethatfeedsyourHadoopcluster.

WhatthisbookcoversChapter1,OverviewandArchitecture,introducesFlumeandtheproblemspacethatit’stryingtoaddress(specificallywithregardstoHadoop).Anarchitecturaloverviewofthevariouscomponentstobecoveredinlaterchaptersisgiven.

Chapter2,AQuickStartGuidetoFlume,servestogetyouupandrunningquickly.ItincludesdownloadingFlume,creatinga“Hello,World!”configuration,andrunningit.

Chapter3,Channels,coversthetwomajorchannelsmostpeoplewilluseandtheconfigurationoptionsavailableforeachofthem.

Chapter4,SinksandSinkProcessors,goesintogreatdetailonusingtheHDFSFlumeoutput,includingcompressionoptionsandoptionsforformattingthedata.Failoveroptionsarealsocoveredsothatyoucancreateamorerobustdatapipeline.

Chapter5,SourcesandChannelSelectors,introducesseveraloftheFlumeinputmechanismsandtheirconfigurationoptions.Alsocoveredisswitchingbetweendifferentchannelsbasedondatacontent,whichallowsthecreationofcomplexdataflows.

Chapter6,Interceptors,ETL,andRouting,explainshowtotransformdatain-flightaswellasextractinformationfromthepayloadtousewithChannelSelectorstomakeroutingdecisions.ThenthischaptercoverstieringFlumeagentsusingAvroserialization,aswellasusingtheFlumecommandlineasastandaloneAvroclientfortestingandimportingdatamanually.

Chapter7,PuttingItAllTogether,walksyouthroughthedetailsofanend-to-endusecasefromthewebserverlogstoasearchableUI,backedbyElasticsearchaswellasarchivalstorageinHDFS.

Chapter8,MonitoringFlume,discussesvariousoptionsavailableformonitoringFlumebothinternallyandexternally,includingMonit,Nagios,Ganglia,andcustomhooks.

Chapter9,ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollection,isacollectionofmiscellaneousthingstoconsiderthatareoutsidethescopeofjustconfiguringandusingFlume.

WhatyouneedforthisbookYou’llneedacomputerwithaJavaVirtualMachineinstalled,sinceFlumeiswritteninJava.Ifyoudon’thaveJavaonyourcomputer,youcandownloaditfromhttp://java.com/.

YouwillalsoneedanInternetconnectionsothatyoucandownloadFlumetoruntheQuickStartexample.

ThisbookcoversApacheFlume1.5.2.

WhothisbookisforThisbookisforpeopleresponsibleforimplementingtheautomaticmovementofdatafromvarioussystemstoaHadoopcluster.IfitisyourjobtoloaddataintoHadooponaregularbasis,thisbookshouldhelpyoutocodeyourselfoutofmanualmonkeyworkorfromwritingacustomtoolyou’llbesupportingforaslongasyouworkatyourcompany.

OnlybasicknowledgeofHadoopandHDFSisrequired.Somecustomimplementationsarecovered,shouldyourneedsnecessitatethem.Forthislevelofimplementation,youwillneedtoknowhowtoprograminJava.

Finally,you’llneedyourfavoritetexteditor,sincemostofthisbookcovershowtoconfigurevariousFlumecomponentsviaanagent’stextconfigurationfile.

ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andexplanationsoftheirmeanings.

Codewordsintextareshownasfollows:“Ifyouwanttousethisfeature,yousettheuseDualCheckpointspropertytotrueandspecifyalocationforthatsecondcheckpointdirectorywiththebackupCheckpointDirproperty.”

Ablockofcodeissetasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/access

agent.sinks.k1.hdfs.filePrefix=access

agent.sinks.k1.hdfs.fileSuffix=.log

Whenwewishtodrawyourattentiontoaparticularpartofacodeblock,therelevantlinesoritemsaresetinbold:

agent.sources.s1.command=uptime

agent.sources.s1.restart=true

agent.sources.s1.restartThrottle=60000

Anycommand-lineinputoroutputiswrittenasfollows:

$tar-zxfapache-flume-1.5.2.tar.gz

$cdapache-flume-1.5.2

Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxesforexample,appearinthetextlikethis:“FlumewasfirstintroducedinCloudera’sCDH3distributionin2011.”

NoteWarningsorimportantnotesappearinaboxlikethis.

TipTipsandtricksappearlikethis.

ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedormayhavedisliked.Readerfeedbackisimportantforustodeveloptitlesthatyoureallygetthemostoutof.

Tosendusgeneralfeedback,simplysendane-mailto<feedback@packtpub.com>,andmentionthebooktitleviathesubjectofyourmessage.

Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideonwww.packtpub.com/authors.

CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.

DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesforallPacktbooksyouhavepurchasedfromyouraccountathttp://www.packtpub.com.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyouwouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheerratasubmissionformlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedonourwebsite,oraddedtoanylistofexistingerrata,undertheErratasectionofthattitle.Anyexistingerratacanbeviewedbyselectingyourtitlefromhttp://www.packtpub.com/support.

PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.

Pleasecontactusat<copyright@packtpub.com>withalinktothesuspectedpiratedmaterial.

Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.

QuestionsYoucancontactusat<questions@packtpub.com>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.

Chapter1.OverviewandArchitectureIfyouarereadingthisbook,chancesareyouareswimminginoceansofdata.Creatingmountainsofdatahasbecomeveryeasy,thankstoFacebook,Twitter,Amazon,digitalcamerasandcameraphones,YouTube,Google,andjustaboutanythingelseyoucanthinkofbeingconnectedtotheInternet.Asaproviderofawebsite,10yearsago,yourapplicationlogswereonlyusedtohelpyoutroubleshootyourwebsite.Today,thissamedatacanprovideavaluableinsightintoyourbusinessandcustomersifyouknowhowtopangoldoutofyourriverofdata.

Furthermore,asyouarereadingthisbook,youarealsoawarethatHadoopwascreatedtosolve(partially)theproblemofsiftingthroughmountainsofdata.Ofcourse,thisonlyworksifyoucanreliablyloadyourHadoopclusterwithdataforyourdatascientiststopickapart.

GettingdataintoandoutofHadoop(inthiscase,theHadoopFileSystem,orHDFS)isn’thard;itisjustasimplecommand,suchas:

%hadoopfs--putdata.csv.

Thisworksgreatwhenyouhaveallyourdataneatlypackagedandreadytoupload.

However,yourwebsiteiscreatingdataallthetime.HowoftenshouldyoubatchloaddatatoHDFS?Daily?Hourly?Whateverprocessingperiodyouchoose,eventuallysomebodyalwaysasks“canyougetmethedatasooner?”Whatyoureallyneedisasolutionthatcandealwithstreaminglogs/data.

Turnsoutyouaren’taloneinthisneed.Cloudera,aproviderofprofessionalservicesforHadoopaswellastheirowndistributionofHadoop,sawthisneedoverandoverwhenworkingwiththeircustomers.Flumewascreatedtofillthisneedandcreateastandard,simple,robust,flexible,andextensibletoolfordataingestionintoHadoop.

Flume0.9FlumewasfirstintroducedinCloudera’sCDH3distributionin2011.Itconsistedofafederationofworkerdaemons(agents)configuredfromacentralizedmaster(ormasters)viaZookeeper(afederatedconfigurationandcoordinationsystem).Fromthemaster,youcouldchecktheagentstatusinawebUIaswellaspushoutconfigurationcentrallyfromtheUIorviaacommand-lineshell(bothreallycommunicatingviaZookeepertotheworkeragents).

Datacouldbesentinoneofthreemodes:Besteffort(BE),DiskFailover(DFO),andEnd-to-End(E2E).ThemasterswereusedfortheE2Emodeacknowledgementsandmultimasterconfigurationneverreallymatured,soyouusuallyonlyhadonemaster,makingitacentralpointoffailureforE2Edataflows.TheBEmodeisjustwhatitsoundslike:theagentwouldtrytosendthedata,butifitcouldn’t,thedatawouldbediscarded.Thismodeisgoodforthingssuchasmetrics,wheregapscaneasilybetolerated,asnewdataisjustasecondaway.TheDFOmodestoresundeliverabledatatothelocaldisk(orsometimes,alocaldatabase)andwouldkeepretryinguntilthedatacouldbedeliveredtothenextrecipientinyourdataflow.Thisishandyforthoseplanned(orunplanned)outages,aslongasyouhavesufficientlocaldiskspacetobuffertheload.

InJune,2011,ClouderamovedcontroloftheFlumeprojecttotheApacheFoundation.Itcameoutoftheincubatorstatusayearlaterin2012.Duringtheincubationyear,workhadalreadybeguntorefactorFlumeundertheStar-Trek-themedtag,Flume-NG(FlumetheNextGeneration).

Flume1.X(Flume-NG)ThereweremanyreasonswhyFlumewasrefactored.Ifyouareinterestedinthedetails,youcanreadaboutthemathttps://issues.apache.org/jira/browse/FLUME-728.WhatstartedasarefactoringbrancheventuallybecamethemainlineofdevelopmentasFlume1.X.

ThemostobviouschangeinFlume1.Xisthatthecentralizedconfigurationmaster(s)andZookeeperaregone.TheconfigurationinFlume0.9wasoverlyverbose,andmistakeswereeasytomake.Furthermore,centralizedconfigurationwasreallyoutsidethescopeofFlume’sgoals.Centralizedconfigurationwasreplacedwithasimpleon-diskconfigurationfile(althoughtheconfigurationproviderispluggablesothatitcanbereplaced).Theseconfigurationfilesareeasilydistributedusingtoolssuchascf-engine,Chef,andPuppet.IfyouareusingaClouderadistribution,takealookatClouderaManagertomanageyourconfigurations.Abouttwoyearsago,theycreatedafreeversionwithnonodelimit,soitmaybeanattractiveoptionforyou.Justbesureyoudon’tmanagetheseconfigurationsmanually,oryou’llbeeditingthesefilesmanuallyforever.

AnothermajordifferenceinFlume1.Xisthatthereadingofinputdataandthewritingofoutputdataarenowhandledbydifferentworkerthreads(calledRunners).InFlume0.9,theinputthreadalsodidthewritingtotheoutput(exceptforfailoverretries).Iftheoutputwriterwasslow(ratherthanjustfailingoutright),itwouldblockFlume’sabilitytoingestdata.Thisnewasynchronousdesignleavestheinputthreadblissfullyunawareofanydownstreamproblem.

ThefirsteditionofthisbookcoveredalltheversionsofFlumeuptillVersion1.3.1.ThissecondeditionwillcovertillVersion1.5.2(thecurrentversionatthetimeofwritingthis).

TheproblemwithHDFSandstreamingdata/logsHDFSisn’tarealfilesystem,atleastnotinthetraditionalsense,andmanyofthethingswetakeforgrantedwithnormalfilesystemsdon’tapplyhere,suchasbeingabletomountit.ThismakesgettingyourstreamingdataintoHadoopalittlemorecomplicated.

InaregularPOSIX-stylefilesystem,ifyouopenafileandwritedata,itstillexistsonthediskbeforethefileisclosed.Thatis,ifanotherprogramopensthesamefileandstartsreading,itwillgetthedataalreadyflushedbythewritertothedisk.Furthermore,ifthiswritingprocessisinterrupted,anyportionthatmadeittodiskisusable(itmaybeincomplete,butitexists).

InHDFS,thefileexistsonlyasadirectoryentry;itshowszerolengthuntilthefileisclosed.Thismeansthatifdataiswrittentoafileforanextendedperiodwithoutclosingit,anetworkdisconnectwiththeclientwillleaveyouwithnothingbutanemptyfileforallyourefforts.Thismayleadyoutotheconclusionthatitwouldbewisetowritesmallfilessothatyoucanclosethemassoonaspossible.

TheproblemisthatHadoopdoesn’tlikelotsoftinyfiles.AstheHDFSfilesystemmetadataiskeptinmemoryontheNameNode,themorefilesyoucreate,themoreRAMyou’llneedtouse.FromaMapReduceprospective,tinyfilesleadtopoorefficiency.Usually,eachMapperisassignedasingleblockofafileastheinput(unlessyouhaveusedcertaincompressioncodecs).Ifyouhavelotsoftinyfiles,thecostofstartingtheworkerprocessescanbedisproportionallyhighcomparedtothedataitisprocessing.ThiskindofblockfragmentationalsoresultsinmoreMappertasks,increasingtheoveralljobruntimes.

ThesefactorsneedtobeweighedwhendeterminingtherotationperiodtousewhenwritingtoHDFS.Iftheplanistokeepthedataaroundforashorttime,thenyoucanleantowardthesmallerfilesize.However,ifyouplanonkeepingthedataforaverylongtime,youcaneithertargetlargerfilesordosomeperiodiccleanuptocompactsmallerfilesintofewer,largerfilestomakethemmoreMapReducefriendly.Afterall,youonlyingestthedataonce,butyoumightrunaMapReducejobonthatdatahundredsorthousandsoftimes.

Sources,channels,andsinksTheFlumeagent’sarchitecturecanbeviewedinthissimplediagram.Inputsarecalledsourcesandoutputsarecalledsinks.Channelsprovidethegluebetweensourcesandsinks.Alloftheseruninsideadaemoncalledanagent.

NoteKeepinmind:

Asourcewriteseventstooneormorechannels.Achannelistheholdingareaaseventsarepassedfromasourcetoasink.Asinkreceiveseventsfromonechannelonly.Anagentcanhavemanychannels.

FlumeeventsThebasicpayloadofdatatransportedbyFlumeiscalledanevent.Aneventiscomposedofzeroormoreheadersandabody.

Theheadersarekey/valuepairsthatcanbeusedtomakeroutingdecisionsorcarryotherstructuredinformation(suchasthetimestampoftheeventorthehostnameoftheserverfromwhichtheeventoriginated).YoucanthinkofitasservingthesamefunctionasHTTPheaders—awaytopassadditionalinformationthatisdistinctfromthebody.

Thebodyisanarrayofbytesthatcontainstheactualpayload.Ifyourinputiscomprisedoftailedlogfiles,thearrayismostlikelyaUTF-8-encodedstringcontainingalineoftext.

Flumemayaddadditionalheadersautomatically(likewhenasourceaddsthehostnamewherethedataissourcedorcreatinganevent’stimestamp),butthebodyismostlyuntouchedunlessyouedititenrouteusinginterceptors.

Interceptors,channelselectors,andsinkprocessorsAninterceptorisapointinyourdataflowwhereyoucaninspectandalterFlumeevents.Youcanchainzeroormoreinterceptorsafterasourcecreatesanevent.IfyouarefamiliarwiththeAOPSpringFramework,thinkMethodInterceptor.InJavaServlets,it’ssimilartoServletFilter.Here’sanexampleofwhatusingfourchainedinterceptorsonasourcemightlooklike:

Channelselectorsareresponsibleforhowdatamovesfromasourcetooneormorechannels.Flumecomespackagedwithtwochannelselectorsthatcovermostusecasesyoumighthave,althoughyoucanwriteyourownifneedbe.Areplicatingchannelselector(thedefault)simplyputsacopyoftheeventintoeachchannel,assumingyouhaveconfiguredmorethanone.Incontrast,amultiplexingchannelselectorcanwritetodifferentchannelsdependingonsomeheaderinformation.Combinedwithsomeinterceptorlogic,thisduoformsthefoundationforroutinginputtodifferentchannels.

Finally,asinkprocessoristhemechanismbywhichyoucancreatefailoverpathsforyoursinksorloadbalanceeventsacrossmultiplesinksfromachannel.

Tiereddatacollection(multipleflowsand/oragents)YoucanchainyourFlumeagentsdependingonyourparticularusecase.Forexample,youmaywanttoinsertanagentinatieredfashiontolimitthenumberofclientstryingtoconnectdirectlytoyourHadoopcluster.Morelikely,yoursourcemachinesdon’thavesufficientdiskspacetodealwithaprolongedoutageormaintenancewindow,soyoucreateatierwithlotsofdiskspacebetweenyoursourcesandyourHadoopcluster.

Inthefollowingdiagram,youcanseethattherearetwoplaceswheredataiscreated(ontheleft-handside)andtwofinaldestinationsforthedata(theHDFSandElasticSearchcloudbubblesontheright-handside).Tomakethingsmoreinteresting,let’ssayoneofthemachinesgeneratestwokindsofdata(let’scallthemsquareandtriangledata).Youcanseethatinthelower-leftagent,weuseamultiplexingchannelselectortosplitthetwokindsofdataintodifferentchannels.Therectanglechannelisthenroutedtotheagentintheupper-rightcorner(alongwiththedatacomingfromtheupper-leftagent).ThecombinedvolumeofeventsiswrittentogetherinHDFSinDatacenter1.Meanwhile,thetriangledataissenttotheagentthatwritestoElasticSearchinDatacenter2.Keepinmindthatdatatransformationscanoccurafteranysource.Howallofthesecomponentscanbeusedtobuildcomplicateddataworkflowswillbebecomeclearasweproceed.

TheKiteSDKOneofthenewtechnologiesincorporatedinFlume,startingwithVersion1.4,issomethingcalledaMorphline.YoucanthinkofaMorphlineasaseriesofcommandschainedtogethertoformadatatransformationpipe.

IfyouareafanofpipeliningUnixcommands,thiswillbeveryfamiliartoyou.Thecommandsthemselvesareintendedtobesmall,single-purposefunctionsthatwhenchainedtogethercreatepowerfullogic.Inmanyways,usingaMorphlinecommandchaincanbeidenticalinfunctionalitytotheinterceptorparadigmjustmentioned.ThereisaMorphlineinterceptorwewillcoverinChapter6,Interceptors,ETL,andRouting,whichyoucanuseinsteadof,orinadditionto,theincludedJava-basedinterceptors.

NoteTogetanideaofhowusefulthesecommandscanbe,takealookatthehandygrokcommandanditsincludedextensibleregularexpressionlibraryathttps://github.com/kite-sdk/kite/blob/master/kite-morphlines/kite-morphlines-core/src/test/resources/grok-dictionaries/grok-patterns

ManyofthecustomJavainterceptorsthatI’vewritteninthepastweretomodifythebody(data)andcaneasilybereplacedwithanout-of-the-boxMorphlinecommandchain.YoucangetfamiliarwiththeMorphlinecommandsbycheckingouttheirreferenceguideathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

FlumeVersion1.4alsoincludesaMorphline-backedsinkusedprimarilytofeeddataintoSolr.We’llseemoreofthisinChapter4,SinksandSinkProcessors,MorphlineSolrSearchSink.

MorphlinesarejustonecomponentoftheKiteSDKincludedinFlume.StartingwithVersion1.5,FlumehasaddedexperimentalsupportforKiteData,whichisanefforttocreateastandardlibraryfordatasetsinHadoop.Itlooksverypromising,butitisoutsidethescopeofthisbook.

NotePleaseseetheprojecthomepageformoreinformation,asitwillcertainlybecomemoreprominentintheHadoopecosystemasthetechnologymatures.YoucanreadallabouttheKiteSDKathttp://kitesdk.org.

SummaryInthischapter,wediscussedtheproblemthatFlumeisattemptingtosolve:gettingdataintoyourHadoopclusterfordataprocessinginaneasilyconfigured,reliableway.WealsodiscussedtheFlumeagentanditslogicalcomponents,includingevents,sources,channelselectors,channels,sinkprocessors,andsinks.Finally,webrieflydiscussedMorphlinesasapowerfulnewETL(Extract,Transform,Load)library,startingwithVersion1.4ofFlume.

Thenextchapterwillcovertheseinmoredetail,specifically,themostcommonlyusedimplementationsofeach.Likeallgoodopensourceprojects,almostallofthesecomponentsareextensibleifthebundledonesdon’tdowhatyouneedthemtodo.

Chapter2.AQuickStartGuidetoFlumeAswecoveredsomeofthebasicsinthepreviouschapter,thischapterwillhelpyougetstartedwithFlume.So,let’sstartwiththefirststep:downloadingandconfiguringFlume.

DownloadingFlumeLet’sdownloadFlumefromhttp://flume.apache.org/.Lookforthedownloadlinkinthesidenavigation.You’llseetwocompressed.tararchivesavailablealongwiththechecksumandGPGsignaturefilesusedtoverifythearchives.Instructionstoverifythedownloadareonthewebsite,soIwon’tcoverthemhere.Checkingthechecksumfilecontentsagainsttheactualchecksumverifiesthatthedownloadwasnotcorrupted.Checkingthesignaturefilevalidatesthatallthefilesyouaredownloading(includingthechecksumandsignature)camefromApacheandnotsomenefariouslocation.Doyoureallyneedtoverifyyourdownloads?Ingeneral,itisagoodideaanditisrecommendedbyApachethatyoudoso.Ifyouchoosenotto,Iwon’ttell.

Thebinarydistributionarchivehasbininthename,andthesourcearchiveismarkedwithsrc.ThesourcearchivecontainsjusttheFlumesourcecode.ThebinarydistributionismuchlargerbecauseitcontainsnotonlytheFlumesourceandthecompiledFlumecomponents(jars,javadocs,andsoon),butalsoallthedependentJavalibraries.ThebinarypackagecontainsthesameMavenPOMfileasthesourcearchive,soyoucanalwaysrecompilethecodeevenifyoustartwiththebinarydistribution.

Goahead,downloadandverifythebinarydistributiontosaveussometimeingettingstarted.

FlumeinHadoopdistributionsFlumeisavailablewithsomeHadoopdistributions.ThedistributionssupposedlyprovidebundlesofHadoop’scorecomponentsandsatelliteprojects(suchasFlume)inawaythatensuresthingssuchasversioncompatibilityandadditionalbugfixesaretakenintoaccount.Thesedistributionsaren’tbetterorworse;they’rejustdifferent.

Therearebenefitstousingadistribution.Someoneelsehasalreadydonetheworkofpullingtogetheralltheversion-compatiblecomponents.Today,thisislessofanissuesincetheApacheBigTopprojectstarted(http://bigtop.apache.org/).Nevertheless,havingprebuiltstandardOSpackages,suchasRPMsandDEBs,easeinstallationaswellasprovidestartup/shutdownscripts.Eachdistributionhasdifferentlevelsoffreeandpaidoptions,includingpaidprofessionalservicesifyoureallygetintoasituationyoujustcan’thandle.

Therearedownsides,ofcourse.TheversionofFlumebundledinadistributionwilloftenlagquiteabitbehindtheApachereleases.Ifthereisaneworbleeding-edgefeatureyouareinterestedinusing,you’lleitherbewaitingforyourdistribution’sprovidertobackportitforyou,oryou’llbestuckpatchingityourself.Furthermore,whilethedistributionprovidersdoafairamountoftesting,suchasanygeneral-purposeplatform,youwillmostlikelyencountersomethingthattheirtestingdidn’tcover,inwhichcase,youarestillonthehooktocomeupwithaworkaroundordiveintothecode,fixit,andhopefully,submitthatpatchbacktotheopensourcecommunity(where,atafuturepoint,it’llmakeitintoanupdateofyourdistributionorthenextversion).

So,thingsmoveslowerinaHadoopdistributionworld.Youcanseethatasgoodorbad.Usually,largecompaniesdon’tliketheinstabilityofbleeding-edgetechnologyormakingchangesoften,aschangecanbethemostcommoncauseofunplannedoutages.You’dbehardpressedtofindsuchacompanyusingthebleeding-edgeLinuxkernelratherthansomethinglikeRedHatEnterpriseLinux(RHEL),CentOS,UbuntuLTS,oranyoftheotherdistributionswhosetargetisstabilityandcompatibility.IfyouareastartupbuildingthenextInternetfad,youmightneedthatbleeding-edgefeaturetogetalegupontheestablishedcompetition.

Ifyouareconsideringadistribution,dotheresearchandseewhatyouaregetting(ornotgetting)witheach.Rememberthateachoftheseofferingsishopingthatyou’lleventuallywantand/orneedtheirEnterpriseoffering,whichusuallydoesn’tcomecheap.Doyourhomework.

NoteHere’sashort,nondefinitivelistofsomeofthemoreestablishedplayers.Formoreinformation,refertothefollowinglinks:

Cloudera:http://cloudera.com/Hortonworks:http://hortonworks.com/MapR:http://mapr.com/

AnoverviewoftheFlumeconfigurationfileNowthatwe’vedownloadedFlume,let’sspendsometimegoingoverhowtoconfigureanagent.

AFlumeagent’sdefaultconfigurationproviderusesasimpleJavapropertyfileofkey/valuepairsthatyoupassasanargumenttotheagentuponstartup.Asyoucanconfiguremorethanoneagentinasinglefile,youwillneedtoadditionallypassanagentidentifier(calledaname)sothatitknowswhichconfigurationstouse.InmyexampleswhereI’monlyspecifyingoneagent,I’mgoingtousethenameagent.

NoteBydefault,theconfigurationpropertyfileismonitoredforchangesevery30seconds.Ifachangeisdetected,Flumewillattempttoreconfigureitself.Inpractice,manyoftheconfigurationsettingscannotbechangedaftertheagenthasstarted.Saveyourselfsometroubleandpasstheundocumented--no-reload-confargumentwhenstartingtheagent(exceptindevelopmentsituationsperhaps).

IfyouusetheClouderadistribution,thepassingofthisflagiscurrentlynotpossible.I’veopenedatickettofixthatathttps://issues.cloudera.org/browse/DISTRO-648.Ifthisisimportanttoyou,pleasevoteitup.

Eachagentisconfigured,startingwiththreeparameters:

agent.sources=<listofsources>

agent.channels=<listofchannels>

agent.sinks=<listofsinks>

TipDownloadingtheexamplecode

Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.

Eachsource,channel,andsinkalsohasauniquenamewithinthecontextofthatagent.Forexample,ifI’mgoingtotransportmyApacheaccesslogs,Imightdefineachannelnamedaccess.Theconfigurationsforthischannelwouldallstartwiththeagent.channels.accessprefix.EachconfigurationitemhasatypepropertythattellsFlumewhatkindofsource,channel,orsinkitis.Inthiscase,wearegoingtouseanin-memorychannelwhosetypeismemory.Thecompleteconfigurationforthechannelnamedaccessintheagentnamedagentwouldbe:

agent.channels.access.type=memory

Anyargumentstoasource,channel,orsinkareaddedasadditionalpropertiesusingthe

sameprefix.ThememorychannelhasacapacityparametertoindicatethemaximumnumberofFlumeeventsitcanhold.Let’ssaywedidn’twanttousethedefaultvalueof100;ourconfigurationwouldnowlooklikethis:

agent.channels.access.type=memory

agent.channels.access.capacity=200

Finally,weneedtoaddtheaccesschannelnametotheagent.channelspropertysothattheagentknowstoloadit:

agent.channels=access

Let’slookatacompleteexampleusingthecanonicalHello,World!example.

Startingupwith“Hello,World!”NotechnicalbookwouldbecompletewithoutaHello,World!example.Hereistheconfigurationfilewe’llbeusing:

agent.sources=s1

agent.channels=c1

agent.sinks=k1

agent.sources.s1.type=netcat

agent.sources.s1.channels=c1

agent.sources.s1.bind=0.0.0.0

agent.sources.s1.port=12345

agent.channels.c1.type=memory

agent.sinks.k1.type=logger

agent.sinks.k1.channel=c1

Here,I’vedefinedoneagent(calledagent)whohasasourcenameds1,achannelnamedc1,andasinknamedk1.

Thes1source’stypeisnetcat,whichsimplyopensasocketlisteningforevents(onelineoftextperevent).Itrequirestwoparameters:abindIPandaportnumber.Inthisexample,weareusing0.0.0.0forabindaddress(theJavaconventiontospecifylistenonanyaddress)andport12345.Thesourceconfigurationalsohasaparametercalledchannels(plural),whichisthenameofthechannel(s)thesourcewillappendeventsto,inthiscase,c1.Itisplural,becauseyoucanconfigureasourcetowritetomorethanonechannel;wejustaren’tdoingthatinthissimpleexample.

Thechannelnamedc1isamemorychannelwithadefaultconfiguration.

Thesinknamedk1isoftheloggertype.Thisisasinkthatismostlyusedfordebuggingandtesting.ItwilllogalleventsattheINFOlevelusingLog4j,whichitreceivesfromtheconfiguredchannel,inthiscase,c1.Here,thechannelkeywordissingularbecauseasinkcanonlybefeddatafromonechannel.

Usingthisconfiguration,let’sruntheagentandconnecttoitusingtheLinuxnetcatutilitytosendanevent.

First,explodethe.tararchiveofthebinarydistributionwedownloadedearlier:

$tar-zxfapache-flume-1.5.2-bin.tar.gz

$cdapache-flume-1.5.2-bin

Next,let’sbrieflylookatthehelp.Runtheflume-ngcommandwiththehelpcommand:

$./bin/flume-nghelp

Usage:./bin/flume-ng<command>[options]...

commands:

helpdisplaythishelptext

agentrunaFlumeagent

avro-clientrunanavroFlumeclient

versionshowFlumeversioninfo

globaloptions:

--conf,-c<conf>useconfigsin<conf>directory

--classpath,-C<cp>appendtotheclasspath

--dryrun,-ddonotactuallystartFlume,justprintthecommand

--plugins-path<dirs>colon-separatedlistofplugins.ddirectories.See

plugins.dsectionintheuserguideformore

details.

Default:$FLUME_HOME/plugins.d

-Dproperty=valuesetsaJavasystempropertyvalue

-Xproperty=valuesetsaJava-Xoption

agentoptions:

--conf-file,-f<file>specifyaconfigfile(required)

--name,-n<name>thenameofthisagent(required)

--help,-hdisplayhelptext

avro-clientoptions:

--rpcProps,-P<file>RPCclientpropertiesfilewithserverconnection

params

--host,-H<host>hostnametowhicheventswillbesent

--port,-p<port>portoftheavrosource

--dirname<dir>directorytostreamtoavrosource

--filename,-F<file>textfiletostreamtoavrosource(default:std

input)

--headerFile,-R<file>Filecontainingeventheadersaskey/valuepairs

oneachnewline

Either--rpcPropsorboth--hostand--portmustbespecified.

Notethatif<conf>directoryisspecified,thenitisalwaysincluded

firstintheclasspath.

Asyoucansee,therearetwowayswithwhichyoucaninvokethecommand(otherthanthesimplehelpandversioncommands).Wewillbeusingtheagentcommand.Theuseofavro-clientwillbecoveredlater.

Theagentcommandhastworequiredparameters:aconfigurationfiletouseandtheagentname(incaseyourconfigurationcontainsmultipleagents).

Let’stakeoursampleconfigurationandopenaneditor(viinmycase,butusewhateveryoulike):

$viconf/hw.conf

Next,placethecontentsoftheprecedingconfigurationintotheeditor,save,andexitbacktotheshell.

Nowyoucanstarttheagent:

$./bin/flume-ngagent-nagent-cconf-fconf/hw.conf-

Dflume.root.logger=INFO,console

The-Dflume.root.loggerpropertyoverridestherootloggerinconf/log4j.propertiestousetheconsoleappender.

Ifwedidn’toverridetherootlogger,everythingwouldstillwork,buttheoutputwouldgotothelog/flume.logfileinsteadofbeingbasedonthecontentsofthedefaultconfigurationfile.Ofcourse,youcanedittheconf/log4j.propertiesfileandchangetheflume.root.loggerproperty(oranythingelseyoulike).Tochangejustthepathorfilename,youcansettheflume.log.dirandflume.log.filepropertiesintheconfigurationfileorpassadditionalflagsonthecommandlineasfollows:

$./bin/flume-ngagent-nagent-cconf-fconf/hw.conf-

Dflume.root.logger=INFO,console-Dflume.log.dir=/tmp-

Dflume.log.file=flume-agent.log

Youmightaskwhyyouneedtospecifythe-cparameter,asthe-fparametercontainsthecompleterelativepathtotheconfiguration.ThereasonforthisisthattheLog4jconfigurationfileshouldbeincludedontheclasspath.

Ifyouleftthe-cparameteroffthecommand,you’llseethiserror:

Warning:Noconfigurationdirectoryset!Use--conf<dir>tooverride.

log4j:WARNNoappenderscouldbefoundforlogger

(org.apache.flume.lifecycle.LifecycleSupervisor).

log4j:WARNPleaseinitializethelog4jsystemproperly.

log4j:WARNSeehttp://logging.apache.org/log4j/1.2/faq.html#noconfigfor

moreinfo.

Butyoudidn’tdothatsoyoushouldseethesekeyloglines:

2014-10-0515:39:06,109(conf-file-poller-0)[INFO-

org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfigu

ration.java:140)]Post-validationflumeconfigurationcontains

configurationforagents:[agent]

Thislinetellsyouthatyouragentstartswiththenameagent.

Usuallyyou’dlookforthislineonlytobesureyoustartedtherightconfigurationwhenyouhavemultipleconfigurationsdefinedinyourconfigurationfile.

org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatche

rRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)]

Reloadingconfigurationfile:conf/hw.conf

Thisisanothersanitychecktomakesureyouareloadingthecorrectfile,inthiscaseourhw.conffile.

org.apache.flume.node.Application.startAllComponents(Application.java:138)]

Startingnewconfiguration:{sourceRunners:{s1=EventDrivenSourceRunner:{

source:org.apache.flume.source.NetcatSource{name:s1,state:IDLE}}}

sinkRunners:{k1=SinkRunner:{

policy:org.apache.flume.sink.DefaultSinkProcessor@442fbe47counterGroup:{

name:nullcounters:{}}}}channels:

{c1=org.apache.flume.channel.MemoryChannel{name:c1}}}

Oncealltheconfigurationshavebeenparsed,youwillseethismessage,whichshowsyoueverythingthatwasconfigured.Youcansees1,c1,andk1,andwhichJavaclassesareactuallydoingthework.Asyouprobablyguessed,netcatisaconveniencefororg.apache.flume.source.NetcatSource.Wecouldhaveusedtheclassnameifwewanted.Infact,ifIhadmyowncustomsourcewritten,Iwoulduseitsclassnameforthesource’stypeparameter.YoucannotdefineyourownshortnameswithoutpatchingtheFlumedistribution.

2014-10-0515:39:06,427(lifecycleSupervisor-1-0)[INFO-

org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)]Created

serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345]

Here,weseethatoursourceisnowlisteningonport12345fortheinput.So,let’ssendsomedatatoit.

Finally,openasecondterminal.We’llusethenccommand(youcanuseTelnetoranythingelsesimilar)tosendtheHelloWorldstringandpresstheReturn(Enter)keytomarktheendoftheevent:

%nclocalhost12345

HelloWorld

TheOKmessagecamefromtheagentafterwepressedtheReturnkey,signifyingthatitacceptedthelineoftextasasingleFlumeevent.Ifyoulookattheagentlog,youwillseethefollowing:

2014-10-0515:44:11,215(SinkRunner-PollingRunner-DefaultSinkProcessor)

[INFO-org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]

Event:{headers:{}body:48656C6C6F20576F726C64

HelloWorld}

ThislogmessageshowsyouthattheFlumeeventcontainsnoheaders(NetcatSourcedoesn’taddanyitself).Thebodyisshowninhexadecimalalongwithastringrepresentation(forushumanstoread,inthiscase,ourHelloWorldmessage).

IfIsendthefollowinglineandthenpresstheEnterkey,you’llgetanOKmessage:

Thequickbrownfoxjumpedoverthelazydog.

You’llseethisintheagent’slog:

[INFO-org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]

Event:{headers:{}body:54686520717569636B2062726F776E20

Thequickbrown}

Theeventappearstohavebeentruncated.Theloggersink,bydesign,limitsthebodycontentto16bytestokeepyourscreenfrombeingfilledwithmorethanwhatyou’dneedinadebuggingcontext.Ifyouneedtoseethefullcontentsfordebugging,youshoulduseadifferentsink,perhapsthefile_rollsink,whichwouldwritetothelocalfilesystem.

SummaryInthischapter,wecoveredhowtodownloadtheFlumebinarydistribution.Wecreatedasimpleconfigurationfilethatincludedonesourcewritingtoonechannel,feedingonesink.Thesourcelistenedonasocketfornetworkclientstoconnecttoandtosenditeventdata.Theseeventswerewrittentoanin-memorychannelandthenfedtoaLog4jsinktobecometheoutput.WethenconnectedtoourlisteningagentusingtheLinuxnetcatutilityandsentsomestringeventstoourFlumeagent’ssource.Finally,weverifiedthatourLog4j-basedsinkwrotetheeventsout.

Inthenextchapter,we’lltakeadetailedlookatthetwomajorchanneltypesyou’llmostlikelyuseinyourdataprocessingworkflows:thememorychannelandthefilechannel.

Wewillalsotakealookatanewexperimentalchannel,introducedinVersion1.5ofFlume,calledtheSpillableMemoryChannel,whichattemptstobeahybridoftheothertwo.

Foreachtype,we’lldiscussalltheconfigurationknobsavailabletoyou,whenandwhyyoumightwanttodeviatefromthedefaults,andmostimportantly,whytouseoneovertheother.

Chapter3.ChannelsInFlume,achannelistheconstructusedbetweensourcesandsinks.Itprovidesabufferforyourin-flighteventsaftertheyarereadfromsourcesuntiltheycanbewrittentosinksinyourdataprocessingpipelines.

Theprimarytypeswe’llcoverhereareamemory-backed/nondurablechannelandalocal-filesystem-backed/durablechannel.StartingwithFlume1.5,anexperimentalhybridmemoryandfilechannelcalledtheSpillableMemoryChannelisintroduced.Thedurablefilechannelflushesallchangestodiskbeforeacknowledgingthereceiptoftheeventtothesender.Thisisconsiderablyslowerthanusingthenondurablememorychannel,butitprovidesrecoverabilityintheeventofsystemorFlumeagentrestarts.Conversely,thememorychannelismuchfaster,butfailureresultsindatalossandithasmuchlowerstoragecapacitywhencomparedtothemultiterabytedisksbackingthefilechannel.ThisiswhytheSpillableMemoryChannelwascreated.Intheory,yougetthebenefitsofmemoryspeeduntilthememoryfillsupduetoflowbackpressure.Atthispoint,thediskwillbeusedtostoretheevents—andwiththatcomesmuchlargercapacity.Therearetrade-offshereaswell,asperformanceisnowvariabledependingonhowtheentireflowisperforming.Ultimately,thechannelyouchoosedependsonyourspecificusecases,failurescenarios,andrisktolerance.

Thatsaid,regardlessofwhatchannelyouchoose,ifyourrateofingestfromthesourcesintothechannelisgreaterthantherateatwhichthesinkcanwritedata,youwillexceedthecapacityofthechannelandthrowaChannelException.Whatyoursourcedoesordoesn’tdowiththatChannelExceptionissource-specific,butinsomecases,datalossispossible,soyou’llwanttoavoidfillingchannelsbysizingthingsproperly.Infact,youalwayswantyoursinktobeabletowritefasterthanyoursourceinput.Otherwise,youmightgetintoasituationwhereonceyoursinkfallsbehind,youcannevercatchup.Ifyourdatavolumetrackswiththesiteusage,youcanhavehighervolumesduringthedayandlowervolumesatnight,givingyourchannelstimetodrain.Inpractice,you’llwanttotryandkeepthechanneldepth(thenumberofeventscurrentlyinthechannel)aslowaspossiblebecausetimespentinthechanneltranslatestoatimedelaybeforereachingthefinaldestination.

ThememorychannelAmemorychannel,asexpected,isachannelwherein-flighteventsarestoredinmemory.Asmemoryis(usually)ordersofmagnitudefasterthanthedisk,eventscanbeingestedmuchmorequickly,resultinginreducedhardwareneeds.Thedownsideofusingthischannelisthatanagentfailure(hardwareproblem,poweroutage,JVMcrash,Flumerestart,andsoon)resultsinthelossofdata.Dependingonyourusecase,thismightbeperfectlyfine.Systemmetricsusuallyfallintothiscategory,asafewlostdatapointsisn’ttheendoftheworld.However,ifyoureventsrepresentpurchasesonyourwebsite,thenamemorychannelwouldbeapoorchoice.

Tousethememorychannel,setthetypeparameteronyournamedchanneltomemory.

Thisdefinesamemorychannelnamedc1fortheagentnamedagent.

Hereisatableofconfigurationparametersyoucanadjustfromthedefaultvalues:

Key Required Type Default

type Yes String memory

capacity No int 100

transactionCapacity No int 100

byteCapacityBufferPercentage No int(percent) 20%

byteCapacity No long(bytes) 80%ofJVMHeap

keep-alive No int 3(seconds)

Thedefaultcapacityofthischannelis100events.Thiscanbeadjustedbysettingthecapacitypropertyasfollows:

agent.channels.c1.capacity=200

Rememberthatifyouincreasethisvalue,youwillmostlikelyhavetoincreaseyourJavaheapspaceusingthe-Xmx,andoptionally-Xms,parameters.

Anothercapacity-relatedsettingyoucansetistransactionCapacity.Thisisthemaximumnumberofeventsthatcanbewritten,alsocalledaput,byasource’sChannelProcessor,thecomponentresponsibleformovingdatafromthesourcetothechannel,inasingletransaction.Thisisalsothenumberofeventsthatcanberead,alsocalledatake,inasingletransactionbytheSinkProcessor,whichisthecomponentresponsibleformovingdatafromthechanneltothesink.Youmightwanttosetthishigherinordertodecreasetheoverheadofthetransactionwrapper,whichmightspeedthingsup.Thedownsidetoincreasingthisisthatasourcewouldhavetorollbackmoredataintheeventofafailure.

TipFlumeonlyprovidestransactionalguaranteesforeachchannelineachindividualagent.Inamultiagent,multichannelconfiguration,duplicatesandout-of-orderdeliveryarelikelybutshouldnotbeconsideredthenorm.Ifyouaregettingduplicatesinnonfailureconditions,itmeansthatyouneedtocontinuetuningyourFlumeconfigurations.

Ifyouareusingasinkthatwritessomeplacethatbenefitsfromlargerbatchesofwork(suchasHDFS),youmightwanttosetthishigher.Likemanythings,theonlywaytobesureistorunperformancetestswithdifferentvalues.ThisblogpostfromFlumecommitterMikePercyshouldgiveyousomegoodstartingpoints:http://bit.ly/flumePerfPt1.

ThebyteCapacityBufferPercentageandbyteCapacityparameterswereintroducedinhttps://issues.apache.org/jira/browse/FLUME-1535asameanstosizethememorychannelcapacityusingthenumberofbytesusedratherthanthenumberofeventsaswellastryingtoavoidOutOfMemoryErrors.Ifyoureventshavealargevarianceinsize,youmightbetemptedtousethesesettingstoadjustthecapacity,butbewarnedthatcalculationsareestimatedfromtheevent’sbodyonly.Ifyouhaveanyheaders,whichyouwill,youractualmemoryusagewillbehigherthantheconfiguredvalues.

Finally,thekeep-aliveparameteristhetimethethreadwritingdataintothechannelwillwaitwhenthechannelisfull,beforegivingup.Asdataisbeingdrainedfromthechannelatthesametime,ifspaceopensupbeforethetimeoutexpires,thedatawillbewrittentothechannelratherthanthrowinganexceptionbacktothesource.Youmightbetemptedtosetthisvalueveryhigh,butrememberthatwaitingforawritetoachannelwillblockthedataflowingintoyoursource,whichmightcausedatatobackupinanupstreamagent.Eventually,thismightresultineventsbeingdropped.Youneedtosizeforperiodicspikesintrafficaswellastemporaryplanned(andunplanned)maintenance.

ThefilechannelAfilechannelisachannelthatstoreseventstothelocalfilesystemoftheagent.Thoughit’sslowerthanthememorychannel,itprovidesadurablestoragepaththatcansurvivemostissuesandshouldbeusedinusecaseswhereagapinyourdataflowisundesirable.

ThisdurabilityisprovidedbyacombinationofaWriteAheadLog(WAL)andoneormorefilestoragedirectories.TheWALisusedtotrackallinputandoutputfromthechannelinanatomicallysafeway.Thisway,iftheagentisrestarted,theWALcanbereplayedtomakesurealltheeventsthatcameintothechannel(puts)havebeenwrittenout(takes)beforethestoreddatacanbepurgedfromthelocalfilesystem.

Additionally,thefilechannelsupportstheencryptionofdatawrittentothefilesystemifyourdatahandlingpolicyrequiresthatalldataonthedisk(eventemporarily)beencrypted.Iwon’tcoverthishere,butshouldyouneedit,thereisanexampleintheFlumeUserGuide(http://flume.apache.org/FlumeUserGuide.html).Keepinmindthatusingencryptionwillreducethethroughputofyourfilechannel.

Tousethefilechannel,setthetypeparameteronyournamedchanneltofile.

agent.channels.c1.type=file

Thisdefinesafilechannelnamedc1fortheagentnamedagent.

type Yes String file

checkpointDir No String ~/.flume/file-channel/checkpoint

useDualCheckpoints No boolean false

backupCheckpointDir No StringNodefault,butmustbedifferentfromcheckpointDir

dataDirs No String(commaseparatedlist)

~/.flume/file-channel/data

capacity No int 1000000

keep-alive No int 3(seconds)

checkpointInterval No long 30000(milliseconds)

maxFileSize No long 2146435071(bytes)

minimumRequiredSpace No long 524288000(bytes)

TospecifythelocationwheretheFlumeagentshouldholddata,setthecheckpointDiranddataDirsproperties:

agent.channels.c1.checkpointDir=/flume/c1/checkpoint

agent.channels.c1.dataDirs=/flume/c1/data

Technically,thesepropertiesarenotrequiredandhavesensibledefaultvaluesfordevelopment.However,ifyouhavemorethanonefilechannelconfiguredinyouragent,onlythefirstchannelwillstart.Forproductiondeploymentsanddevelopmentworkwithmultiplefilechannels,youshouldusedistinctdirectorypathsforeachfilechannelstorageareaandconsiderplacingdifferentchannelsondifferentdiskstoavoidIOcontention.Additionally,ifyouaresizingalargemachine,considerusingsomeformofRAIDthatcontainsstriping(RAID10,50,or60)toachievehigherdiskperformanceratherthanbuyingmoreexpensive10Kor15KdrivesorSSDs.Ifyoudon’thaveRAIDstripingbuthavemultipledisks,setdataDirstoacomma-separatedlistofeachstoragelocation.UsingmultiplediskswillspreadthedisktrafficalmostaswellasstripedRAIDbutwithoutthecomputationaloverheadassociatedwithRAID50/60aswellasthe50percentspacewasteassociatedwithRAID10.You’llwanttotestyoursystemtoseewhethertheRAIDoverheadisworththespeeddifference.Asharddrivefailuresareareality,youmightprefercertainRAIDconfigurationstosingledisksinordertoprotectyourselffromthedatalossassociatedwithsingledrivefailures.RAID6acrossthemaximumnumberofdiskscanprovidethehighestperformancefortheminimalamountofdataprotectionwhencombinedwithredundantandreliablepowersources(suchasanuninterruptablepowersupplyorUPS).

UsingtheJDBCchannelisabadideaasitwouldintroduceabottleneckandsinglepointoffailureinwhatshouldbedesignedasahighlydistributedsystem.NFSstorageshouldbeavoidedforthesamereason.

TipBesuretosetHADOOP_PREFIXandJAVA_HOMEenvironmentvariableswhenusingthefilechannel.Whileweseeminglyhaven’tusedanythingHadoop-specific(suchaswritingtoHDFS),thefilechannelusesHadoopWritablesasanon-diskserializationformat.IfFlumecan’tfindtheHadooplibraries,youmightseethisinyourstartup,socheckyourenvironmentvariables:java.lang.NoClassDefFoundError:org/apache/hadoop/io/Writable

StartingwithFlume1.4,thefilechannelsupportsasecondarycheckpointdirectory.Insituationswhereafailureoccurswhilewritingthecheckpointdata,thatinformationcouldbecomeunusableandafullreplayofthelogsisnecessaryinordertorecoverthestateofthechannel.Asonlyonecheckpointisupdatedatatimebeforeflippingtotheother,oneshouldalwaysbeinaconsistentstate,thusshorteningrestarttimes.Withoutvalidcheckpointinformation,theFlumeagentcan’tknowwhathasbeensentandwhathasnotbeensentindataDirs.Asfilesinthedatadirectoriesmightcontainlargeamountsofdataalreadysentbutnotyetdeleted,alackofcheckpointinformationwouldresultinalargenumberofrecordsbeingresentasduplicates.

Incomingdatafromsourcesiswrittenandacknowledgedattheendofthemostcurrentfileinthedatadirectory.Thefilesinthecheckpointdirectorykeeptrackofthisdatawhenit’stakenbyasink.

Ifyouwanttousethisfeature,settheuseDualCheckpointspropertytotrueandspecifyalocationforthatsecondcheckpointdirectorywiththebackupCheckpointDirproperty.Forperformancereasons,itisalwayspreferredthatthisbeonadifferentdiskfromtheotherdirectoriesusedbythefilechannel:

agent.channels.c1.useDualCheckpoints=true

agent.channels.c1.backupCheckpointDir=/flume/c1/checkpoint2

Thedefaultfilechannelcapacityisonemillioneventsregardlessofthesizeoftheeventcontents.Ifthechannelcapacityisreached,asourcewillnolongerbeabletoingestthedata.Thisdefaultshouldbefineforlowvolumecases.You’llwanttosizethishigherifyouringestionissoheavythatyoucan’ttoleratenormalplannedorunplannedoutages.Forinstance,therearemanyconfigurationchangesyoucanmakeinHadoopthatrequireaclusterrestart.IfyouhaveFlumewritingimportantdataintoHadoop,thefilechannelshouldbesizedtotoleratethetimeittakestorestartHadoop(andmaybeaddacomfortbufferfortheunexpected).Ifyourclusterorothersystemsareunreliable,youcansetthishigherstilltohandleevenlargeramountsofdowntime.Atsomepoint,you’llrunintothefactthatyourdiskspaceisafiniteresource,soyouwillhavetopicksomeupperlimit(orbuybiggerdisks).

Thekeep-aliveparameterissimilartomemorychannels.Itisthemaximumtimethesourcewillwaitwhentryingtowriteintoafullchannelbeforegivingup.Ifspacebecomesavailablebeforethetimeout,thewriteissuccessful;otherwise,ChannelExceptionisthrownbacktothesource.

ThetransactionCapacitypropertyisthemaximumnumberofeventsallowedinasingletransaction.Thismightbecomeimportantforcertainsourcesthatbatchtogethereventsandpassthemtothechannelinasinglecall.Mostlikely,youwon’tneedtochangethisfromthedefault.Settingthishigherallocatesadditionalresourcesinternally,soyoushouldn’tincreaseitunlessyourunintoperformanceissues.

ThecheckpointIntervalpropertyisthenumberofmillisecondsbetweenperformingacheckpoint(whichalsorollsthelogfileswrittentologDirs).Ifyoudonotsetthis,30secondswillbeused.

CheckpointfilesalsorollbasedonthevolumeofdatawrittentothemusingthemaxFileSizeproperty.Youcanlowerthisvalueforlowtrafficchannelsifyouwanttotryandsavesomediskspace.Let’ssayyourmaximumfilesizeis50,000bytesbutyourchannelonlywrites500bytesaday;itwouldtake100daystofillasinglelog.Let’ssaythatyouwereonday100and2000bytescameinallatonce.Somedatawouldbewrittentotheoldfileandanewfilewouldbestartedwiththeoverflow.Aftertheroll,Flumetriestoremoveanylogfilesthataren’tneededanymore.Asthefullloghasunprocessedrecords,itcannotberemovedyet.Thenextchancetocleanupthatoldlogfilemightnotcomeforanother100days.Itprobablydoesn’tmatterifthatold50,000bytefilesticks

aroundlonger,butasthedefaultisaround2GB,youcouldhavetwicethat(4GB)diskspaceusedperchannel.Dependingonhowmuchdiskyouhaveavailableandthenumberofchannelsconfiguredinyouragent,thismightormightnotbeaproblem.Ifyourmachineshaveplentyofstoragespace,thedefaultshouldbefine.

Finally,theminimumRequiredSpacepropertyistheamountofspaceyoudonotwanttouseforwritinglogs.Thedefaultconfigurationwillthrowanexceptionifyouattempttousethelast500MBofthediskassociatedwiththedataDirpath.Thislimitappliesacrossallchannels,soifyouhavethreefilechannelsconfigured,theupperlimitisstill500MBandnot1.5GB.Youcansetthisvalueaslowas1MB,butgenerallyspeaking,badthingstendtohappenwhenyoupushdiskutilizationtowards100percent.

SpillableMemoryChannelIntroducedinFlume1.5,theSpillableMemoryChannelisachannelthatactslikeamemorychanneluntilitisfull.Atthatpoint,itactslikeafilechannelthatisconfiguredwithamuchlargercapacitythanitsmemorycounterpartbutrunsatthespeedofyourdisks(whichmeansordersofmagnitudeslower).

NoteTheSpillableMemoryChannelisstillconsideredexperimental.Useitatyourownrisk!

Ihavemixedfeelingsaboutthisnewchanneltype.Onthesurface,itseemslikeagoodidea,butinpractice,Icanseeproblems.Specifically,havingavariablechannelspeedthatchangesdependingonhowdownstreamentitiesinyourdatapipebehavemakesfordifficultcapacityplanning.Asamemorychannelisusedundergoodconditions,thisimpliesthatthedatacontainedinitcanbelost.SowhywouldIgothroughextratroubletosavesomeofittothedisk?Thedataiseitherveryimportantformetospoolittodiskwithafile-backedchannel,orit’slessimportantandcanbelost,soIcangetawaywithlesshardwareanduseafastermemory-backedchannel.IfIreallyneedmemoryspeedbutwiththecapacityofaharddrive,SolidStateDrive(SSD)priceshavecomedownenoughinrecentyearsforafilechannelonSSDtonowbeaviableoptionforyouratherthanusingthishybridchanneltype.Idonotusethischannelmyselfforthesereasons.

Tousethischannelconfiguration,setthetypeparameteronyournamedchanneltospillablememory:

agent.channels.c1.type=spillablememory

ThisdefinesaSpillableMemoryChannelnamedc1fortheagentnamedagent.

type Yes String spillablememory

memoryCapacity No int 10000

overflowCapacity No int 100000000

overflowTimeout No int 3(seconds)

overflowDeactivationThreshold No int 5(percent)

byteCapacityBufferPercentage No int 20(percent)

byteCapacity No long(bytes) 80%ofJVMHeap

checkpointDir No String ~/.flume/file-channel/checkpoint

dataDirsNo String(comma

~/.flume/file-channel/data

separatedlist)

useDualCheckpoints No boolean false

backupCheckpointDir No StringNodefault,butmustbedifferentthancheckpointDir

checkpointInterval No long 30000(milliseconds)

maxFileSize No long 2146435071(bytes)

minimumRequiredSpace No long 524288000(bytes)

Asyoucansee,manyofthefieldsmatchagainstthememorychannelandfilechannel’sproperties,sothereshouldbenosurpriseshere.Let’sstartwiththememoryside.

ThememoryCapacitypropertydeterminesthemaximumnumberofeventsheldinthememory(thiswasjustcalledcapacityforthememorychannelbutwasrenamedheretoavoidambiguity).Also,thedefaultvalue,ifunspecified,is10,000recordsinsteadof100.Ifyouwantedtodoublethedefaultcapacity,theconfigurationmightlooksomethinglikethis:

agent.channels.c1.memoryCapacity=20000

Asmentionedpreviously,youwillmostlikelyneedtoincreasetheJavaheapspaceallocatedusingthe-Xmxand-Xmsparameters.LikemostJavaprograms,morememoryusuallyhelpsthegarbagecollectorrunmoreefficiently,especiallyasanapplicationsuchasFlumegeneratesalotofshort-livedobjects.BesuretodoyourresearchandpickanappropriateJVMgarbagecollectorbasedonyouravailablehardware.

TipYoucansetmemoryCapacitytozero,whicheffectivelyturnsitintoafilechannel.Don’tdothis.Justusethefilechannelandbedonewithit.

TheoverflowCapacitypropertydeterminesthemaximumnumberofeventsthatcanbewrittentothediskbeforeanerroristhrownbacktothesourcefeedingit.Thedefaultvalueisagenerous100,000,000events(farlargerthanthefilechannel’sdefaultof100,000).Mostserversnowadayshavemultiterabytedisks,sospaceshouldnotbeaproblem,butdothemathagainstyouraverageeventsizetobesureyoudon’tfillyourdisksbyaccident.Ifyourchannelsarefillingthismuch,youareprobablydealingwithanotherissuedownstreamandthelastthingyouneedisyourdatabufferlayerfillingupcompletely.Forexample,ifyouhadalarge1megabyteeventpayload,100millionoftheseaddupto100terabytes,whichisprobablybiggerthanthediskspaceonanaverageserver.A1kilobytepayloadwouldonlytake100gigabytes,whichisprobablyfine.Justdothemathaheadoftimesoyouarenotsurprised.

TipYoucansetoverflowCapacitytozero,whicheffectivelyturnsitintoamemorychannel.

Don’tdothis.Justusethememorychannelandbedonewithit.

ThetransactionCapacitypropertyadjuststhebatchsizeofeventswrittentoachannelinasingletransaction.Ifthisissettoolow,itwilllowerthethroughputoftheagentonhighvolumeflowsbecauseofthetransactionoverhead.Forahighvolumechannel,youwillprobablyneedtosetthishigher,buttheonlywaytobesureistotestyourparticularworkflow.SeeChapter8,MonitoringFlume,tolearnhowtoaccomplishthis.

Finally,thebyteCapacityBufferPercentageandbyteCapacityparametersareidenticalinfunctionalityanddefaultstothememorychannel,soIwon’twasteyourtimerepeatingithere.

WhatisimportantistheoverflowTimeoutproperty.Thisisthenumberofsecondsafterwhichthememorypartofthechannelfillsbeforedatastartsgettingwrittentothedisk-backedportionofthechannel.Ifyouwantwritestostartoccurringimmediately,youcansetthistozero.Youmightwonderwhyyouneedtowaitbeforestartingtowritetothediskportionofthechannel.ThisiswheretheundocumentedoverflowDeactivationThresholdpropertycomesintoplay.Thisistheamountoftimethatspacehastobeavailableinthememorypathbeforeitcanswitchbackfromdiskwriting.Ibelievethisisanattempttopreventflappingbackandforthbetweenthetwo.Ofcourse,therereallyarenoorderingguaranteesinFlume,soIdon’tknowwhyyouwouldchoosetoappendtothediskbufferifaspotisavailableinfastermemory.Perhapstheyaretryingtoavoidsomekindofstarvationcondition,althoughthecodeappearstoattempttoremoveeventsintheorderofarrivalevenwhenusingbothmemoryanddiskqueues.Perhapsitwillbeexplainedtousshoulditevercomeoutofexperimentalstatus.

NoteTheoverflowDeactivationThresholdpropertyisstatedtobeforinternaluseonly,soadjustitatyourownperil.Ifyouareconsideringit,besuretogetfamiliarwiththesourcecodesothatyouunderstandtheimplicationsofalteringthedefault.

Therestofthepropertiesonthischannelareidenticalinnameandfunctionalitytoitsfilechannelcounterpart,sopleaserefertotheprevioussection.

SummaryInthischapter,wecoveredthetwochanneltypesyouaremostlikelytouseinyourdataprocessingpipelines.

Thememorychanneloffersspeedatthecostofdatalossintheeventoffailure.Alternatively,thefilechannelprovidesamorereliabletransportinthatitcantolerateagentfailuresandrestartsataperformancecost.

Youwillneedtodecidewhichchannelisappropriateforyourusecases.Whentryingtodecidewhetheramemorychannelisappropriate,askyourselfwhatthemonetarycostisifyoulosesomedata.Weighthatagainsttheadditionalcostsofmorehardwaretocoverthedifferenceinperformancewhendecidingifyouneedadurablechannelafterall.Anotherconsiderationiswhetherornotthedatacanberesent.NotalldatayoumightingestintoHadoopwillcomefromstreamingapplicationlogs.Ifyoureceive“dailydownloads”ofdata,youcangetawaywithusingamemorychannelbecauseifyouencounteraproblem,youcanalwaysreruntheimport.

Finally,wecoveredtheexperimentalSpillableMemoryChannel.Personally,Ithinkitscreationisabadidea,butlikemostthingsincomputerscience,everybodyhasanopiniononwhatisgoodorbad.Ifeelthattheaddedcomplexityandnondeterministicperformancemakefordifficultcapacityplanning,asyoushouldalwayssizethingsfortheworst-casescenario.Ifyourdataiscriticalenoughforyoutooverflowtothediskratherthandiscardtheevents,thenyouaren’tgoingtobeokaywithlosingeventhesmallamountheldinmemory.

Inthenextchapter,we’lllookatsinks,specifically,theHDFSsinktowriteeventstoHDFS,theElasticSearchsinktowriteeventstoElasticSearch,andtheMorphlineSolrsinktowriteeventstoSolr.WewillalsocoverEventSerializers,whichspecifyhowFlumeeventsaretranslatedintooutputthat’smoresuitableforthesink.Finally,wewillcoversinkprocessorsandhowtosetuploadbalancingandfailurepathsinatieredconfigurationformorerobustdatatransport.

Chapter4.SinksandSinkProcessorsBynow,youshouldhaveaprettygoodideawherethesinkfitsintotheFlumearchitecture.Inthischapter,wewillfirstlearnaboutthemost-usedsinkwithHadoop,theHDFSsink.WewillthencovertwoofthenewersinksthatsupportcommonNearRealTime(NRT)logprocessing:theElasticSearchSinkandtheMorphlineSolrSink.Asyou’dexpect,thefirstwritesdataintoElasticsearchandthelattertoSolr.ThegeneralarchitectureofFlumesupportsmanyothersinkswewon’thavespacetocoverinthisbook.SomecomebundledwithFlumeandcanwritetoHBase,IRC,and,aswesawinChapter2,AQuickStartGuidetoFlume,alog4jandfilesink.OthersinksareavailableontheInternetandcanbeusedtowritedatatoMongoDB,Cassandra,RabbitMQ,Redis,andjustaboutanyotherdatastoreyoucanthinkof.Ifyoucan’tfindasinkthatsuitsyourneeds,youcanwriteoneeasilybyextendingtheorg.apache.flume.sink.AbstractSinkclass.

HDFSsinkThejoboftheHDFSsinkistocontinuouslyopenafileinHDFS,streamdataintoit,andatsomepoint,closethatfileandstartanewone.AswediscussedinChapter1,OverviewandArchitecture,thetimebetweenfilesrotationsmustbebalancedwithhowquicklyfilesareclosedinHDFS,thusmakingthedatavisibleforprocessing.Aswe’vediscussed,havinglotsoftinyfilesforinputwillmakeyourMapReducejobsinefficient.

TousetheHDFSsink,setthetypeparameteronyournamedsinktohdfs.

agent.sinks.k1.type=hdfs

ThisdefinesaHDFSsinknamedk1fortheagentnamedagent.Therearesomeadditionalparametersyoumustspecify,startingwiththepathinHDFSyouwanttowritethedatato:

agent.sinks.k1.hdfs.path=/path/in/hdfs

ThisHDFSpath,likemostfilepathsinHadoop,canbespecifiedinthreedifferentways:absolute,absolutewithservername,andrelative.Theseareallequivalent(assumingyourFlumeagentisrunastheflumeuser):

absolute /Users/flume/mydata

absolutewithserver hdfs://namenode/Users/flume/mydata

relative mydata

IprefertoconfigureanyserverI’minstallingFlumeonwithaworkinghadoopcommandlinebysettingthefs.default.namepropertyinHadoop’score-site.xmlfile.Idon’tkeeppersistentdatainHDFSuserdirectoriesbutprefertouseabsolutepathswithsomemeaningfulpathname(forexample,/logs/apache/access).TheonlytimeIwouldspecifyaNameNodespecificallyisifthetargetwasadifferentHadoopclusterentirely.Thisallowsyoutomoveconfigurationsyou’vealreadytestedinoneenvironmentintoanotherwithoutunintendedconsequencessuchasyourproductionserverwritingdatatoyourstagingHadoopclusterbecausesomebodyforgottoeditthetargetintheconfiguration.Iconsiderexternalizingenvironmentspecificsagoodbestpracticetoavoidsituationssuchasthese.

OnefinalrequiredparameterfortheHDFSsink,actuallyanysink,isthechannelthatitwillbedoingtakeoperationsfrom.Forthis,setthechannelparameterwiththechannelnametoreadfrom:

Thistellsthek1sinktoreadeventsfromthec1channel.

Hereisamostlycompletetableofconfigurationparametersyoucanadjustfromthedefaultvalues:

type Yes String hdfs

channel Yes String

hdfs.path Yes String

hdfs.filePrefix No String FlumeData

hdfs.fileSuffix No String

hdfs.minBlockReplicas No intSeethedfs.replicationpropertyinyourinheritedHadoopconfiguration,usually,3.

hdfs.maxOpenFiles No long 5000

hdfs.closeTries No int 0(0=tryforever,otherwiseacount)

hdfs.retryInterval No int 180Seconds(0=don’tretry)

hdfs.round No boolean false

hdfs.roundValue No int 1

hdfs.roundUnit No String(second,

minuteorhour)second

hdfs.timeZone No String Localtime

hdfs.useLocalTimeStamp No boolean False

hdfs.inUsePrefix No String Blank

hdfs.inUseSuffix No String .tmp

hdfs.rollInterval No long(seconds) 30Seconds(0=disable)

hdfs.rollSize No long(bytes) 1024bytes(0=disable)

hdfs.rollCount No long 10(0=disable)

hdfs.batchSize No long 100

hdfs.codeC No String

RemembertoalwayschecktheFlumeUserGuidefortheversionyouareusingathttp://flume.apache.org/,asthingsmightchangebetweenthereleaseofthisbookandtheversionyouareactuallyusing.

PathandfilenameEachtimeFlumestartsanewfileathdfs.pathinHDFStowritedatainto,thefilenameiscomposedofthehdfs.filePrefix,aperiodcharacter,theepochtimestampatwhichthefilewasstarted,andoptionally,afilesuffixspecifiedbythehdfs.fileSuffixproperty(ifset),forexample:

Theprecedingcommandwouldresultinafilesuchas/logs/apache/access/FlumeData.1362945258

However,inthefollowingconfiguration,yourfilenameswouldbemorelike/logs/apache/access/access.1362945258.log:

agent.sinks.k1.hdfs.filePrefix=access

agent.sinks.k1.hdfs.fileSuffix=.log

Overtime,thehdfs.pathdirectorywillgetveryfull,soyouwillwanttoaddsomekindoftimeelementintothepathtopartitionthefilesintosubdirectories.Flumesupportsvarioustime-basedescapesequences,suchas%Ytospecifyafour-digityear.Iliketousesequencesintheyear/month/day/hourform(sothattheyaresortedoldesttonewest),soIoftenusethisforapath:

agent.sinks.k1.hdfs.path=/logs/apache/access/%Y/%m/%d/%H

ThissaysIwantapathlike/logs/apache/access/2013/03/10/18/.

NoteForacompletelistoftime-basedescapesequences,seetheFlumeUserGuide.

AnotherhandyescapesequencemechanismistheabilitytouseFlumeheadervaluesinyourpath.Forinstance,iftherewasaheaderwithakeyoflogType,IcouldsplitApacheaccessanderrorlogsintodifferentdirectorieswhileusingthesamechannel,byescapingtheheader’skeyasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%{logType}/%Y/%m/%d/%H

Theprecedinglineofcodewouldresultinaccesslogsgoingto/logs/apache/access/2013/03/10/18/,anderrorlogsgoingto/logs/apache/error/2013/03/10/18/.However,ifIpreferredbothlogtypesinthesamedirectorypath,IcouldhaveusedlogTypeinmyhdfs.filePrefixinstead,asfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H

agent.sinks.k1.hdfs.filePrefix=%{logType}

Obviously,itispossibleforFlumetowritetomultiplefilesatonce.Thehdfs.maxOpenFilespropertysetstheupperlimitforhowmanycanbeopenatonce,withadefaultof5000.Ifyoushouldexceedthislimit,theoldestfilethat’sstillopenisclosed.RememberthateveryopenfileincursoverheadbothattheOSlevelandinHDFS(NameNodeandDataNodeconnections).

Anothersetofpropertiesyoumightfindusefulallowforroundingdowneventtimesatanhour,minute,orsecondgranularitywhilestillmaintainingtheseelementsinfilepaths.Let’ssayyouhadapathspecificationasfollows:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H%M

However,ifyouwantedonlyfoursubdirectoriesperday(at00,15,30,and45pastthehour,eachcontaining15minutesofdata),youcouldaccomplishthisbysettingthefollowing:

agent.sinks.k1.hdfs.round=true

agent.sinks.k1.hdfs.roundValue=15

agent.sinks.k1.hdfs.roundUnit=minute

Thiswouldresultinlogsbetween01:15:00and01:29:59onMarch10,2013beingwrittentofilescontainedin/logs/apache/2013/03/10/0115/.Logsfrom01:30:00to01:44:59wouldbewritteninfilescontainedin/logs/apache/2013/03/10/0130/.

Thehdfs.timeZonepropertyisusedtospecifythetimezonethatyouwanttimeinterpretedforyourescapesequences.Thedefaultisyourcomputer’slocaltime.Ifyourlocaltimeisaffectedbydaylightsavingstimeadjustments,youwillhavetwiceasmuchdatawhen%H==02(inthefall)andnodatawhen%H==02(inthespring).Ithinkitisabadideatointroducetimezonesintothingsthataremeantforcomputerstoread.Ibelievetimezonesareaconcernforhumansaloneandcomputersshouldonlyconverseinuniversaltime.Forthisreason,IsetthispropertyonmyFlumeagentstomakethetimezoneissuejustgoaway:

-Duser.timezone=UTC

Ifyoudon’tagree,youarefreetousethedefault(localtime)orsethdfs.timeZonetowhateveryoulike.Thevalueyoupassedisusedinacalltojava.util.Timezone.getTimeZone(…),sochecktheJavadocsforacceptablevaluestobeusedhere.

Theothertime-relatedpropertyisthehdfs.useLocalTimeStampbooleanproperty.Bydefault,itsvalueisfalse,whichtellsthesinktousetheevent’stimestampheaderwhencalculatingdate-basedescapesequencesinfilepaths,asshownpreviously.Ifyousetthepropertytotrue,thecurrentsystemtimewillbeusedinstead,effectivelytellingFlumetousethetransportarrivaltimeratherthantheoriginaleventtime.YouwouldnotsetthisincaseswhereHDFSwasthefinaltargetforthestreamedevents.Thisway,delayedeventswillstillbeplacedcorrectly(whereuserswouldnormallylookforthem)regardlessoftheirarrivaltime.However,theremaybeausecasewhereeventsaretemporarilywrittentoHadoopandprocessedinbatches,onsomeinterval(perhapsdaily).Inthiscase,thetransporttimewouldbepreferred,soyourpostprocessingjobdoesn’tneedtoscanolderfoldersfordelayeddata.

RememberthatfilesinHDFSarebrokenintofileblocksthatarereplicatedacrosstheDataNodes.Thedefaultnumberofreplicasisusuallythree(assetintheHadoopbaseconfiguration).Youcanoverridethisvalueupordownforthissinkwiththehdfs.minBlockReplicasproperty.Forexample,ifIhaveadatastreamthatIfeelonly

needstworeplicasinsteadofthree,Icanoverridethisasfollows:

agent.skinks.k1.hdfs.minBlockReplicas=2

NoteDon’tsettheminimumreplicacounthigherthanthenumberofdatanodesyouhave,otherwiseyou’llcreateadegradedstateHDFS.Youalsodon’twanttosetitsohighthatadownedboxformaintenancewouldtriggerthissituation.Personally,I’veneversetthishigherthanthedefaultofthree,butIhavesetitloweronlessimportantdatainordertosavespace.

Finally,whilefilesarebeingwrittentoHDFS,a.tmpextensionisadded.Whenthefileisclosed,theextensionisremoved.Youcanchangetheextensionusedbysettingthehdfs.inUseSuffixproperty,butI’veneverhadareasontodoso:

agent.sinks.k1.hdfs.inUseSuffix=flumeiswriting

ThisallowsyoutoseewhichfilesarebeingwrittentosimplybylookingatadirectorylistinginHDFS.AsyoutypicallyspecifyadirectoryforinputinyourMapReducejob(orbecauseyouareusingHive),thetemporaryfileswilloftenbepickedupasemptyorgarbledinputbymistake.Toavoidhavingyourtemporaryfilespickedupbeforebeingclosed,settheprefixtoeitheradotoranunderscorecharacterasfollows:

agent.sinks.k1.hdfs.inUsePrefix=_

Thatsaid,thereareoccasionswherefileswerenotclosedproperlyduetosomeHDFSglitch,soyoumightseefileswiththein-useprefix/suffixthathaven’tbeenusedinsometime.AfewnewpropertieswereaddedinVersion1.5tochangethedefaultbehaviorofclosingfiles.Thefirstisthehdfs.closeTriesproperty.Thedefaultofzeroactuallymeans“tryforever”,soitisalittleconfusing.Settingitto4meanstry4timesbeforegivingup.Youcanadjusttheintervalbetweenretriesbysettingthehdfs.retryIntervalproperty.SettingittoolowcouldswampyourNameNodewithtoomanyrequests,sobecarefulifyoulowerthisfromthedefaultof3minutes.Ofcourse,ifyouareopeningfilestooquickly,youmightneedtolowerthisjusttokeepfromgoingoverthehdfs.maxOpenFilessettingwhichwascoveredpreviously.Ifyouactuallydidn’twantanyretries,youcansethdfs.retryIntervaltozeroseconds(again,nottobeconfusedwithcloseTries=0,whichmeanstryforever).Hopefullyinafutureversion,theywillusethemorecommonlyusedconventionofanegativenumber(usually,-1)wheninfiniteisdesired.

FilerotationBydefault,Flumewillrotateactivelywritten-tofilesevery30seconds,10events,or1024bytes.Thisisdonebysettingthehdfs.rollInterval,hdfs.rollCount,andhdfs.rollSizeproperties,respectively.Oneormoreofthesecanbesettozerotodisablethisparticularrollingmechanism.Forinstance,ifyouonlywantedatime-basedrollof1minute,youwouldsetthefollowing:

agent.sinks.k1.hdfs.rollInterval=60

agent.sinks.k1.hdfs.rollCount=0

agent.sinks.k1.hdfs.rollSize=0

Ifyouroutputcontainsanyamountofheaderinformation,theHDFSsizeperfilecanbelargerthanwhatyouexpect,becausethehdfs.rollSizerotationschemeonlycountstheeventbodylength.Clearly,youmightnotwanttodisableallthreemechanismsforrotationatthesametime,oryouwillhaveonedirectoryinHDFSoverflowingwithfiles.

Finally,arelatedparameterishdfs.batchSize.Thisisthenumberofeventsthatthesinkwillreadpertransactionfromthechannel.Ifyouhavealargevolumeofdatainyourchannel,youmightseeaperformanceincreasebysettingthishigherthanthedefaultof100,whichdecreasesthetransactionoverheadperevent.

Nowthatwe’vediscussedthewayfilesaremanagedandrolledinHDFS,let’slookintohowtheeventcontentsgetwritten.

CompressioncodecsCodecs(Coder/Decoders)areusedtocompressanddecompressdatausingvariouscompressionalgorithms.Flumesupportsgzip,bzip2,lzo,andsnappy,althoughyoumighthavetoinstalllzoyourself,especiallyifyouareusingadistributionsuchasCDH,duetolicensingissues.

Ifyouwanttospecifycompressionforyourdata,setthehdfs.codeCpropertyifyouwanttheHDFSsinktowritecompressedfiles.ThepropertyisalsousedasthefilesuffixforthefileswrittentoHDFS.Forexample,ifyouspecifythefollowing,allfilesthatarewrittenwillhavea.gzipextension,soyoudon’tneedtospecifythehdfs.fileSuffixpropertyinthiscase:

agent.sinks.k1.hdfs.codeC=gzip

Thecodecyouchoosetousewillrequiresomeresearchonyourpart.Thereareargumentsforusinggziporbzip2fortheirhighercompressionratiosatthecostoflongercompressiontimes,especiallyifyourdataiswrittenoncebutwillbereadhundredsorthousandsoftimes.Ontheotherhand,usingsnappyorlzoresultsinfastercompressionperformancebutresultsinalowercompressionratio.Keepinmindthatthesplitabilityofthefile,especiallyifyouareusingplaintextfiles,willgreatlyaffecttheperformanceofyourMapReducejobs.GopickupacopyofHadoopBeginner’sGuide,GarryTurkington,PacktPublishing(http://amzn.to/14Dh6TA)orHadoop:TheDefinitiveGuide,TomWhite,O’Reilly(http://amzn.to/16OsfIf)ifyouaren’tsurewhatI’mtalkingabout.

EventSerializersAnEventSerializeristhemechanismbywhichaFlumeEventisconvertedintoanotherformatforoutput.ItissimilarinfunctiontotheLayoutclassinlog4j.Bydefault,thetextserializer,whichoutputsjusttheFlumeeventbody,isused.Thereisanotherserializer,header_and_text,whichoutputsboththeheadersandthebody.Finally,thereisanavro_eventserializerthatcanbeusedtocreateanAvrorepresentationoftheevent.Ifyouwriteyourown,you’dusetheimplementation’sfullyqualifiedclassnameastheserializerpropertyvalue.

TextoutputAsmentionedpreviously,thedefaultserializeristhetextserializer.ThiswilloutputonlytheFlumeeventbody,withtheheadersdiscarded.Eacheventhasanewlinecharacterappenderunlessyouoverridethisdefaultbehaviorbysettingtheserializer.appendNewLinepropertytofalse.

Serializer No String text

serializer.appendNewLine No boolean true

TextwithheadersThetext_with_headersserializerallowsyoutosavetheFlumeeventheadersratherthandiscardthem.Theoutputformatconsistsoftheheaders,followedbyaspace,thenthebodypayload,andfinally,terminatedbyanoptionallydisablednewlinecharacter,forinstance:

{key1=value1,key2=value2}bodytexthere

serializer No String text_with_headers

serializer.appendNewLine No boolean true

ApacheAvroTheApacheAvroproject(http://avro.apache.org/)providesaserializationformatthatissimilarinfunctionalitytoGoogleProtocolBuffersbutismoreHadoopfriendlyasthecontainerisbasedonHadoop’sSequenceFileandhassomeMapReduceintegration.Theformatisalsoself-describingusingJSON,makingforagoodlong-termdatastorageformat,asyourdataformatmightevolveovertime.IfyourdatahasalotofstructureandyouwanttoavoidturningitintoStringsonlytothenparsetheminyourMapReducejob,youshouldreadmoreaboutAvrotoseewhetheryouwanttouseitasastorageformatinHDFS.

Theavro_eventserializercreatesAvrodatabasedontheFlumeeventschema.IthasnoformattingparametersasAvrodictatestheformatofthedata,andthestructureoftheFlumeeventdictatestheschemaused:

serializer No String avro_event

serializer.compressionCodec No String(gzip,bzip2,lzo,orsnappy)

serializer.syncIntervalBytes No int(bytes) 2048000(bytes)

IfyouwantyourdatacompressedbeforebeingwrittentotheAvrocontainer,youshouldsettheserializer.compressionCodecpropertytothefileextensionofaninstalledcodec.Theserializer.syncIntervalBytespropertydeterminesthesizeofthedatabufferusedbeforeflushingthedatatoHDFS,andtherefore,thissettingcanaffectyourcompressionratiowhenusingacodec.HereisanexampleusingsnappycompressiononAvrodatausinga4MBbuffer:

agent.sinks.k1.serializer=avro_event

agent.sinks.k1.serializer.compressionCodec=snappy

agent.sinks.k1.serializer.syncIntervalBytes=4194304

agent.sinks.k1.hdfs.fileSuffix=.avro

ForAvrofilestoworkinanAvroMapReducejob,theymustendin.avroortheywillbeignoredasinput.Forthisreason,youneedtoexplicitlysetthehdfs.fileSuffixproperty.Furthermore,youwouldnotsetthehdfs.codeCpropertyonanAvrofile.

User-providedAvroschemaIfyouwanttouseadifferentschemafromtheFlumeeventschemausedwiththeavro_eventtype,startinginVersion1.4,thecloselynamedAvroEventSerializerwillletyoudothis.Keepinmindthatusingthisimplementationonly,theevent’sbodyisserializedandheadersarenotpassedon.

Settheserializertypetothefullyqualifiedorg.apache.flume.sink.hdfs.AvroEventSerializerclassname:

agent.sinks.k1.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer

UnliketheotherserializersthattakeadditionalparametersintheFlumeconfigurationfile,thisonerequiresthatyoupasstheschemainformationviaaFlumeheader.ThisisabyproductofoneoftheAvro-awaresourceswe’llseeinChapter6,Interceptors,ETL,andRouting,whereschemainformationissentfromthesourcetothefinaldestinationviatheeventheader.Youcanfakethisifyouareusingasourcethatdoesn’tsetthesebyusingastaticheaderinterceptor.We’lltalkmoreaboutinterceptorsinChapter6,Interceptors,ETL,andRouting,soflipbacktothispartlateron.

TospecifytheschemadirectlyintheFlumeconfigurationfile,usetheflume.avro.schema.literalheaderasshowninthisexample(usingamapofstringsschema):

agent.sinks.k1.interceptors=i1

agent.sinks.k1.interceptors.i1.type=static

agent.sinks.k1.interceptors.i1.key=flume.avro.schema.literal

agent.sinks.k1.interceptors.i1.value="

{\"type\":\"map\",\"values\":\"string\"}"

IfyouprefertoputtheschemafileinHDFS,usetheflume.avro.schema.urlheaderinstead,asshowninthisexample:

agent.sinks.k1.interceptors=i1

agent.sinks.k1.interceptors.i1.type=static

agent.sinks.k1.interceptors.i1.key=flume.avro.schema.url

agent.sinks.k1.interceptors.i1.value=hdfs://path/to/schema.avsc

Actually,inthissecondform,youcanpassanyURLincludingafile://URL,butthiswouldindicateafilelocaltowhereyouarerunningtheFlumeagent,whichmightcreateadditionalsetupworkforyouradministrators.ThisisalsotrueofconfigurationservedupbyaHTTPwebserverorfarm.Ratherthancreatingadditionalsetupdependencies,justusethedependencyyoucannotremove,whichisHDFS,usingahdfs://URL.

Besuretoonlyseteithertheflume.avro.schema.literalheaderortheflume.avro.schema.urlheaderbothnotboth.

FiletypeBydefault,theHDFSsinkwritesdatatoHDFSasHadoop’sSequenceFile.ThisisacommonHadoopwrapperthatconsistsofakeyandvaluefieldseparatedbybinaryfieldandrecorddelimiters.Usually,textfilesonacomputermakeassumptionslikeanewlinecharacterterminateseachrecord.So,whatdoyoudoifyourdatacontainsanewlinecharacter,suchassomeXML?Usingasequencefilecansolvethisproblembecauseitusesnonprintablecharactersfordelimiters.Sequencefilesarealsosplittable,whichmakesforbetterlocalityandparallelismwhenrunningMapReducejobsonyourdata,especiallyonlargefiles.

SequenceFileWhenusingaSequenceFilefiletype,youneedtospecifyhowyouwantthekeyandvaluetobewrittenontherecordintheSequenceFile.ThekeyoneachrecordwillalwaysbeaLongWritabletypeandwillcontainthecurrenttimestamp,orifthetimestampeventheaderisset,itwillbeusedinstead.Bydefault,theformatofthevalueisaorg.apache.hadoop.io.BytesWritabletype,whichcorrespondstothebyte[]Flumebody:

hdfs.fileType No String SequenceFile

hdfs.writeFormat No String writable

However,ifyouwantthepayloadinterpretedasaString,youcanoverridethehdfs.writeFormatproperty,soorg.apache.hadoop.io.Textwillbeusedasthevaluefield:

hdfs.fileType No String SequenceFile

hdfs.writeFormat No String text

DataStreamIfyoudonotwanttooutputaSequenceFilefilebecauseyourdatadoesn’thaveanaturalkey,youcanuseaDataStreamtooutputonlytheuncompressedvalue.Simplyoverridethehdfs.fileTypeproperty:

agent.sinks.k1.hdfs.fileType=DataStream

ThisisthefiletypeyouwouldusewithAvroserialization,asanycompressionshouldhavebeendoneintheEventSerializer.Toserializegzip-compressedAvrofiles,youwouldsettheseproperties:

agent.sinks.k1.serializer=avro_event

agent.sinks.k1.serializer.compressionCodec=gzip

agent.sinks.k1.hdfs.fileType=DataStream

agent.sinks.k1.hdfs.fileSuffix=.avro

CompressedStreamCompressedStreamissimilartoaDataStream,exceptthatthedataiscompressedwhenit’swritten.Youcanthinkofthisasrunningthegziputilityonanuncompressedfile,butallinonestep.ThisdiffersfromacompressedAvrofilewhosecontentsarecompressedandthenwrittenintoanuncompressedAvrowrapper:

agent.sinks.k1.hdfs.fileType=CompressedStream

RememberthatonlycertaincompressedformatsaresplittableinMapReduceshouldyoudecidetouseCompressedStream.Thecompressionalgorithmselectiondoesn’thaveaFlumeconfigurationbutisdictatedbythezlib.compress.strategyandzlib.compress.levelpropertiesincoreHadoopinstead.

TimeoutsandworkersFinally,therearetwomiscellaneouspropertiesrelatedtotimeoutsandtwoforworkerpoolsthatyoucanchange:

hdfs.callTimeout No long(milliseconds) 10000

hdfs.idleTimeout No int(seconds) 0(0=disable)

hdfs.threadsPoolSize No int 10

hdfs.rollTimerPoolSize No int 1

Thehdfs.callTimeoutpropertyistheamountoftimetheHDFSsinkwillwaitforHDFSoperationstoreturnasuccess(orfailure)beforegivingup.IfyourHadoopclusterisparticularlyslow(forinstance,adevelopmentorvirtualcluster),youmightneedtosetthisvaluehigherinordertoavoiderrors.Keepinmindthatyourchannelwilloverflowifyoucannotsustainhigherwritethroughputthantheinputrateofyourchannel.

Thehdfs.idleTimeoutproperty,ifsettoanonzerovalue,isthetimeFlumewillwaittoautomaticallycloseanidlefile.Ihaveneverusedthisashdfs.fileRollIntervalhandlestheclosingoffilesforeachrollperiod,andifthechannelisidle,itwillnotopenanewfile.Thissettingseemstohavebeencreatedasanalternativerollmechanismtothesize,time,andeventcountmechanismsthathavealreadybeendiscussed.Youmightwantasmuchdatawrittentoafileaspossibleandonlycloseitwhentherereallyisnomoredata.Inthiscase,youcanusehdfs.idleTimeouttoaccomplishthisrotationschemeifyoualsosethdfs.rollInterval,hdfs.rollSize,andhdfs.rollCounttozero.

Thefirstpropertyyoucansettoadjustthenumberofworkersishdfs.threadsPoolSizeanditdefaultsto10.Thisisthemaximumnumberoffilesthatcanbewrittentoatthesametime.Ifyouareusingeventheaderstodeterminefilepathsandnames,youmighthavemorethan10filesopenatonce,butbecarefulwhenincreasingthisvaluetoomuchsoasnottooverwhelmHDFS.

Thelastpropertyrelatedtoworkerpoolsisthehdfs.rollTimerPoolSize.Thisisthenumberofworkersprocessingtimeoutssetbythehdfs.idleTimeoutproperty.Theamountofworktoclosethefilesisprettysmall,soincreasingthisvaluefromthedefaultofoneworkerisunlikely.Ifyoudonotusearotationbasedonhdfs.idleTimeout,youcanignorethehdfs.rollTimerPoolSizeproperty,asitisnotused.

SinkgroupsInordertoremovesinglepointsoffailuresinyourdataprocessingpipeline,Flumehastheabilitytosendeventstodifferentsinksusingeitherloadbalancingorfailover.Inordertodothis,weneedtointroduceanewconceptcalledasinkgroup.Asinkgroupisusedtocreatealogicalgroupingofsinks.Thebehaviorofthisgroupingisdictatedbysomethingcalledthesinkprocessor,whichdetermineshoweventsarerouted.

Thereisadefaultsinkprocessorthatcontainsasinglesinkwhichisusedwheneveryouhaveasinkthatisn’tpartofanysinkgroup.OurHello,World!exampleinChapter2,AQuickStartGuidetoFlume,usedthedefaultsinkprocessor.Nospecialconfigurationisrequiredforsinglesinks.

InorderforFlumetoknowaboutthesinkgroups,thereisanewtop-levelagentpropertycalledsinkgroups.Similartosources,channels,andsinks,youprefixthepropertywiththeagentname:

agent.sinkgroups=sg1

Here,wehavedefinedasinkgroupcalledsg1fortheagentnamedagent.

Foreachnamedsinkgroup,youneedtospecifythesinksitcontainsusingthesinkspropertyconsistingofaspace-delimitedlistofsinknames:

agent.sinkgroups.sg1.sinks=k1k2

Thisdefinesthatthek1andk2sinksarepartofthesg1sinkgroupfortheagentnamedagent.

Often,sinkgroupsareusedinconjunctionwiththetieredmovementofdatatoroutearoundfailures.However,theycanalsobeusedtowritetodifferentHadoopclusters,asevenawell-maintainedclusterhasperiodicmaintenance.

LoadbalancingContinuingtheprecedingexample,let’ssayyouwanttoloadbalancetraffictok1andk2evenly.Therearesomeadditionalpropertiesyouneedtospecify,aslistedinthistable:

Key Type Default

processor.type String load_balance

processor.selector String(round_robin,random) round_robin

processor.backoff boolean false

Whenyousetprocessor.typetoload_balance,roundrobinselectionwillbeused,unlessotherwisespecifiedbytheprocessor.selectorproperty.Thiscanbesettoeitherround_robinorrandom.Youcanalsospecifyyourownloadbalancingselectormechanism,whichwewon’tcoverhere.ConsulttheFlumedocumentationifyouneedthiscustomcontrol.

Theprocessor.backoffpropertyspecifieswhetheranexponentialbackupshouldbeusedwhenretryingasinkthatthrewanexception.Thedefaultisfalse,whichmeansthatafterathrownexception,thesinkwillbetriedagainthenexttimeitsturnisupbasedonroundrobinorrandomselection.Ifsettotrue,thenthewaittimeforeachfailureisdoubled,startingat1seconduptoalimitofaround18hours(216seconds).

NoteInanearlierversionofFlume,thedefaultinthecodeforprocessor.backoffwasstatedasfalse,butthedocumentationstateditastrue.Thiserrorhasbeenfixed,however,itmaysaveyouaheadachebyspecifyingwhatyouwantforpropertysettingsratherthanrelyingonthedefaults.

FailoverIfyouwouldrathertryonesinkandifthatonefailstotryanother,thenyouwanttosetprocessor.typetofailover.Next,you’llneedtosetadditionalpropertiestospecifytheorderbysettingtheprocessor.priorityproperty,followedbythesinkname:

Key Type Default

processor.type String failover

processor.priority.NAME int

processor.maxpenality int(milliseconds) 30000

Let’slookatthefollowingexample:

agent.sinkgroups.sg1.sinks=k1k2k3

agent.sinkgroups.sg1.processor.type=failover

agent.sinkgroups.sg1.processor.priority.k1=10

Lowerprioritynumberscomefirst,andinthecaseofatie,orderisarbitrary.Youcanuseanynumberingsystemthatmakessensetoyou(byones,fives,tens—whatever).Inthisexample,thek1sinkwillbetriedfirst,andifanexceptionisthrown,eitherk2ork3willbetriednext.Ifk3wasselectedfirstfortrialanditfailed,k2willstillbetried.Ifallsinksinthesinkgroupfail,thetransactionwiththechannelisrolledback.

Finally,processor.maxPenalitysetsanupperlimittoanexponentialbackoffforfailedsinksinthegroup.Afterthefirstfailure,itwillbe1secondbeforeitcanbeusedagain.Eachsubsequentfailuredoublesthewaittimeuntilprocessor.maxPenalityisreached.

MorphlineSolrSinkHDFSisnottheonlyusefulplacetosendyourlogsanddata.Solrisapopularreal-timesearchplatformusedtoindexlargeamountsofdata,sofulltextsearchingcanbeperformedalmostinstantaneously.Hadoop’shorizontalscalabilitycreatesaninterestingproblemforSolr,asthereisnowmoredatathanasingleinstancecanhandle.Forthisreason,ahorizontallyscalableversionofSolrwascreated,calledSolrCloud.Cloudera’sSearchproductisalsobasedonSolrCloud,soitshouldbenosurprisethatFlumedeveloperscreatedanewsinkspecificallytowritestreamingdataintoSolr.

Likemoststreamingdataflows,younotonlytransportthedata,butyoualsooftenreformatitintoaformmoreconsumabletothetargetoftheflow.Typically,thisisdoneinaFlume-onlyworkflowbyapplyingoneormoreinterceptorsjustpriortothesinkwritingthedatatothetargetsystem.ThissinkusestheMorphlineenginetotransformthedata,insteadofinterceptors.

Internally,eachFlumeeventisconvertedintoaMorphlinerecordandpassedtothefirstcommandintheMorphlinecommandchain.Arecordcanbethoughtofasasetofkey/valuepairswithstringkeysandarbitraryobjectvalues.EachoftheFlumeheadersispassedasaRecordFieldwiththesameheaderkeys.AspecialRecordFieldkey_attachment_bodyisusedfortheFlumeeventbody.Keepinmindthatthebodyisstillabytearray(Javabyte[])atthispointandmustbespecificallyprocessedintheMorphlinecommandchain.

Eachcommandprocessestherecordinturn,passingtheoutputtotheinputofthenextcommandinlinewiththefinalcommandresponsibleforterminatingtheflow.Inmanyways,itissimilarinfunctionallytoFlume’sInterceptorfunctionality,whichwe’llseeinChapter6,Interceptors,ETL,andRouting.InthecaseofwritingtoSolr,weusetheloadSolrcommandtoconverttheMorphlinerecordintoaSolrDocumentandwritetotheSolrcluster.Hereiswhatthissimplifiedflowmightlooklikeinapictureform:

MorphlineconfigurationfilesMorphlineconfigurationfilesusetheHOCONformat,whichissimilartoJSONbuthasalessstrictsyntax,makingthemlesserror-pronewhenusedforconfigurationfilesoverJSON.

NoteHOCONisanacronymforHumanOptimizedConfigurationObjectNotation.YoucanreadmoreaboutHOCONonthisGitHubpage:https://github.com/typesafehub/config/blob/master/HOCON.md

Theconfigurationfilecontainsasinglekeywiththemorphlinesvalue.ThevalueisanarrayofMorphlineconfigurations.Eachindividualentryiscomprisedofthreekeys:

importCommands

commands

IfyourconfigurationcontainsmultipleMorphlines,thevalueofidmustbeprovidedtotheFlumesinkbywayofthemorphlineIdproperty.ThevalueofimportCommandsspecifiestheJavaclassestoimportwhentheMorphineisevaluated.Thedoublestarindicatesthatallpathsandclassesfromthatpointinthepackagehierarchyshouldbeincluded.Allclassesthatimplementcom.cloudera.cdk.morphline.api.CommandBuilderareinterrogatedfortheirnamesviathegetNames()method.Thesenamesarethecommandnamesyouuseinthenextsection.Don’tworry;youdon’tneedtosiftthroughthesourcecodetofindthem,astheyhaveawell-documentedreferenceguideonline.Finally,thecommandskeyreferencesalistofcommanddictionaries.EachcommanddictionaryhasasinglekeyconsistingofthenameoftheMorphlinecommandfollowedbyitsspecificproperties.

NoteForalistofMorphlinecommandsandassociatedconfigurationproperties,seethereferenceguideathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

Hereiswhataskeletonconfigurationfilemightlooklike:

morphlines:[

id:transform_my_data

importCommands:[

"com.cloudera.**",

"org.apache.solr.**"

commands:[

COMMAND_NAME1:{

property1:value1

property2:value2

{COMMAND_NAME2:{

property1:value1

TypicalSolrSinkconfigurationHereistheprecedingskeletonconfigurationappliedtoourSolrusecase.Thisisnotmeanttobecomplete,butitissufficienttodiscusstheflowintheprecedingdiagram:

morphlines:[

id:solr_flow

importCommands:[

"com.cloudera.**",

"org.apache.solr.**"

commands:[

readLine:{

charset:UTF-8

grok:{

GROK_PROPERTIES_HERE

loadSolr:{

solrLocator:{

collection:my_collection

zkHost:"solr.example.com:2181/solr"

YoucanseethesameboilerplateconfigurationwherewedefineasingleMorphlinewiththesolr_flowidentifier.ThecommandsequencestartswiththereadLinecommand.Thissimplyreadstheeventbodyfromthe_attachment_bodyfieldandconvertsbyte[]toStringusingtheconfiguredencoding(inthiscase,UTF-8).TheresultingStringvalueissettothefieldwiththekeymessage.Thenextcommandinthesequence,whichisthegrokcommand,usesregularexpressionstoextractadditionalfieldstomakeamoreinterestingSolrDocument.Icouldn’tpossiblydothiscommandjusticebytryingtoexplaineverythingyoucandowithit.Forthat,pleaseseetheKiteSDKdocumentation.

NoteSeethereferenceguideforacompletelistofMorphlinecommands,theirproperties,andusageinformationathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

Sufficetosay,grokletsmetakeawebserverloglinesuchasthis:10.4.240.176--[14/Mar/2014:12:02:17-0500]"POST

http://mysite.com/do_stuff.phpHTTP/1.1"500834

Then,itletsmeturnitintomorestructureddatalikethis:

ip:10.4.240.176

timestamp:1413306137

method:POST

url:http://mysite.com/do_stuff.php

protocol:HTTP/1.1

status_code:500

length:834

Ifyouwantedtosearchforallthetimesthispagethrewa500statuscode,havingthesefieldsbrokenoutmakesthetaskeasyforSolr.

Finally,wecalltheloadSolrcommandtoinserttherecordintoourSolrcluster.ThesolrLocatorpropertyindicatesthetargetSolrcluster(bywayofitsZookeeperserver(s))andthedatacollectiontowritethesedocumentsinto.

SinkconfigurationNowthatyouhaveabasicideaofhowtocreateaMorphlineconfigurationfile,let’sapplythistotheactualsinkconfiguration.

Thefollowingtabledetailsthesink’sparametersanddefaultvalues:

type Yes String org.apache.flume.sink.solr.morphline.MorphlineSolrSink

channel Yes String

morphlineFile Yes String

morphlineId No StringRequirediftheMorphlineconfigurationfilecontainsmorethanoneMorphline.

batchSize No int 1000

batchDurationMillis No long 1000(milliseconds)

handlerClass No String org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl

TheMorphlineSolrSinkdoesnothaveashorttypealias,sosetthetypeparameteronyournamedsinktoorg.apache.flume.sink.solr.morphline.MorphlineSolrSink:

agent.sinks.k1.type=org.apache.flume.sink.solr.morphline.MorphlineSolrSink

ThisdefinesaMorphlineSolrSinknamedk1fortheagentnamedagent.

Thenextrequiredparameteristhechannelproperty.Thisspecifieswhichchanneltoreadeventsfromforprocessing.

Thistellsthek1sinktoreadeventsfromthec1channel.

TheonlyotherrequiredparameteristherelativeorabsolutepathtotheMorphlineconfigurationfile.ThiscannotbeapathinHDFS;itmustaccessibleontheservertheFlumeagentisrunningon(localdisk,NFSdisk,andsoon).

Tospecifytheconfigurationfilepath,setthemorphlineFileproperty:

agent.sinks.k1.morphlineFile=/path/on/local/system/morphline.conf

AsaMorphlineconfigurationfilecancontainmultipleMorphlines,youmustspecifytheidentifierifmorethanoneexists,usingthemorphlineIdproperty:

agent.sinks.k1.morphlineId=transform_my_data

Thenexttwopropertiesarefairlycommonamongsinks.Theyspecifyhowmanyeventstoremoveatatimeforprocessing,alsoknownasabatch.ThebatchSizepropertydefaultsto1000events,butyoumightneedtosetthishigherifyouaren’tconsuming

eventsfromthechannelfasterthantheyarebeinginserted.Clearly,youcanonlyincreasethissomuch,asthethingyouarewritingto—inthiscase,Solr—willhavesomerecordconsumptionlimit.Onlythroughtestingwillyoubeabletostressyoursystemstoseewherethelimitsare.

TherelatedbatchDurationMillispropertyspecifiesthemaximumtimetowaitbeforethesinkproceedswiththeprocessingwhenfewerthanthebatchSizenumberofeventshavebeenread.Thedefaultvalueis1secondandisspecifiedinmillisecondsintheconfigurationproperties.Inasituationwithalightdataflow(usingthedefaults,lessthan1000recordspersecond),settingbatchDurationMillishighercanmakethingsworse.Forinstance,ifyouareusingamemorychannelwiththissink,yourFlumeagentcouldbesittingtherewithdatatowritetothesink’stargetbutiswaitingformore,onlytoshowupwhenacrashhappens,resultinginlostdata.Thatsaid,yourdownstreamentitymightperformbetteronlargerbatches,whichmightpushboththeseconfigurationvalueshigher,sothereisnouniversallycorrectanswer.Startwiththedefaultsifyouareunsure,anduseharddatathatyou’llcollectusingtechniquesinChapter8,MonitoringFlume,toadjustbasedonfactsandnotguesses.

Finally,youshouldneverneedtotouchthehandlerClasspropertyunlessyouplantowriteanalternateimplementationoftheMorphlineprocessingclass.AsthereisonlyoneMorphlineengineimplementationtodate,I’mnotreallysurewhythisisadocumentedpropertyinFlume.I’mjustmentioningitforcompleteness.

ElasticSearchSinkAnothercommontargettostreamdatatobesearchedinNRTisElasticsearch.ElasticsearchisalsoaclusteredsearchingplatformbasedonLucene,likeSolr.Itisoftenusedalongwiththelogstashproject(tocreatestructuredlogs)andtheKibanaproject(awebUIforsearches).ThistrioisoftenreferredtoastheacronymELK(Elasticsearch/Logstash/Kibana).

NoteHerearetheprojecthomepagesfortheELKstackthatcangiveyouamuchbetteroverviewthanIcaninafewshortpages:

Elasticsearch:http://elasticsearch.org/Logstash:http://logstash.net/Kibana:http://www.elasticsearch.org/overview/kibana/

InElasticsearch,dataisgroupedintoindices.YoucanthinkoftheseasbeingequivalenttodatabasesinasingleMySQLinstallation.Theindicesarecomposedoftypes(similartotablesindatabases),whicharemadeupofdocuments.Adocumentislikeasinglerowinadatabase,so,eachFlumeeventwillbecomeasingledocumentinElasticSearch.Documentshaveoneormorefields(justlikecolumnsinadatabase).

ThisisbynomeansacompleteintroductiontoElasticsearch,butitshouldbeenoughtogetyoustarted,assumingyoualreadyhaveanElasticsearchclusteratyourdisposal.Aseventsgetmappedtodocumentsbythesink’sserializer,theactualsinkconfigurationneedsonlyafewconfigurationitems:wheretheclusterislocated,whichindextowriteto,andwhattypetherecordis.

ThistablesummarizesthesettingsforElasticSearchSink:

type Yes String org.apache.flume.sink.elasticsearch.ElasticSearchSink

hostNames Yes StringAcomma-separatedlistofElasticsearchnodestoconnectto.Iftheportisspecified,useacolonafterthename.Thedefaultportis9300.

clusterName No String elasticsearch

indexName No String flume

indexType No String log

ttl No String Defaultstoneverexpire.Specifythenumberandunit(5m=5minutes).

Withthisinformationinmind,let’sstartbysettingthesink’stypeproperty:

agent.sinks.k1.type=org.apache.flume.sink.elasticsearch.ElasticSearchSink

Next,weneedtosetthelistofserversandportstoestablishconnectivityusingthehostNamesproperty.Thisisacomma-separatedlistofhostname:portpairs.Ifyouareusingthedefaultportof9300,youcanjustspecifytheservernameorIP,forexample:

agent.sinks.k1.hostNames=es1.example.com,es2.example.com:12345

NowthatwecancommunicatewiththeElasticsearchservers,weneedtotellthemwhichcluster,index,andtypetowriteourdocumentsto.TheclusterisspecifiedusingtheclusterNameproperty.Thiscorrespondswiththecluster.namepropertyinElasticsearch’selasticsearch.ymlconfigurationfile.Itneedstobespecified,asanElasticsearchnodecanparticipateinmorethanonecluster.HereishowIwouldspecifyanondefaultclusternamecalledproduction:

agent.sinks.k1.clusterName=production

TheindexNamepropertyisreallyaprefixusedtocreateadailyindex.Thiskeepsanysingleindexfrombecomingtoolargeovertime.Ifyouusethedefaultindexname,theindexonSeptember30,2014willbenamedflume-2014-10-30.

Lastly,theindexTypepropertyspecifiestheElasticsearchtype.Ifunspecified,thelogdefaultvaluewillbeused.

Bydefault,datawrittenintoElasticsearchwillneverexpire.Ifyouwantthedatatoautomaticallyexpire,youcanspecifyatime-to-livevalueontherecordswiththettlproperty.Valuesareanumericnumberinmillisecondsoranumberwithunits.Theunitsaregiveninthistable:

Unitstring Definition Example

ms Milliseconds 5ms=5milliseconds

notspecified Milliseconds 10000=10seconds

m Minutes 10m=10minutes

h Hours 1h=1hour

d Days 7d=7days

w Weeks 4w=4weeks

KeepinmindthatyoualsoneedtoenabletheTTLfeaturesontheElasticsearchcluster,asitdisabledbydefault.SeetheElasticsearchdocumentationforhowtodothis.

Finally,liketheHDFSsink,thebatchpropertyisthenumberofeventspertransactionthatthesinkwillreadfromthechannel.Ifyouhavealargevolumeofdatainyourchannel,youshouldseeaperformanceincreasebysettingthishigherthanthedefaultof100,duetothereducedoverheadpertransaction.

Thesink’sserializerdoestheworkoftransformingtheFlumeeventtotheElasticsearchdocument.TherearetwoElasticsearchserializersthatcomepackagedwithFlume,neither

hasadditionalconfigurationpropertiessincetheymostlyuseexistingheaderstodictatefieldmappings.

We’llseemoreofthissinkinactioninChapter7,PuttingItAllTogether.

LogStashSerializerThedefaultserializer,ifnotspecified,isElasticSearchLogStashEventSerializer:

agent.sinks.k1.serializer=org.apache.flume.sink.elasticsearch.

ElasticSearchLogStashEventSerializer

ItwritesdatainthesameformatthatLogstashusesinconjunctionwithKibana.HereisatableofthecommonlyusedfieldsandtheirassociatedmappingsfromFlumeevents:

Elasticsearchfield

TakenfromtheFlumeheader Notes

@timestamp timestamp Fromtheheader,ifpresent

@source source Fromtheheader,ifpresent

@source_host source_host Fromtheheader,ifpresent

@source_path source_path Fromtheheader,ifpresent

@type type Fromtheheader,ifpresent

@host host Fromtheheader,ifpresent

@fields allheaders AdictionaryofallFlumeheaders,includingtheonesthatmighthavebeenmappedtotheotherfields,suchasthehost

@message FlumeBody

Whileyoumightthinkthedocument’s@typefieldwillbeautomaticallysettothesink’sindexTypeconfigurationproperty,you’dbeincorrect.Ifyouhadonlyonetypeoflog,itwouldbewastefultowritethisoverandoveragainforeverydocument.However,ifyouhadmorethanonelogtypegoingthroughyourFlumechannel,youcandesignateitstypeinElasticsearchusingtheStaticinterceptorwe’llseeinChapter6,Interceptors,ETL,andRouting,tosetthetype(or@type)Flumeheaderontheevent.

DynamicSerializerAnotherserializeristheElasticSearchDynamicSerializerserializer.Ifyouusethisserializer,theevent’sbodyiswrittentoafieldcalledbody.AllotherFlumeheaderkeysareusedasfieldnames.Clearly,youwanttoavoidhavingaflumeheaderkeycalledbody,asthiswillconflictwiththeactualevent’sbodywhentransformedintotheElasticsearchdocument.Tousethisserializer,specifythefullyqualifiedclassname,asshowninthisexample:

agent.sinks.k1.serializer=org.apache.flume.sink.elasticsearch.

ElasticSearchDynamicSerializer

Forcompleteness,hereisatablethatshowsyouthebreakdownofhowFlumeheadersandbodygetmappedtoElasticsearchfields:

Flumeentity Elasticsearchfield

Allheaders sameasFlumeheaders

Body body

AstheversionofElasticsearchcanbedifferentforeachuser,Flumedoesn’tpackagetheElasticsearchclientandcorrespondingLucenelibraries.FindoutfromyouradministratorwhichversionsshouldbeincludedontheFlumeclasspath,orcheckouttheMavenpom.xmlfileonGitHubforthecorrespondingversiontagorbranchathttps://github.com/elasticsearch/elasticsearch/blob/master/pom.xml.MakesurethelibraryversionsusedbyFlumematchwithElasticsearchoryoumightseeserializationerrors.

NoteAsSolrandElasticsearchhavesimilarcapabilities,checkoutKelvinTan’sappropriately-namedside-by-sidedetailedfeaturebreakdownwebpage.Itshouldhelpgetyoustartedwithwhatismostappropriateforyourspecificusecase:

http://solr-vs-elasticsearch.com/

SummaryInthischapter,wecoveredtheHDFSsinkindepth,whichwritesstreamingdataintoHDFS.WecoveredhowFlumecanseparatedataintodifferentHDFSpathsbasedontimeorcontentsofFlumeheaders.Severalfile-rollingtechniqueswerealsodiscussed,includingtimerotation,eventcountrotation,sizerotation,androtationonidleonly.

CompressionwasdiscussedasameanstoreducestoragerequirementsinHDFS,andshouldbeusedwhenpossible.Besidesstoragesavings,itisoftenfastertoreadacompressedfileanddecompressinmemorythanitistoreadanuncompressedfile.ThiswillresultinperformanceimprovementsinMapReducejobsrunonthisdata.Thesplitabilityofcompresseddatawasalsocoveredasafactortodecidewhenandwhichcompressionalgorithmtouse.

EventSerializerswereintroducedasthemechanismbywhichFlumeeventsareconvertedintoanexternalstorageformat,includingtext(bodyonly),textandheaders(headersandbody),andAvroserialization(withoptionalcompression).

Next,variousfileformats,includingsequencefiles(Hadoopkey/valuefiles),DataStreams(uncompresseddatafiles,likeAvrocontainers),andCompressedDataStreams,werediscussed.

Next,wecoveredsinkgroupsasameanstorouteeventstodifferentsourcesusingloadbalancingorfailoverpaths,whichcanbeusedtoeliminatesinglepointsoffailureinroutingdatatoitsdestination.

Finally,wecoveredtwonewsinksaddedinFlume1.4towritedatatoApacheSolrandElasticSearchinaNearRealTime(NRT)way.Foryears,MapReducejobshaveserveduswell,andwillcontinuetodoso,butsometimesitstillisn’tfastenoughtosearchlargedatasetsquicklyandlookatthingsfromdifferentangleswithoutreprocessingdata.KiteSDKMorphlineswerealsointroducedasawaytopreparedataforwritingtoSolr.WewillrevisitMorphlinesagaininChapter6,Interceptors,ETL,andRouting,whenwelookataMorphline-poweredinterceptor.

Inthenextchapter,wewilldiscussvariousinputmechanisms(sources)thatwillfeedyourconfiguredchannelswhichwerecoveredbackinChapter3,Channels.

Chapter5.SourcesandChannelSelectorsNowthatwehavecoveredchannelsandsinks,wewillnowcoversomeofthemorecommonwaystogetdataintoyourFlumeagents.AsdiscussedinChapter1,OverviewandArchitecture,thesourceistheinputpointfortheFlumeagent.TherearemanysourcesavailablewiththeFlumedistributionaswellasmanyopensourceoptionsavailable.Likemostopensourcesoftware,ifyoucan’tfindwhatyouneed,youcanalwayswriteyourownbyextendingtheorg.apache.flume.source.AbstractSourceclass.SincetheprimaryfocusofthisbookisingestingfilesoflogsintoHadoop,we’llcoverafewofthemoreappropriatesourcestoaccomplishthis.

TheproblemwithusingtailIfyouhaveusedanyoftheFlume0.9releases,you’llnoticethattheTailSourceisnolongerapartofFlume.TailSourceprovidedamechanismto“tail”(http://en.wikipedia.org/wiki/Tail_(Unix))anyfileonthesystemandcreateFlumeeventsforeachlineofthefile.Itcouldalsohandlefilerotations,somanyusedthefilesystemasahandoffpointbetweentheapplicationcreatingthedata(forinstance,log4j)andthemechanismresponsibleformovingthosefilessomeplaceelse(forinstance,syslog).

Asisthecasewithbothchannelsandsinks,eventsareaddedandremovedfromachannelaspartofatransaction.Whenyouaretailingafile,thereisnowaytoparticipateproperlyinatransaction.Iffailuretowritesuccessfullytoachanneloccurred,orifthechannelwassimplyfull(amorelikelyeventthanfailure),thedatacouldn’tbe“putback”asrollbacksemanticsdictate.

Furthermore,iftherateofdatawrittentoafileexceedstherateFlumecouldreadthedata,itispossibletoloseoneormorelogfilesofinputoutright.Forexample,sayyouweretailing/var/log/app.log.Whenthatfilereachesacertainsize,itisrotatedorrenamed,to/var/log/app.log.1,andanewfilecalled/var/log/app.logiscreated.Let’ssayyouhadafavorablereviewinthepressandyourapplicationlogsaremuchhigherthanusual.Flumemaystillbereadingfromtherotatedfile(/var/log/app.log.1)whenanotherrotationoccurs,moving/var/log/app.logto/var/log/app.log.1.ThefileFlumeisreadingisnowrenamedto/var/log/app.log.2.WhenFlumefinisheswiththisfile,itwillmovetowhatitthinksisthenextfile(/var/log/app.log),thusskippingthefilethatnowresidesat/var/log/app.log.1.Thiskindofdatalosswouldgocompletelyunnoticedandissomethingwewanttoavoidifpossible.

Forthesereasons,itwasdecidedtoremovethetailfunctionalityfromFlumewhenitwasrefactored.TherearesomeworkaroundsforTailSourceafterit’sremoved,butitshouldbenotedthatnoworkaroundcaneliminatethepossibilityofdatalossunderloadthatoccursundertheseconditions.

TheExecsourceTheExecsourceprovidesamechanismtorunacommandoutsideFlumeandthenturntheoutputintoFlumeevents.TousetheExecsource,setthetypepropertytoexec:

agent.sources.s1.type=exec

AllsourcesinFlumearerequiredtospecifythelistofchannelstowriteeventstousingthechannels(plural)property.Thisisaspace-separatedlistofoneormorechannelnames:

Theonlyotherrequiredparameteristhecommandproperty,whichtellsFlumewhatcommandtopasstotheoperatingsystem.Hereisanexampleoftheuseofthisproperty:

agent.sources=s1

agent.sources.s1.type=exec

agent.sources.s1.command=tail-F/var/log/app.log

Here,Ihaveconfiguredasinglesources1foranagentnamedagent.Thesource,anExecsource,willtailthe/var/log/app.logfileandfollowanyrotationsthatoutsideapplicationsmayperformonthatlogfile.Alleventsarewrittentothec1channel.ThisisanexampleofoneoftheworkaroundsforthelackofTailSourceinFlume1.x.Thisisnotmypreferredworkaroundtousetail,butjustasimpleexampleoftheexecsourcetype.IwillshowmypreferredmethodinChapter7,PullingItAllTogether.

NoteShouldyouusethetail-FcommandinconjunctionwiththeExecsource,itisprobablethattheforkedprocesswillnotshutdown100percentofthetimewhentheFlumeagentshutsdownorrestarts.Thiswillleaveorphanedtailprocessesthatwillneverexit.Thetail–Fcommand,bydefinition,hasnoend.Evenifyoudeletethefilebeingtailed(atleastinLinux),therunningtailprocesswillkeepthefilehandleopenindefinitely.Thiskeepsthefile’sspacefromactuallybeingreclaimeduntilthetailprocessexits,whichwon’thappen.IthinkyouarebeginningtoseewhyFlumedevelopersdon’tliketailingfiles.

Ifyougothisroute,besuretoperiodicallyscantheprocesstablesfortail-FwhoseparentPIDis1.Theseareeffectivelydeadprocessesandneedtobekilledmanually.

HereisalistofotherpropertiesyoucanusewiththeExecsource:

type Yes String exec

channels Yes String spaceseparatedlistofchannels

command Yes String

shell No String shellcommand

restart No boolean false

restartThrottle No long(milliseconds) 10000(milliseconds)

logStdErr No boolean false

batchSize No int 20

batchTimeout No long 3000(milliseconds)

Noteverycommandkeepsrunning,eitherbecauseitfails(forexample,whenthechannelitiswritingtoisfull)orbecauseitisdesignedtoexitimmediately.Inthisexample,wewanttorecordthesystemloadviatheLinuxuptimecommand,whichprintsoutsomesysteminformationtostdoutandexits:

Thiscommandwillimmediatelyexit,soyoucanusetherestartandrestartThrottlepropertiestorunitperiodically:

Thiswillproduceoneeventperminute.Inthetailexample,shouldthechannelfillcausingtheExecsourcetofail,youcanusethesepropertiestorestarttheExecsource.Inthiscase,settingtherestartpropertywillstartthetailingofthefilefromthebeginningofthecurrentfile,thusproducingduplicates.DependingonhowlongtherestartThrottlepropertyis,youmayhavemissedsomedataduetoafilerotationoutsideFlume.Furthermore,thechannelmaystillbeunabletoacceptdata,inwhichcasethesourcewillfailagain.Settingthisvaluetoolowmeansgivinglesstimetothechanneltodrain,andunlikesomeofthesinkswesaw,thereisnotanoptionforexponentialbackoff.

Ifyouneedtouseshell-specificfeaturessuchaswildcardexpansion,youcansettheshellpropertyasinthisexample:

agent.sources.s1.command=grep–iapachelib/*.jar|wc-l

agent.sources.s1.shell=/bin/bash-c

Thisexamplewillfindthenumberoftimesthecase-insensitiveapachestringisfoundinalltheJARfilesinthelibdirectory.Onceperminute,thatcountwillbesentasaFlumeeventpayload.

WhilethecommandoutputwrittentostdoutbecomestheFlumeeventbody,errorsaresometimeswrittentostderr.IfyouwanttheselinesincludedintheFlumeagent’ssystemlogs,setthelogStdErrpropertytotrue.Otherwise,theywillbesilentlyignored,whichisthedefaultbehavior.

Finally,youcanspecifythenumberofeventstowritepertransactionbychangingthebatchSizeproperty.Youmayneedtosetthisvaluehigherthanthedefaultof20ifyourinputdataislargeandyourealizethatyoucannotwritetoyourchannelfastenough.Usingahigherbatchsizereducestheoverallaveragetransactionoverheadperevent.Testingwithdifferentvaluesandmonitoringthechannel’sputrateistheonlywaytoknowthisforsure.TherelatedbatchTimeoutpropertysetsthemaximumtimetowaitwhenrecordsfewerthanthebatchsize’snumberofrecordshavebeenseenbefore,flushingapartialbatchtothechannel.Thedefaultsettingforthisis3seconds(specifiedinmilliseconds).

SpoolingDirectorySourceInanefforttoavoidalltheassumptionsinherentintailingafile,anewsourcewasdevisedtokeeptrackofwhichfileshavebeenconvertedintoFlumeeventsandwhichstillneedtobeprocessed.TheSpoolingDirectorySourceisgivenadirectorytowatchfornewfilesappearing.Itisassumedthatfilescopiedtothisdirectoryarecomplete.Otherwise,thesourcemighttryandsendapartialfile.Italsoassumesthatfilenamesneverchange.Otherwise,onrestarting,thesourcewouldforgetwhichfileshavebeensentandwhichhavenot.Thefilenameconditioncanbemetinlog4jusingDailyRollingFileAppenderratherthanRollingFileAppender.However,thecurrentlyopenfilewouldneedtobewrittentoonedirectoryandcopiedtothespooldirectoryafterbeingclosed.Noneofthelog4jappendersshippinghavethiscapability.

Thatsaid,ifyouareusingtheLinuxlogrotateprograminyourenvironment,thismightbeofinterest.Youcanmovecompletedfilestoaseparatedirectoryusingapostrotatescript.Thefinalflowmightlooksomethinglikethis:

TocreateaSpoolingDirectorySource,setthetypepropertytospooldir.YoumustspecifythedirectorytowatchbysettingthespoolDirproperty:

agent.sources=s1

agent.sources.channels=c1

agent.sources.s1.type=spooldir

agent.sources.s1.spoolDir=/path/to/files

HereisasummaryofthepropertiesfortheSpoolingDirectorySource:

type Yes String spooldir

spoolDir Yes String pathtodirectorytospool

fileSuffix No String .COMPLETED

deletePolicy No String(neverorimmediate) never

fileHeader No boolean false

fileHeaderKey No String file

basenameHeader No boolean false

basenameHeaderKey No String basename

ignorePattern No String ^$

trackerDir No String ${spoolDir}/.flumespool

consumeOrder No String(oldest,youngest,orrandom) oldest

batchSize No int 10

bufferMaxLines No int 100

maxBufferLineLength No int 5000

maxBackoff No int 4000(milliseconds)

Whenafilehasbeencompletelytransmitted,itwillberenamedwitha.COMPLETEDextension,unlessoverriddenbysettingthefileSuffixproperty,likethis:

agent.sources.s1.fileSuffix=.DONE

StartingwithFlume1.4,anewproperty,deletePolicy,wascreatedtoremovecompletedfilesfromthefilesystemratherthanjustmarkingthemasdone.Inaproductionenvironment,thisiscriticalbecauseyourspooldiskwillfillupovertime.Currently,youcanonlysetthisforimmediatedeletionortoleavethefilesforever.Ifyouwantdelayeddeletion,you’llneedtoimplementyourownperiodic(cron)job,perhapsusingthefindcommandtofindfilesinthespooldirectorywiththeCOMPLETEDfilesuffixandamodificationtimelongerthansomeregularvalue,forexample:

find/path/to/spool/dir-typef-name"*.COMPLETED"–mtime7–execrm{}\;

Thiswillfindallfilescompletedmorethansevendaysagoanddeletethem.

Ifyouwanttheabsolutefilepathattachedtoeachevent,setthefileHeaderpropertytotrue.ThiswillcreateaheaderwiththefilekeyunlesssettosomethingelseusingthefileHeaderKeyproperty,likethiswouldaddthe{sourceFile=/path/to/files/foo.1234.log}headeriftheeventwasreadfromthe/path/to/files/foo.1234.logfile:

agent.sources.s1.fileHeader=true

agent.sources.s1.fileHeaderKey=sourceFile

Therelatedproperty,basenameHeader,ifsettotrue,willaddaheaderwiththebasenamekey,whichcontainsjustthefilename.ThebasenameHeaderKeypropertyallowsyoutochangethekey’svalue,asshownhere:

agent.sources.s1.basenameHeader=true

agent.sources.s1.basenameHeaderKey=justTheName

Thisconfigurationwouldaddthe{justTheName=foo.1234.log}headeriftheeventwasreadfromthesamefilelocatedat/path/to/files/foo.1234.log.

Iftherearecertainfilepatternsthatyoudonotwantthissourcetoreadasinput,youcanpassaregularexpressionusingtheignorePatternproperty.Personally,Iwon’tcopyanyfilesIdon’twanttransferredtoFlumeinthespoolDirinthefirstplace.Ifthissituationcannotbeavoided,usetheignorePatternpropertytopassaregularexpressiontomatchfilenamesthatshouldnotbetransferredasdata.Furthermore,subdirectoriesandfilesthatstartwitha"."(period)characterareignored,soyoucanavoidcostlyregularexpressionprocessingusingthisconventioninstead.

WhileFlumeissendingdata,itkeepstrackofhowfarithasgottenineachfilebykeepingametadatafileinthedirectoryspecifiedbythetrackerDirproperty.Bydefault,thisfilewillbe.flumespoolunderspoolDir.ShouldyouwantalocationotherthaninsidespoolDir,youcanspecifyanabsolutefilepath.

Filesinthedirectoryareprocessedinthe“oldestfirst”mannerascalculatedbylookingatthemodificationtimesofthefiles.YoucanchangethisbehaviorbysettingtheconsumeOrderproperty.Ifyousetthispropertytoyoungest,thenewestfileswillbeprocessedfirst.Thismaybedesiredifdataistimesensitive.Ifyou’drathergiveequalprecedencetoallfiles,youcansetthispropertytoavalueofrandom.

ThebatchSizepropertyallowsyoutotunethenumberofeventspertransactionforwritestothechannel.Increasingthismayprovidebetterthroughputatthecostoflargertransactions(andpossiblylargerrollbacks).ThebufferMaxLinespropertyisusedtosetthesizeofthememorybufferusedinreadingfilesbymultiplyingitwithmaxBufferLineLength.Ifyourdataisveryshort,youmightconsiderincreasingbufferMaxLineswhilereducingthemaxBufferLineLengthproperty.Inthiscase,itwillresultinbetterthroughputwithoutincreasingyourmemoryoverhead.Thatsaid,ifyouhaveeventslongerthan5000characters,you’llwanttosetmaxBufferLineLengthhigher.

Ifthereisaproblemwritingdatatothechannel,aChannelExceptionisthrownbacktothesource,whereit’llretryafteraninitialwaittimeof250ms.Eachfailedattemptwilldoublethistimeuptoamaximumof4seconds.Tosetthismaximumtimehigher,setthemaxBackoffproperty.Forinstance,ifIwantedamaximumof5minutes,Icansetitinmillisecondslikethis:

agent.sources.s1.maxBackoff=300000

Finally,you’llwanttoensurethatwhatevermechanismiswritingnewfilestoyourspoolingdirectorycreatesuniquefilenames,suchasaddingatimestamp(andpossibly

more).Reusingafilenamewillconfusethesource,andyourdatamaynotbeprocessed.

Asalways,rememberthatrestartsanderrorswillcreateduplicatesduetoretransmissionoffilespartiallysent,butnotmarkedascomplete,orbecausethemetadataisincomplete.

SyslogsourcesSysloghasbeenaroundfordecadesandisoftenusedasanoperating-system-levelmechanismtocaptureandmovelogsaroundsystems.Inmanyways,thereareoverlapswithsomeofthefunctionalityFlumeprovides.ThereisevenaHadoopmoduleforrsyslog,oneofthemoremodernvariantsofsyslog(http://www.rsyslog.com/doc/rsyslog_conf_modules.html/omhdfs.html).Generally,Idon’tlikesolutionsthatcoupletechnologiesthatmayversionindependently.Ifyouusethisrsyslog/Hadoopintegration,youwouldberequiredtoupdatetheversionofHadoopyoucompiledintorsyslogatthesametimeyouupgradedyourHadoopclustertoanewmajorversion.Thismaybelogisticallydifficultifyouhavealargenumberofserversand/orenvironments.BackwardcompatibilityinHadoopwireprotocolsissomethingthatisbeingactivelyworkedonintheHadoopcommunity,butcurrently,itisn’tthenorm.We’lltalkmoreaboutthisinChapter8,MonitoringFlume,whenwediscusstieringdataflows.

SysloghasanolderUDPtransportaswellasanewerTCPprotocolthatcanhandledatalargerthanasingleUDPpacketcantransmit(about64KB)anddealwithnetwork-relatedcongestioneventsthatmightrequirethedatatoberetransmitted.

Finally,therearesomeundocumentedpropertiesofsyslogsourcesthatallowustoaddmoreregular-expressionpattern-matchingformessagesthatdonotconformtoRFCstandards.Iwon’tbediscussingtheseadditionalsettings,butyoushouldbeawareofthemifyourunintofrequentparsingerrors.Inthiscase,takealookatthesourcefororg.apache.flume.source.SyslogUtilsforimplementationdetailstofindthecause.

Moredetailsonsyslogterms(suchasafacility)andstandardformatscanbefoundinRFC3164athttp://tools.ietf.org/html/rfc3164.

ThesyslogUDPsourceTheUDPversionofsyslogisusuallysafetousewhenyouarereceivingdatafromtheserver’slocalsyslogprocess,providedthedataissmallenough(lessthanabout64KB).

NoteTheimplementationforthissourcehaschosen2,500bytesasthemaximumpayloadsizeregardlessofwhatyournetworkcanactuallyhandle.Soifyourpayloadwillbelargerthanthis,useoneoftheTCPsourcesinstead.

TocreateaSyslogUDPsource,setthetypepropertytosyslogudp.Youmustsettheporttolistenonusingtheportproperty.Theoptionalhostpropertyspecifiesthebindaddress.Ifnohostisspecified,allIPsfortheserverwillbeused,whichisthesameasspecifying0.0.0.0.Inthisexample,wewillonlylistenforlocalUDPconnectionsonport5140:

agent.sources=s1

agent.sources.channels=c1

agent.sources.s1.type=syslogudp

agent.sources.s1.host=localhost

Ifyouwantsyslogtoforwardatailedfile,youcanaddalinelikethistoyoursyslogconfigurationfile:

*.err;*.alert;*.crit;*.emerg;kern.*@localhost:5140

Thiswillsendallerrorpriority,criticalpriority,emergencypriority,andkernelmessagesofanypriorityintoyourFlumesource.Thesingle@symboldesignatesthatUDPprotocolshouldbeused.

HereisasummaryofthepropertiesoftheSyslogUDPsource:

type Yes String syslogudp

port Yes int

host No String 0.0.0.0

keepFields No boolean false

ThekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbytheSyslogUDPsourcearesummarizedhere:

HeaderKey Description

Facility Thisisthesyslogfacility.Seethesyslogdocumentation.

Priority Thisisthesyslogpriority.Seethesyslogdocumentation.

timestampThisisthetimeofthesyslogevent,translatedintoanepochtimestamp.It’somittedifit’snotparsedfromoneofthestandardRFCformats.

hostname Thisistheparsedhostnameinthesyslogmessage.Itisomittedifitisnotparsed.

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.ThisvalueissettoInvalidifthepayloaddidn’tconformtotheRFCs,andsettoIncompleteifthemessagewaslongerthantheeventSizevalue(forUDP,thisissetinternallyto2,500bytes).It’somittedifeverythingisfine.

ThesyslogTCPsourceAspreviouslymentioned,theSyslogTCPsourceprovidesanendpointformessagesoverTCP,allowingforalargerpayloadsizeandTCPretrysemanticsthatshouldbeusedforanyreliableinter-servercommunications.

TocreateaSyslogTCPsource,setthetypepropertytosyslogtcp.Youmuststillsetthebindaddressandporttolistenon:

agent.sources=s1

agent.sources.s1.type=syslogtcp

agent.sources.s1.host=0.0.0.0

IfyoursyslogimplementationsupportssyslogoverTCP,theconfigurationisusuallythesame,exceptthatadouble@symbolisusedtoindicateTCPtransport.HereisthesameexampleusingTCP,whereIamforwardingthevaluestoaFlumeagentthatisrunningonadifferentservernamedflume-1.

*.err;*.alert;*.crit;*.emerg;kern.*@@flume-1:12345

TherearesomeoptionalpropertiesfortheSyslogTCPsource,aslistedhere:

type Yes String syslogtcp

port Yes int

eventSize No int(bytes) 2500bytes

ThekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbytheSyslogTCPsourcearesummarizedhere:

timestampThisisthetimeofthesyslogeventtranslatedintoanepochtimestamp.It’somittedifnotparsedfromoneofthestandardRFCformats.

hostname Theparsedhostnameinthesyslogmessage.It’somittedifnotparsed.

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.It’ssettoInvalidifthepayloaddidn’tconformtotheRFCsandsettoIncompleteifthemessagewaslongerthantheconfiguredeventSize.It’somittedifeverythingisfine.

ThemultiportsyslogTCPsourceTheMultiportSyslogTCPsourceisnearlyidenticalinfunctionalitytotheSyslogTCPsource,exceptthatitcanlistentomultipleportsforinput.Youmayneedtousethiscapabilityifyouareunabletochangewhichportsyslogwilluseinitsforwardingrules(itmaynotbeyourserveratall).Itismorelikelythatyouwillusethistoreadmultipleformatsusingonesourcetowritetodifferentchannels.We’llcoverthatinamomentintheChannelSelectorssection.Underthehood,ahigh-performanceasynchronousTCPlibrarycalledMina(https://mina.apache.org/)isused,whichoftenprovidesbetterthroughputonmulticoreserversevenwhenconsumingonlyasingleTCPport.

Toconfigurethissource,setthetypepropertytomultiport_syslogtcp:

agent.sources.s1.type=multiport_syslogtcp

Liketheothersyslogsources,youneedtospecifytheport,butinthiscaseitisaspace-separatedlistofports.Youcanusethisonlyifyouhaveoneportspecified.Thepropertyforthisisports(plural):

agent.sources.s1.type=multiport_syslogtcp

agent.sources.s1.ports=3333344444

agent.sources.s1.host=0.0.0.0

ThiscodeconfigurestheMultiportSyslogTCPsourcenameds1tolistentoanyincomingconnectionsonports33333and44444andsendthemtochannelc1.

Inordertotellwhicheventcamefromwhichport,youcansettheoptionalportHeaderpropertytothenameofthekeywhosevaluewillbetheportnumber.Let’saddthispropertytotheconfiguration:

agent.sources.s1.portHeader=port

Then,anyeventsreceivedfromport33333wouldhaveaheaderkey/valueof{"port"="33333"}.AsyousawinChapter4,SinksandSinkProcessors,youcannowusethisvalue(oranyheader)asapartofyourHDFSSinkfilepathconvention,likethis:

agent.sinks.k1.hdfs.path=/logs/%{hostname}/%{port}/%Y/%m/%D/%H

Hereisacompletetableoftheproperties:

type Yes String syslogtcp

channels Yes String Space-separatedlistofchannels

ports Yes int Space-separatedlistofportnumbers

eventSize No int 2500(bytes)

portHeader No String

readBufferSize No int(bytes) 1024

numProcessors No int automaticallydetected

charset.default No String UTF-8

charset.port.PORT# No String

ThisTCPsourcehassomeadditionaltunableoptionsoverthestandardTCPsyslogsource.ThefirstisthebatchSizeproperty.Thisisthenumberofeventsprocessedpertransactionwiththechannel.ThereisalsothereadBufferSizeproperty.ItspecifiestheinternalbuffersizeusedbyaninternalMinalibrary.Finally,thenumProcessorspropertyisusedtosizetheworkerthreadpoolinMina.Beforeyoutunetheseparameters,youmaywanttofamiliarizeyourselfwithMina(http://mina.apache.org/),andlookatthesourcecodebeforedeviatingfromthedefaults.

Finally,youcanspecifythedefaultandper-portcharacterencodingtousewhenconvertingbetweenStringsandbytes:

agent.sources.s1.charset.default=UTF-16

agent.sources.s1.charset.port.33333=UTF-8

ThissampleconfigurationshowsthatallportswillbeinterpretedusingUTF-16encoding,exceptforport33333traffic,whichwilluseUTF-8.

Asyou’vealreadyseenintheothersyslogsources,thekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.

TheFlumeheaderscreatedbythissourcearesummarizedhere:

timestampThisisthetimeofthesyslogeventtranslatedintoanepochtimestamp.OmittedifnotparsedfromoneofthestandardRFCformats.

hostname Thisistheparsedhostnameinthesyslogmessage.Omittedifnotparsed.

flume.syslog.status

Therewasaproblemparsingthesyslogmessage’sheaders.ThisissettoInvalidifthepayloaddidn’tconformtotheRFCs,settoIncompleteifthemessagewaslongerthantheconfiguredeventSize,andomittedifeverythingisfine.

JMSsourceSometimes,datacanoriginatefromasynchronousmessagequeues.Forthesecases,youcanuseFlume’sJMSsourcetocreateeventsreadfromaJMSQueueorTopic.WhileitistheoreticallypossibletouseanyJavaMessageService(JMS)implementation,FlumehasonlybeentestedwithActiveMQ,sobesuretotestthoroughlyifyouuseadifferentprovider.

LikethepreviouslycoveredElasticSearchsinkinChapter4,SinksandSinkProcessors,FlumedoesnotcomepackagedwiththeJMSimplementationyou’llbeusing,astheversionsneedtomatchup,soyou’llneedtoincludethenecessaryimplementationJARfilesontheFlumeagent’sclasspath.Thepreferredmethodistousethe--plugins-dirparametermentionedinChapter2,AQuickStartGuidetoFlume,whichwe’llcoverinmoredetailinthenextchapter.

NoteActiveMQisjustoneprovideroftheJavaMessageServiceAPI.Formoreinformation,seetheprojecthomepageathttp://activemq.apache.org.

Toconfigurethissource,setthetypepropertytojms:

agent.sources.s1.type=jms

ThefirstthreepropertiesareusedtoestablishaconnectionwiththeJMSServer.TheyareinitialContextFactory,connectionFactory,andproviderURL.TheinitialContextFactorypropertyforActiveMQwillbeorg.apache.activemq.jndi.ActiveMQInitialContextFactory.TheconnectionFactorypropertywillbetheregisteredJNDIname,whichdefaultstoConnectionFactoryifunspecified.Finally,providerURListheconnectionStringpassedtotheconnectionfactorytoactuallyestablishanetworkconnection.ItisusuallyaURL-likeStringconsistingoftheservernameandportinformation.Ifyouaren’tfamiliarwithyourJMSconfiguration,asksomebodywhodoeswhatvaluestouseinyourenvironment.

Thistablesummarizesthesesettingsandotherswe’lldiscussinamoment:

type Yes String jms

initialContextFactory Yes String

connectionFactory No String ConnectionFactory

providerURL Yes String

userName No String usernameforauthentication

passwordFile No String pathtofilecontainingpasswordforauthentication

destinationName Yes String

destinationType Yes String

messageSelector No String

errorThreshold No int 10

pollTimeout No long 1000(milliseconds)

IfyourJMSServerrequiresauthentication,passtheuserNameandpasswordFileproperties:

agent.sources.s1.userName=jms_bot_user

agent.sources.s1.passwordFile=/path/to/password.txt

Puttingthepasswordinafileyoureference,ratherthandirectlyintheFlumeconfiguration,allowsyoutokeepthepermissionsontheconfigurationopenforinspection,whilestoringthemoresensitivepassworddatainaseparatefilethatisaccessibleonlytotheFlumeagent(restrictionsarecommonlyprovidedbytheoperatingsystem’spermissionssystem).

ThedestinationNamepropertydecideswhichmessagetoreadfromourconnection.SinceJMSsupportsbothqueuesandtopics,youneedtosetthedestinationTypepropertytoqueueortopic,respectively.Here’swhatthepropertiesmightlooklikeforaqueue:

agent.source.s1.destinationName=my_cool_data

agent.source.s1.destinationType=queue

Foratopic,itwilllookasfollows:

agent.source.s1.destinationName=restart_events

agent.source.s1.destinationType=topic

Thedifferencebetweenaqueueandatopicisbasicallythenumberofentitiesthatwillbereadfromthenameddestination.Ifyouarereadingfromaqueue,themessagewillberemovedfromthequeueonceitiswrittentoFlumeasanevent.Atopic,onceread,isstillavailableforotherentitiestoread.

Shouldyouneedonlyasubsetofmessagespublishedtoatopicorqueue,theoptionalmessageSelectorStringpropertyprovidesthiscapability.Forexample,tocreateFlumeeventsonmessageswithafieldcalledAgethatislargerthan10,IcanspecifyamessageSelectorfilter:

agent.source.s1.messageSelector="Age>10"

NoteDescribingtheselectorcapabilitiesandsyntaxindetailisfarbeyondthescopeofthis

book.Seehttp://docs.oracle.com/javaee/1.4/api/javax/jms/Message.htmlorpickupabookonJMStobecomemorefamiliarwithJMSmessageselectors.

ShouldtherebeaproblemincommunicatingwiththeJMSserver,theconnectionwillresetitselfafteranumberoffailuresspecifiedbytheerrorThresholdproperty.Thedefaultvalueof10isreasonableandwillmostlikelynotneedtobechanged.

Next,theJMSsourcehasapropertyusedtoadjustthevalueofbatchSize,whichdefaultsto100.Bynow,youshouldbefairlyfamiliarwithadjustingbatchsizeswithothersourcesandsinks.Forhigh-volumeflows,you’llwanttosetthishighertoconsumedatainlargerchunksfromyourJMSservertogethigherthroughput.Settingthistoohighforlowvolumeflowscoulddelayprocessing.Asalways,testingistheonlysurewaytoadjustthisproperly.TherelatedpollTimeoutpropertyspecifieshowlongtowaitfornewmessagestoappearbeforeattemptingtoreadabatch.Thedefault,specifiedinmilliseconds,is1second.Ifnomessagesarereadbeforethepolltimeout,theSourcewillgointoanexponentialbackoffmodebeforeattemptinganotherread.Thiscoulddelaytheprocessingofmessagesuntilitwakesupagain.Chancesarethatyouwon’tneedtochangethisvaluefromthedefault.

WhenthemessagegetsconvertedintoaFlumeevent,themessagepropertiesbecomeFlumeheaders.ThemessagepayloadbecomestheFlumebody,butsinceJMSusesserializedJavaobjects,weneedtotelltheJMSsourcehowtointerpretthepayload.Wedothisbysettingtheconverter.typeproperty,whichdefaultstotheonlyimplementationthatispackagedwithFlumeusingtheDEFAULTString(implementedbytheDefaultJMSMessageConvererclass).ItcandeserializeJMSBytesMessages,TextMessages,andObjectMessages(Javaobjectsthatimplementthejava.io.DataOutputinterface).ItdoesnothandleStreamMessagesorMapMessages,soifyouneedtoprocessthem,you’llneedtoimplementyourownconvertertype.Todothis,you’llimplementtheorg.apache.flume.source.jms.JMSMessageConverterinterface,anduseitsfullyqualifiedclassnameastheconverter.typepropertyvalue.

Forcompleteness,herearethesepropertiesintabularform:

converter.type No String DEFAULT

converter.charset No String UTF-8

ChannelselectorsAswediscussedinChapter1,OverviewandArchitecture,asourcecanwritetooneormorechannels.Thisiswhythepropertyisplural(channelsinsteadofchannel).Therearetwowaysmultiplechannelscanbehandled.Theeventcanbewrittentoallthechannelsortojustonechannel,basedonsomeFlumeheadervalue.TheinternalmechanismforthisinFlumeiscalledachannelselector.

Theselectorforanychannelcanbespecifiedusingtheselector.typeproperty.Allselector-specificpropertiesbeginwiththeusualSourceprefix:theagentname,keywordsources,andsourcename:

agent.sources.s1.selector.type=replicating

ReplicatingIfyoudonotspecifyaselectorforasource,replicatingisthedefault.Thereplicatingselectorwritesthesameeventtoallchannelsinthesource’schannelslist:

agent.sources.s1.channels=c1c2c3

Inthisexample,everyeventwillbewrittentoallthreechannels:c1,c2,andc3.

Thereisanoptionalpropertyonthisselector,calledoptional.Itisaspace-separatedlistofchannelsthatareoptional.Considerthismodifiedexample:

agent.sources.s1.channels=c1c2c3

agent.sources.s1.selector.optional=c2c3

Now,anyfailuretowritetochannelsc2orc3willnotcausethetransactiontofail,andanydatawrittentoc1willbecommitted.Intheearlierexamplewithnooptionalchannels,anysinglechannelfailurewouldrollbackthetransactionforallchannels.

MultiplexingIfyouwanttosenddifferenteventstodifferentchannels,youshoulduseamultiplexingchannelselectorbysettingthevalueofselector.typetomultiplexing.Youalsoneedtotellthechannelselectorwhichheadertousebysettingtheselector.headerproperty:

agent.sources.s1.selector.type=multiplexing

agent.sources.s1.selector.header=port

Let’sassumeweusedtheMultiportSyslogTCPsourcetolistenonfourports—11111,22222,33333,and44444—withaportHeadersettingofport:

agent.sources.s1.selector.default=c2

agent.sources.s1.selector.mapping.11111=c1c2

agent.sources.s1.selector.mapping.44444=c2

agent.sources.s1.selector.optional.44444=c3

Thisconfigurationwillresultinthetrafficofport22222andport33333goingtothec2channelonly.Thetrafficofport11111willgotothec1andc2channels.Afailureoneitherchannelwouldresultinnothingbeingaddedtoeitherchannel.Thetrafficofport44444willgotochannelsc2andc3.However,afailuretowritetoc3willstillcommitthetransactiontoc2,andc3willnotbeattemptedagainwiththatevent.

SummaryInthischapter,wecoveredindepththevarioussourcesthatwecanusetoinsertlogdataintoFlume,includingtheExecsource,theSpoolingDirectorySource,Syslogsources(UDP,TCP,andmultiportTCP),andtheJMSsource.

WediscussedreplicatingtheoldTailSourcefunctionalityinFlume0.9andproblemswithusingtailsemanticsingeneral.

Wealsocoveredchannelselectorsandsendingeventstooneormorechannels,specificallythereplicatingandmultiplexingchannelselectors.

Optionalchannelswerealsodiscussedasawaytoonlyfailaputtransactionforonlysomeofthechannelswhenmorethanonechannelisused.

Inthenextchapter,we’llintroduceinterceptorsthatwillallowin-flightinspectionandtransformationofevents.Usedinconjunctionwithchannelselectors,interceptorsprovidethefinalpiecetocreatecomplexdataflowswithFlume.Additionally,wewillcoverRPCmechanisms(source/sinkpairs)betweenFlumeagentsusingbothAvroandThrift,whichcanbeusedtocreatecomplexdataflows.

Chapter6.Interceptors,ETL,andRoutingThefinalpieceoffunctionalityrequiredinyourdataprocessingpipelineistheabilitytoinspectandtransformeventsinflight.Thiscanbeaccomplishedusinginterceptors.Interceptors,aswediscussedinChapter1,OverviewandArchitecture,canbeinsertedafterasourcecreatesanevent,butbeforewritingtothechanneloccurs.

InterceptorsAninterceptor’sfunctionalitycanbesummedupwiththismethod:

publicEventintercept(Eventevent);

AFlumeeventispassedtoit,anditreturnsaFlumeevent.Itmaydonothing,inwhichcase,thesameunalteredeventisreturned.Often,italterstheeventinsomeusefulway.Ifnullisreturned,theeventisdropped.

Toaddinterceptorstoasource,simplyaddtheinterceptorspropertytothenamedsource,forexample:

agent.sources.s1.interceptors=i1i2i3

Thisdefinesthreeinterceptors:i1,i2,andi3onthes1sourcefortheagentnamedagent.

NoteInterceptorsarerunintheorderinwhichtheyarelisted.Intheprecedingexample,i2willreceivetheoutputfromi1.Then,i3willreceivetheoutputfromi2.Finally,thechannelselectorreceivestheoutputfromi3.

Nowthatwehavedefinedtheinterceptorbyname,weneedtospecifyitstypeasfollows:

agent.sources.s1.interceptors.i1.type=TYPE1

agent.sources.s1.interceptors.i1.additionalProperty1=VALUE

Let’slookatsomeoftheinterceptorsthatcomebundledwithFlumetogetabetterideaofhowtoconfigurethem.

TimestampTheTimestampinterceptor,asitsnamesuggests,addsaheaderwiththetimestampkeytotheFlumeeventifonedoesn’talreadyexist.Touseit,setthetypepropertytotimestamp.

Iftheeventalreadycontainsatimestampheader,itwillbeoverwrittenwiththecurrenttimeunlessconfiguredtopreservetheoriginalvaluebysettingthepreserveExistingpropertytotrue.

HereisatablesummarizingthepropertiesoftheTimestampinterceptor:

type Yes String timestamp

preserveExisting No boolean false

Hereiswhatatotalconfigurationforasourcemightlooklikeifweonlywantittoaddatimestampheaderifnoneexists:

agent.sources.s1.interceptors=i1

agent.sources.s1.interceptors.i1.type=timestamp

agent.sources.s1.interceptors.i1.preserveExisting=true

RecallthisHDFSSinkpathfromChapter4,SinksandSinkProcessors,utilizingtheeventdate:

agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H

Thetimestampheaderiswhatdeterminesthispath.Ifitismissing,youcanbesureFlumewillnotknowwheretocreatethefiles,andyouwillnotgettheresultyouarelookingfor.

HostSimilarinsimplicitytotheTimestampinterceptor,theHostinterceptorwilladdaheadertotheeventcontainingtheIPaddressofthecurrentFlumeagent.Touseit,setthetypepropertytohost:

agent.sources.s1.interceptors.type=host

ThekeyforthisheaderwillbehostunlessyouspecifysomethingelseusingthehostHeaderproperty.Likebefore,anexistingheaderwillbeoverwritten,unlessyousetthepreserveExistingpropertytotrue.Finally,ifyouwantareverseDNSlookupofthehostnametobeusedinsteadoftheIPasavalue,settheuseIPpropertytofalse.Rememberthatreverselookupswilladdprocessingtimetoyourdataflow.

HereisatablesummarizingthepropertiesoftheHostinterceptor:

type Yes String host

hostHeader No String host

preserveExisting No boolean false

useIP No boolean true

HereiswhatatotalconfigurationforasourcemightlooklikeifweonlywantittoaddarelayHostheadercontainingtheDNShostnameofthisagenttoeveryevent:

agent.sources.s1.interceptors.i1.type=host

agent.sources.s1.interceptors.i1.hostHeader=relayHost

agent.sources.s1.interceptors.i1.useIP=false

Thisinterceptormightbeusefulifyouwantedtorecordthepathyoureventstookthoughyourdataflow,forinstance.Chancesareyouaremoreinterestedintheoriginoftheeventratherthanthepathittook,whichiswhyIhaveyettousethis.

StaticTheStaticinterceptorisusedtoinsertasinglekey/valueheaderintoeachFlumeeventprocessed.Ifmorethanonekey/valueisdesired,yousimplyaddadditionalStaticinterceptors.Unliketheinterceptorswe’velookedatsofar,thedefaultbehavioristopreserveexistingheaderswiththesamekey.Asalways,myrecommendationistoalwaysspecifywhatyouwantandnotrelyonthedefaults.

Idonotknowwhythekeyandvaluepropertiesarenotrequired,asthedefaultsarenotterriblyuseful.

HereisatablesummarizingthepropertiesoftheStaticinterceptor:

type Yes String static

key No String key

value No String value

preserveExisting No boolean true

Finally,let’slookatanexampleconfigurationthatinsertstwonewheaders,providedtheydon’talreadyexistintheevent:

agent.sources.s1.interceptors=posenv

agent.sources.s1.interceptors.pos.type=static

agent.sources.s1.interceptors.pos.key=pointOfSale

agent.sources.s1.interceptors.pos.value=US

agent.sources.s1.interceptors.env.type=static

agent.sources.s1.interceptors.env.key=environment

agent.sources.s1.interceptors.env.value=staging

RegularexpressionfilteringIfyouwanttofiltereventsbasedonthecontentofthebody,theregularexpressionfilteringinterceptorisyourfriend.Basedonaregularexpressionyouprovide,itwilleitherfilteroutthematchingeventsorkeeponlythematchingevents.Startbysettingthetypeinterceptortoregex_filter.ThepatternyouwanttomatchisspecifiedusingaJava-styleregularexpressionsyntax.Seethesejavadocsforusagedetailsathttp://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html.Thepatternstringissetintheregexproperty.BesuretoescapebackslashesinJavaStrings.Forinstance,the\d+patternwouldneedtobewrittenwithtwobackslashes:\\d+.Ifyouwantedtomatchabackslash,thedocumentationsaystotypetwobackslashes,buteachneedstobeescaped,resultinginfour,thatis,\\\\.Youwillseetheuseofescapedbackslashesthroughoutthischapter.Finally,youneedtotelltheinterceptorifyouwanttoexcludematchingrecordsbysettingtheexcludeEventspropertytotrue.Thedefault(false)indicatesthatyouwanttoonlykeepeventsthatmatchthepattern.

Hereisatablesummarizingthepropertiesoftheregularexpressionfilteringinterceptor:

type Yes String regex_filter

regex No String .*

excludeEvents No boolean false

Inthisexample,anyeventscontainingtheNullPointerExceptionstringwillbedropped:

agent.sources.s1.interceptors=npe

agent.sources.s1.interceptors.npe.type=regex_filter

agent.sources.s1.interceptors.npe.regex=NullPointerException

agent.sources.s1.interceptors.npe.excludeEvents=true

RegularexpressionextractorSometimes,you’llwanttoextractbitsofyoureventbodyintoFlumeheaderssothatyoucanperformroutingviaChannelSelectors.Youcanusetheregularexpressionextractorinterceptortoperformthisfunction.Startbysettingthetypeinterceptortoregex_extractor:

agent.sources.s1.interceptors=e1

agent.sources.s1.interceptors.e1.type=regex_extractor

Liketheregularexpressionfilteringinterceptor,theregularexpressionextractorinterceptortoousesaJava-styleregularexpressionsyntax.Inordertoextractoneormorefields,youstartbyspecifyingtheregexpropertywithgroupmatchingparentheses.Let’sassumewearelookingforerrornumbersinoureventsintheError:Nform,whereNisanumber:

agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+)

Asyoucansee,Iputcaptureparenthesesaroundthenumber,whichmaybeoneormoredigit.NowthatI’vematchedmydesiredpattern,IneedtotellFlumewhattodowithmymatch.Here,weneedtointroduceserializers,whichprovideapluggablemechanismforhowtointerpreteachmatch.Inthisexample,I’veonlygotonematch,somyspace-separatedlistofserializernameshasonlyoneentry:

agent.sources.s1.interceptors.e1.serializers=ser1

agent.sources.s1.interceptors.e1.serializers.ser1.type=default

agent.sources.s1.interceptors.e1.serializers.ser1.name=error_no

Thenamepropertyspecifiestheeventkeytouse,wherethevalueisthematchingtextfromtheregularexpression.Thetypeofdefaultvalue(alsothedefaultifnotspecified)isasimplepass-throughserializer.Forthiseventbody,lookatthefollowing:

NullPointerException:Aproblemoccurred.Error:123.TxnID:5X2T9E.

Thefollowingheaderwouldbeaddedtotheevent:

{"error_no":"123"}

IfIwantedtoaddtheTxnIDvalueasaheader,I’dsimplyaddanothermatchingpatterngroupandserializer:

agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+).*TxnID:\\s(\\w+)

agent.sources.s1.interceptors.e1.serializers=ser1ser2

agent.sources.s1.interceptors.e1.serializers.ser2.name=txnid

Then,Iwouldcreatetheseheadersfortheprecedinginput:

{"error_no":"123","txnid":"5x2T9E"}

However,takealookatwhatwouldhappenifthefieldswerereversedasfollows:

NullPointerException:Aproblemoccurred.TxnID:5X2T9E.Error:123.

Iwouldwindupwithonlyaheaderfortxnid.Abetterwaytohandlethiskindoforderingwouldbetousemultipleinterceptorssothattheorderdoesn’tmatter:

agent.sources.s1.interceptors=e1e2

agent.sources.s1.interceptors.e2.regex=TxnID:\\s(\\w+)

agent.sources.s1.interceptors.e2.serializers.ser1.name=txnid

TheonlyothertypeofserializerimplementationthatshipswithFlume,otherthanthepass-through,istospecifythefullyqualifiedclassnameoforg.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer.Thisserializerisusedtoconverttimesintomilliseconds.Youneedtospecifyapatternpropertybasedonorg.joda.time.format.DateTimeFormatpatterns.

Forinstance,let’ssayyouwereingestingApacheWebServeraccesslogs,forexample:

192.168.1.42--[29/Mar/2013:15:27:09-0600]"GET/index.htmlHTTP/1.1"

2001037

Thecompleteregularexpressionforthismightlooklikethis(intheformofaJavaString,withbackslashandquotesescapedwithanextrabackslash):

^([\\d.]+)\\S+\\S+\\[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})

(\\d+)

Thetimepatternmatchedcorrespondstotheorg.joda.time.format.DateTimeFormatpattern:

yyyy/MMM/dd:HH:mm:ssZ

Takealookatwhatwouldhappenifwemakeourconfigurationsomethinglikethis:

agent.sources.s1.interceptors.e1.regex=^([\\d.]+)\\S+\\S+\\

[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})(\\d+)

agent.sources.s1.interceptors.e1.serializers=ipdturlscbc

agent.sources.s1.interceptors.e1.serializers.ip.name=ip_address

agent.sources.s1.interceptors.e1.serializers.dt.type=org.apache.flume.inter

ceptor.RegexExtractorInterceptorMillisSerializer

agent.sources.s1.interceptors.e1.serializers.dt.pattern=dd/MMM/yyyy:HH:mm:s

agent.sources.s1.interceptors.e1.serializers.dt.name=timestamp

agent.sources.s1.interceptors.e1.serializers.url.name=http_request

agent.sources.s1.interceptors.e1.serializers.sc.name=status_code

agent.sources.s1.interceptors.e1.serializers.bc.name=bytes_xfered

Thiswouldcreatethefollowingheadersfortheprecedingsample:

{"ip_address":"192.168.1.42","timestamp":"1364588829",

"http_request":"GET/index.htmlHTTP/1.1","status_code":"200",

"bytes_xfered":"1037"}

Thebodycontentisunaffected.You’llalsonoticethatIdidn’tspecifydefaultfortheothertypeofserializers,asthatisthedefault.

NoteThereisnooverwritecheckinginthisinterceptortype.Forinstance,usingthetimestampkeywilloverwritetheevent’sprevioustimevalueiftherewasone.

Youcanimplementyourownserializersforthisinterceptorbyimplementingtheorg.apache.flume.interceptor.RegexExtractorInterceptorSerializerinterface.However,ifyourgoalistomovedatafromthebodyofaneventtotheheader,you’llprobablywanttoimplementacustominterceptorsothatyoucanalterthebodycontentsinadditiontosettingtheheadervalue,otherwisethedatawillbeeffectivelyduplicated.

Tosummarize,let’sreviewthepropertiesforthisinterceptor:

type Yes String regex_extractor

regex Yes String

serializers Yes Space-separatedlistofserializernames

serializers.NAME.name Yes String

serializers.NAME.type No DefaultorFQCNofimplementation default

serializers.NAME.PROP No Serializer-specificproperties

MorphlineinterceptorAswesawinChapter4,SinksandSinkProcessors,apowerfullibraryoftransformationsbackstheMorphlineSolrSinkfromtheKiteSDKproject.Itshouldcomeasnosurprisethatyoucanalsousetheselibrariesinmanyplaceswhereyou’dbeforcedtowriteacustominterceptor.Similarinconfigurationtoitssinkcounterpart,youonlyneedtospecifytheMorphlineconfigurationfile,andoptionally,theMorphlineuniqueidentifier(iftheconfigurationspecifiesmorethanoneMorphline).HereisasummarytableoftheFlumeinterceptorconfiguration:

type Yes String org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder

morphlineFile Yes String

morphlineId No String Picksthefirstifnotspecifiedandanerrorifnotspecified;morethanoneexists.

YourFlumeconfigurationmightlooksomethinglikethis:

agent.sources.s1.interceptors=i1m1

agent.sources.s1.interceptors.i1.type=timestamp

agent.sources.s1.interceptors.m1.type=org.apache.flume.sink.solr.morphline.

MorphlineInterceptor$Builder

agent.sources.s1.interceptors.m1.morphlineFile=/path/to/morph.conf

agent.sources.s1.interceptors.m1.morphlineId=goMorphy

Inthisexample,wehavespecifiedthes1sourceontheagentnamedagent,whichcontainstwointerceptors,i1andm1,processedinthatorder.Thefirstinterceptorisastandardinterceptorthatinsertsatimestampheaderifnoneexists.ThesecondwillsendtheeventthroughtheMorphlineprocessorspecifiedbytheMorphlineconfigurationfilecorrespondingtothegoMorphyID.

Eventsprocessedasaninterceptormustonlyoutputoneeventforeveryinputevent.IfyouneedtooutputmultiplerecordsfromasingleFlumeevent,youmustdothisintheMorphlineSolrSink.Withinterceptors,itisstrictlyoneeventinandoneeventout.

ThefirstcommandwillmostlikelybethereadLinecommand(orreadBlob,readCSV,andsoon)toconverttheeventintoaMorphlineRecord.Fromthere,yourunanyotherMorphlinecommandsyoulikewiththefinalRecordattheendoftheMorphlinechainbeingconvertedbackintoaFlumeevent.WeknowfromChapter4,SinksandSinkProcessors,thatthe_attachment_bodyspecialRecordkeyshouldbebyte[](thesameastheevent’sbody).Thismeansthatthelastcommandneedstodotheconversion(suchastoByteArrayorwriteAvroToByteArray).AllotherRecordkeysareconvertedtoStringvaluesandsetasheaderswiththesamekeys.

NoteTakealookatthereferenceguideforacompletelistofMorphlinecommands,theirpropertiesandusageinformationathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html

TheMorphlineconfigurationfilesyntaxhasalreadybeendiscussedindetailintheMorphlineconfigurationfilesectioninChapter4,SinksandSinkProcessors.

CustominterceptorsIfthereisonepieceofcustomcodeyouwilladdtoyourFlumeimplementation,itwillmostlikelybeacustominterceptor.Asmentionedearlier,youimplementtheorg.apache.flume.interceptor.Interceptorinterfaceandtheassociatedorg.apache.flume.interceptor.Interceptor.Builderinterface.

Let’ssayIneededtoURLCodemyeventbody.Thecodewouldlooksomethinglikethis:

publicclassURLDecodeimplementsInterceptor{

publicvoidinitialize(){}

publicEventintercept(Eventevent){

byte[]decoded=URLDecoder.decode(newString(event.getBody()),"UTF-

8").getBytes("UTF-8");

event.setBody(decoded);

}catchUnsupportedEncodingExceptione){

//Shouldn'thappen.Fallthroughtounalteredevent.

returnevent;

publicList<Event>intercept(List<Event>events){

for(Eventevent:events){

intercept(event);

returnevents;

publicvoidclose(){}

publicstaticclassBuilderimplementsInterceptor.Builder{

publicInterceptorbuild(){

returnnewURLDecode();

publicvoidconfigure(Contextcontext){}

Then,toconfiguremynewinterceptor,usethefullyqualifiedclassnamefortheBuilderclassasthetype:

agent.sources.s1.interceptors.i1.type=com.example.URLDecoder$Builder

Formoreexamplesofhowtopassandvalidateproperties,lookatanyoftheexistinginterceptorimplementationsintheFlumesourcecode.

Keepinmindthatanyheavyprocessinginyourcustominterceptorcanaffecttheoverallthroughput,sobemindfulofobjectchurnorcomputationallyintensiveprocessinginyourimplementations.

ThepluginsdirectoryCustomcode(sources,interceptors,andsoon)canalwaysbeinstalledalongsidetheFlumecoreclassesinthe$FLUME_HOME/libdirectory.YoucouldalsospecifyadditionalpathsforCLASSPATHonstartupbywayoftheflume-env.shshellscript(whichisoftensourcedatstartuptimewhenusingapackageddistributionofFlume).StartingwithFlume1.4,thereisnowacommand-lineoptiontospecifyadirectorythatcontainsthecustomcode.Bydefault,ifnotspecified,the$FLUME_HOME/plugins.ddirectoryisused,butyoucanoverridethisusingthe--plugins-pathcommand-lineparameter.

Withinthisdirectory,eachpieceofcustomcodeisseparatedintoasubdirectorywhosenameisofyourchoosing—picksomethingeasyforyoutokeeptrackofthings.Inthisdirectory,youcanincludeuptothreesubdirectories:lib,libext,andnative.

ThelibdirectoryshouldcontaintheJARfileforyourcustomcomponent.Anydependenciesitusesshouldbeaddedtothelibextsubdirectory.BothofthesepathsareaddedtotheJavaCLASSPATHvariableatstartup.

TipTheFlumedocumentationimpliesthatthisdirectoryseparationbycomponentallowsforconflictingJavalibrariestocoexist.Intruth,thisisnotpossibleunlesstheunderlyingimplementationmakesuseofdifferentclassloaderswithintheJVM.Inthiscase,theFlumestartupcodesimplyappendsallofthesepathstothestartupCLASSPATHvariable,sotheorderinwhichthesubdirectoriesareprocessedwilldeterminetheprecedence.Youcannotevenbeguaranteedthatthesubdirectorieswillbeprocessedinalexicographicorder,astheunderlyingbashshellforloopcangivenosuchguarantees.Inpractice,youshouldalwaystryandavoidconflictingdependencies.Codethatdependstoomuchonorderingtendstobebuggy,especiallyifyourclasspathreordersitselffromserverinstallationtoserverinstallation.

Thethirddirectory,native,iswhereyouputanynativelibrariesassociatedwithyourcustomcomponent.Thispath,ifitexists,getsaddedtoLD_LIBRARY_PATHatstartupsothattheJVMcanlocatethesenativecomponents.

So,ifIhadthreecustomcomponents,mydirectorystructuremightlooksomethinglikethis:

$FLUME_HOME/plugins.d/base64-enc/lib/base64interceptor.jar

$FLUME_HOME/plugins.d/base64-enc/libext/base64-2.0.0.jar

$FLUME_HOME/plugins.d/base64-enc/native/libFoo.so

$FLUME_HOME/plugins.d/uuencode/lib/uuEncodingInterceptor.jar

$FLUME_HOME/plugins.d/my-avro-serializer/myAvroSerializer.jar

$FLUME_HOME/plugins.d/my-avro-serializer/native/libAvro.so

Keepinmindthatallthesepathsgetmashedtogetheratstartup,soconflictingversionsoflibrariesshouldbeavoided.Thisstructureisanorganizationmechanismforeaseofdeploymentforhumans.Personally,Ihavenotusedityet,asIuseChef(orPuppet)toinstallcustomcomponentsonmyFlumeagentsdirectlyinto$FLUME_HOME/lib,butifyouprefertousethismechanism,itisavailable.

TieringflowsInChapter1,OverviewandArchitecture,wetalkedabouttieringyourdataflows.Thereareseveralreasonsforyoutowanttodothis.YoumaywanttolimitthenumberofFlumeagentsthatdirectlyconnecttoyourHadoopcluster,tolimitthenumberofparallelrequests.YoumayalsolacksufficientdiskspaceonyourapplicationserverstostoreasignificantamountofdatawhileyouareperformingmaintenanceonyourHadoopcluster.Whateveryourreasonorusecase,themostcommonmechanismtochainFlumeagentsistousetheAvrosource/sinkpair.

TheAvrosource/sinkWecoveredAvroabitinChapter4,SinksandSinkProcessors,whenwediscussedhowtouseitasanon-diskserializationformatforfilesstoredinHDFS.Here,we’llputittouseincommunicationbetweenFlumeagents.Atypicalconfigurationmightlooksomethinglikethis:

TousetheAvrosource,youspecifythetypepropertywithavalueofavro.Youneedtoprovideabindaddressandportnumbertolistenon:

collector.sources=av1

collector.sources.av1.type=avro

collector.sources.av1.bind=0.0.0.0

collector.sources.av1.port=42424

collector.sources.av1.channels=ch1

collector.channels=ch1

collector.channels.ch1.type=memory

collector.sinks=k1

collector.sinks.k1.type=hdfs

collector.sinks.k1.channel=ch1

collector.sinks.k1.hdfs.path=/path/in/hdfs

Here,wehaveconfiguredtheagentinthemiddlethatlistensonport42424,usesamemorychannel,andwritestoHDFS.I’veusedthememorychannelforbrevityinthisexampleconfiguration.AlsonotethatI’vegiventhisagentadifferentname,collector,justtoavoidconfusion.

Theagentsonthetopandbottomsidesfeedingthecollectortiermighthaveaconfigurationsimilartothis.Ihaveleftthesourcesoffthisconfigurationforbrevity:

client.channels=ch1

client.channels.ch1.type=memory

client.sinks=k1

client.sinks.k1.type=avro

client.sinks.k1.channel=ch1

client.sinks.k1.hostname=collector.example.com

client.sinks.k1.port=42424

Thehostname,collector.example.com,hasnothingtodowiththeagentnameonthismachine;itisthehostname(oryoucanuseanIP)ofthetargetmachinewiththereceivingAvrosource.Thisconfiguration,namedclient,wouldbeappliedtobothagentsonthetopandbottomsides,assumingbothhadsimilarsourceconfigurations.

AsIdon’tlikesinglepointsoffailure,Iwouldconfiguretwocollectoragentswiththeprecedingconfigurationandinstead,seteachclientagenttoroundrobinbetweenthetwo,usingasinkgroup.Again,I’veleftoffthesourcesforbrevity:

client.channels=ch1

client.sinks=k1k2

client.sinks.k1.hostname=collectorA.example.com

client.sinks.k2.hostname=collectorB.example.com

client.sinkgroups=g1

client.sinkgroups.g1=k1k2

client.sinkgroups.g1.processor.type=load_balance

client.sinkgroups.g1.processor.selector=round_robin

client.sinkgroups.g1.processor.backoff=true

TherearefouradditionalpropertiesassociatedwiththeAvrosinkthatyoumayneedtoadjustfromtheirsensibledefaults.

Thefirstisthebatch-sizeproperty,whichdefaultsto100.Inheavyloads,youmayseebetterthroughputbysettingthishigher,forexample:

client.sinks.k1.batch-size=1024

Thenexttwopropertiescontrolnetworkconnectiontimeouts.Theconnect-timeoutproperty,whichdefaultsto20seconds(specifiedinmilliseconds),istheamountoftimerequiredtoestablishaconnectionwithanAvrosource(receiver).Therelatedrequest-timeoutproperty,whichalsodefaultsto20seconds(specifiedinmilliseconds),isthe

amountoftimeforasentmessagetobeacknowledged.IfIwantedtoincreasethesevaluesto1minute,Icouldaddtheseadditionalproperties:

client.sinks.k1.connect-timeout=60000

client.sinks.k1.request-timeout=60000

Finally,thereset-connection-intervalpropertycanbesettoforceconnectionstoreestablishthemselvesaftersometimeperiod.ThiscanbeusefulwhenyoursinkisconnectingthroughaVIP(VirtualIPAddress)orhardwareloadbalancertokeepthingsbalanced,asservicesofferedbehindtheVIPmaychangeovertimeduetofailuresorchangesincapacity.Bydefault,theconnectionswillnotberesetexceptincasesoffailure.Ifyouwantedtochangethissothattheconnectionsresetthemselveseveryhour,forexample,youcanspecifythisbysettingthispropertywiththenumberofseconds,asfollows:

client.sinks.k1.reset-connection-interval=3600

CompressingAvroCommunicationbetweentheAvrosourceandsinkcanbecompressedbysettingthecompression-typepropertytodeflate.OntheAvrosink,youcanadditionally,setthecompression-levelpropertytoanumberbetween1and9,withthedefaultbeing6.Typically,therearediminishingreturnsathighercompressionlevels,buttestingmayproveanondefaultvaluethatworksbetterforyou.Clearly,youneedtoweightheadditionalCPUcostsagainsttheoverheadtoperformahigherlevelofcompression.Typically,thedefaultisfine.

Compressedcommunicationsareespeciallyimportantwhenyouaredealingwithhighlatencynetworks,suchassendingdatabetweentwodatacenters.Anothercommonusecaseforcompressioniswhereyouarechargedforthebandwidthconsumed,suchasmostpubliccloudservices.Inthesecases,youwillprobablychoosetospendCPUcycles,compressinganddecompressingyourflowratherthansendinghighlycompressibledatauncompressed.

TipItisveryimportantthatifyousetthecompression-typepropertyinasource/sinkpair,yousetthispropertyatbothends.Otherwise,asinkcouldbesendingdatathesourcecan’tconsume.

Continuingtheprecedingexample,toaddcompression,youwouldaddtheseadditionalpropertyfieldsonbothagents:

collector.sources.av1.compression-type=deflate

client.sinks.k1.compression-type=deflate

client.sinks.k1.compression-level=7

client.sinks.k2.compression-type=deflate

client.sinks.k2.compression-level=7

SSLAvroflowsSometimes,communicationisofasensitivenatureormaytraverseuntrustednetwork

paths.YoucanencryptyourcommunicationsbetweenanAvrosinkandanAvrosourcebysettingthesslpropertytotrue.Additionally,youwillneedtopassSSLcertificateinformationtothesource(thereceiver)andoptionally,thesink(thesender).

Iwon’tclaimtobeanexpertinSSLandSSLcertificates,butthegeneralideaisthatacertificatecaneitherbeself-signedorsignedbyatrustedthirdpartysuchasVeriSign(calledacertificateauthorityorCA).Generally,becauseofthecostassociatedwithgettingaverifiedcertificate,peopleonlydothisforwebbrowsercertificatesacustomermightseeinawebbrowser.SomeorganizationswillhaveaninternalCAwhocansigncertificates.Forthepurposeofthisexample,we’llbeusingself-signedcertificates.Thisbasicallymeansthatwe’llgenerateacertificatebutnothaveitsignedbyanyauthority.Forthis,we’llusethekeytoolutilitythatcomeswithallJavainstallations(thereferencecanbefoundathttp://docs.oracle.com/javase/7/docs/technotes/tools/solaris/keytool.html).Here,IcreateaJavaKeyStore(JKS)filethatcontainsa2048bitkey.Whenpromptedforapassword,I’llusethepasswordstring.Makesureyouusearealpasswordinyournontestenvironments:

%keytool-genkey-aliasflumey-keyalgRSA-keystorekeystore.jks-keysize

Enterkeystorepassword:password

Re-enternewpassword:password

Whatisyourfirstandlastname?

[Unknown]:SteveHoffman

Whatisthenameofyourorganizationalunit?

[Unknown]:Operations

Whatisthenameofyourorganization?

[Unknown]:Me,MyselfandI

WhatisthenameofyourCityorLocality?

[Unknown]:Chicago

WhatisthenameofyourStateorProvince?

[Unknown]:Illinois

Whatisthetwo-lettercountrycodeforthisunit?

[Unknown]:US

IsCN=SteveHoffman,OU=Operations,O="Me,MyselfandI",L=Chicago,

ST=Illinois,C=UScorrect?

[no]:yes

Enterkeypasswordfor<flumey>

(RETURNifsameaskeystorepassword):

IcanverifythecontentsoftheJKSbyrunningthelistsubcommand:

%keytool-list-keystorekeystore.jks

Enterkeystorepassword:password

Keystoretype:JKS

Keystoreprovider:SUN

Yourkeystorecontains1entry

flumey,Nov11,2014,PrivateKeyEntry,Certificatefingerprint(SHA1):

5C:BC:3C:7F:7A:E7:77:EB:B5:54:FA:E2:8B:DD:D3:66:36:86:DE:E4

NowthatIhaveakeyinmykeystorefile,Icansettheadditionalpropertiesonthereceivingsource:

collector.sources.av1.ssl=true

collector.sources.av1.keystore=/path/to/keystore.jks

collector.sources.av1.keystore-password=password

AsIamusingaself-signedcertificate,thesinkwon’tbesurethatcommunicationscanbetrusted,soIhavetwooptions.Thefirstistotellittojusttrustallcertificates.Clearly,youwouldonlydothisonaprivatenetworkthathassomereasonableassurancesaboutitssecurity.Toignorethedubiousoriginofmycertificate,Icansetthetrust-all-certspropertytotrueasfollows:

client.sinks.k1.ssl=true

client.sinks.k1.trust-all-certs=true

Ifyoudidn’tsetthis,you’dseesomethinglikethisinthelogs:

org.apache.flume.EventDeliveryException:Failedtosendevents…

Causedby:javax.net.ssl.SSLHandshakeException:GeneralSSLEngineproblem…

Causedby:sun.security.validator.ValidatorException:Notrusted

certificatefound…

ForcommunicationsoverthepublicInternetorinmultitenantcloudenvironments,morecareshouldbetaken.Inthiscase,youcanprovideatruststoretothesink.Atruststoreissimilartoakeystore,exceptthatitcontainscertificateauthoritiesyouaretellingFlumecanbetrusted.Forwell-knowncertificateauthorities,suchasVeriSignandothers,youdon’tneedtospecifyatruststore,astheiridentitiesarealreadyincludedinyourJavadistribution(atleastforthemajorones).Youwillneedtohaveyourcertificatesignedbyacertificateauthorityandthenaddthesignedcertificatefiletothesource’skeystorefile.YouwillalsoneedtoincludethesigningCA’scertificateinthekeystoresothatthecertificatechaincanbefullyresolved.

Onceyouhaveaddedthekey,signedcertificate,andsigningCA’scertificatetothesource,youneedtoconfigurethesink(thesender)totrusttheseauthoritiessothatitwillpassthevalidationstepduringtheSSLhandshake.Tospecifythetruststore,setthetruststoreandtruststore-passwordpropertiesasfollows:

client.sinks.k1.ssl=true

client.sinks.k1.truststore=/path/to/truststore.jks

client.sinks.k1.truststore-password=password

Chancesarethereissomebodyinyourorganizationresponsibleforobtainingthethird-partycertificatesorsomeonewhocanissueanorganizationalCA-signed-certificatetoyou,soI’mnotgoingtogointodetailsabouthowtocreateyourowncertificateauthority.ThereisplentyofinformationontheInternetifyouchoosetotakethisroute.Rememberthatcertificateshaveanexpirationdate(usually,ayear),soyou’llneedtorepeatthisprocesseveryyearorcommunicationswillabruptlystopwhencertificatesorcertificateauthoritiesexpire.

TheThriftsource/sinkAnothersource/sinkpairyoucanusetotieryourdataflowsisbasedonThrift(http://thrift.apache.org/).UnlikeAvrodata,whichisself-documenting,Thriftusesanexternalschema,whichcanbecompiledintojustabouteveryprogramminglanguageontheplanet.TheconfigurationisalmostidenticaltotheAvroexamplealreadycovered.TheprecedingcollectorconfigurationthatusesThriftwouldnowlooksomethinglikethis:

collector.sources=th1

collector.sources.th1.type=thrift

collector.sources.th1.bind=0.0.0.0

collector.sources.th1.port=42324

collector.sources.th1.channels=ch1

collector.channels=ch1

collector.channels.ch1.type=memory

collector.sinks=k1

collector.sinks.k1.type=hdfs

collector.sinks.k1.channel=ch1

collector.sinks.k1.hdfs.path=/path/in/hdfs

Thereisoneadditionalpropertythatsetsthemaximumworkerthreadsintheunderlyingthreadpool.Ifunset,avalueofzeroisassumed,whichmakesthethreadpoolunbounded,soyoushouldprobablysetthisinyourproductionconfigurations.Testingshouldprovideyouwithareasonableupperlimitforyourenvironment.Tosetthethreadpoolsizeto10,forexample,youwouldaddthethreadsproperty:

collector.source.th1.threads=10

The“client”agentswouldusethecorrespondingThriftsinkasfollows:

client.channels=ch1

client.sinks=k1

client.sinks.k1.type=thrift

client.sinks.k1.hostname=collector.example.com

LikeitsAvrocounterpart,thereareadditionalsettingsforthebatchsize,connectiontimeouts,andconnectionresetintervalsthatyoumaywanttoadjustbasedonyourtestingresults.RefertotheTheAvrosource/sinksectionearlierinthischapterfordetails.

Usingcommand-lineAvroTheAvrosourcecanalsobeusedinconjunctionwithoneofthecommand-lineoptionsyoumayhavenoticedbackinChapter2,AQuickStartGuidetoFlume.Ratherthanrunningflume-ngwiththeagentparameter,youcanpasstheavro-clientparametertosendoneormorefilestoanAvrosource.Thesearetheoptionsspecifictoavro-clientfromthehelptext:

avro-clientoptions:

--dirname<dir>directorytostreamtoavrosource

--host,-H<host>hostnametowhicheventswillbesent(required)

--port,-p<port>portoftheavrosource(required)

--filename,-F<file>textfiletostreamtoavrosource[default:std

input]

--headerFile,-R<file>headerFilecontainingheadersaskey/valuepairson

eachnewline

Thisvariationisveryusefulfortesting,resendingdatamanuallyduetoerrors,orimportingolderdatastoredelsewhere.

JustlikeanAvrosink,youhavetospecifythehostnameandportyouwillbesendingdatato.Youcansendasinglefilewiththe--filenameoptionorallthefilesinadirectorywiththe--dirnameoption.Ifyouspecifyneitherofthese,stdinwillbeused.Hereishowyoumightsendafilenamedfoo.logtotheFlumeagentwepreviouslyconfigured:

$./flume-ngavro-client--filenamefoo.log--hostcollector.example.com--

port42424

EachlineoftheinputwillbeconvertedintoasingleFlumeevent.

Optionally,youcanspecifyafilecontainingkey/valuepairstosetFlumeheadervalues.ThefileusesJavapropertyfilesyntax.SupposeIhadafilenamedheaders.propertiescontaining:

pointOfSale=US

environment=staging

Then,includingthe--headerFileoptionwouldsetthesetwoheadersoneveryeventcreated:

$./flume-ngavro-client--filenamefoo.log--headerFileheaders.properties

--hostcollector.example.com--port42424

TheLog4JappenderAswediscussedinChapter5,SourcesandChannelSelectors,thereareissuesthatmayarisefromusingafilesystemfileasasource.OnewaytoavoidthisproblemistousetheFlumeLog4JAppenderinyourJavaapplication(s).Underthehood,itusesthesameAvrocommunicationthattheAvrosinkuses,soyouneedonlyconfigureittosenddatatoanAvrosource.

TheAppenderhastwoproperties,whichareshownhereinXML:

<appendername="FLUME"

class="org.apache.flume.clients.log4jappender.Log4jAppender">

<paramname="Hostname"value="collector.example.com"/>

<paramname="Port"value="42424"/>

</appender>

TheformatofthebodywillbedictatedbytheAppender’sconfiguredlayout(notshown).Thelog4jfieldsthatgetmappedtoFlumeheadersaresummarizedinthistable:

Flumeheaderkey Log4Jloggingeventfield

flume.client.log4j.logger.name event.getLoggerName()

flume.client.log4j.log.levelevent.getLevel()asanumber.Seeorg.apache.log4j.Levelformappings.

flume.client.log4j.timestamp event.getTimeStamp()

flume.client.log4j.message.encoding N/A—alwaysUTF8

flume.client.log4j.logger.otherWillonlyseethisiftherewasaproblemmappingoneoftheabovefields,sonormally,thiswon’tbepresent.

Refertohttp://logging.apache.org/log4j/1.2/formoredetailsonusingLog4J.

Youwillneedtoincludetheflume-ng-sdkJARintheclasspathofyourJavaapplicationatruntimetouseFlume’sLog4JAppender.

KeepinmindthatifthereisaproblemsendingdatatotheAvrosource,theappenderwillthrowanexceptionandthelogmessagewillbedropped,asthereisnoplacetoputit.KeepingitinmemorycouldquicklyoverloadyourJVMheap,whichisusuallyconsideredworsethandroppingthedatarecord.

TheLog4Jload-balancingappenderI’msureyounoticedthattheprecedingLog4jAppenderonlyhasasinglehostname/portinitsconfiguration.Ifyouwantedtospreadtheloadacrossmultiplecollectoragents,eitherforadditionalcapacityorforfaulttolerance,youcanusetheLoadBalancingLog4jAppender.ThisappenderhasasinglerequiredpropertynamedHosts,whichisaspace-separatedlistofhostnamesandportnumbersseparatedbyacolon,asfollows:

class="org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender">

<paramname="Hosts"value="server1:42424server2:42424"/>

</appender>

Thereisanoptionalproperty,Selector,whichspecifiesthemethodthatyouwanttoloadbalance.ValidvaluesareRANDOMandROUND_ROBIN.Ifnotspecified,thedefaultisROUND_ROBIN.Youcanimplementyourownselector,butthatisoutsidethescopeofthisbook.Ifyouareinterested,gohavealookatthewell-documentedsourcecodefortheLoadBalancingLog4jAppenderclass.

Finally,thereisanotheroptionalpropertytooverridethemaximumtimeforexponentialbackoffwhenaservercannotbecontacted.Initially,ifaservercannotbecontacted,1secondwillneedtopassbeforethatserveristriedagain.Eachtimetheserverisunavailable,theretrytimedoubles,uptoadefaultmaximumof30seconds.Ifwewantedtoincreasethismaximumto2minutes,wecanspecifyaMaxBackoffpropertyinmillisecondsasfollows:

class="org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender">

<paramname="Hosts"value="server1:42424server2:42424"/>

<paramname="Selector"value="RANDOM"/>

<paramname="MaxBackoff"value="120000"/>

</appender>

Inthisexample,wehavealsooverriddenthedefaultround_robinselectortousearandomselection.

TheembeddedagentIfyouarewritingaJavaprogramthatcreatesdata,youmaychoosetosendthedatadirectlyasstructureddatausingaspecialmodeofFlumecalledtheEmbeddedAgent.Itisbasicallyasimplesinglesource/singlechannelFlumeagentthatyouruninsideyourJVM.

Therearebenefitsanddrawbackstothisapproach.Onthepositiveside,youdon’tneedtomonitoranadditionalprocessonyourserverstorelaydata.Theembeddedchannelalsoallowsforthedataproducertocontinueexecutingitscodeimmediatelyafterqueuingtheeventtothechannel.TheSinkRunnerthreadhandlestakingeventsfromthechannelandsendingthemtotheconfiguredsinks.Evenifyoudidn’tuseembeddedFlumetoperformthishandofffromthecallingthread,youwouldmostlikelyusesomekindofsynchronizedqueue(suchasBlockingQueue)toisolatethesendingofthedatafromthemainexecutionthread.UsingEmbeddedFlumeprovidesthesamefunctionalitywithouthavingtoworrywhetheryou’vewrittenyourmultithreadedcodecorrectly.

ThemajordrawbacktoembeddingFlumeinyourapplicationisaddedmemorypressureonyourJVM’sgarbagecollector.Ifyouareusinganin-memorychannel,anyunsenteventsareheldintheheapandgetinthewayofcheapgarbagecollectionbywayofshort-livedobjects.However,thisiswhyyousetachannelsize:tokeepthemaximummemoryfootprinttoaknownquantity.Furthermore,anyconfigurationchangeswillrequireanapplicationrestart,whichcanbeproblematicifyouroverallsystemdoesn’thavesufficientredundancybuiltintotoleraterestarts.

ConfigurationandstartupAssumingyouarenotdissuaded(andyoushouldn’tbe),youwillfirstneedtoincludetheflume-ng-embedded-agentlibrary(anddependencies)intoyourJavaproject.Dependingonwhatbuildsystemyouareusing(Maven,Ivy,Gradle,andsoon),theexactformatwilldiffer,soI’lljustshowyoutheMavenconfigurationhere.Youcanlookupthealternativeformatsathttp://mvnrepository.com/:

<groupId>org.apache.flume</groupId>

<artifactId>flume-ng-embedded-agent</artifactId>

</dependency>

StartbycreatinganEmbeddedAgentobjectinyourJavacodebycallingtheconstructor(andpassingastringname—usedonlyinerrormessages):

EmbeddedAgenttoHadoop=newEmbeddedAgent("myData");

Next,youhavetosetpropertiesforthechannel,sinks,andsinkprocessorviatheconfigure()method,asshowninthisexample.Mostofthisconfigurationshouldlookveryfamiliartoyouatthispoint:

Map<String,String>config=newHashMap<String,String>;

config.put("channel.type","memory");

config.put("channel.capacity","75");

config.put("sinks","s1s2");

config.put("sink.s1.type","avro");

config.put("sink.s1.hostname","foo.example.com");

config.put("sink.s1.port","12345");

config.put("sink.s1.compression-type","deflate");

config.put("sink.s2.type","avro");

config.put("sink.s2.hostname","bar.example.com");

config.put("sink.s2.port","12345");

config.put("sink.s2.compression-type","deflate");

config.put("processor.type","failover");

config.put("processor.priority.s1","10");

config.put("processor.priority.s2","20");

toHadoop.configure(config);

Here,wedefineamemorychannelwithacapacityof75eventsalongwithtwoAvrosinksinanactive/standbyconfiguration(first,tofoo.example.com,andifthatfails,tobar.example.com)usingAvroserializationwithcompression.RefertoChapter3,Channels,forspecificsettingsformemory-orfile-backedchannelproperties.

Thesinkprocessoronlycomesintoplayifyouhavemorethanonesinkdefinedinyoursinksproperty.UnliketheoptionalsinkgroupsoptionscoveredinChapter4,SinksandSinkProcessors,youneedtospecifyasinklist,evenifitisjustone.Clearly,thereisnobehaviordifferencebetweenfailoverandloadbalancewhenthereisonlyoncesink.Refertoeachspecificsink’spropertiesinChapter4,SinksandSinkProcessors,forspecificconfigurationparameters.

Finally,beforeyoustartusingthisagent,youneedtocallthestart()methodto

instantiateeverythingbasedonyourpropertiesandstartallthebackgroundprocessingthreadsasfollows:

toHadoop.start();

Theclassisnowreadytostartreceivingandforwardingdata.

You’llwanttokeepareferencetothisobjectaroundforcleanuplater,aswellastopassittootherobjectsthatwillbesendingdataastheunderlyingchannelprovidesthreadsafety.Personally,IuseSpringFramework’sdependencyinjectiontoconfigureandpassreferencesatstartupratherthandoingitprogrammatically,asI’veshowninthisexample.RefertotheSpringwebsiteformoreinformation(http://spring.io/),asproperuseofSpringisawholebookuntoitself,anditisnottheonlydependencyinjectionframeworkavailabletoyou.

SendingdataDatacanbesenteitherassingleeventsorinbatches.Batchescansometimesbemoreefficientinhighvolumedatastreams.

Tosendasingleevent,justcalltheput()method,asshowninthisexample:

Evente=EventBuilder.withBody("HelloHadoop",Charset.forName("UTF8");

toHadoop.put(e);

Here,I’musingoneofmanymethodsavailableontheorg.apache.flume.event.EventBuilderclass.Hereisacompletelistofmethodsyoucanusetoconstructeventswiththishelperclass:

EventBuilder.withBody(byte[]body,Map<String,String>headers);

EventBuilder.withBody(byte[]body);

EventBuilder.withBody(Stringbody,Charsetcharset,Map<String,String>

headers);

EventBuilder.withBody(Stringbody,Charsetcharset);

Inordertosendseveraleventsinonebatch,youcancalltheputAll()methodwithalistofevents,asshowninthisexample,whichalsoincludesatimeheader:

Map<String,String>headers=newHashMap<String,String>;

headers.put("timestamp",Long.toString(System.currentTimeMillis());

List<Event>events=newArrayList<Event>;

events.add(EventBuilder.withBody("First".getBytes("UTF-8"),headers);

events.add(EventBuilder.withBody("Second".getBytes("UTF-8"),headers);

toHadoop.putAll(events);

Asinterceptorsarenotsupported,youwillneedtoaddanyheadersyouwanttoaddtotheeventsprogrammaticallyinJavabeforeyoucallput()orputAll().

ShutdownFinally,astherearebackgroundthreadswaitingfordatatoarriveonyourembeddedchannelwhicharemostlikelyholdingpersistentconnectionstoconfigureddestinations,whenitistimetoshutdownyourapplication,you’llwanttohaveFlumedoitscleanupaswell.Yousimplycallthestop()methodontheconfiguredEmbeddedAgentasfollows:

toHadoop.stop();

RoutingTheroutingofdatatodifferentdestinationsbasedoncontentshouldbefairlystraightforwardnowthatyou’vebeenintroducedtoallthevariousmechanismsinFlume.

ThefirststepistogetthedatayouwanttoswitchonintoaFlumeheaderbymeansofasource-sideinterceptoriftheheaderisn’talreadyavailable.ThesecondstepistouseaMultiplexingChannelSelectoronthatheadervaluetoswitchthedatatoanalternatechannel.

Forinstance,let’ssayyouwantedtocaptureallexceptionstoHDFS.Inthisconfiguration,youcanseeeventscominginonthes1sourceviaavroonport42424.TheeventistestedtoseewhetherthebodycontainsthetextException.Ifitdoes,itcreatesanexceptionheaderkey(withthevalueofException).Thisheaderisusedtoswitchtheseeventstochannelc1,andultimately,HDFS.Iftheeventdidn’tmatchthepattern,itwouldnothavetheexceptionheaderandwouldgetpassedtothec2channelviathedefaultselectorwhereitwouldbeforwardedviaAvroserializationtoport12345onserverfoo.example.com:

agent.sources=s1

agent.sources.s1.type=avro

agent.sources.s1.interceptors.i1.type=regex_extractor

agent.sources.s1.interceptors.i1.regex=(Exception)

agent.sources.s1.interceptors.i1.serializers=ex

agent.sources.s1.intercetpros.i1.serializers.ex.name=exception

agent.sources.s1.selector.type=multiplexing

agent.sources.s1.selector.header=exception

agent.sources.s1.selector.mapping.Exception=c1

agent.sources.s1.selector.default=c2

agent.channels=c1c2

agent.sinks=k1k2

agent.sinks.k1.type=hdfs

agent.sinks.k1.hdfs.path=/logs/exceptions/%y/%M/%d/%H

agent.sinks.k2.type=avro

agent.sinks.k2.hostname=foo.example.com

agent.sinks.k2.port=12345

SummaryInthischapter,wecoveredvariousinterceptorsshippedwithFlume,including:

Timestamp:Theseareusedtoaddatimestampheader,possiblyoverwritinganexistingone.Host:ThisisusedtoaddtheFlumeagenthostnameorIPasaheaderintheevent.Static:ThisisusedtoaddstaticStringheaders.Regularexpressionfiltering:Thisisusedtoincludeorexcludeeventsbasedonamatchedregularexpression.Regularexpressionextractor:Thisisusedtocreateheadersfrommatchedregularexpressions.It’susefulforroutingwithChannelSelectors.Morphline:ThisisusedtodelegatetransformationtoaMorphlinecommandchain.Custom:Thisisusedtocreateanycustomtransformationsyouneedthatyoucan’tfindelsewhere.

WealsocoveredtieringdataflowsusingtheAvrosourceandsink.OptionalcompressionandSSLwithAvroflowswerecoveredaswell.Finally,Thriftsourcesandsinkswerebrieflycovered,assomeenvironmentsmayalreadyhaveThriftdataflowstointegratewith.

Next,weintroducedtwoLog4JAppenders,asinglepathandaload-balancingversion,fordirectintegrationwithJavaapplications.

TheEmbeddedFlumeAgentwascoveredforthosewishingtodirectlyintegratebasicFlumefunctionalityintotheirJavaapplications.

Finally,wegaveyouanexampleofusinginterceptorsinconjunctionwithaChannelSelectortoprovideroutingdecisionlogic.

Inthenextchapter,wewilldiveintoanexampletostitchtogethereverythingwehavecoveredsofar.

Chapter7.PuttingItAllTogetherNowthatwe’vewalkedthroughallthecomponentsandconfigurations,let’sputtogetheraworkingend-to-endconfiguration.Thisexampleisbynomeansexhaustive,nordoesitcovereverypossiblescenarioyoumightneed,butIthinkitshouldcoveracoupleofcommonusecasesI’veseenoverandover:

FindingerrorsbysearchinglogsacrossmultipleserversinnearrealtimeStreamingdatatoHDFSforlong-termbatchprocessing

Inthefirstsituation,yoursystemsmaybeimpaired,andyouhavemultipleplaceswhereyouneedtosearchforproblems.Bringingallofthoselogstoasingleplacethatyoucansearchmeansgettingyoursystemsrestoredquickly.Inthesecondscenario,youareinterestedincapturingdatainthelongtermforanalyticsandmachinelearning.

WeblogstosearchableUILet’ssimulateawebapplicationbysettingupasimplewebserverwhoselogswewanttostreamintosomesearchableapplication.Inthiscase,we’llbeusingaKibanaUItoperformadhocqueriesagainstElasticsearch.

Forthisexample,I’llstartthreeserversinAmazon’sElasticComputeCluster(EC2),asshowninthisdiagram:

EachserverhasapublicIP(startingwith54)andaprivateIP(startingwith172).Forinterservercommunication,I’llbeusingtheprivateIPsinmyconfigurations.Mypersonalinteractionwiththewebserver(tosimulatetraffic)andwithKibanaandElasticsearchwillrequirethepublicIPs,sinceI’msittinginmyhouseandnotinAmazon’sdatacenters.

NotePaycarefulattentiontotheIPaddressesintheshellprompts,aswewillbejumpingfrommachinetomachine,andIdon’twantyoutogetlost.Forinstance,ontheCollectorbox,thepromptwillcontainitsprivateIP:[ec2-user@ip-172-31-26-205~]$

IfyoutrythisoutyourselfinEC2,youwillgetdifferentIPassignments,soadjusttheconfigurationfilesandURLsreferencedasneeded.Forthisexample,I’llbeusingAmazon’sLinuxAMIforanoperatingsystemandthet1.microsizeserver.Also,besuretoadjustthesecuritygroupstoallownetworktrafficbetweentheseservers(andyourself).Whenindoubtaboutconnectivity,useutilitiessuchastelnetornctotestconnectionsbetweenserversandports.

TipWhilegoingthroughthisexercise,Imademistakesinmysecuritygroupsmorethanonce,soperformthesetestsonconnectivityexceptions.

SettingupthewebserverLet’sstartbysettingupthewebserver.Forthis,we’llinstallthepopularNginxwebserver(http://nginx.org).Sincewebserverlogsareprettymuchstandardizednowadays,Icouldhavechosenanywebserver,butthepointhereisthatthisapplicationwritesitlogs(bydefault)toafileonthedisk.SinceI’vewarnedyoufromtryingtousethetailprogramtostreamthem,we’llbeusingacombinationoflogrotateandtheSpoolingDirectorySourcetoingestthedata.First,let’slogintotheserver.ThedefaultaccountwithsudoforanAmazonLinuxserverisec2-user.You’llneedtopassthisinyoursshcommand:

%ssh54.69.112.61-lec2-user

Theauthenticityofhost'54.69.112.61(54.69.112.61)'can'tbe

established.

RSAkeyfingerprintisdc:ee:7a:1a:05:e1:36:cd:a4:81:03:97:48:8d:b3:cc.

Areyousureyouwanttocontinueconnecting(yes/no)?yes

Warning:Permanentlyadded'54.69.112.61'(RSA)tothelistofknownhosts.

__|__|_)

_|(/AmazonLinuxAMI

___|\___|___|

https://aws.amazon.com/amazon-linux-ami/2014.09-release-notes/

18package(s)neededforsecurity,outof41available

Run"sudoyumupdate"toapplyallupdates.

Nowthatyouareloggedintotheserver,let’sinstallNginx(I’mnotgoingtoshowalloftheoutputtosavepaper):

[ec2-user@ip-172-31-18-146~]$sudoyum-yinstallnginx

Loadedplugins:priorities,update-motd,upgrade-helper

ResolvingDependencies

-->Runningtransactioncheck

--->Packagenginx.x86_641:1.6.2-1.22.amzn1willbeinstalled

[SNIP]

Installed:

nginx.x86_641:1.6.2-1.22.amzn1

DependencyInstalled:

GeoIP.x86_640:1.4.8-1.5.amzn1gd.x86_640:2.0.35-11.10.amzn1

gperftools-libs.x86_640:2.0-11.5.amzn1libXpm.x86_640:3.5.10-2.9.amzn1

libunwind.x86_640:1.1-2.1.amzn1

Complete!

Next,let’sstarttheNginxwebserver:

[ec2-user@ip-172-31-18-146~]$sudo/etc/init.d/nginxstart

Startingnginx:[OK]

Atthispoint,thewebservershouldberunning,solet’suseourcomputer’swebbrowsertogotothepublicIP,http://54.69.112.61/inthiscase.Ifitisworking,youshouldseethedefaultwelcomepage,likethis:

Tohelpgenerateasampleinput,youcanusealoadgeneratorprogramsuchasApachebenchmark(http://httpd.apache.org/docs/2.4/programs/ab.html)orwrk(https://github.com/wg/wrk).HereisthecommandIusedtogeneratedatafor20minutesatatime:

%wrk-c2-d20mhttp://54.69.112.61

Running20mtest@http://54.69.112.61

2threadsand2connections

ConfiguringlogrotationtothespooldirectoryBydefault,theserver’saccesslogsarewrittento/var/log/nginx/access.log.Theproblemweneedtoovercomehereistomovethefilestoaseparatedirectorywhileitisstillopenbythenginxprocess.Thisiswherelogrotatecomesintothepicture.Let’sfirstcreateaspooldirectorythatwillbecometheinputpathfortheSpoolingDirectorySourcelateron.Forthepurposeofthisexample,I’lljustcreatethespooldirectoryinthehomedirectoryofec2-user.Inaproductionenvironment,youwouldplacethespooldirectorysomewhereelse:

[ec2-user@ip-172-31-18-146~]$pwd

/home/ec2-user

[ec2-user@ip-172-31-18-146~]$mkdirspool

Let’salsocreatealogrotatescriptinthehomedirectoryofec2-user,calledrotateAccess.conf.Openaneditorandpastethesecontents(aftertheprompt)intoafile

calledaccessRotate.conf:

[ec2-user@ip-172-31-18-146~]$cataccessRotate.conf

/var/log/nginx/access.log{

missingok

notifempty

rotate0

copytruncate

sharedscripts

olddir/home/ec2-user/spool

postrotate

chownec2-user:ec2-user/home/ec2-user/spool/access.log.1

ts=$(date+%s)

mv/home/ec2-user/spool/access.log.1/home/ec2-

user/spool/access.log.$ts

endscript

Let’sgooverthissothatyouunderstandwhatitisdoing.Thisisbynomeansacompleteintroductiontothelogrotateutility.Readtheonlinedocumentationathttp://linuxconfig.org/logrotateformoredetails.

Whenlogrotateruns,itwillcopy/var/log/nginx/access.logtothe/home/ec2-user/spooldirectory,butonlyifithasanonzerolength(meaning,thereisdatatosend).Therotate0commandtellslogrotatenottoremoveanyfilesfromthetargetdirectory(sincetheFlumesourcewilldothatafterithassuccessfullytransmittedeachfile).We’llmakeuseofthecopytruncatefeaturebecauseitkeepsthefileopenasfarasthenginxprocessisconcerned,whileresettingittozerolength.Thisavoidstheneedtosignalnginxthatarotationhasjustoccurred.Thedestinationoftherotatedfileinthespooldirectoryis/home/ec2-user/spool/access.log.1.Next,inthepostrotatescriptsection,wewillchangetheownershipofthefilefromtherootusertoec2-usersothattheFlumeagentcanreadit,andlater,deleteit.Finally,werenamethefilewithauniquetimestampsothatwhenthenextrotationoccurs,theaccess.log.1filewillnotexist.Thisalsomeansthatwe’llneedtoaddanexclusionruletotheFlumesourcelatertokeepitfromreadingaccess.log.1fileuntilthecopyingprocesshascompleted.Oncerenamed,itisreadytobeconsumedbyFlume.

Nowlet’stryrunningitindebugmode,whichwillbejustadryrunsothatwecanseewhatitcando.Normally,logrotatehasadailyrotationsettingthatpreventsitfromdoinganythingunlessenoughtimehaspassed:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-d/home/ec2-

user/accessRotate.conf

readingconfigfile/home/ec2-user/accessRotate.conf

readingconfiginfofor/var/log/nginx/access.log

olddirisnow/home/ec2-user/spool

Handling1logs

rotatingpattern:/var/log/nginx/access.log1048576bytes(nooldlogs

willbekept)

olddiris/home/ec2-user/spool,emptylogfilesarenotrotated,oldlogs

areremoved

consideringlog/var/log/nginx/access.log

logdoesnotneedrotating

notrunningpostrotatescript,sincenologswererotated

Therefore,wewillalsoincludethe-f(force)flagtooverridetheprecedingfunctionality:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-df/home/ec2-

readingconfigfile/home/ec2-user/accessRotate.conf

Handling1logs

rotatingpattern:/var/log/nginx/access.logforcedfromcommandline(no

oldlogswillbekept)

areremoved

logneedsrotating

rotatinglog/var/log/nginx/access.log,log->rotateCountis0

dateextsuffix'-20141207'

globpattern'-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'

renaming/home/ec2-user/spool/access.log.1to/home/ec2-

user/spool/access.log.2(rotatecount1,logstart1,i1),

renaming/home/ec2-user/spool/access.log.0to/home/ec2-

user/spool/access.log.1(rotatecount1,logstart1,i0),

copying/var/log/nginx/access.logto/home/ec2-user/spool/access.log.1

truncating/var/log/nginx/access.log

runningpostrotatescript

runningscriptwitharg/var/log/nginx/access.log:"

chownec2-user:ec2-user/home/ec2-user/spool/access.log.1

ts=$(date+%s)

mv/home/ec2-user/spool/access.log.1/home/ec2-

user/spool/access.log.$ts

removingoldlog/home/ec2-user/spool/access.log.2

Asyoucansee,logrotatewillcopythelogtothespooldirectorywiththe.1extensionasexpected,followedbyourscriptblockattheendtochangepermissionsandrenameitwithauniquetimestamp.

Nowlet’srunitagain,butthistimeforreal,withoutthedebugflag,andthenwewilllistthesourceandtargetdirectoriestoseewhathappened:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-f/home/ec2-

[ec2-user@ip-172-31-18-146~]$ls-lspool/

total188

-rw-r--r--1ec2-userec2-user189344Dec717:27access.log.1417973241

[ec2-user@ip-172-31-18-146~]$ls-l/var/log/nginx/

total4

-rw-r--r--1rootroot0Dec717:27access.log

-rw-r--r--1rootroot520Dec716:44error.log

[ec2-user@ip-172-31-18-146~]$

Youcanseetheaccesslogisemptyandtheoldcontentshavebeencopiedtothespooldirectory,withcorrectpermissionsandthefilenameendinginauniquetimestamp.

Let’salsoverifythattheemptyaccesslogwillresultinnoactionifrunwithoutanydata.You’llneedtoincludethedebugflagtoseewhattheprocessisthinking:

[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-df/home/ec2-

readingconfigfileaccessRotate.conf

Handling1logs

rotatingpattern:/var/log/nginx/access.logforcedfromcommandline(1

rotations)

areremoved

logdoesnotneedrotating

Nowthatwehaveaworkingrotationscript,weneedsomethingtorunitperiodically(notindebugmode)insomechoseninterval.Forthis,wewillusethecrondaemon(http://en.wikipedia.org/wiki/Cron)bycreatingafileinthe/etc/cron.ddirectory:

[ec2-user@ip-172-31-18-146~]$cat/etc/cron.d/rotateLogsToSpool

#Movefilestospooldirectoryevery5minutes

*/5****root/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf

Here,I’veindicatedafive-minuteintervalandtorunastherootuser.Sincetherootuserownstheoriginallogfile,weneedelevatedaccesstoreassignownershiptoec2-usersothatitcanberemovedlaterbyFlume.

Onceyou’vesavedthiscronconfigurationfile,youshouldseeourscriptevery5minutes(aswellasotherprocesses),byinspectingthecrondaemon’slogfile:

[ec2-user@ip-172-31-18-146~]$sudotail/var/log/cron

Dec717:45:01ip-172-31-18-146CROND[22904]:(root)CMD

(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)

Dec717:54:01ip-172-31-18-146anacron[2426]:Job'cron.weekly'started

Dec717:54:01ip-172-31-18-146anacron[2426]:Job'cron.weekly'

terminated

Dec718:01:01ip-172-31-18-146CROND[23060]:(root)CMD(run-parts

/etc/cron.hourly)

Dec718:01:01ip-172-31-18-146run-parts(/etc/cron.hourly)[23060]:

starting0anacron

Dec718:01:01ip-172-31-18-146run-parts(/etc/cron.hourly)[23069]:

finished0anacron

Aninspectionofthespooldirectoryalsoshowsnewfilescreatedonourarbitrarilychosen5-minuteinterval:

[ec2-user@ip-172-31-18-146~]$ls-lspool/

total1072

KeepinmindthattheNginxRPMpackagealsoinstalledarotationconfigurationlocatedat/etc/logrotate.d/nginxtoperformdailyrotations.Ifyouaregoingtousethisexampleinproduction,you’llwanttoremovetheconfiguration,sincewedon’twantitclashingwithourmorefrequentlyrunningcronscript.I’llleavehandlingerror.logasanexerciseforyou.You’lleitherwanttosenditsomeplace(andhaveFlumeremoveit)orrotateitperiodicallysothatyourdiskdoesn’tfillupovertime.

Settingupthetarget–ElasticsearchNowlet’smovetotheotherserverinourdiagramandsetupElasticsearch,ourdestinationforsearchabledata.I’mgoingtobeusingtheinstructionsfoundathttp://www.elasticsearch.org/overview/elkdownloads/.First,let’sdownloadandinstalltheRPMpackage:

[ec2-user@ip-172-31-26-120~]$wget

https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc

h-1.4.1.noarch.rpm

--2014-12-0718:25:35--

h-1.4.1.noarch.rpm

Resolvingdownload.elasticsearch.org(download.elasticsearch.org)...

54.225.133.195,54.243.77.158,107.22.222.16,...

Connectingtodownload.elasticsearch.org

(download.elasticsearch.org)|54.225.133.195|:443…connected.

HTTPrequestsent,awaitingresponse…200OK

Length:26326154(25M)[application/x-redhat-package-manager]

Savingto:'elasticsearch-1.4.1.noarch.rpm'

[==========================================================================

==================================>]26,326,1549.92MB/sin2.5s

2014-12-0718:25:38(9.92MB/s)-'elasticsearch-1.4.1.noarch.rpm'saved

[26326154/26326154]

[ec2-user@ip-172-31-26-120~]$sudorpm-ivhelasticsearch-1.4.1.noarch.rpm

Preparing…#################################[100%]

Updating/installing…

1:elasticsearch-1.4.1-1#################################[100%]

###NOTstartingoninstallation,pleaseexecutethefollowingstatements

toconfigureelasticsearchtostartautomaticallyusingchkconfig

sudo/sbin/chkconfig--addelasticsearch

###Youcanstartelasticsearchbyexecuting

sudoserviceelasticsearchstart

Theinstallationiskindenoughtotellmehowtoconfiguretheservicetoautomaticallystartonsystemboot-up,anditalsotellsmetolaunchtheservicenow,solet’sdothat:

[ec2-user@ip-172-31-26-120~]$sudo/sbin/chkconfig--addelasticsearch

[ec2-user@ip-172-31-26-120~]$sudoserviceelasticsearchstart

Startingelasticsearch:[OK]

Let’sperformaquicktestwithcurltoverifythatitisrunningonthedefaultport,whichis9200:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/

"status":200,

"name":"Siege",

"cluster_name":"elasticsearch",

"version":{

"number":"1.4.1",

"build_hash":"89d3241d670db65f994242c8e8383b169779e2d4",

"build_timestamp":"2014-11-26T15:49:29Z",

"build_snapshot":false,

"lucene_version":"4.10.2"

"tagline":"YouKnow,forSearch"

Sincetheindexesarecreatedasdatacomesin,andwehaven’tsentanydata,weexpecttoseenoindexescreatedyet:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/_cat/indices?v

healthstatusindexprirepdocs.countdocs.deletedstore.size

pri.store.size

Allweseeistheheaderwithnoindexeslisted,solet’sheadovertothecollectorserverandsetuptheFlumerelaywhichwillwritedatatoElasticsearch.

SettingupFlumeoncollector/relayTheFlumeconfigurationontheFlumecollectorwillbecompressedAvrocominginandElasticsearchgoingout.Goaheadandlogintothecollectorserver(172.31.26.205).

Let’sstartbydownloadingtheFlumebinaryfromtheApachewebsite.Followthedownloadlinkandselectamirror:

[ec2-user@ip-172-31-26-205~]$wget

http://apache.arvixe.com/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz

--2014-12-0719:50:30--http://apache.arvixe.com/flume/1.5.2/apache-flume-

1.5.2-bin.tar.gz

Resolvingapache.arvixe.com(apache.arvixe.com)...198.58.87.82

Connectingtoapache.arvixe.com(apache.arvixe.com)|198.58.87.82|:80…

connected.

Length:25323459(24M)[application/x-gzip]

Savingto:'apache-flume-1.5.2-bin.tar.gz'

100%[==========================>]25,323,4598.38MB/sin2.9s

2014-12-0719:50:33(8.38MB/s)-'apache-flume-1.5.2-bin.tar.gz'saved

[25323459/25323459]

Next,expandandchangedirectories:

[ec2-user@ip-172-31-26-205~]$tar-zxfapache-flume-1.5.2-bin.tar.gz

[ec2-user@ip-172-31-26-205~]$cdapache-flume-1.5.2-bin

[ec2-user@ip-172-31-26-205apache-flume-1.5.2-bin]$

Createtheconfigurationfile,calledcollector.conf,inFlume’sconfigurationdirectoryusingyourfavoriteeditor:

[ec2-user@ip-172-31-18-146apache-flume-1.5.2-bin]$catconf/collector.conf

collector.sources=av

collector.channels=m1

collector.sinks=es

collector.sources.av.type=avro

collector.sources.av.bind=0.0.0.0

collector.sources.av.port=12345

collector.sources.av.compression-type=deflate

collector.sources.av.channels=m1

collector.channels.m1.type=memory

collector.channels.m1.capacity=10000

collector.sinks.es.type=org.apache.flume.sink.elasticsearch.ElasticSearchSi

collector.sinks.es.channel=m1

collector.sinks.es.hostNames=172.31.26.120

Here,youcanseethatweareusingasimplememorychannelconfiguredwithacapacityof10,000events.

ThesourceisconfiguredtoacceptcompressedAvroonport12345andpassittoour

memorychannel.

Finally,thesinkisconfiguredtowritetoElasticsearchontheserverwejustsetupatthe172.31.26.120privateIP.Weareusingthedefaultsettings,whichmeansit’llwritetotheindexnamedflume-YYYY-MM-DDwiththelogtype.

Let’stryrunningtheFlumeagent:

[ec2-user@ip-172-31-26-205apache-flume-1.5.2-bin]$./bin/flume-ngagent-n

collector-cconf-fconf/collector.conf-Dflume.root.logger=INFO,console

You’llseeanexceptioninthelog,includingsomethinglikethis:

2014-12-0720:00:13,184(conf-file-poller-0)[ERROR-

org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatche

rRunnable.run(PollingPropertiesFileConfigurationProvider.java:145)]Failed

tostartagentbecausedependencieswerenotfoundinclasspath.Error

follows.

java.lang.NoClassDefFoundError:org/elasticsearch/common/io/BytesStream

TheFlumeagentcan’tfindtheElasticsearchclasses.RememberthatthesearenotpackagedwithFlume,asthelibrariesneedtobecompatiblewiththeversionofElasticsearchyouarerunning.LookingbackattheElasticsearchserver,wecangetanideaofwhatweneed.Rememberthatthislistincludesmanyruntimeserverdependencies,soitisprobablymorethanwhatyou’llneedforafunctionalElasticsearchclient:

[ec2-user@ip-172-31-26-120~]$rpm-qilelasticsearch|grepjar

/usr/share/elasticsearch/lib/elasticsearch-1.4.1.jar

/usr/share/elasticsearch/lib/groovy-all-2.3.2.jar

/usr/share/elasticsearch/lib/jna-4.1.0.jar

/usr/share/elasticsearch/lib/jts-1.13.jar

/usr/share/elasticsearch/lib/log4j-1.2.17.jar

/usr/share/elasticsearch/lib/lucene-analyzers-common-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-core-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-expressions-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-grouping-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-highlighter-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-join-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-memory-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-misc-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-queries-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-queryparser-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-sandbox-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-spatial-4.10.2.jar

/usr/share/elasticsearch/lib/lucene-suggest-4.10.2.jar

/usr/share/elasticsearch/lib/sigar/sigar-1.6.4.jar

/usr/share/elasticsearch/lib/spatial4j-0.4.1.jar

Really,youonlyneedtheelasticsearch.jarfileanditsdependencies,butwearegoingtobelazyandjustdownloadtheRPMagaininthecollectormachine,andcopytheJARfilestoFlume:

[ec2-user@ip-172-31-26-205~]$wget

h-1.4.1.noarch.rpm

--2014-12-0720:03:38--

h-1.4.1.noarch.rpm

54.225.133.195,54.243.77.158,107.22.222.16,...

Length:26326154(25M)[application/x-redhat-package-manager]

Savingto:'elasticsearch-1.4.1.noarch.rpm'

[==========================================================================

==================================>]26,326,1549.70MB/sin2.6s

2014-12-0720:03:41(9.70MB/s)-'elasticsearch-1.4.1.noarch.rpm'saved

[26326154/26326154]

[ec2-user@ip-172-31-26-205~]$sudorpm-ivhelasticsearch-1.4.1.noarch.rpm

Preparing…#################################

[100%]

1:elasticsearch-1.4.1-1#################################

[100%]

###NOTstartingoninstallation,pleaseexecutethefollowingstatements

toconfigureelasticsearchtostartautomaticallyusingchkconfig

sudo/sbin/chkconfig--addelasticsearch

###Youcanstartelasticsearchbyexecuting

sudoserviceelasticsearchstart

Thistimewewillnotconfiguretheservicetostart.Instead,we’llcopytheJARfilesweneedtoFlume’spluginsdirectoryarchitecture,whichwelearnedaboutinthepreviouschapter:

[ec2-user@ip-172-31-26-205apache-flume-1.5.2-bin]$mkdir-p

plugins.d/elasticsearch/libext

[ec2-user@ip-172-31-26-205apache-flume-1.5.2-bin]$cp

/usr/share/elasticsearch/lib/*.jarplugins.d/elasticsearch/libext/

NowtryrunningtheFlumeagentagain:

collector-cconf-fconf/collector.conf-Dflume.root.logger=INFO,console

Noexceptionsthistimearound,butstillnodata.Let’sgobacktothewebservermachineandsetupthefinalFlumeagent.

SettingupFlumeontheclientOnthefirstserver,thewebserver,let’sdownloadtheFlumebinariesagainandexpandthepackage:

[ec2-user@ip-172-31-18-146~]$wget

http://apache.arvixe.com/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz

--2014-12-0720:12:53--http://apache.arvixe.com/flume/1.5.2/apache-flume-

1.5.2-bin.tar.gz

Resolvingapache.arvixe.com(apache.arvixe.com)...198.58.87.82

Connectingtoapache.arvixe.com(apache.arvixe.com)|198.58.87.82|:80…

connected.

Length:25323459(24M)[application/x-gzip]

Savingto:'apache-flume-1.5.2-bin.tar.gz'

100%[=========================>]25,323,4598.13MB/sin3.0s

2014-12-0720:12:56(8.13MB/s)-'apache-flume-1.5.2-bin.tar.gz'saved

[25323459/25323459]

[ec2-user@ip-172-31-18-146~]$tar-zxfapache-flume-1.5.2-bin.tar.gz

Thistime,ourFlumeconfigurationtakesthespooldirectorywesetupbefore,withlogrotateasinput,anditneedstowritecompressedAvroasthecollectorserver.Openaneditorandcreatetheclient.conffileusingyourfavoriteeditor:

[ec2-user@ip-172-31-18-146apache-flume-1.5.2-bin]$catconf/client.conf

client.sources=sd

client.channels=m1

client.sinks=av

client.sources.sd.type=spooldir

client.sources.sd.spoolDir=/home/ec2-user/spool

client.sources.sd.deletePolicy=immediate

client.sources.sd.ignorePattern=access.log.1$

client.sources.sd.channels=m1

client.channels.m1.type=memory

client.channels.m1.capacity=10000

client.sinks.av.type=avro

client.sinks.av.hostname=172.31.26.205

client.sinks.av.port=12345

client.sinks.av.compression-type=deflate

client.sinks.av.channel=m1

Again,forsimplicity,weareusingamemorychannelwith10,000-recordcapacity.

Forthesource,weconfiguretheSpoolingDirectorySourcewith/home/ec2-user/spoolastheinput.Additionally,weconfigurethedeletionpolicytoremovethefilesaftersendingiscomplete.Thisisalsowherewesetuptheexclusionrulefortheaccess.log.1filenamepatternmentionedearlier.Notethedollarsignattheendofthefilename,

denotingtheendoftheline.Withoutthis,theexclusionpatternwouldalsoexcludevalidfiles,suchasaccess.log.1417975373.

Finally,anAvrosinkisconfiguredtopointatthecollector’sprivateIPandport12345.Additionally,wesetthecompressionsothatitmatchesthereceivingAvrosource’ssettings.

Nowlet’stryrunningtheagent:

client-cconf-fconf/client.conf-Dflume.root.logger=INFO,console

Noexceptions!Butmoreimportantly,Iseethelogfilesinthespooldirectorybeingprocessedanddeleted:

2014-12-0720:59:04,041(pool-4-thread-1)[INFO-

org.apache.flume.client.avro.ReliableSpoolingFileEventReader.deleteCurrentF

ile(ReliableSpoolingFileEventReader.java:390)]Preparingtodeletefile

/home/ec2-user/spool/access.log.1417976401

2014-12-0720:59:05,319(pool-4-thread-1)[INFO-

2014-12-0720:59:06,245(pool-4-thread-1)[INFO-

org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(Spo

olDirectorySource.java:254)]SpoolingDirectorySourcerunnerhasshutdown.

2014-12-0720:59:06,746(pool-4-thread-1)[INFO-

org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(Spo

olDirectorySource.java:254)]SpoolingDirectorySourcerunnerhasshutdown.

Onceallthefileshavebeenprocessed,thelastlinesarerepeatedevery500milliseconds.ThisisaknownbuginFlume(https://issues.apache.org/jira/browse/FLUME-2385).Ithasalreadybeenfixedandisslatedforthe1.6.0release,sobesuretosetuplogrotationonyourFlumeagentandcleanupthismessbeforeyourunoutofdiskspace.

Onthecollectorbox,weseewritesoccurringtoElasticsearch:

[INFO-

org.apache.flume.sink.elasticsearch.client.ElasticSearchTransportClient.exe

cute(ElasticSearchTransportClient.java:181)]Sendingbulktoelasticsearch

cluster

QueryingtheElasticsearchRESTAPI,wecanseeanindexwithrecords:

[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/_cat/indices?v

healthstatusindexprirepdocs.countdocs.deletedstore.size

pri.store.size

yellowopenflume-2014-12-07513210201.3mb

Let’sreadthefirst5records:

[ec2-user@ip-172-31-26-120elasticsearch]$curl-XGET

'http://localhost:9200/flume-2014-12-07/_search?pretty=true&q=*.*&size=5'

"took":26,

"timed_out":false,

"_shards":{

"total":5,

"successful":5,

"failed":0

"hits":{

"total":32102,

"max_score":1.0,

"hits":[{

"_index":"flume-2014-12-07",

"_type":"log",

"_id":"AUomjGVVbObD75ecNqJg",

"_score":1.0,

"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47