Apache Flume: Distributed Log Collection -...
Transcript of Apache Flume: Distributed Log Collection -...
ApacheFlume:DistributedLogCollectionforHadoopSecondEdition
TableofContents
ApacheFlume:DistributedLogCollectionforHadoopSecondEdition
Credits
AbouttheAuthor
AbouttheReviewers
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmore
Whysubscribe?
FreeaccessforPacktaccountholders
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Downloadingtheexamplecode
Errata
Piracy
Questions
1.OverviewandArchitecture
Flume0.9
Flume1.X(Flume-NG)
TheproblemwithHDFSandstreamingdata/logs
Sources,channels,andsinks
Flumeevents
Interceptors,channelselectors,andsinkprocessors
Tiereddatacollection(multipleflowsand/oragents)
TheKiteSDK
Summary
2.AQuickStartGuidetoFlume
DownloadingFlume
FlumeinHadoopdistributions
AnoverviewoftheFlumeconfigurationfile
Startingupwith“Hello,World!”
Summary
3.Channels
Thememorychannel
Thefilechannel
SpillableMemoryChannel
Summary
4.SinksandSinkProcessors
HDFSsink
Pathandfilename
Filerotation
Compressioncodecs
EventSerializers
Textoutput
Textwithheaders
ApacheAvro
User-providedAvroschema
Filetype
SequenceFile
DataStream
CompressedStream
Timeoutsandworkers
Sinkgroups
Loadbalancing
Failover
MorphlineSolrSink
Morphlineconfigurationfiles
TypicalSolrSinkconfiguration
Sinkconfiguration
ElasticSearchSink
LogStashSerializer
DynamicSerializer
Summary
5.SourcesandChannelSelectors
Theproblemwithusingtail
TheExecsource
SpoolingDirectorySource
Syslogsources
ThesyslogUDPsource
ThesyslogTCPsource
ThemultiportsyslogTCPsource
JMSsource
Channelselectors
Replicating
Multiplexing
Summary
6.Interceptors,ETL,andRouting
Interceptors
Timestamp
Host
Static
Regularexpressionfiltering
Regularexpressionextractor
Morphlineinterceptor
Custominterceptors
Thepluginsdirectory
Tieringflows
TheAvrosource/sink
CompressingAvro
SSLAvroflows
TheThriftsource/sink
Usingcommand-lineAvro
TheLog4Jappender
TheLog4Jload-balancingappender
Theembeddedagent
Configurationandstartup
Sendingdata
Shutdown
Routing
Summary
7.PuttingItAllTogether
WeblogstosearchableUI
Settingupthewebserver
Configuringlogrotationtothespooldirectory
Settingupthetarget–Elasticsearch
SettingupFlumeoncollector/relay
SettingupFlumeontheclient
Creatingmoresearchfieldswithaninterceptor
Settingupabetteruserinterface–Kibana
ArchivingtoHDFS
Summary
8.MonitoringFlume
Monitoringtheagentprocess
Monit
Nagios
Monitoringperformancemetrics
Ganglia
InternalHTTPserver
Custommonitoringhooks
Summary
9.ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollection
Transporttimeversuslogtime
Timezonesareevil
Capacityplanning
Considerationsformultipledatacenters
Complianceanddataexpiry
Summary
Index
ApacheFlume:DistributedLogCollectionforHadoopSecondEdition
ApacheFlume:DistributedLogCollectionforHadoopSecondEditionCopyright©2015PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishing,anditsdealersanddistributorswillbeheldliableforanydamagescausedorallegedtobecauseddirectlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
Firstpublished:July2013
Secondedition:February2015
Productionreference:1190215
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
BirminghamB32PB,UK.
ISBN978-1-78439-217-8
www.packtpub.com
CreditsAuthor
SteveHoffman
Reviewers
SachinHandiekar
MichaelKeane
StefanWill
CommissioningEditor
DipikaGaonkar
AcquisitionEditor
ReshmaRaman
ContentDevelopmentEditor
NeetuAnnMathew
TechnicalEditor
MenzaMathew
CopyEditors
VikrantPhadke
StutiSrivastava
ProjectCoordinator
MaryAlex
Proofreader
SimranBhogal
SafisEditing
Indexer
RekhaNair
Graphics
SheetalAute
AbhinashSahu
ProductionCoordinator
KomalRamchandani
CoverWork
KomalRamchandani
AbouttheAuthorSteveHoffmanhas32yearsofexperienceinsoftwaredevelopment,rangingfromembeddedsoftwaredevelopmenttothedesignandimplementationoflarge-scale,service-oriented,object-orientedsystems.Forthelast5years,hehasfocusedoninfrastructureascode,includingautomatedHadoopandHBaseimplementationsanddataingestionusingApacheFlume.SteveholdsaBSincomputerengineeringfromtheUniversityofIllinoisatUrbana-ChampaignandanMSincomputersciencefromDePaulUniversity.HeiscurrentlyaseniorprincipalengineeratOrbitzWorldwide(http://orbitz.com/).
MoreinformationonStevecanbefoundathttp://bit.ly/bacoboyandonTwitterat@bacoboy.
ThisisthefirstupdatetoSteve’sfirstbook,ApacheFlume:DistributedLogCollectionforHadoop,PacktPublishing.
I’dagainliketodedicatethisupdatedbooktomylovingandsupportivewife,Tracy.Sheputsupwithalot,andthatisverymuchappreciated.Icouldn’taskforabetterfrienddailybymyside.
Myterrificchildren,RachelandNoah,areaconstantreminderthathardworkdoespayoffandthatgreatthingscancomefromchaos.
Ialsowanttogiveabigthankstomyparents,AlanandKaren,formoldingmeintothesomewhatsatisfactoryhumanI’vebecome.TheirdedicationtofamilyandeducationaboveallelseguidesmedailyasIattempttohelpmyownchildrenfindtheirhappinessintheworld.
AbouttheReviewersSachinHandiekarisaseniorsoftwaredeveloperwithover5yearsofexperienceinJavaEEdevelopment.HegraduatedincomputersciencefromtheUniversityofGreenwich,London,andcurrentlyworksforaglobalconsultingcompany,developingenterpriseapplicationsusingvariousopensourcetechnologies,suchasApacheCamel,ServiceMix,ActiveMQ,andZooKeeper.
Sachinhasalotofinterestinopensourceprojects.HehascontributedcodetoApacheCamelanddevelopedpluginsforSpringSocial,whichcanbefoundatGitHub(https://github.com/sachin-handiekar).
Healsoactivelywritesaboutenterpriseapplicationdevelopmentonhisblog(http://sachinhandiekar.com).
MichaelKeanehasaBSincomputersciencefromtheUniversityofIllinoisatUrbana-Champaign.Hehasworkedasasoftwareengineer,codingalmostexclusivelyinJavasinceJDK1.1.Hehasalsoworkedonthemission-criticalmedicaldevicesoftware,e-commerce,transportation,navigation,andadvertisingdomains.HeiscurrentlyadevelopmentleaderforConversant,wherehemaintainsFlumeflowsofnearly100billionloglinesperday.
Michaelisafatherofthree,andbesideswork,hespendsmostofhistimewithhisfamilyandcoachingyouthsoftball.
StefanWillisacomputerscientistwithadegreeinmachinelearningandpatternrecognitionfromtheUniversityofBonn,Germany.Foroveradecade,hehasworkedforseveralstart-upsinSiliconValleyandRaleigh,NorthCarolina,intheareaofsearchandanalytics.Presently,heleadsthedevelopmentofthesearchbackendandreal-timeanalyticsplatformatZendesk,aproviderofcustomerservicesoftware.
www.PacktPub.com
Supportfiles,eBooks,discountoffers,andmoreForsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.com.
DidyouknowthatPacktofferseBookversionsofeverybookpublished,withPDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.PacktPub.comandasaprintbookcustomer,youareentitledtoadiscountontheeBookcopy.Getintouchwithusat<[email protected]>formoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffersonPacktbooksandeBooks.
https://www2.packtpub.com/books/subscription/packtlib
DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt’sonlinedigitalbooklibrary.Here,youcansearch,access,andreadPackt’sentirelibraryofbooks.
Whysubscribe?FullysearchableacrosseverybookpublishedbyPacktCopyandpaste,print,andbookmarkcontentOndemandandaccessibleviaawebbrowser
FreeaccessforPacktaccountholdersIfyouhaveanaccountwithPacktatwww.PacktPub.com,youcanusethistoaccessPacktLibtodayandviewnineentirelyfreebooks.Simplyuseyourlogincredentialsforimmediateaccess.
PrefaceHadoopisagreatopensourcetoolforshiftingtonsofunstructureddataintosomethingmanageablesothatyourbusinesscangainbetterinsightintoyourcustomers’needs.It’scheap(mostlyfree),scaleshorizontallyaslongasyouhavespaceandpowerinyourdatacenter,andcanhandleproblemsthatwouldcrushyourtraditionaldatawarehouse.Thatsaid,alittle-knownsecretisthatyourHadoopclusterrequiresyoutofeeditdata.Otherwise,youjusthaveaveryexpensiveheatgenerator!Youwillquicklyrealize(onceyougetpastthe“playingaround”phasewithHadoop)thatyouwillneedatooltoautomaticallyfeeddataintoyourcluster.Inthepast,youhadtocomeupwithasolutionforthisproblem,butnomore!FlumewasstartedasaprojectoutofCloudera,whenitsintegrationengineershadtokeepwritingtoolsoverandoveragainfortheircustomerstoautomaticallyimportdata.Today,theprojectliveswiththeApacheFoundation,isunderactivedevelopment,andboastsofuserswhohavebeenusingitintheirproductionenvironmentsforyears.
Inthisbook,IhopetogetyouupandrunningquicklywithanarchitecturaloverviewofFlumeandaquick-startguide.Afterthat,we’lldivedeepintothedetailsofmanyofthemoreusefulFlumecomponents,includingtheveryimportantfilechannelforthepersistenceofin-flightdatarecordsandtheHDFSSinkforbufferingandwritingdataintoHDFS(theHadoopFileSystem).SinceFlumecomeswithawidevarietyofmodules,chancesarethattheonlytoolyou’llneedtogetstartedisatexteditorfortheconfigurationfile.
Bythetimeyoureachtheendofthisbook,youshouldknowenoughtobuildahighlyavailable,fault-tolerant,streamingdatapipelinethatfeedsyourHadoopcluster.
WhatthisbookcoversChapter1,OverviewandArchitecture,introducesFlumeandtheproblemspacethatit’stryingtoaddress(specificallywithregardstoHadoop).Anarchitecturaloverviewofthevariouscomponentstobecoveredinlaterchaptersisgiven.
Chapter2,AQuickStartGuidetoFlume,servestogetyouupandrunningquickly.ItincludesdownloadingFlume,creatinga“Hello,World!”configuration,andrunningit.
Chapter3,Channels,coversthetwomajorchannelsmostpeoplewilluseandtheconfigurationoptionsavailableforeachofthem.
Chapter4,SinksandSinkProcessors,goesintogreatdetailonusingtheHDFSFlumeoutput,includingcompressionoptionsandoptionsforformattingthedata.Failoveroptionsarealsocoveredsothatyoucancreateamorerobustdatapipeline.
Chapter5,SourcesandChannelSelectors,introducesseveraloftheFlumeinputmechanismsandtheirconfigurationoptions.Alsocoveredisswitchingbetweendifferentchannelsbasedondatacontent,whichallowsthecreationofcomplexdataflows.
Chapter6,Interceptors,ETL,andRouting,explainshowtotransformdatain-flightaswellasextractinformationfromthepayloadtousewithChannelSelectorstomakeroutingdecisions.ThenthischaptercoverstieringFlumeagentsusingAvroserialization,aswellasusingtheFlumecommandlineasastandaloneAvroclientfortestingandimportingdatamanually.
Chapter7,PuttingItAllTogether,walksyouthroughthedetailsofanend-to-endusecasefromthewebserverlogstoasearchableUI,backedbyElasticsearchaswellasarchivalstorageinHDFS.
Chapter8,MonitoringFlume,discussesvariousoptionsavailableformonitoringFlumebothinternallyandexternally,includingMonit,Nagios,Ganglia,andcustomhooks.
Chapter9,ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollection,isacollectionofmiscellaneousthingstoconsiderthatareoutsidethescopeofjustconfiguringandusingFlume.
WhatyouneedforthisbookYou’llneedacomputerwithaJavaVirtualMachineinstalled,sinceFlumeiswritteninJava.Ifyoudon’thaveJavaonyourcomputer,youcandownloaditfromhttp://java.com/.
YouwillalsoneedanInternetconnectionsothatyoucandownloadFlumetoruntheQuickStartexample.
ThisbookcoversApacheFlume1.5.2.
WhothisbookisforThisbookisforpeopleresponsibleforimplementingtheautomaticmovementofdatafromvarioussystemstoaHadoopcluster.IfitisyourjobtoloaddataintoHadooponaregularbasis,thisbookshouldhelpyoutocodeyourselfoutofmanualmonkeyworkorfromwritingacustomtoolyou’llbesupportingforaslongasyouworkatyourcompany.
OnlybasicknowledgeofHadoopandHDFSisrequired.Somecustomimplementationsarecovered,shouldyourneedsnecessitatethem.Forthislevelofimplementation,youwillneedtoknowhowtoprograminJava.
Finally,you’llneedyourfavoritetexteditor,sincemostofthisbookcovershowtoconfigurevariousFlumecomponentsviaanagent’stextconfigurationfile.
ConventionsInthisbook,youwillfindanumberofstylesoftextthatdistinguishbetweendifferentkindsofinformation.Herearesomeexamplesofthesestyles,andexplanationsoftheirmeanings.
Codewordsintextareshownasfollows:“Ifyouwanttousethisfeature,yousettheuseDualCheckpointspropertytotrueandspecifyalocationforthatsecondcheckpointdirectorywiththebackupCheckpointDirproperty.”
Ablockofcodeissetasfollows:
agent.sinks.k1.hdfs.path=/logs/apache/access
agent.sinks.k1.hdfs.filePrefix=access
agent.sinks.k1.hdfs.fileSuffix=.log
Whenwewishtodrawyourattentiontoaparticularpartofacodeblock,therelevantlinesoritemsaresetinbold:
agent.sources.s1.command=uptime
agent.sources.s1.restart=true
agent.sources.s1.restartThrottle=60000
Anycommand-lineinputoroutputiswrittenasfollows:
$tar-zxfapache-flume-1.5.2.tar.gz
$cdapache-flume-1.5.2
Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthescreen,inmenusordialogboxesforexample,appearinthetextlikethis:“FlumewasfirstintroducedinCloudera’sCDH3distributionin2011.”
NoteWarningsorimportantnotesappearinaboxlikethis.
TipTipsandtricksappearlikethis.
ReaderfeedbackFeedbackfromourreadersisalwayswelcome.Letusknowwhatyouthinkaboutthisbook—whatyoulikedormayhavedisliked.Readerfeedbackisimportantforustodeveloptitlesthatyoureallygetthemostoutof.
Tosendusgeneralfeedback,simplysendane-mailto<[email protected]>,andmentionthebooktitleviathesubjectofyourmessage.
Ifthereisatopicthatyouhaveexpertiseinandyouareinterestedineitherwritingorcontributingtoabook,seeourauthorguideonwww.packtpub.com/authors.
CustomersupportNowthatyouaretheproudownerofaPacktbook,wehaveanumberofthingstohelpyoutogetthemostfromyourpurchase.
DownloadingtheexamplecodeYoucandownloadtheexamplecodefilesforallPacktbooksyouhavepurchasedfromyouraccountathttp://www.packtpub.com.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.
ErrataAlthoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyoufindamistakeinoneofourbooks—maybeamistakeinthetextorthecode—wewouldbegratefulifyouwouldreportthistous.Bydoingso,youcansaveotherreadersfromfrustrationandhelpusimprovesubsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthembyvisitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheerratasubmissionformlink,andenteringthedetailsofyourerrata.Onceyourerrataareverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedonourwebsite,oraddedtoanylistofexistingerrata,undertheErratasectionofthattitle.Anyexistingerratacanbeviewedbyselectingyourtitlefromhttp://www.packtpub.com/support.
PiracyPiracyofcopyrightmaterialontheInternetisanongoingproblemacrossallmedia.AtPackt,wetaketheprotectionofourcopyrightandlicensesveryseriously.Ifyoucomeacrossanyillegalcopiesofourworks,inanyform,ontheInternet,pleaseprovideuswiththelocationaddressorwebsitenameimmediatelysothatwecanpursuearemedy.
Pleasecontactusat<[email protected]>withalinktothesuspectedpiratedmaterial.
Weappreciateyourhelpinprotectingourauthors,andourabilitytobringyouvaluablecontent.
QuestionsYoucancontactusat<[email protected]>ifyouarehavingaproblemwithanyaspectofthebook,andwewilldoourbesttoaddressit.
Chapter1.OverviewandArchitectureIfyouarereadingthisbook,chancesareyouareswimminginoceansofdata.Creatingmountainsofdatahasbecomeveryeasy,thankstoFacebook,Twitter,Amazon,digitalcamerasandcameraphones,YouTube,Google,andjustaboutanythingelseyoucanthinkofbeingconnectedtotheInternet.Asaproviderofawebsite,10yearsago,yourapplicationlogswereonlyusedtohelpyoutroubleshootyourwebsite.Today,thissamedatacanprovideavaluableinsightintoyourbusinessandcustomersifyouknowhowtopangoldoutofyourriverofdata.
Furthermore,asyouarereadingthisbook,youarealsoawarethatHadoopwascreatedtosolve(partially)theproblemofsiftingthroughmountainsofdata.Ofcourse,thisonlyworksifyoucanreliablyloadyourHadoopclusterwithdataforyourdatascientiststopickapart.
GettingdataintoandoutofHadoop(inthiscase,theHadoopFileSystem,orHDFS)isn’thard;itisjustasimplecommand,suchas:
%hadoopfs--putdata.csv.
Thisworksgreatwhenyouhaveallyourdataneatlypackagedandreadytoupload.
However,yourwebsiteiscreatingdataallthetime.HowoftenshouldyoubatchloaddatatoHDFS?Daily?Hourly?Whateverprocessingperiodyouchoose,eventuallysomebodyalwaysasks“canyougetmethedatasooner?”Whatyoureallyneedisasolutionthatcandealwithstreaminglogs/data.
Turnsoutyouaren’taloneinthisneed.Cloudera,aproviderofprofessionalservicesforHadoopaswellastheirowndistributionofHadoop,sawthisneedoverandoverwhenworkingwiththeircustomers.Flumewascreatedtofillthisneedandcreateastandard,simple,robust,flexible,andextensibletoolfordataingestionintoHadoop.
Flume0.9FlumewasfirstintroducedinCloudera’sCDH3distributionin2011.Itconsistedofafederationofworkerdaemons(agents)configuredfromacentralizedmaster(ormasters)viaZookeeper(afederatedconfigurationandcoordinationsystem).Fromthemaster,youcouldchecktheagentstatusinawebUIaswellaspushoutconfigurationcentrallyfromtheUIorviaacommand-lineshell(bothreallycommunicatingviaZookeepertotheworkeragents).
Datacouldbesentinoneofthreemodes:Besteffort(BE),DiskFailover(DFO),andEnd-to-End(E2E).ThemasterswereusedfortheE2Emodeacknowledgementsandmultimasterconfigurationneverreallymatured,soyouusuallyonlyhadonemaster,makingitacentralpointoffailureforE2Edataflows.TheBEmodeisjustwhatitsoundslike:theagentwouldtrytosendthedata,butifitcouldn’t,thedatawouldbediscarded.Thismodeisgoodforthingssuchasmetrics,wheregapscaneasilybetolerated,asnewdataisjustasecondaway.TheDFOmodestoresundeliverabledatatothelocaldisk(orsometimes,alocaldatabase)andwouldkeepretryinguntilthedatacouldbedeliveredtothenextrecipientinyourdataflow.Thisishandyforthoseplanned(orunplanned)outages,aslongasyouhavesufficientlocaldiskspacetobuffertheload.
InJune,2011,ClouderamovedcontroloftheFlumeprojecttotheApacheFoundation.Itcameoutoftheincubatorstatusayearlaterin2012.Duringtheincubationyear,workhadalreadybeguntorefactorFlumeundertheStar-Trek-themedtag,Flume-NG(FlumetheNextGeneration).
Flume1.X(Flume-NG)ThereweremanyreasonswhyFlumewasrefactored.Ifyouareinterestedinthedetails,youcanreadaboutthemathttps://issues.apache.org/jira/browse/FLUME-728.WhatstartedasarefactoringbrancheventuallybecamethemainlineofdevelopmentasFlume1.X.
ThemostobviouschangeinFlume1.Xisthatthecentralizedconfigurationmaster(s)andZookeeperaregone.TheconfigurationinFlume0.9wasoverlyverbose,andmistakeswereeasytomake.Furthermore,centralizedconfigurationwasreallyoutsidethescopeofFlume’sgoals.Centralizedconfigurationwasreplacedwithasimpleon-diskconfigurationfile(althoughtheconfigurationproviderispluggablesothatitcanbereplaced).Theseconfigurationfilesareeasilydistributedusingtoolssuchascf-engine,Chef,andPuppet.IfyouareusingaClouderadistribution,takealookatClouderaManagertomanageyourconfigurations.Abouttwoyearsago,theycreatedafreeversionwithnonodelimit,soitmaybeanattractiveoptionforyou.Justbesureyoudon’tmanagetheseconfigurationsmanually,oryou’llbeeditingthesefilesmanuallyforever.
AnothermajordifferenceinFlume1.Xisthatthereadingofinputdataandthewritingofoutputdataarenowhandledbydifferentworkerthreads(calledRunners).InFlume0.9,theinputthreadalsodidthewritingtotheoutput(exceptforfailoverretries).Iftheoutputwriterwasslow(ratherthanjustfailingoutright),itwouldblockFlume’sabilitytoingestdata.Thisnewasynchronousdesignleavestheinputthreadblissfullyunawareofanydownstreamproblem.
ThefirsteditionofthisbookcoveredalltheversionsofFlumeuptillVersion1.3.1.ThissecondeditionwillcovertillVersion1.5.2(thecurrentversionatthetimeofwritingthis).
TheproblemwithHDFSandstreamingdata/logsHDFSisn’tarealfilesystem,atleastnotinthetraditionalsense,andmanyofthethingswetakeforgrantedwithnormalfilesystemsdon’tapplyhere,suchasbeingabletomountit.ThismakesgettingyourstreamingdataintoHadoopalittlemorecomplicated.
InaregularPOSIX-stylefilesystem,ifyouopenafileandwritedata,itstillexistsonthediskbeforethefileisclosed.Thatis,ifanotherprogramopensthesamefileandstartsreading,itwillgetthedataalreadyflushedbythewritertothedisk.Furthermore,ifthiswritingprocessisinterrupted,anyportionthatmadeittodiskisusable(itmaybeincomplete,butitexists).
InHDFS,thefileexistsonlyasadirectoryentry;itshowszerolengthuntilthefileisclosed.Thismeansthatifdataiswrittentoafileforanextendedperiodwithoutclosingit,anetworkdisconnectwiththeclientwillleaveyouwithnothingbutanemptyfileforallyourefforts.Thismayleadyoutotheconclusionthatitwouldbewisetowritesmallfilessothatyoucanclosethemassoonaspossible.
TheproblemisthatHadoopdoesn’tlikelotsoftinyfiles.AstheHDFSfilesystemmetadataiskeptinmemoryontheNameNode,themorefilesyoucreate,themoreRAMyou’llneedtouse.FromaMapReduceprospective,tinyfilesleadtopoorefficiency.Usually,eachMapperisassignedasingleblockofafileastheinput(unlessyouhaveusedcertaincompressioncodecs).Ifyouhavelotsoftinyfiles,thecostofstartingtheworkerprocessescanbedisproportionallyhighcomparedtothedataitisprocessing.ThiskindofblockfragmentationalsoresultsinmoreMappertasks,increasingtheoveralljobruntimes.
ThesefactorsneedtobeweighedwhendeterminingtherotationperiodtousewhenwritingtoHDFS.Iftheplanistokeepthedataaroundforashorttime,thenyoucanleantowardthesmallerfilesize.However,ifyouplanonkeepingthedataforaverylongtime,youcaneithertargetlargerfilesordosomeperiodiccleanuptocompactsmallerfilesintofewer,largerfilestomakethemmoreMapReducefriendly.Afterall,youonlyingestthedataonce,butyoumightrunaMapReducejobonthatdatahundredsorthousandsoftimes.
Sources,channels,andsinksTheFlumeagent’sarchitecturecanbeviewedinthissimplediagram.Inputsarecalledsourcesandoutputsarecalledsinks.Channelsprovidethegluebetweensourcesandsinks.Alloftheseruninsideadaemoncalledanagent.
NoteKeepinmind:
Asourcewriteseventstooneormorechannels.Achannelistheholdingareaaseventsarepassedfromasourcetoasink.Asinkreceiveseventsfromonechannelonly.Anagentcanhavemanychannels.
FlumeeventsThebasicpayloadofdatatransportedbyFlumeiscalledanevent.Aneventiscomposedofzeroormoreheadersandabody.
Theheadersarekey/valuepairsthatcanbeusedtomakeroutingdecisionsorcarryotherstructuredinformation(suchasthetimestampoftheeventorthehostnameoftheserverfromwhichtheeventoriginated).YoucanthinkofitasservingthesamefunctionasHTTPheaders—awaytopassadditionalinformationthatisdistinctfromthebody.
Thebodyisanarrayofbytesthatcontainstheactualpayload.Ifyourinputiscomprisedoftailedlogfiles,thearrayismostlikelyaUTF-8-encodedstringcontainingalineoftext.
Flumemayaddadditionalheadersautomatically(likewhenasourceaddsthehostnamewherethedataissourcedorcreatinganevent’stimestamp),butthebodyismostlyuntouchedunlessyouedititenrouteusinginterceptors.
Interceptors,channelselectors,andsinkprocessorsAninterceptorisapointinyourdataflowwhereyoucaninspectandalterFlumeevents.Youcanchainzeroormoreinterceptorsafterasourcecreatesanevent.IfyouarefamiliarwiththeAOPSpringFramework,thinkMethodInterceptor.InJavaServlets,it’ssimilartoServletFilter.Here’sanexampleofwhatusingfourchainedinterceptorsonasourcemightlooklike:
Channelselectorsareresponsibleforhowdatamovesfromasourcetooneormorechannels.Flumecomespackagedwithtwochannelselectorsthatcovermostusecasesyoumighthave,althoughyoucanwriteyourownifneedbe.Areplicatingchannelselector(thedefault)simplyputsacopyoftheeventintoeachchannel,assumingyouhaveconfiguredmorethanone.Incontrast,amultiplexingchannelselectorcanwritetodifferentchannelsdependingonsomeheaderinformation.Combinedwithsomeinterceptorlogic,thisduoformsthefoundationforroutinginputtodifferentchannels.
Finally,asinkprocessoristhemechanismbywhichyoucancreatefailoverpathsforyoursinksorloadbalanceeventsacrossmultiplesinksfromachannel.
Tiereddatacollection(multipleflowsand/oragents)YoucanchainyourFlumeagentsdependingonyourparticularusecase.Forexample,youmaywanttoinsertanagentinatieredfashiontolimitthenumberofclientstryingtoconnectdirectlytoyourHadoopcluster.Morelikely,yoursourcemachinesdon’thavesufficientdiskspacetodealwithaprolongedoutageormaintenancewindow,soyoucreateatierwithlotsofdiskspacebetweenyoursourcesandyourHadoopcluster.
Inthefollowingdiagram,youcanseethattherearetwoplaceswheredataiscreated(ontheleft-handside)andtwofinaldestinationsforthedata(theHDFSandElasticSearchcloudbubblesontheright-handside).Tomakethingsmoreinteresting,let’ssayoneofthemachinesgeneratestwokindsofdata(let’scallthemsquareandtriangledata).Youcanseethatinthelower-leftagent,weuseamultiplexingchannelselectortosplitthetwokindsofdataintodifferentchannels.Therectanglechannelisthenroutedtotheagentintheupper-rightcorner(alongwiththedatacomingfromtheupper-leftagent).ThecombinedvolumeofeventsiswrittentogetherinHDFSinDatacenter1.Meanwhile,thetriangledataissenttotheagentthatwritestoElasticSearchinDatacenter2.Keepinmindthatdatatransformationscanoccurafteranysource.Howallofthesecomponentscanbeusedtobuildcomplicateddataworkflowswillbebecomeclearasweproceed.
TheKiteSDKOneofthenewtechnologiesincorporatedinFlume,startingwithVersion1.4,issomethingcalledaMorphline.YoucanthinkofaMorphlineasaseriesofcommandschainedtogethertoformadatatransformationpipe.
IfyouareafanofpipeliningUnixcommands,thiswillbeveryfamiliartoyou.Thecommandsthemselvesareintendedtobesmall,single-purposefunctionsthatwhenchainedtogethercreatepowerfullogic.Inmanyways,usingaMorphlinecommandchaincanbeidenticalinfunctionalitytotheinterceptorparadigmjustmentioned.ThereisaMorphlineinterceptorwewillcoverinChapter6,Interceptors,ETL,andRouting,whichyoucanuseinsteadof,orinadditionto,theincludedJava-basedinterceptors.
NoteTogetanideaofhowusefulthesecommandscanbe,takealookatthehandygrokcommandanditsincludedextensibleregularexpressionlibraryathttps://github.com/kite-sdk/kite/blob/master/kite-morphlines/kite-morphlines-core/src/test/resources/grok-dictionaries/grok-patterns
ManyofthecustomJavainterceptorsthatI’vewritteninthepastweretomodifythebody(data)andcaneasilybereplacedwithanout-of-the-boxMorphlinecommandchain.YoucangetfamiliarwiththeMorphlinecommandsbycheckingouttheirreferenceguideathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html
FlumeVersion1.4alsoincludesaMorphline-backedsinkusedprimarilytofeeddataintoSolr.We’llseemoreofthisinChapter4,SinksandSinkProcessors,MorphlineSolrSearchSink.
MorphlinesarejustonecomponentoftheKiteSDKincludedinFlume.StartingwithVersion1.5,FlumehasaddedexperimentalsupportforKiteData,whichisanefforttocreateastandardlibraryfordatasetsinHadoop.Itlooksverypromising,butitisoutsidethescopeofthisbook.
NotePleaseseetheprojecthomepageformoreinformation,asitwillcertainlybecomemoreprominentintheHadoopecosystemasthetechnologymatures.YoucanreadallabouttheKiteSDKathttp://kitesdk.org.
SummaryInthischapter,wediscussedtheproblemthatFlumeisattemptingtosolve:gettingdataintoyourHadoopclusterfordataprocessinginaneasilyconfigured,reliableway.WealsodiscussedtheFlumeagentanditslogicalcomponents,includingevents,sources,channelselectors,channels,sinkprocessors,andsinks.Finally,webrieflydiscussedMorphlinesasapowerfulnewETL(Extract,Transform,Load)library,startingwithVersion1.4ofFlume.
Thenextchapterwillcovertheseinmoredetail,specifically,themostcommonlyusedimplementationsofeach.Likeallgoodopensourceprojects,almostallofthesecomponentsareextensibleifthebundledonesdon’tdowhatyouneedthemtodo.
Chapter2.AQuickStartGuidetoFlumeAswecoveredsomeofthebasicsinthepreviouschapter,thischapterwillhelpyougetstartedwithFlume.So,let’sstartwiththefirststep:downloadingandconfiguringFlume.
DownloadingFlumeLet’sdownloadFlumefromhttp://flume.apache.org/.Lookforthedownloadlinkinthesidenavigation.You’llseetwocompressed.tararchivesavailablealongwiththechecksumandGPGsignaturefilesusedtoverifythearchives.Instructionstoverifythedownloadareonthewebsite,soIwon’tcoverthemhere.Checkingthechecksumfilecontentsagainsttheactualchecksumverifiesthatthedownloadwasnotcorrupted.Checkingthesignaturefilevalidatesthatallthefilesyouaredownloading(includingthechecksumandsignature)camefromApacheandnotsomenefariouslocation.Doyoureallyneedtoverifyyourdownloads?Ingeneral,itisagoodideaanditisrecommendedbyApachethatyoudoso.Ifyouchoosenotto,Iwon’ttell.
Thebinarydistributionarchivehasbininthename,andthesourcearchiveismarkedwithsrc.ThesourcearchivecontainsjusttheFlumesourcecode.ThebinarydistributionismuchlargerbecauseitcontainsnotonlytheFlumesourceandthecompiledFlumecomponents(jars,javadocs,andsoon),butalsoallthedependentJavalibraries.ThebinarypackagecontainsthesameMavenPOMfileasthesourcearchive,soyoucanalwaysrecompilethecodeevenifyoustartwiththebinarydistribution.
Goahead,downloadandverifythebinarydistributiontosaveussometimeingettingstarted.
FlumeinHadoopdistributionsFlumeisavailablewithsomeHadoopdistributions.ThedistributionssupposedlyprovidebundlesofHadoop’scorecomponentsandsatelliteprojects(suchasFlume)inawaythatensuresthingssuchasversioncompatibilityandadditionalbugfixesaretakenintoaccount.Thesedistributionsaren’tbetterorworse;they’rejustdifferent.
Therearebenefitstousingadistribution.Someoneelsehasalreadydonetheworkofpullingtogetheralltheversion-compatiblecomponents.Today,thisislessofanissuesincetheApacheBigTopprojectstarted(http://bigtop.apache.org/).Nevertheless,havingprebuiltstandardOSpackages,suchasRPMsandDEBs,easeinstallationaswellasprovidestartup/shutdownscripts.Eachdistributionhasdifferentlevelsoffreeandpaidoptions,includingpaidprofessionalservicesifyoureallygetintoasituationyoujustcan’thandle.
Therearedownsides,ofcourse.TheversionofFlumebundledinadistributionwilloftenlagquiteabitbehindtheApachereleases.Ifthereisaneworbleeding-edgefeatureyouareinterestedinusing,you’lleitherbewaitingforyourdistribution’sprovidertobackportitforyou,oryou’llbestuckpatchingityourself.Furthermore,whilethedistributionprovidersdoafairamountoftesting,suchasanygeneral-purposeplatform,youwillmostlikelyencountersomethingthattheirtestingdidn’tcover,inwhichcase,youarestillonthehooktocomeupwithaworkaroundordiveintothecode,fixit,andhopefully,submitthatpatchbacktotheopensourcecommunity(where,atafuturepoint,it’llmakeitintoanupdateofyourdistributionorthenextversion).
So,thingsmoveslowerinaHadoopdistributionworld.Youcanseethatasgoodorbad.Usually,largecompaniesdon’tliketheinstabilityofbleeding-edgetechnologyormakingchangesoften,aschangecanbethemostcommoncauseofunplannedoutages.You’dbehardpressedtofindsuchacompanyusingthebleeding-edgeLinuxkernelratherthansomethinglikeRedHatEnterpriseLinux(RHEL),CentOS,UbuntuLTS,oranyoftheotherdistributionswhosetargetisstabilityandcompatibility.IfyouareastartupbuildingthenextInternetfad,youmightneedthatbleeding-edgefeaturetogetalegupontheestablishedcompetition.
Ifyouareconsideringadistribution,dotheresearchandseewhatyouaregetting(ornotgetting)witheach.Rememberthateachoftheseofferingsishopingthatyou’lleventuallywantand/orneedtheirEnterpriseoffering,whichusuallydoesn’tcomecheap.Doyourhomework.
NoteHere’sashort,nondefinitivelistofsomeofthemoreestablishedplayers.Formoreinformation,refertothefollowinglinks:
Cloudera:http://cloudera.com/Hortonworks:http://hortonworks.com/MapR:http://mapr.com/
AnoverviewoftheFlumeconfigurationfileNowthatwe’vedownloadedFlume,let’sspendsometimegoingoverhowtoconfigureanagent.
AFlumeagent’sdefaultconfigurationproviderusesasimpleJavapropertyfileofkey/valuepairsthatyoupassasanargumenttotheagentuponstartup.Asyoucanconfiguremorethanoneagentinasinglefile,youwillneedtoadditionallypassanagentidentifier(calledaname)sothatitknowswhichconfigurationstouse.InmyexampleswhereI’monlyspecifyingoneagent,I’mgoingtousethenameagent.
NoteBydefault,theconfigurationpropertyfileismonitoredforchangesevery30seconds.Ifachangeisdetected,Flumewillattempttoreconfigureitself.Inpractice,manyoftheconfigurationsettingscannotbechangedaftertheagenthasstarted.Saveyourselfsometroubleandpasstheundocumented--no-reload-confargumentwhenstartingtheagent(exceptindevelopmentsituationsperhaps).
IfyouusetheClouderadistribution,thepassingofthisflagiscurrentlynotpossible.I’veopenedatickettofixthatathttps://issues.cloudera.org/browse/DISTRO-648.Ifthisisimportanttoyou,pleasevoteitup.
Eachagentisconfigured,startingwiththreeparameters:
agent.sources=<listofsources>
agent.channels=<listofchannels>
agent.sinks=<listofsinks>
TipDownloadingtheexamplecode
Youcandownloadtheexamplecodefilesfromyouraccountathttp://www.packtpub.comforallthePacktPublishingbooksyouhavepurchased.Ifyoupurchasedthisbookelsewhere,youcanvisithttp://www.packtpub.com/supportandregistertohavethefilese-maileddirectlytoyou.
Eachsource,channel,andsinkalsohasauniquenamewithinthecontextofthatagent.Forexample,ifI’mgoingtotransportmyApacheaccesslogs,Imightdefineachannelnamedaccess.Theconfigurationsforthischannelwouldallstartwiththeagent.channels.accessprefix.EachconfigurationitemhasatypepropertythattellsFlumewhatkindofsource,channel,orsinkitis.Inthiscase,wearegoingtouseanin-memorychannelwhosetypeismemory.Thecompleteconfigurationforthechannelnamedaccessintheagentnamedagentwouldbe:
agent.channels.access.type=memory
Anyargumentstoasource,channel,orsinkareaddedasadditionalpropertiesusingthe
sameprefix.ThememorychannelhasacapacityparametertoindicatethemaximumnumberofFlumeeventsitcanhold.Let’ssaywedidn’twanttousethedefaultvalueof100;ourconfigurationwouldnowlooklikethis:
agent.channels.access.type=memory
agent.channels.access.capacity=200
Finally,weneedtoaddtheaccesschannelnametotheagent.channelspropertysothattheagentknowstoloadit:
agent.channels=access
Let’slookatacompleteexampleusingthecanonicalHello,World!example.
Startingupwith“Hello,World!”NotechnicalbookwouldbecompletewithoutaHello,World!example.Hereistheconfigurationfilewe’llbeusing:
agent.sources=s1
agent.channels=c1
agent.sinks=k1
agent.sources.s1.type=netcat
agent.sources.s1.channels=c1
agent.sources.s1.bind=0.0.0.0
agent.sources.s1.port=12345
agent.channels.c1.type=memory
agent.sinks.k1.type=logger
agent.sinks.k1.channel=c1
Here,I’vedefinedoneagent(calledagent)whohasasourcenameds1,achannelnamedc1,andasinknamedk1.
Thes1source’stypeisnetcat,whichsimplyopensasocketlisteningforevents(onelineoftextperevent).Itrequirestwoparameters:abindIPandaportnumber.Inthisexample,weareusing0.0.0.0forabindaddress(theJavaconventiontospecifylistenonanyaddress)andport12345.Thesourceconfigurationalsohasaparametercalledchannels(plural),whichisthenameofthechannel(s)thesourcewillappendeventsto,inthiscase,c1.Itisplural,becauseyoucanconfigureasourcetowritetomorethanonechannel;wejustaren’tdoingthatinthissimpleexample.
Thechannelnamedc1isamemorychannelwithadefaultconfiguration.
Thesinknamedk1isoftheloggertype.Thisisasinkthatismostlyusedfordebuggingandtesting.ItwilllogalleventsattheINFOlevelusingLog4j,whichitreceivesfromtheconfiguredchannel,inthiscase,c1.Here,thechannelkeywordissingularbecauseasinkcanonlybefeddatafromonechannel.
Usingthisconfiguration,let’sruntheagentandconnecttoitusingtheLinuxnetcatutilitytosendanevent.
First,explodethe.tararchiveofthebinarydistributionwedownloadedearlier:
$tar-zxfapache-flume-1.5.2-bin.tar.gz
$cdapache-flume-1.5.2-bin
Next,let’sbrieflylookatthehelp.Runtheflume-ngcommandwiththehelpcommand:
$./bin/flume-nghelp
Usage:./bin/flume-ng<command>[options]...
commands:
helpdisplaythishelptext
agentrunaFlumeagent
avro-clientrunanavroFlumeclient
versionshowFlumeversioninfo
globaloptions:
--conf,-c<conf>useconfigsin<conf>directory
--classpath,-C<cp>appendtotheclasspath
--dryrun,-ddonotactuallystartFlume,justprintthecommand
--plugins-path<dirs>colon-separatedlistofplugins.ddirectories.See
the
plugins.dsectionintheuserguideformore
details.
Default:$FLUME_HOME/plugins.d
-Dproperty=valuesetsaJavasystempropertyvalue
-Xproperty=valuesetsaJava-Xoption
agentoptions:
--conf-file,-f<file>specifyaconfigfile(required)
--name,-n<name>thenameofthisagent(required)
--help,-hdisplayhelptext
avro-clientoptions:
--rpcProps,-P<file>RPCclientpropertiesfilewithserverconnection
params
--host,-H<host>hostnametowhicheventswillbesent
--port,-p<port>portoftheavrosource
--dirname<dir>directorytostreamtoavrosource
--filename,-F<file>textfiletostreamtoavrosource(default:std
input)
--headerFile,-R<file>Filecontainingeventheadersaskey/valuepairs
oneachnewline
--help,-hdisplayhelptext
Either--rpcPropsorboth--hostand--portmustbespecified.
Notethatif<conf>directoryisspecified,thenitisalwaysincluded
firstintheclasspath.
Asyoucansee,therearetwowayswithwhichyoucaninvokethecommand(otherthanthesimplehelpandversioncommands).Wewillbeusingtheagentcommand.Theuseofavro-clientwillbecoveredlater.
Theagentcommandhastworequiredparameters:aconfigurationfiletouseandtheagentname(incaseyourconfigurationcontainsmultipleagents).
Let’stakeoursampleconfigurationandopenaneditor(viinmycase,butusewhateveryoulike):
$viconf/hw.conf
Next,placethecontentsoftheprecedingconfigurationintotheeditor,save,andexitbacktotheshell.
Nowyoucanstarttheagent:
$./bin/flume-ngagent-nagent-cconf-fconf/hw.conf-
Dflume.root.logger=INFO,console
The-Dflume.root.loggerpropertyoverridestherootloggerinconf/log4j.propertiestousetheconsoleappender.
Ifwedidn’toverridetherootlogger,everythingwouldstillwork,buttheoutputwouldgotothelog/flume.logfileinsteadofbeingbasedonthecontentsofthedefaultconfigurationfile.Ofcourse,youcanedittheconf/log4j.propertiesfileandchangetheflume.root.loggerproperty(oranythingelseyoulike).Tochangejustthepathorfilename,youcansettheflume.log.dirandflume.log.filepropertiesintheconfigurationfileorpassadditionalflagsonthecommandlineasfollows:
$./bin/flume-ngagent-nagent-cconf-fconf/hw.conf-
Dflume.root.logger=INFO,console-Dflume.log.dir=/tmp-
Dflume.log.file=flume-agent.log
Youmightaskwhyyouneedtospecifythe-cparameter,asthe-fparametercontainsthecompleterelativepathtotheconfiguration.ThereasonforthisisthattheLog4jconfigurationfileshouldbeincludedontheclasspath.
Ifyouleftthe-cparameteroffthecommand,you’llseethiserror:
Warning:Noconfigurationdirectoryset!Use--conf<dir>tooverride.
log4j:WARNNoappenderscouldbefoundforlogger
(org.apache.flume.lifecycle.LifecycleSupervisor).
log4j:WARNPleaseinitializethelog4jsystemproperly.
log4j:WARNSeehttp://logging.apache.org/log4j/1.2/faq.html#noconfigfor
moreinfo.
Butyoudidn’tdothatsoyoushouldseethesekeyloglines:
2014-10-0515:39:06,109(conf-file-poller-0)[INFO-
org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfigu
ration.java:140)]Post-validationflumeconfigurationcontains
configurationforagents:[agent]
Thislinetellsyouthatyouragentstartswiththenameagent.
Usuallyyou’dlookforthislineonlytobesureyoustartedtherightconfigurationwhenyouhavemultipleconfigurationsdefinedinyourconfigurationfile.
2014-10-0515:39:06,076(conf-file-poller-0)[INFO-
org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatche
rRunnable.run(PollingPropertiesFileConfigurationProvider.java:133)]
Reloadingconfigurationfile:conf/hw.conf
Thisisanothersanitychecktomakesureyouareloadingthecorrectfile,inthiscaseourhw.conffile.
2014-10-0515:39:06,221(conf-file-poller-0)[INFO-
org.apache.flume.node.Application.startAllComponents(Application.java:138)]
Startingnewconfiguration:{sourceRunners:{s1=EventDrivenSourceRunner:{
source:org.apache.flume.source.NetcatSource{name:s1,state:IDLE}}}
sinkRunners:{k1=SinkRunner:{
policy:org.apache.flume.sink.DefaultSinkProcessor@442fbe47counterGroup:{
name:nullcounters:{}}}}channels:
{c1=org.apache.flume.channel.MemoryChannel{name:c1}}}
Oncealltheconfigurationshavebeenparsed,youwillseethismessage,whichshowsyoueverythingthatwasconfigured.Youcansees1,c1,andk1,andwhichJavaclassesareactuallydoingthework.Asyouprobablyguessed,netcatisaconveniencefororg.apache.flume.source.NetcatSource.Wecouldhaveusedtheclassnameifwewanted.Infact,ifIhadmyowncustomsourcewritten,Iwoulduseitsclassnameforthesource’stypeparameter.YoucannotdefineyourownshortnameswithoutpatchingtheFlumedistribution.
2014-10-0515:39:06,427(lifecycleSupervisor-1-0)[INFO-
org.apache.flume.source.NetcatSource.start(NetcatSource.java:164)]Created
serverSocket:sun.nio.ch.ServerSocketChannelImpl[/0.0.0.0:12345]
Here,weseethatoursourceisnowlisteningonport12345fortheinput.So,let’ssendsomedatatoit.
Finally,openasecondterminal.We’llusethenccommand(youcanuseTelnetoranythingelsesimilar)tosendtheHelloWorldstringandpresstheReturn(Enter)keytomarktheendoftheevent:
%nclocalhost12345
HelloWorld
OK
TheOKmessagecamefromtheagentafterwepressedtheReturnkey,signifyingthatitacceptedthelineoftextasasingleFlumeevent.Ifyoulookattheagentlog,youwillseethefollowing:
2014-10-0515:44:11,215(SinkRunner-PollingRunner-DefaultSinkProcessor)
[INFO-org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]
Event:{headers:{}body:48656C6C6F20576F726C64
HelloWorld}
ThislogmessageshowsyouthattheFlumeeventcontainsnoheaders(NetcatSourcedoesn’taddanyitself).Thebodyisshowninhexadecimalalongwithastringrepresentation(forushumanstoread,inthiscase,ourHelloWorldmessage).
IfIsendthefollowinglineandthenpresstheEnterkey,you’llgetanOKmessage:
Thequickbrownfoxjumpedoverthelazydog.
You’llseethisintheagent’slog:
2014-10-0515:44:57,232(SinkRunner-PollingRunner-DefaultSinkProcessor)
[INFO-org.apache.flume.sink.LoggerSink.process(LoggerSink.java:70)]
Event:{headers:{}body:54686520717569636B2062726F776E20
Thequickbrown}
Theeventappearstohavebeentruncated.Theloggersink,bydesign,limitsthebodycontentto16bytestokeepyourscreenfrombeingfilledwithmorethanwhatyou’dneedinadebuggingcontext.Ifyouneedtoseethefullcontentsfordebugging,youshoulduseadifferentsink,perhapsthefile_rollsink,whichwouldwritetothelocalfilesystem.
SummaryInthischapter,wecoveredhowtodownloadtheFlumebinarydistribution.Wecreatedasimpleconfigurationfilethatincludedonesourcewritingtoonechannel,feedingonesink.Thesourcelistenedonasocketfornetworkclientstoconnecttoandtosenditeventdata.Theseeventswerewrittentoanin-memorychannelandthenfedtoaLog4jsinktobecometheoutput.WethenconnectedtoourlisteningagentusingtheLinuxnetcatutilityandsentsomestringeventstoourFlumeagent’ssource.Finally,weverifiedthatourLog4j-basedsinkwrotetheeventsout.
Inthenextchapter,we’lltakeadetailedlookatthetwomajorchanneltypesyou’llmostlikelyuseinyourdataprocessingworkflows:thememorychannelandthefilechannel.
Wewillalsotakealookatanewexperimentalchannel,introducedinVersion1.5ofFlume,calledtheSpillableMemoryChannel,whichattemptstobeahybridoftheothertwo.
Foreachtype,we’lldiscussalltheconfigurationknobsavailabletoyou,whenandwhyyoumightwanttodeviatefromthedefaults,andmostimportantly,whytouseoneovertheother.
Chapter3.ChannelsInFlume,achannelistheconstructusedbetweensourcesandsinks.Itprovidesabufferforyourin-flighteventsaftertheyarereadfromsourcesuntiltheycanbewrittentosinksinyourdataprocessingpipelines.
Theprimarytypeswe’llcoverhereareamemory-backed/nondurablechannelandalocal-filesystem-backed/durablechannel.StartingwithFlume1.5,anexperimentalhybridmemoryandfilechannelcalledtheSpillableMemoryChannelisintroduced.Thedurablefilechannelflushesallchangestodiskbeforeacknowledgingthereceiptoftheeventtothesender.Thisisconsiderablyslowerthanusingthenondurablememorychannel,butitprovidesrecoverabilityintheeventofsystemorFlumeagentrestarts.Conversely,thememorychannelismuchfaster,butfailureresultsindatalossandithasmuchlowerstoragecapacitywhencomparedtothemultiterabytedisksbackingthefilechannel.ThisiswhytheSpillableMemoryChannelwascreated.Intheory,yougetthebenefitsofmemoryspeeduntilthememoryfillsupduetoflowbackpressure.Atthispoint,thediskwillbeusedtostoretheevents—andwiththatcomesmuchlargercapacity.Therearetrade-offshereaswell,asperformanceisnowvariabledependingonhowtheentireflowisperforming.Ultimately,thechannelyouchoosedependsonyourspecificusecases,failurescenarios,andrisktolerance.
Thatsaid,regardlessofwhatchannelyouchoose,ifyourrateofingestfromthesourcesintothechannelisgreaterthantherateatwhichthesinkcanwritedata,youwillexceedthecapacityofthechannelandthrowaChannelException.Whatyoursourcedoesordoesn’tdowiththatChannelExceptionissource-specific,butinsomecases,datalossispossible,soyou’llwanttoavoidfillingchannelsbysizingthingsproperly.Infact,youalwayswantyoursinktobeabletowritefasterthanyoursourceinput.Otherwise,youmightgetintoasituationwhereonceyoursinkfallsbehind,youcannevercatchup.Ifyourdatavolumetrackswiththesiteusage,youcanhavehighervolumesduringthedayandlowervolumesatnight,givingyourchannelstimetodrain.Inpractice,you’llwanttotryandkeepthechanneldepth(thenumberofeventscurrentlyinthechannel)aslowaspossiblebecausetimespentinthechanneltranslatestoatimedelaybeforereachingthefinaldestination.
ThememorychannelAmemorychannel,asexpected,isachannelwherein-flighteventsarestoredinmemory.Asmemoryis(usually)ordersofmagnitudefasterthanthedisk,eventscanbeingestedmuchmorequickly,resultinginreducedhardwareneeds.Thedownsideofusingthischannelisthatanagentfailure(hardwareproblem,poweroutage,JVMcrash,Flumerestart,andsoon)resultsinthelossofdata.Dependingonyourusecase,thismightbeperfectlyfine.Systemmetricsusuallyfallintothiscategory,asafewlostdatapointsisn’ttheendoftheworld.However,ifyoureventsrepresentpurchasesonyourwebsite,thenamemorychannelwouldbeapoorchoice.
Tousethememorychannel,setthetypeparameteronyournamedchanneltomemory.
agent.channels.c1.type=memory
Thisdefinesamemorychannelnamedc1fortheagentnamedagent.
Hereisatableofconfigurationparametersyoucanadjustfromthedefaultvalues:
Key Required Type Default
type Yes String memory
capacity No int 100
transactionCapacity No int 100
byteCapacityBufferPercentage No int(percent) 20%
byteCapacity No long(bytes) 80%ofJVMHeap
keep-alive No int 3(seconds)
Thedefaultcapacityofthischannelis100events.Thiscanbeadjustedbysettingthecapacitypropertyasfollows:
agent.channels.c1.capacity=200
Rememberthatifyouincreasethisvalue,youwillmostlikelyhavetoincreaseyourJavaheapspaceusingthe-Xmx,andoptionally-Xms,parameters.
Anothercapacity-relatedsettingyoucansetistransactionCapacity.Thisisthemaximumnumberofeventsthatcanbewritten,alsocalledaput,byasource’sChannelProcessor,thecomponentresponsibleformovingdatafromthesourcetothechannel,inasingletransaction.Thisisalsothenumberofeventsthatcanberead,alsocalledatake,inasingletransactionbytheSinkProcessor,whichisthecomponentresponsibleformovingdatafromthechanneltothesink.Youmightwanttosetthishigherinordertodecreasetheoverheadofthetransactionwrapper,whichmightspeedthingsup.Thedownsidetoincreasingthisisthatasourcewouldhavetorollbackmoredataintheeventofafailure.
TipFlumeonlyprovidestransactionalguaranteesforeachchannelineachindividualagent.Inamultiagent,multichannelconfiguration,duplicatesandout-of-orderdeliveryarelikelybutshouldnotbeconsideredthenorm.Ifyouaregettingduplicatesinnonfailureconditions,itmeansthatyouneedtocontinuetuningyourFlumeconfigurations.
Ifyouareusingasinkthatwritessomeplacethatbenefitsfromlargerbatchesofwork(suchasHDFS),youmightwanttosetthishigher.Likemanythings,theonlywaytobesureistorunperformancetestswithdifferentvalues.ThisblogpostfromFlumecommitterMikePercyshouldgiveyousomegoodstartingpoints:http://bit.ly/flumePerfPt1.
ThebyteCapacityBufferPercentageandbyteCapacityparameterswereintroducedinhttps://issues.apache.org/jira/browse/FLUME-1535asameanstosizethememorychannelcapacityusingthenumberofbytesusedratherthanthenumberofeventsaswellastryingtoavoidOutOfMemoryErrors.Ifyoureventshavealargevarianceinsize,youmightbetemptedtousethesesettingstoadjustthecapacity,butbewarnedthatcalculationsareestimatedfromtheevent’sbodyonly.Ifyouhaveanyheaders,whichyouwill,youractualmemoryusagewillbehigherthantheconfiguredvalues.
Finally,thekeep-aliveparameteristhetimethethreadwritingdataintothechannelwillwaitwhenthechannelisfull,beforegivingup.Asdataisbeingdrainedfromthechannelatthesametime,ifspaceopensupbeforethetimeoutexpires,thedatawillbewrittentothechannelratherthanthrowinganexceptionbacktothesource.Youmightbetemptedtosetthisvalueveryhigh,butrememberthatwaitingforawritetoachannelwillblockthedataflowingintoyoursource,whichmightcausedatatobackupinanupstreamagent.Eventually,thismightresultineventsbeingdropped.Youneedtosizeforperiodicspikesintrafficaswellastemporaryplanned(andunplanned)maintenance.
ThefilechannelAfilechannelisachannelthatstoreseventstothelocalfilesystemoftheagent.Thoughit’sslowerthanthememorychannel,itprovidesadurablestoragepaththatcansurvivemostissuesandshouldbeusedinusecaseswhereagapinyourdataflowisundesirable.
ThisdurabilityisprovidedbyacombinationofaWriteAheadLog(WAL)andoneormorefilestoragedirectories.TheWALisusedtotrackallinputandoutputfromthechannelinanatomicallysafeway.Thisway,iftheagentisrestarted,theWALcanbereplayedtomakesurealltheeventsthatcameintothechannel(puts)havebeenwrittenout(takes)beforethestoreddatacanbepurgedfromthelocalfilesystem.
Additionally,thefilechannelsupportstheencryptionofdatawrittentothefilesystemifyourdatahandlingpolicyrequiresthatalldataonthedisk(eventemporarily)beencrypted.Iwon’tcoverthishere,butshouldyouneedit,thereisanexampleintheFlumeUserGuide(http://flume.apache.org/FlumeUserGuide.html).Keepinmindthatusingencryptionwillreducethethroughputofyourfilechannel.
Tousethefilechannel,setthetypeparameteronyournamedchanneltofile.
agent.channels.c1.type=file
Thisdefinesafilechannelnamedc1fortheagentnamedagent.
Hereisatableofconfigurationparametersyoucanadjustfromthedefaultvalues:
Key Required Type Default
type Yes String file
checkpointDir No String ~/.flume/file-channel/checkpoint
useDualCheckpoints No boolean false
backupCheckpointDir No StringNodefault,butmustbedifferentfromcheckpointDir
dataDirs No String(commaseparatedlist)
~/.flume/file-channel/data
capacity No int 1000000
keep-alive No int 3(seconds)
transactionCapacity No int 10000
checkpointInterval No long 30000(milliseconds)
maxFileSize No long 2146435071(bytes)
minimumRequiredSpace No long 524288000(bytes)
TospecifythelocationwheretheFlumeagentshouldholddata,setthecheckpointDiranddataDirsproperties:
agent.channels.c1.checkpointDir=/flume/c1/checkpoint
agent.channels.c1.dataDirs=/flume/c1/data
Technically,thesepropertiesarenotrequiredandhavesensibledefaultvaluesfordevelopment.However,ifyouhavemorethanonefilechannelconfiguredinyouragent,onlythefirstchannelwillstart.Forproductiondeploymentsanddevelopmentworkwithmultiplefilechannels,youshouldusedistinctdirectorypathsforeachfilechannelstorageareaandconsiderplacingdifferentchannelsondifferentdiskstoavoidIOcontention.Additionally,ifyouaresizingalargemachine,considerusingsomeformofRAIDthatcontainsstriping(RAID10,50,or60)toachievehigherdiskperformanceratherthanbuyingmoreexpensive10Kor15KdrivesorSSDs.Ifyoudon’thaveRAIDstripingbuthavemultipledisks,setdataDirstoacomma-separatedlistofeachstoragelocation.UsingmultiplediskswillspreadthedisktrafficalmostaswellasstripedRAIDbutwithoutthecomputationaloverheadassociatedwithRAID50/60aswellasthe50percentspacewasteassociatedwithRAID10.You’llwanttotestyoursystemtoseewhethertheRAIDoverheadisworththespeeddifference.Asharddrivefailuresareareality,youmightprefercertainRAIDconfigurationstosingledisksinordertoprotectyourselffromthedatalossassociatedwithsingledrivefailures.RAID6acrossthemaximumnumberofdiskscanprovidethehighestperformancefortheminimalamountofdataprotectionwhencombinedwithredundantandreliablepowersources(suchasanuninterruptablepowersupplyorUPS).
UsingtheJDBCchannelisabadideaasitwouldintroduceabottleneckandsinglepointoffailureinwhatshouldbedesignedasahighlydistributedsystem.NFSstorageshouldbeavoidedforthesamereason.
TipBesuretosetHADOOP_PREFIXandJAVA_HOMEenvironmentvariableswhenusingthefilechannel.Whileweseeminglyhaven’tusedanythingHadoop-specific(suchaswritingtoHDFS),thefilechannelusesHadoopWritablesasanon-diskserializationformat.IfFlumecan’tfindtheHadooplibraries,youmightseethisinyourstartup,socheckyourenvironmentvariables:java.lang.NoClassDefFoundError:org/apache/hadoop/io/Writable
StartingwithFlume1.4,thefilechannelsupportsasecondarycheckpointdirectory.Insituationswhereafailureoccurswhilewritingthecheckpointdata,thatinformationcouldbecomeunusableandafullreplayofthelogsisnecessaryinordertorecoverthestateofthechannel.Asonlyonecheckpointisupdatedatatimebeforeflippingtotheother,oneshouldalwaysbeinaconsistentstate,thusshorteningrestarttimes.Withoutvalidcheckpointinformation,theFlumeagentcan’tknowwhathasbeensentandwhathasnotbeensentindataDirs.Asfilesinthedatadirectoriesmightcontainlargeamountsofdataalreadysentbutnotyetdeleted,alackofcheckpointinformationwouldresultinalargenumberofrecordsbeingresentasduplicates.
Incomingdatafromsourcesiswrittenandacknowledgedattheendofthemostcurrentfileinthedatadirectory.Thefilesinthecheckpointdirectorykeeptrackofthisdatawhenit’stakenbyasink.
Ifyouwanttousethisfeature,settheuseDualCheckpointspropertytotrueandspecifyalocationforthatsecondcheckpointdirectorywiththebackupCheckpointDirproperty.Forperformancereasons,itisalwayspreferredthatthisbeonadifferentdiskfromtheotherdirectoriesusedbythefilechannel:
agent.channels.c1.useDualCheckpoints=true
agent.channels.c1.backupCheckpointDir=/flume/c1/checkpoint2
Thedefaultfilechannelcapacityisonemillioneventsregardlessofthesizeoftheeventcontents.Ifthechannelcapacityisreached,asourcewillnolongerbeabletoingestthedata.Thisdefaultshouldbefineforlowvolumecases.You’llwanttosizethishigherifyouringestionissoheavythatyoucan’ttoleratenormalplannedorunplannedoutages.Forinstance,therearemanyconfigurationchangesyoucanmakeinHadoopthatrequireaclusterrestart.IfyouhaveFlumewritingimportantdataintoHadoop,thefilechannelshouldbesizedtotoleratethetimeittakestorestartHadoop(andmaybeaddacomfortbufferfortheunexpected).Ifyourclusterorothersystemsareunreliable,youcansetthishigherstilltohandleevenlargeramountsofdowntime.Atsomepoint,you’llrunintothefactthatyourdiskspaceisafiniteresource,soyouwillhavetopicksomeupperlimit(orbuybiggerdisks).
Thekeep-aliveparameterissimilartomemorychannels.Itisthemaximumtimethesourcewillwaitwhentryingtowriteintoafullchannelbeforegivingup.Ifspacebecomesavailablebeforethetimeout,thewriteissuccessful;otherwise,ChannelExceptionisthrownbacktothesource.
ThetransactionCapacitypropertyisthemaximumnumberofeventsallowedinasingletransaction.Thismightbecomeimportantforcertainsourcesthatbatchtogethereventsandpassthemtothechannelinasinglecall.Mostlikely,youwon’tneedtochangethisfromthedefault.Settingthishigherallocatesadditionalresourcesinternally,soyoushouldn’tincreaseitunlessyourunintoperformanceissues.
ThecheckpointIntervalpropertyisthenumberofmillisecondsbetweenperformingacheckpoint(whichalsorollsthelogfileswrittentologDirs).Ifyoudonotsetthis,30secondswillbeused.
CheckpointfilesalsorollbasedonthevolumeofdatawrittentothemusingthemaxFileSizeproperty.Youcanlowerthisvalueforlowtrafficchannelsifyouwanttotryandsavesomediskspace.Let’ssayyourmaximumfilesizeis50,000bytesbutyourchannelonlywrites500bytesaday;itwouldtake100daystofillasinglelog.Let’ssaythatyouwereonday100and2000bytescameinallatonce.Somedatawouldbewrittentotheoldfileandanewfilewouldbestartedwiththeoverflow.Aftertheroll,Flumetriestoremoveanylogfilesthataren’tneededanymore.Asthefullloghasunprocessedrecords,itcannotberemovedyet.Thenextchancetocleanupthatoldlogfilemightnotcomeforanother100days.Itprobablydoesn’tmatterifthatold50,000bytefilesticks
aroundlonger,butasthedefaultisaround2GB,youcouldhavetwicethat(4GB)diskspaceusedperchannel.Dependingonhowmuchdiskyouhaveavailableandthenumberofchannelsconfiguredinyouragent,thismightormightnotbeaproblem.Ifyourmachineshaveplentyofstoragespace,thedefaultshouldbefine.
Finally,theminimumRequiredSpacepropertyistheamountofspaceyoudonotwanttouseforwritinglogs.Thedefaultconfigurationwillthrowanexceptionifyouattempttousethelast500MBofthediskassociatedwiththedataDirpath.Thislimitappliesacrossallchannels,soifyouhavethreefilechannelsconfigured,theupperlimitisstill500MBandnot1.5GB.Youcansetthisvalueaslowas1MB,butgenerallyspeaking,badthingstendtohappenwhenyoupushdiskutilizationtowards100percent.
SpillableMemoryChannelIntroducedinFlume1.5,theSpillableMemoryChannelisachannelthatactslikeamemorychanneluntilitisfull.Atthatpoint,itactslikeafilechannelthatisconfiguredwithamuchlargercapacitythanitsmemorycounterpartbutrunsatthespeedofyourdisks(whichmeansordersofmagnitudeslower).
NoteTheSpillableMemoryChannelisstillconsideredexperimental.Useitatyourownrisk!
Ihavemixedfeelingsaboutthisnewchanneltype.Onthesurface,itseemslikeagoodidea,butinpractice,Icanseeproblems.Specifically,havingavariablechannelspeedthatchangesdependingonhowdownstreamentitiesinyourdatapipebehavemakesfordifficultcapacityplanning.Asamemorychannelisusedundergoodconditions,thisimpliesthatthedatacontainedinitcanbelost.SowhywouldIgothroughextratroubletosavesomeofittothedisk?Thedataiseitherveryimportantformetospoolittodiskwithafile-backedchannel,orit’slessimportantandcanbelost,soIcangetawaywithlesshardwareanduseafastermemory-backedchannel.IfIreallyneedmemoryspeedbutwiththecapacityofaharddrive,SolidStateDrive(SSD)priceshavecomedownenoughinrecentyearsforafilechannelonSSDtonowbeaviableoptionforyouratherthanusingthishybridchanneltype.Idonotusethischannelmyselfforthesereasons.
Tousethischannelconfiguration,setthetypeparameteronyournamedchanneltospillablememory:
agent.channels.c1.type=spillablememory
ThisdefinesaSpillableMemoryChannelnamedc1fortheagentnamedagent.
Hereisatableofconfigurationparametersyoucanadjustfromthedefaultvalues:
Key Required Type Default
type Yes String spillablememory
memoryCapacity No int 10000
overflowCapacity No int 100000000
overflowTimeout No int 3(seconds)
overflowDeactivationThreshold No int 5(percent)
byteCapacityBufferPercentage No int 20(percent)
byteCapacity No long(bytes) 80%ofJVMHeap
checkpointDir No String ~/.flume/file-channel/checkpoint
dataDirsNo String(comma
~/.flume/file-channel/data
separatedlist)
useDualCheckpoints No boolean false
backupCheckpointDir No StringNodefault,butmustbedifferentthancheckpointDir
transactionCapacity No int 10000
checkpointInterval No long 30000(milliseconds)
maxFileSize No long 2146435071(bytes)
minimumRequiredSpace No long 524288000(bytes)
Asyoucansee,manyofthefieldsmatchagainstthememorychannelandfilechannel’sproperties,sothereshouldbenosurpriseshere.Let’sstartwiththememoryside.
ThememoryCapacitypropertydeterminesthemaximumnumberofeventsheldinthememory(thiswasjustcalledcapacityforthememorychannelbutwasrenamedheretoavoidambiguity).Also,thedefaultvalue,ifunspecified,is10,000recordsinsteadof100.Ifyouwantedtodoublethedefaultcapacity,theconfigurationmightlooksomethinglikethis:
agent.channels.c1.memoryCapacity=20000
Asmentionedpreviously,youwillmostlikelyneedtoincreasetheJavaheapspaceallocatedusingthe-Xmxand-Xmsparameters.LikemostJavaprograms,morememoryusuallyhelpsthegarbagecollectorrunmoreefficiently,especiallyasanapplicationsuchasFlumegeneratesalotofshort-livedobjects.BesuretodoyourresearchandpickanappropriateJVMgarbagecollectorbasedonyouravailablehardware.
TipYoucansetmemoryCapacitytozero,whicheffectivelyturnsitintoafilechannel.Don’tdothis.Justusethefilechannelandbedonewithit.
TheoverflowCapacitypropertydeterminesthemaximumnumberofeventsthatcanbewrittentothediskbeforeanerroristhrownbacktothesourcefeedingit.Thedefaultvalueisagenerous100,000,000events(farlargerthanthefilechannel’sdefaultof100,000).Mostserversnowadayshavemultiterabytedisks,sospaceshouldnotbeaproblem,butdothemathagainstyouraverageeventsizetobesureyoudon’tfillyourdisksbyaccident.Ifyourchannelsarefillingthismuch,youareprobablydealingwithanotherissuedownstreamandthelastthingyouneedisyourdatabufferlayerfillingupcompletely.Forexample,ifyouhadalarge1megabyteeventpayload,100millionoftheseaddupto100terabytes,whichisprobablybiggerthanthediskspaceonanaverageserver.A1kilobytepayloadwouldonlytake100gigabytes,whichisprobablyfine.Justdothemathaheadoftimesoyouarenotsurprised.
TipYoucansetoverflowCapacitytozero,whicheffectivelyturnsitintoamemorychannel.
Don’tdothis.Justusethememorychannelandbedonewithit.
ThetransactionCapacitypropertyadjuststhebatchsizeofeventswrittentoachannelinasingletransaction.Ifthisissettoolow,itwilllowerthethroughputoftheagentonhighvolumeflowsbecauseofthetransactionoverhead.Forahighvolumechannel,youwillprobablyneedtosetthishigher,buttheonlywaytobesureistotestyourparticularworkflow.SeeChapter8,MonitoringFlume,tolearnhowtoaccomplishthis.
Finally,thebyteCapacityBufferPercentageandbyteCapacityparametersareidenticalinfunctionalityanddefaultstothememorychannel,soIwon’twasteyourtimerepeatingithere.
WhatisimportantistheoverflowTimeoutproperty.Thisisthenumberofsecondsafterwhichthememorypartofthechannelfillsbeforedatastartsgettingwrittentothedisk-backedportionofthechannel.Ifyouwantwritestostartoccurringimmediately,youcansetthistozero.Youmightwonderwhyyouneedtowaitbeforestartingtowritetothediskportionofthechannel.ThisiswheretheundocumentedoverflowDeactivationThresholdpropertycomesintoplay.Thisistheamountoftimethatspacehastobeavailableinthememorypathbeforeitcanswitchbackfromdiskwriting.Ibelievethisisanattempttopreventflappingbackandforthbetweenthetwo.Ofcourse,therereallyarenoorderingguaranteesinFlume,soIdon’tknowwhyyouwouldchoosetoappendtothediskbufferifaspotisavailableinfastermemory.Perhapstheyaretryingtoavoidsomekindofstarvationcondition,althoughthecodeappearstoattempttoremoveeventsintheorderofarrivalevenwhenusingbothmemoryanddiskqueues.Perhapsitwillbeexplainedtousshoulditevercomeoutofexperimentalstatus.
NoteTheoverflowDeactivationThresholdpropertyisstatedtobeforinternaluseonly,soadjustitatyourownperil.Ifyouareconsideringit,besuretogetfamiliarwiththesourcecodesothatyouunderstandtheimplicationsofalteringthedefault.
Therestofthepropertiesonthischannelareidenticalinnameandfunctionalitytoitsfilechannelcounterpart,sopleaserefertotheprevioussection.
SummaryInthischapter,wecoveredthetwochanneltypesyouaremostlikelytouseinyourdataprocessingpipelines.
Thememorychanneloffersspeedatthecostofdatalossintheeventoffailure.Alternatively,thefilechannelprovidesamorereliabletransportinthatitcantolerateagentfailuresandrestartsataperformancecost.
Youwillneedtodecidewhichchannelisappropriateforyourusecases.Whentryingtodecidewhetheramemorychannelisappropriate,askyourselfwhatthemonetarycostisifyoulosesomedata.Weighthatagainsttheadditionalcostsofmorehardwaretocoverthedifferenceinperformancewhendecidingifyouneedadurablechannelafterall.Anotherconsiderationiswhetherornotthedatacanberesent.NotalldatayoumightingestintoHadoopwillcomefromstreamingapplicationlogs.Ifyoureceive“dailydownloads”ofdata,youcangetawaywithusingamemorychannelbecauseifyouencounteraproblem,youcanalwaysreruntheimport.
Finally,wecoveredtheexperimentalSpillableMemoryChannel.Personally,Ithinkitscreationisabadidea,butlikemostthingsincomputerscience,everybodyhasanopiniononwhatisgoodorbad.Ifeelthattheaddedcomplexityandnondeterministicperformancemakefordifficultcapacityplanning,asyoushouldalwayssizethingsfortheworst-casescenario.Ifyourdataiscriticalenoughforyoutooverflowtothediskratherthandiscardtheevents,thenyouaren’tgoingtobeokaywithlosingeventhesmallamountheldinmemory.
Inthenextchapter,we’lllookatsinks,specifically,theHDFSsinktowriteeventstoHDFS,theElasticSearchsinktowriteeventstoElasticSearch,andtheMorphlineSolrsinktowriteeventstoSolr.WewillalsocoverEventSerializers,whichspecifyhowFlumeeventsaretranslatedintooutputthat’smoresuitableforthesink.Finally,wewillcoversinkprocessorsandhowtosetuploadbalancingandfailurepathsinatieredconfigurationformorerobustdatatransport.
Chapter4.SinksandSinkProcessorsBynow,youshouldhaveaprettygoodideawherethesinkfitsintotheFlumearchitecture.Inthischapter,wewillfirstlearnaboutthemost-usedsinkwithHadoop,theHDFSsink.WewillthencovertwoofthenewersinksthatsupportcommonNearRealTime(NRT)logprocessing:theElasticSearchSinkandtheMorphlineSolrSink.Asyou’dexpect,thefirstwritesdataintoElasticsearchandthelattertoSolr.ThegeneralarchitectureofFlumesupportsmanyothersinkswewon’thavespacetocoverinthisbook.SomecomebundledwithFlumeandcanwritetoHBase,IRC,and,aswesawinChapter2,AQuickStartGuidetoFlume,alog4jandfilesink.OthersinksareavailableontheInternetandcanbeusedtowritedatatoMongoDB,Cassandra,RabbitMQ,Redis,andjustaboutanyotherdatastoreyoucanthinkof.Ifyoucan’tfindasinkthatsuitsyourneeds,youcanwriteoneeasilybyextendingtheorg.apache.flume.sink.AbstractSinkclass.
HDFSsinkThejoboftheHDFSsinkistocontinuouslyopenafileinHDFS,streamdataintoit,andatsomepoint,closethatfileandstartanewone.AswediscussedinChapter1,OverviewandArchitecture,thetimebetweenfilesrotationsmustbebalancedwithhowquicklyfilesareclosedinHDFS,thusmakingthedatavisibleforprocessing.Aswe’vediscussed,havinglotsoftinyfilesforinputwillmakeyourMapReducejobsinefficient.
TousetheHDFSsink,setthetypeparameteronyournamedsinktohdfs.
agent.sinks.k1.type=hdfs
ThisdefinesaHDFSsinknamedk1fortheagentnamedagent.Therearesomeadditionalparametersyoumustspecify,startingwiththepathinHDFSyouwanttowritethedatato:
agent.sinks.k1.hdfs.path=/path/in/hdfs
ThisHDFSpath,likemostfilepathsinHadoop,canbespecifiedinthreedifferentways:absolute,absolutewithservername,andrelative.Theseareallequivalent(assumingyourFlumeagentisrunastheflumeuser):
absolute /Users/flume/mydata
absolutewithserver hdfs://namenode/Users/flume/mydata
relative mydata
IprefertoconfigureanyserverI’minstallingFlumeonwithaworkinghadoopcommandlinebysettingthefs.default.namepropertyinHadoop’score-site.xmlfile.Idon’tkeeppersistentdatainHDFSuserdirectoriesbutprefertouseabsolutepathswithsomemeaningfulpathname(forexample,/logs/apache/access).TheonlytimeIwouldspecifyaNameNodespecificallyisifthetargetwasadifferentHadoopclusterentirely.Thisallowsyoutomoveconfigurationsyou’vealreadytestedinoneenvironmentintoanotherwithoutunintendedconsequencessuchasyourproductionserverwritingdatatoyourstagingHadoopclusterbecausesomebodyforgottoeditthetargetintheconfiguration.Iconsiderexternalizingenvironmentspecificsagoodbestpracticetoavoidsituationssuchasthese.
OnefinalrequiredparameterfortheHDFSsink,actuallyanysink,isthechannelthatitwillbedoingtakeoperationsfrom.Forthis,setthechannelparameterwiththechannelnametoreadfrom:
agent.sinks.k1.channel=c1
Thistellsthek1sinktoreadeventsfromthec1channel.
Hereisamostlycompletetableofconfigurationparametersyoucanadjustfromthedefaultvalues:
Key Required Type Default
type Yes String hdfs
channel Yes String
hdfs.path Yes String
hdfs.filePrefix No String FlumeData
hdfs.fileSuffix No String
hdfs.minBlockReplicas No intSeethedfs.replicationpropertyinyourinheritedHadoopconfiguration,usually,3.
hdfs.maxOpenFiles No long 5000
hdfs.closeTries No int 0(0=tryforever,otherwiseacount)
hdfs.retryInterval No int 180Seconds(0=don’tretry)
hdfs.round No boolean false
hdfs.roundValue No int 1
hdfs.roundUnit No String(second,
minuteorhour)second
hdfs.timeZone No String Localtime
hdfs.useLocalTimeStamp No boolean False
hdfs.inUsePrefix No String Blank
hdfs.inUseSuffix No String .tmp
hdfs.rollInterval No long(seconds) 30Seconds(0=disable)
hdfs.rollSize No long(bytes) 1024bytes(0=disable)
hdfs.rollCount No long 10(0=disable)
hdfs.batchSize No long 100
hdfs.codeC No String
RemembertoalwayschecktheFlumeUserGuidefortheversionyouareusingathttp://flume.apache.org/,asthingsmightchangebetweenthereleaseofthisbookandtheversionyouareactuallyusing.
PathandfilenameEachtimeFlumestartsanewfileathdfs.pathinHDFStowritedatainto,thefilenameiscomposedofthehdfs.filePrefix,aperiodcharacter,theepochtimestampatwhichthefilewasstarted,andoptionally,afilesuffixspecifiedbythehdfs.fileSuffixproperty(ifset),forexample:
agent.sinks.k1.hdfs.path=/logs/apache/access
Theprecedingcommandwouldresultinafilesuchas/logs/apache/access/FlumeData.1362945258
However,inthefollowingconfiguration,yourfilenameswouldbemorelike/logs/apache/access/access.1362945258.log:
agent.sinks.k1.hdfs.path=/logs/apache/access
agent.sinks.k1.hdfs.filePrefix=access
agent.sinks.k1.hdfs.fileSuffix=.log
Overtime,thehdfs.pathdirectorywillgetveryfull,soyouwillwanttoaddsomekindoftimeelementintothepathtopartitionthefilesintosubdirectories.Flumesupportsvarioustime-basedescapesequences,suchas%Ytospecifyafour-digityear.Iliketousesequencesintheyear/month/day/hourform(sothattheyaresortedoldesttonewest),soIoftenusethisforapath:
agent.sinks.k1.hdfs.path=/logs/apache/access/%Y/%m/%d/%H
ThissaysIwantapathlike/logs/apache/access/2013/03/10/18/.
NoteForacompletelistoftime-basedescapesequences,seetheFlumeUserGuide.
AnotherhandyescapesequencemechanismistheabilitytouseFlumeheadervaluesinyourpath.Forinstance,iftherewasaheaderwithakeyoflogType,IcouldsplitApacheaccessanderrorlogsintodifferentdirectorieswhileusingthesamechannel,byescapingtheheader’skeyasfollows:
agent.sinks.k1.hdfs.path=/logs/apache/%{logType}/%Y/%m/%d/%H
Theprecedinglineofcodewouldresultinaccesslogsgoingto/logs/apache/access/2013/03/10/18/,anderrorlogsgoingto/logs/apache/error/2013/03/10/18/.However,ifIpreferredbothlogtypesinthesamedirectorypath,IcouldhaveusedlogTypeinmyhdfs.filePrefixinstead,asfollows:
agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H
agent.sinks.k1.hdfs.filePrefix=%{logType}
Obviously,itispossibleforFlumetowritetomultiplefilesatonce.Thehdfs.maxOpenFilespropertysetstheupperlimitforhowmanycanbeopenatonce,withadefaultof5000.Ifyoushouldexceedthislimit,theoldestfilethat’sstillopenisclosed.RememberthateveryopenfileincursoverheadbothattheOSlevelandinHDFS(NameNodeandDataNodeconnections).
Anothersetofpropertiesyoumightfindusefulallowforroundingdowneventtimesatanhour,minute,orsecondgranularitywhilestillmaintainingtheseelementsinfilepaths.Let’ssayyouhadapathspecificationasfollows:
agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H%M
However,ifyouwantedonlyfoursubdirectoriesperday(at00,15,30,and45pastthehour,eachcontaining15minutesofdata),youcouldaccomplishthisbysettingthefollowing:
agent.sinks.k1.hdfs.round=true
agent.sinks.k1.hdfs.roundValue=15
agent.sinks.k1.hdfs.roundUnit=minute
Thiswouldresultinlogsbetween01:15:00and01:29:59onMarch10,2013beingwrittentofilescontainedin/logs/apache/2013/03/10/0115/.Logsfrom01:30:00to01:44:59wouldbewritteninfilescontainedin/logs/apache/2013/03/10/0130/.
Thehdfs.timeZonepropertyisusedtospecifythetimezonethatyouwanttimeinterpretedforyourescapesequences.Thedefaultisyourcomputer’slocaltime.Ifyourlocaltimeisaffectedbydaylightsavingstimeadjustments,youwillhavetwiceasmuchdatawhen%H==02(inthefall)andnodatawhen%H==02(inthespring).Ithinkitisabadideatointroducetimezonesintothingsthataremeantforcomputerstoread.Ibelievetimezonesareaconcernforhumansaloneandcomputersshouldonlyconverseinuniversaltime.Forthisreason,IsetthispropertyonmyFlumeagentstomakethetimezoneissuejustgoaway:
-Duser.timezone=UTC
Ifyoudon’tagree,youarefreetousethedefault(localtime)orsethdfs.timeZonetowhateveryoulike.Thevalueyoupassedisusedinacalltojava.util.Timezone.getTimeZone(…),sochecktheJavadocsforacceptablevaluestobeusedhere.
Theothertime-relatedpropertyisthehdfs.useLocalTimeStampbooleanproperty.Bydefault,itsvalueisfalse,whichtellsthesinktousetheevent’stimestampheaderwhencalculatingdate-basedescapesequencesinfilepaths,asshownpreviously.Ifyousetthepropertytotrue,thecurrentsystemtimewillbeusedinstead,effectivelytellingFlumetousethetransportarrivaltimeratherthantheoriginaleventtime.YouwouldnotsetthisincaseswhereHDFSwasthefinaltargetforthestreamedevents.Thisway,delayedeventswillstillbeplacedcorrectly(whereuserswouldnormallylookforthem)regardlessoftheirarrivaltime.However,theremaybeausecasewhereeventsaretemporarilywrittentoHadoopandprocessedinbatches,onsomeinterval(perhapsdaily).Inthiscase,thetransporttimewouldbepreferred,soyourpostprocessingjobdoesn’tneedtoscanolderfoldersfordelayeddata.
RememberthatfilesinHDFSarebrokenintofileblocksthatarereplicatedacrosstheDataNodes.Thedefaultnumberofreplicasisusuallythree(assetintheHadoopbaseconfiguration).Youcanoverridethisvalueupordownforthissinkwiththehdfs.minBlockReplicasproperty.Forexample,ifIhaveadatastreamthatIfeelonly
needstworeplicasinsteadofthree,Icanoverridethisasfollows:
agent.skinks.k1.hdfs.minBlockReplicas=2
NoteDon’tsettheminimumreplicacounthigherthanthenumberofdatanodesyouhave,otherwiseyou’llcreateadegradedstateHDFS.Youalsodon’twanttosetitsohighthatadownedboxformaintenancewouldtriggerthissituation.Personally,I’veneversetthishigherthanthedefaultofthree,butIhavesetitloweronlessimportantdatainordertosavespace.
Finally,whilefilesarebeingwrittentoHDFS,a.tmpextensionisadded.Whenthefileisclosed,theextensionisremoved.Youcanchangetheextensionusedbysettingthehdfs.inUseSuffixproperty,butI’veneverhadareasontodoso:
agent.sinks.k1.hdfs.inUseSuffix=flumeiswriting
ThisallowsyoutoseewhichfilesarebeingwrittentosimplybylookingatadirectorylistinginHDFS.AsyoutypicallyspecifyadirectoryforinputinyourMapReducejob(orbecauseyouareusingHive),thetemporaryfileswilloftenbepickedupasemptyorgarbledinputbymistake.Toavoidhavingyourtemporaryfilespickedupbeforebeingclosed,settheprefixtoeitheradotoranunderscorecharacterasfollows:
agent.sinks.k1.hdfs.inUsePrefix=_
Thatsaid,thereareoccasionswherefileswerenotclosedproperlyduetosomeHDFSglitch,soyoumightseefileswiththein-useprefix/suffixthathaven’tbeenusedinsometime.AfewnewpropertieswereaddedinVersion1.5tochangethedefaultbehaviorofclosingfiles.Thefirstisthehdfs.closeTriesproperty.Thedefaultofzeroactuallymeans“tryforever”,soitisalittleconfusing.Settingitto4meanstry4timesbeforegivingup.Youcanadjusttheintervalbetweenretriesbysettingthehdfs.retryIntervalproperty.SettingittoolowcouldswampyourNameNodewithtoomanyrequests,sobecarefulifyoulowerthisfromthedefaultof3minutes.Ofcourse,ifyouareopeningfilestooquickly,youmightneedtolowerthisjusttokeepfromgoingoverthehdfs.maxOpenFilessettingwhichwascoveredpreviously.Ifyouactuallydidn’twantanyretries,youcansethdfs.retryIntervaltozeroseconds(again,nottobeconfusedwithcloseTries=0,whichmeanstryforever).Hopefullyinafutureversion,theywillusethemorecommonlyusedconventionofanegativenumber(usually,-1)wheninfiniteisdesired.
FilerotationBydefault,Flumewillrotateactivelywritten-tofilesevery30seconds,10events,or1024bytes.Thisisdonebysettingthehdfs.rollInterval,hdfs.rollCount,andhdfs.rollSizeproperties,respectively.Oneormoreofthesecanbesettozerotodisablethisparticularrollingmechanism.Forinstance,ifyouonlywantedatime-basedrollof1minute,youwouldsetthefollowing:
agent.sinks.k1.hdfs.rollInterval=60
agent.sinks.k1.hdfs.rollCount=0
agent.sinks.k1.hdfs.rollSize=0
Ifyouroutputcontainsanyamountofheaderinformation,theHDFSsizeperfilecanbelargerthanwhatyouexpect,becausethehdfs.rollSizerotationschemeonlycountstheeventbodylength.Clearly,youmightnotwanttodisableallthreemechanismsforrotationatthesametime,oryouwillhaveonedirectoryinHDFSoverflowingwithfiles.
Finally,arelatedparameterishdfs.batchSize.Thisisthenumberofeventsthatthesinkwillreadpertransactionfromthechannel.Ifyouhavealargevolumeofdatainyourchannel,youmightseeaperformanceincreasebysettingthishigherthanthedefaultof100,whichdecreasesthetransactionoverheadperevent.
Nowthatwe’vediscussedthewayfilesaremanagedandrolledinHDFS,let’slookintohowtheeventcontentsgetwritten.
CompressioncodecsCodecs(Coder/Decoders)areusedtocompressanddecompressdatausingvariouscompressionalgorithms.Flumesupportsgzip,bzip2,lzo,andsnappy,althoughyoumighthavetoinstalllzoyourself,especiallyifyouareusingadistributionsuchasCDH,duetolicensingissues.
Ifyouwanttospecifycompressionforyourdata,setthehdfs.codeCpropertyifyouwanttheHDFSsinktowritecompressedfiles.ThepropertyisalsousedasthefilesuffixforthefileswrittentoHDFS.Forexample,ifyouspecifythefollowing,allfilesthatarewrittenwillhavea.gzipextension,soyoudon’tneedtospecifythehdfs.fileSuffixpropertyinthiscase:
agent.sinks.k1.hdfs.codeC=gzip
Thecodecyouchoosetousewillrequiresomeresearchonyourpart.Thereareargumentsforusinggziporbzip2fortheirhighercompressionratiosatthecostoflongercompressiontimes,especiallyifyourdataiswrittenoncebutwillbereadhundredsorthousandsoftimes.Ontheotherhand,usingsnappyorlzoresultsinfastercompressionperformancebutresultsinalowercompressionratio.Keepinmindthatthesplitabilityofthefile,especiallyifyouareusingplaintextfiles,willgreatlyaffecttheperformanceofyourMapReducejobs.GopickupacopyofHadoopBeginner’sGuide,GarryTurkington,PacktPublishing(http://amzn.to/14Dh6TA)orHadoop:TheDefinitiveGuide,TomWhite,O’Reilly(http://amzn.to/16OsfIf)ifyouaren’tsurewhatI’mtalkingabout.
EventSerializersAnEventSerializeristhemechanismbywhichaFlumeEventisconvertedintoanotherformatforoutput.ItissimilarinfunctiontotheLayoutclassinlog4j.Bydefault,thetextserializer,whichoutputsjusttheFlumeeventbody,isused.Thereisanotherserializer,header_and_text,whichoutputsboththeheadersandthebody.Finally,thereisanavro_eventserializerthatcanbeusedtocreateanAvrorepresentationoftheevent.Ifyouwriteyourown,you’dusetheimplementation’sfullyqualifiedclassnameastheserializerpropertyvalue.
TextoutputAsmentionedpreviously,thedefaultserializeristhetextserializer.ThiswilloutputonlytheFlumeeventbody,withtheheadersdiscarded.Eacheventhasanewlinecharacterappenderunlessyouoverridethisdefaultbehaviorbysettingtheserializer.appendNewLinepropertytofalse.
Key Required Type Default
Serializer No String text
serializer.appendNewLine No boolean true
TextwithheadersThetext_with_headersserializerallowsyoutosavetheFlumeeventheadersratherthandiscardthem.Theoutputformatconsistsoftheheaders,followedbyaspace,thenthebodypayload,andfinally,terminatedbyanoptionallydisablednewlinecharacter,forinstance:
{key1=value1,key2=value2}bodytexthere
Key Required Type Default
serializer No String text_with_headers
serializer.appendNewLine No boolean true
ApacheAvroTheApacheAvroproject(http://avro.apache.org/)providesaserializationformatthatissimilarinfunctionalitytoGoogleProtocolBuffersbutismoreHadoopfriendlyasthecontainerisbasedonHadoop’sSequenceFileandhassomeMapReduceintegration.Theformatisalsoself-describingusingJSON,makingforagoodlong-termdatastorageformat,asyourdataformatmightevolveovertime.IfyourdatahasalotofstructureandyouwanttoavoidturningitintoStringsonlytothenparsetheminyourMapReducejob,youshouldreadmoreaboutAvrotoseewhetheryouwanttouseitasastorageformatinHDFS.
Theavro_eventserializercreatesAvrodatabasedontheFlumeeventschema.IthasnoformattingparametersasAvrodictatestheformatofthedata,andthestructureoftheFlumeeventdictatestheschemaused:
Key Required Type Default
serializer No String avro_event
serializer.compressionCodec No String(gzip,bzip2,lzo,orsnappy)
serializer.syncIntervalBytes No int(bytes) 2048000(bytes)
IfyouwantyourdatacompressedbeforebeingwrittentotheAvrocontainer,youshouldsettheserializer.compressionCodecpropertytothefileextensionofaninstalledcodec.Theserializer.syncIntervalBytespropertydeterminesthesizeofthedatabufferusedbeforeflushingthedatatoHDFS,andtherefore,thissettingcanaffectyourcompressionratiowhenusingacodec.HereisanexampleusingsnappycompressiononAvrodatausinga4MBbuffer:
agent.sinks.k1.serializer=avro_event
agent.sinks.k1.serializer.compressionCodec=snappy
agent.sinks.k1.serializer.syncIntervalBytes=4194304
agent.sinks.k1.hdfs.fileSuffix=.avro
ForAvrofilestoworkinanAvroMapReducejob,theymustendin.avroortheywillbeignoredasinput.Forthisreason,youneedtoexplicitlysetthehdfs.fileSuffixproperty.Furthermore,youwouldnotsetthehdfs.codeCpropertyonanAvrofile.
User-providedAvroschemaIfyouwanttouseadifferentschemafromtheFlumeeventschemausedwiththeavro_eventtype,startinginVersion1.4,thecloselynamedAvroEventSerializerwillletyoudothis.Keepinmindthatusingthisimplementationonly,theevent’sbodyisserializedandheadersarenotpassedon.
Settheserializertypetothefullyqualifiedorg.apache.flume.sink.hdfs.AvroEventSerializerclassname:
agent.sinks.k1.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer
UnliketheotherserializersthattakeadditionalparametersintheFlumeconfigurationfile,thisonerequiresthatyoupasstheschemainformationviaaFlumeheader.ThisisabyproductofoneoftheAvro-awaresourceswe’llseeinChapter6,Interceptors,ETL,andRouting,whereschemainformationissentfromthesourcetothefinaldestinationviatheeventheader.Youcanfakethisifyouareusingasourcethatdoesn’tsetthesebyusingastaticheaderinterceptor.We’lltalkmoreaboutinterceptorsinChapter6,Interceptors,ETL,andRouting,soflipbacktothispartlateron.
TospecifytheschemadirectlyintheFlumeconfigurationfile,usetheflume.avro.schema.literalheaderasshowninthisexample(usingamapofstringsschema):
agent.sinks.k1.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer
agent.sinks.k1.interceptors=i1
agent.sinks.k1.interceptors.i1.type=static
agent.sinks.k1.interceptors.i1.key=flume.avro.schema.literal
agent.sinks.k1.interceptors.i1.value="
{\"type\":\"map\",\"values\":\"string\"}"
IfyouprefertoputtheschemafileinHDFS,usetheflume.avro.schema.urlheaderinstead,asshowninthisexample:
agent.sinks.k1.serializer=org.apache.flume.sink.hdfs.AvroEventSerializer
agent.sinks.k1.interceptors=i1
agent.sinks.k1.interceptors.i1.type=static
agent.sinks.k1.interceptors.i1.key=flume.avro.schema.url
agent.sinks.k1.interceptors.i1.value=hdfs://path/to/schema.avsc
Actually,inthissecondform,youcanpassanyURLincludingafile://URL,butthiswouldindicateafilelocaltowhereyouarerunningtheFlumeagent,whichmightcreateadditionalsetupworkforyouradministrators.ThisisalsotrueofconfigurationservedupbyaHTTPwebserverorfarm.Ratherthancreatingadditionalsetupdependencies,justusethedependencyyoucannotremove,whichisHDFS,usingahdfs://URL.
Besuretoonlyseteithertheflume.avro.schema.literalheaderortheflume.avro.schema.urlheaderbothnotboth.
FiletypeBydefault,theHDFSsinkwritesdatatoHDFSasHadoop’sSequenceFile.ThisisacommonHadoopwrapperthatconsistsofakeyandvaluefieldseparatedbybinaryfieldandrecorddelimiters.Usually,textfilesonacomputermakeassumptionslikeanewlinecharacterterminateseachrecord.So,whatdoyoudoifyourdatacontainsanewlinecharacter,suchassomeXML?Usingasequencefilecansolvethisproblembecauseitusesnonprintablecharactersfordelimiters.Sequencefilesarealsosplittable,whichmakesforbetterlocalityandparallelismwhenrunningMapReducejobsonyourdata,especiallyonlargefiles.
SequenceFileWhenusingaSequenceFilefiletype,youneedtospecifyhowyouwantthekeyandvaluetobewrittenontherecordintheSequenceFile.ThekeyoneachrecordwillalwaysbeaLongWritabletypeandwillcontainthecurrenttimestamp,orifthetimestampeventheaderisset,itwillbeusedinstead.Bydefault,theformatofthevalueisaorg.apache.hadoop.io.BytesWritabletype,whichcorrespondstothebyte[]Flumebody:
Key Required Type Default
hdfs.fileType No String SequenceFile
hdfs.writeFormat No String writable
However,ifyouwantthepayloadinterpretedasaString,youcanoverridethehdfs.writeFormatproperty,soorg.apache.hadoop.io.Textwillbeusedasthevaluefield:
Key Required Type Default
hdfs.fileType No String SequenceFile
hdfs.writeFormat No String text
DataStreamIfyoudonotwanttooutputaSequenceFilefilebecauseyourdatadoesn’thaveanaturalkey,youcanuseaDataStreamtooutputonlytheuncompressedvalue.Simplyoverridethehdfs.fileTypeproperty:
agent.sinks.k1.hdfs.fileType=DataStream
ThisisthefiletypeyouwouldusewithAvroserialization,asanycompressionshouldhavebeendoneintheEventSerializer.Toserializegzip-compressedAvrofiles,youwouldsettheseproperties:
agent.sinks.k1.serializer=avro_event
agent.sinks.k1.serializer.compressionCodec=gzip
agent.sinks.k1.hdfs.fileType=DataStream
agent.sinks.k1.hdfs.fileSuffix=.avro
CompressedStreamCompressedStreamissimilartoaDataStream,exceptthatthedataiscompressedwhenit’swritten.Youcanthinkofthisasrunningthegziputilityonanuncompressedfile,butallinonestep.ThisdiffersfromacompressedAvrofilewhosecontentsarecompressedandthenwrittenintoanuncompressedAvrowrapper:
agent.sinks.k1.hdfs.fileType=CompressedStream
RememberthatonlycertaincompressedformatsaresplittableinMapReduceshouldyoudecidetouseCompressedStream.Thecompressionalgorithmselectiondoesn’thaveaFlumeconfigurationbutisdictatedbythezlib.compress.strategyandzlib.compress.levelpropertiesincoreHadoopinstead.
TimeoutsandworkersFinally,therearetwomiscellaneouspropertiesrelatedtotimeoutsandtwoforworkerpoolsthatyoucanchange:
Key Required Type Default
hdfs.callTimeout No long(milliseconds) 10000
hdfs.idleTimeout No int(seconds) 0(0=disable)
hdfs.threadsPoolSize No int 10
hdfs.rollTimerPoolSize No int 1
Thehdfs.callTimeoutpropertyistheamountoftimetheHDFSsinkwillwaitforHDFSoperationstoreturnasuccess(orfailure)beforegivingup.IfyourHadoopclusterisparticularlyslow(forinstance,adevelopmentorvirtualcluster),youmightneedtosetthisvaluehigherinordertoavoiderrors.Keepinmindthatyourchannelwilloverflowifyoucannotsustainhigherwritethroughputthantheinputrateofyourchannel.
Thehdfs.idleTimeoutproperty,ifsettoanonzerovalue,isthetimeFlumewillwaittoautomaticallycloseanidlefile.Ihaveneverusedthisashdfs.fileRollIntervalhandlestheclosingoffilesforeachrollperiod,andifthechannelisidle,itwillnotopenanewfile.Thissettingseemstohavebeencreatedasanalternativerollmechanismtothesize,time,andeventcountmechanismsthathavealreadybeendiscussed.Youmightwantasmuchdatawrittentoafileaspossibleandonlycloseitwhentherereallyisnomoredata.Inthiscase,youcanusehdfs.idleTimeouttoaccomplishthisrotationschemeifyoualsosethdfs.rollInterval,hdfs.rollSize,andhdfs.rollCounttozero.
Thefirstpropertyyoucansettoadjustthenumberofworkersishdfs.threadsPoolSizeanditdefaultsto10.Thisisthemaximumnumberoffilesthatcanbewrittentoatthesametime.Ifyouareusingeventheaderstodeterminefilepathsandnames,youmighthavemorethan10filesopenatonce,butbecarefulwhenincreasingthisvaluetoomuchsoasnottooverwhelmHDFS.
Thelastpropertyrelatedtoworkerpoolsisthehdfs.rollTimerPoolSize.Thisisthenumberofworkersprocessingtimeoutssetbythehdfs.idleTimeoutproperty.Theamountofworktoclosethefilesisprettysmall,soincreasingthisvaluefromthedefaultofoneworkerisunlikely.Ifyoudonotusearotationbasedonhdfs.idleTimeout,youcanignorethehdfs.rollTimerPoolSizeproperty,asitisnotused.
SinkgroupsInordertoremovesinglepointsoffailuresinyourdataprocessingpipeline,Flumehastheabilitytosendeventstodifferentsinksusingeitherloadbalancingorfailover.Inordertodothis,weneedtointroduceanewconceptcalledasinkgroup.Asinkgroupisusedtocreatealogicalgroupingofsinks.Thebehaviorofthisgroupingisdictatedbysomethingcalledthesinkprocessor,whichdetermineshoweventsarerouted.
Thereisadefaultsinkprocessorthatcontainsasinglesinkwhichisusedwheneveryouhaveasinkthatisn’tpartofanysinkgroup.OurHello,World!exampleinChapter2,AQuickStartGuidetoFlume,usedthedefaultsinkprocessor.Nospecialconfigurationisrequiredforsinglesinks.
InorderforFlumetoknowaboutthesinkgroups,thereisanewtop-levelagentpropertycalledsinkgroups.Similartosources,channels,andsinks,youprefixthepropertywiththeagentname:
agent.sinkgroups=sg1
Here,wehavedefinedasinkgroupcalledsg1fortheagentnamedagent.
Foreachnamedsinkgroup,youneedtospecifythesinksitcontainsusingthesinkspropertyconsistingofaspace-delimitedlistofsinknames:
agent.sinkgroups.sg1.sinks=k1k2
Thisdefinesthatthek1andk2sinksarepartofthesg1sinkgroupfortheagentnamedagent.
Often,sinkgroupsareusedinconjunctionwiththetieredmovementofdatatoroutearoundfailures.However,theycanalsobeusedtowritetodifferentHadoopclusters,asevenawell-maintainedclusterhasperiodicmaintenance.
LoadbalancingContinuingtheprecedingexample,let’ssayyouwanttoloadbalancetraffictok1andk2evenly.Therearesomeadditionalpropertiesyouneedtospecify,aslistedinthistable:
Key Type Default
processor.type String load_balance
processor.selector String(round_robin,random) round_robin
processor.backoff boolean false
Whenyousetprocessor.typetoload_balance,roundrobinselectionwillbeused,unlessotherwisespecifiedbytheprocessor.selectorproperty.Thiscanbesettoeitherround_robinorrandom.Youcanalsospecifyyourownloadbalancingselectormechanism,whichwewon’tcoverhere.ConsulttheFlumedocumentationifyouneedthiscustomcontrol.
Theprocessor.backoffpropertyspecifieswhetheranexponentialbackupshouldbeusedwhenretryingasinkthatthrewanexception.Thedefaultisfalse,whichmeansthatafterathrownexception,thesinkwillbetriedagainthenexttimeitsturnisupbasedonroundrobinorrandomselection.Ifsettotrue,thenthewaittimeforeachfailureisdoubled,startingat1seconduptoalimitofaround18hours(216seconds).
NoteInanearlierversionofFlume,thedefaultinthecodeforprocessor.backoffwasstatedasfalse,butthedocumentationstateditastrue.Thiserrorhasbeenfixed,however,itmaysaveyouaheadachebyspecifyingwhatyouwantforpropertysettingsratherthanrelyingonthedefaults.
FailoverIfyouwouldrathertryonesinkandifthatonefailstotryanother,thenyouwanttosetprocessor.typetofailover.Next,you’llneedtosetadditionalpropertiestospecifytheorderbysettingtheprocessor.priorityproperty,followedbythesinkname:
Key Type Default
processor.type String failover
processor.priority.NAME int
processor.maxpenality int(milliseconds) 30000
Let’slookatthefollowingexample:
agent.sinkgroups.sg1.sinks=k1k2k3
agent.sinkgroups.sg1.processor.type=failover
agent.sinkgroups.sg1.processor.priority.k1=10
agent.sinkgroups.sg1.processor.priority.k2=20
agent.sinkgroups.sg1.processor.priority.k3=20
Lowerprioritynumberscomefirst,andinthecaseofatie,orderisarbitrary.Youcanuseanynumberingsystemthatmakessensetoyou(byones,fives,tens—whatever).Inthisexample,thek1sinkwillbetriedfirst,andifanexceptionisthrown,eitherk2ork3willbetriednext.Ifk3wasselectedfirstfortrialanditfailed,k2willstillbetried.Ifallsinksinthesinkgroupfail,thetransactionwiththechannelisrolledback.
Finally,processor.maxPenalitysetsanupperlimittoanexponentialbackoffforfailedsinksinthegroup.Afterthefirstfailure,itwillbe1secondbeforeitcanbeusedagain.Eachsubsequentfailuredoublesthewaittimeuntilprocessor.maxPenalityisreached.
MorphlineSolrSinkHDFSisnottheonlyusefulplacetosendyourlogsanddata.Solrisapopularreal-timesearchplatformusedtoindexlargeamountsofdata,sofulltextsearchingcanbeperformedalmostinstantaneously.Hadoop’shorizontalscalabilitycreatesaninterestingproblemforSolr,asthereisnowmoredatathanasingleinstancecanhandle.Forthisreason,ahorizontallyscalableversionofSolrwascreated,calledSolrCloud.Cloudera’sSearchproductisalsobasedonSolrCloud,soitshouldbenosurprisethatFlumedeveloperscreatedanewsinkspecificallytowritestreamingdataintoSolr.
Likemoststreamingdataflows,younotonlytransportthedata,butyoualsooftenreformatitintoaformmoreconsumabletothetargetoftheflow.Typically,thisisdoneinaFlume-onlyworkflowbyapplyingoneormoreinterceptorsjustpriortothesinkwritingthedatatothetargetsystem.ThissinkusestheMorphlineenginetotransformthedata,insteadofinterceptors.
Internally,eachFlumeeventisconvertedintoaMorphlinerecordandpassedtothefirstcommandintheMorphlinecommandchain.Arecordcanbethoughtofasasetofkey/valuepairswithstringkeysandarbitraryobjectvalues.EachoftheFlumeheadersispassedasaRecordFieldwiththesameheaderkeys.AspecialRecordFieldkey_attachment_bodyisusedfortheFlumeeventbody.Keepinmindthatthebodyisstillabytearray(Javabyte[])atthispointandmustbespecificallyprocessedintheMorphlinecommandchain.
Eachcommandprocessestherecordinturn,passingtheoutputtotheinputofthenextcommandinlinewiththefinalcommandresponsibleforterminatingtheflow.Inmanyways,itissimilarinfunctionallytoFlume’sInterceptorfunctionality,whichwe’llseeinChapter6,Interceptors,ETL,andRouting.InthecaseofwritingtoSolr,weusetheloadSolrcommandtoconverttheMorphlinerecordintoaSolrDocumentandwritetotheSolrcluster.Hereiswhatthissimplifiedflowmightlooklikeinapictureform:
MorphlineconfigurationfilesMorphlineconfigurationfilesusetheHOCONformat,whichissimilartoJSONbuthasalessstrictsyntax,makingthemlesserror-pronewhenusedforconfigurationfilesoverJSON.
NoteHOCONisanacronymforHumanOptimizedConfigurationObjectNotation.YoucanreadmoreaboutHOCONonthisGitHubpage:https://github.com/typesafehub/config/blob/master/HOCON.md
Theconfigurationfilecontainsasinglekeywiththemorphlinesvalue.ThevalueisanarrayofMorphlineconfigurations.Eachindividualentryiscomprisedofthreekeys:
id
importCommands
commands
IfyourconfigurationcontainsmultipleMorphlines,thevalueofidmustbeprovidedtotheFlumesinkbywayofthemorphlineIdproperty.ThevalueofimportCommandsspecifiestheJavaclassestoimportwhentheMorphineisevaluated.Thedoublestarindicatesthatallpathsandclassesfromthatpointinthepackagehierarchyshouldbeincluded.Allclassesthatimplementcom.cloudera.cdk.morphline.api.CommandBuilderareinterrogatedfortheirnamesviathegetNames()method.Thesenamesarethecommandnamesyouuseinthenextsection.Don’tworry;youdon’tneedtosiftthroughthesourcecodetofindthem,astheyhaveawell-documentedreferenceguideonline.Finally,thecommandskeyreferencesalistofcommanddictionaries.EachcommanddictionaryhasasinglekeyconsistingofthenameoftheMorphlinecommandfollowedbyitsspecificproperties.
NoteForalistofMorphlinecommandsandassociatedconfigurationproperties,seethereferenceguideathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html
Hereiswhataskeletonconfigurationfilemightlooklike:
morphlines:[
{
id:transform_my_data
importCommands:[
"com.cloudera.**",
"org.apache.solr.**"
]
commands:[
{
COMMAND_NAME1:{
property1:value1
property2:value2
}
}
{COMMAND_NAME2:{
property1:value1
}
]
}
]
TypicalSolrSinkconfigurationHereistheprecedingskeletonconfigurationappliedtoourSolrusecase.Thisisnotmeanttobecomplete,butitissufficienttodiscusstheflowintheprecedingdiagram:
morphlines:[
{
id:solr_flow
importCommands:[
"com.cloudera.**",
"org.apache.solr.**"
]
commands:[
{
readLine:{
charset:UTF-8
}
{
grok:{
GROK_PROPERTIES_HERE
}
}
{
loadSolr:{
solrLocator:{
collection:my_collection
zkHost:"solr.example.com:2181/solr"
}
}
}
]
}
]
YoucanseethesameboilerplateconfigurationwherewedefineasingleMorphlinewiththesolr_flowidentifier.ThecommandsequencestartswiththereadLinecommand.Thissimplyreadstheeventbodyfromthe_attachment_bodyfieldandconvertsbyte[]toStringusingtheconfiguredencoding(inthiscase,UTF-8).TheresultingStringvalueissettothefieldwiththekeymessage.Thenextcommandinthesequence,whichisthegrokcommand,usesregularexpressionstoextractadditionalfieldstomakeamoreinterestingSolrDocument.Icouldn’tpossiblydothiscommandjusticebytryingtoexplaineverythingyoucandowithit.Forthat,pleaseseetheKiteSDKdocumentation.
NoteSeethereferenceguideforacompletelistofMorphlinecommands,theirproperties,andusageinformationathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html
Sufficetosay,grokletsmetakeawebserverloglinesuchasthis:10.4.240.176--[14/Mar/2014:12:02:17-0500]"POST
http://mysite.com/do_stuff.phpHTTP/1.1"500834
Then,itletsmeturnitintomorestructureddatalikethis:
{
ip:10.4.240.176
timestamp:1413306137
method:POST
url:http://mysite.com/do_stuff.php
protocol:HTTP/1.1
status_code:500
length:834
}
Ifyouwantedtosearchforallthetimesthispagethrewa500statuscode,havingthesefieldsbrokenoutmakesthetaskeasyforSolr.
Finally,wecalltheloadSolrcommandtoinserttherecordintoourSolrcluster.ThesolrLocatorpropertyindicatesthetargetSolrcluster(bywayofitsZookeeperserver(s))andthedatacollectiontowritethesedocumentsinto.
SinkconfigurationNowthatyouhaveabasicideaofhowtocreateaMorphlineconfigurationfile,let’sapplythistotheactualsinkconfiguration.
Thefollowingtabledetailsthesink’sparametersanddefaultvalues:
Key Required Type Default
type Yes String org.apache.flume.sink.solr.morphline.MorphlineSolrSink
channel Yes String
morphlineFile Yes String
morphlineId No StringRequirediftheMorphlineconfigurationfilecontainsmorethanoneMorphline.
batchSize No int 1000
batchDurationMillis No long 1000(milliseconds)
handlerClass No String org.apache.flume.sink.solr.morphline.MorphlineHandlerImpl
TheMorphlineSolrSinkdoesnothaveashorttypealias,sosetthetypeparameteronyournamedsinktoorg.apache.flume.sink.solr.morphline.MorphlineSolrSink:
agent.sinks.k1.type=org.apache.flume.sink.solr.morphline.MorphlineSolrSink
ThisdefinesaMorphlineSolrSinknamedk1fortheagentnamedagent.
Thenextrequiredparameteristhechannelproperty.Thisspecifieswhichchanneltoreadeventsfromforprocessing.
agent.sinks.k1.channel=c1
Thistellsthek1sinktoreadeventsfromthec1channel.
TheonlyotherrequiredparameteristherelativeorabsolutepathtotheMorphlineconfigurationfile.ThiscannotbeapathinHDFS;itmustaccessibleontheservertheFlumeagentisrunningon(localdisk,NFSdisk,andsoon).
Tospecifytheconfigurationfilepath,setthemorphlineFileproperty:
agent.sinks.k1.morphlineFile=/path/on/local/system/morphline.conf
AsaMorphlineconfigurationfilecancontainmultipleMorphlines,youmustspecifytheidentifierifmorethanoneexists,usingthemorphlineIdproperty:
agent.sinks.k1.morphlineId=transform_my_data
Thenexttwopropertiesarefairlycommonamongsinks.Theyspecifyhowmanyeventstoremoveatatimeforprocessing,alsoknownasabatch.ThebatchSizepropertydefaultsto1000events,butyoumightneedtosetthishigherifyouaren’tconsuming
eventsfromthechannelfasterthantheyarebeinginserted.Clearly,youcanonlyincreasethissomuch,asthethingyouarewritingto—inthiscase,Solr—willhavesomerecordconsumptionlimit.Onlythroughtestingwillyoubeabletostressyoursystemstoseewherethelimitsare.
TherelatedbatchDurationMillispropertyspecifiesthemaximumtimetowaitbeforethesinkproceedswiththeprocessingwhenfewerthanthebatchSizenumberofeventshavebeenread.Thedefaultvalueis1secondandisspecifiedinmillisecondsintheconfigurationproperties.Inasituationwithalightdataflow(usingthedefaults,lessthan1000recordspersecond),settingbatchDurationMillishighercanmakethingsworse.Forinstance,ifyouareusingamemorychannelwiththissink,yourFlumeagentcouldbesittingtherewithdatatowritetothesink’stargetbutiswaitingformore,onlytoshowupwhenacrashhappens,resultinginlostdata.Thatsaid,yourdownstreamentitymightperformbetteronlargerbatches,whichmightpushboththeseconfigurationvalueshigher,sothereisnouniversallycorrectanswer.Startwiththedefaultsifyouareunsure,anduseharddatathatyou’llcollectusingtechniquesinChapter8,MonitoringFlume,toadjustbasedonfactsandnotguesses.
Finally,youshouldneverneedtotouchthehandlerClasspropertyunlessyouplantowriteanalternateimplementationoftheMorphlineprocessingclass.AsthereisonlyoneMorphlineengineimplementationtodate,I’mnotreallysurewhythisisadocumentedpropertyinFlume.I’mjustmentioningitforcompleteness.
ElasticSearchSinkAnothercommontargettostreamdatatobesearchedinNRTisElasticsearch.ElasticsearchisalsoaclusteredsearchingplatformbasedonLucene,likeSolr.Itisoftenusedalongwiththelogstashproject(tocreatestructuredlogs)andtheKibanaproject(awebUIforsearches).ThistrioisoftenreferredtoastheacronymELK(Elasticsearch/Logstash/Kibana).
NoteHerearetheprojecthomepagesfortheELKstackthatcangiveyouamuchbetteroverviewthanIcaninafewshortpages:
Elasticsearch:http://elasticsearch.org/Logstash:http://logstash.net/Kibana:http://www.elasticsearch.org/overview/kibana/
InElasticsearch,dataisgroupedintoindices.YoucanthinkoftheseasbeingequivalenttodatabasesinasingleMySQLinstallation.Theindicesarecomposedoftypes(similartotablesindatabases),whicharemadeupofdocuments.Adocumentislikeasinglerowinadatabase,so,eachFlumeeventwillbecomeasingledocumentinElasticSearch.Documentshaveoneormorefields(justlikecolumnsinadatabase).
ThisisbynomeansacompleteintroductiontoElasticsearch,butitshouldbeenoughtogetyoustarted,assumingyoualreadyhaveanElasticsearchclusteratyourdisposal.Aseventsgetmappedtodocumentsbythesink’sserializer,theactualsinkconfigurationneedsonlyafewconfigurationitems:wheretheclusterislocated,whichindextowriteto,andwhattypetherecordis.
ThistablesummarizesthesettingsforElasticSearchSink:
Key Required Type Default
type Yes String org.apache.flume.sink.elasticsearch.ElasticSearchSink
hostNames Yes StringAcomma-separatedlistofElasticsearchnodestoconnectto.Iftheportisspecified,useacolonafterthename.Thedefaultportis9300.
clusterName No String elasticsearch
indexName No String flume
indexType No String log
ttl No String Defaultstoneverexpire.Specifythenumberandunit(5m=5minutes).
batchSize No int 100
Withthisinformationinmind,let’sstartbysettingthesink’stypeproperty:
agent.sinks.k1.type=org.apache.flume.sink.elasticsearch.ElasticSearchSink
Next,weneedtosetthelistofserversandportstoestablishconnectivityusingthehostNamesproperty.Thisisacomma-separatedlistofhostname:portpairs.Ifyouareusingthedefaultportof9300,youcanjustspecifytheservernameorIP,forexample:
agent.sinks.k1.hostNames=es1.example.com,es2.example.com:12345
NowthatwecancommunicatewiththeElasticsearchservers,weneedtotellthemwhichcluster,index,andtypetowriteourdocumentsto.TheclusterisspecifiedusingtheclusterNameproperty.Thiscorrespondswiththecluster.namepropertyinElasticsearch’selasticsearch.ymlconfigurationfile.Itneedstobespecified,asanElasticsearchnodecanparticipateinmorethanonecluster.HereishowIwouldspecifyanondefaultclusternamecalledproduction:
agent.sinks.k1.clusterName=production
TheindexNamepropertyisreallyaprefixusedtocreateadailyindex.Thiskeepsanysingleindexfrombecomingtoolargeovertime.Ifyouusethedefaultindexname,theindexonSeptember30,2014willbenamedflume-2014-10-30.
Lastly,theindexTypepropertyspecifiestheElasticsearchtype.Ifunspecified,thelogdefaultvaluewillbeused.
Bydefault,datawrittenintoElasticsearchwillneverexpire.Ifyouwantthedatatoautomaticallyexpire,youcanspecifyatime-to-livevalueontherecordswiththettlproperty.Valuesareanumericnumberinmillisecondsoranumberwithunits.Theunitsaregiveninthistable:
Unitstring Definition Example
ms Milliseconds 5ms=5milliseconds
notspecified Milliseconds 10000=10seconds
m Minutes 10m=10minutes
h Hours 1h=1hour
d Days 7d=7days
w Weeks 4w=4weeks
KeepinmindthatyoualsoneedtoenabletheTTLfeaturesontheElasticsearchcluster,asitdisabledbydefault.SeetheElasticsearchdocumentationforhowtodothis.
Finally,liketheHDFSsink,thebatchpropertyisthenumberofeventspertransactionthatthesinkwillreadfromthechannel.Ifyouhavealargevolumeofdatainyourchannel,youshouldseeaperformanceincreasebysettingthishigherthanthedefaultof100,duetothereducedoverheadpertransaction.
Thesink’sserializerdoestheworkoftransformingtheFlumeeventtotheElasticsearchdocument.TherearetwoElasticsearchserializersthatcomepackagedwithFlume,neither
hasadditionalconfigurationpropertiessincetheymostlyuseexistingheaderstodictatefieldmappings.
We’llseemoreofthissinkinactioninChapter7,PuttingItAllTogether.
LogStashSerializerThedefaultserializer,ifnotspecified,isElasticSearchLogStashEventSerializer:
agent.sinks.k1.serializer=org.apache.flume.sink.elasticsearch.
ElasticSearchLogStashEventSerializer
ItwritesdatainthesameformatthatLogstashusesinconjunctionwithKibana.HereisatableofthecommonlyusedfieldsandtheirassociatedmappingsfromFlumeevents:
Elasticsearchfield
TakenfromtheFlumeheader Notes
@timestamp timestamp Fromtheheader,ifpresent
@source source Fromtheheader,ifpresent
@source_host source_host Fromtheheader,ifpresent
@source_path source_path Fromtheheader,ifpresent
@type type Fromtheheader,ifpresent
@host host Fromtheheader,ifpresent
@fields allheaders AdictionaryofallFlumeheaders,includingtheonesthatmighthavebeenmappedtotheotherfields,suchasthehost
@message FlumeBody
Whileyoumightthinkthedocument’s@typefieldwillbeautomaticallysettothesink’sindexTypeconfigurationproperty,you’dbeincorrect.Ifyouhadonlyonetypeoflog,itwouldbewastefultowritethisoverandoveragainforeverydocument.However,ifyouhadmorethanonelogtypegoingthroughyourFlumechannel,youcandesignateitstypeinElasticsearchusingtheStaticinterceptorwe’llseeinChapter6,Interceptors,ETL,andRouting,tosetthetype(or@type)Flumeheaderontheevent.
DynamicSerializerAnotherserializeristheElasticSearchDynamicSerializerserializer.Ifyouusethisserializer,theevent’sbodyiswrittentoafieldcalledbody.AllotherFlumeheaderkeysareusedasfieldnames.Clearly,youwanttoavoidhavingaflumeheaderkeycalledbody,asthiswillconflictwiththeactualevent’sbodywhentransformedintotheElasticsearchdocument.Tousethisserializer,specifythefullyqualifiedclassname,asshowninthisexample:
agent.sinks.k1.serializer=org.apache.flume.sink.elasticsearch.
ElasticSearchDynamicSerializer
Forcompleteness,hereisatablethatshowsyouthebreakdownofhowFlumeheadersandbodygetmappedtoElasticsearchfields:
Flumeentity Elasticsearchfield
Allheaders sameasFlumeheaders
Body body
AstheversionofElasticsearchcanbedifferentforeachuser,Flumedoesn’tpackagetheElasticsearchclientandcorrespondingLucenelibraries.FindoutfromyouradministratorwhichversionsshouldbeincludedontheFlumeclasspath,orcheckouttheMavenpom.xmlfileonGitHubforthecorrespondingversiontagorbranchathttps://github.com/elasticsearch/elasticsearch/blob/master/pom.xml.MakesurethelibraryversionsusedbyFlumematchwithElasticsearchoryoumightseeserializationerrors.
NoteAsSolrandElasticsearchhavesimilarcapabilities,checkoutKelvinTan’sappropriately-namedside-by-sidedetailedfeaturebreakdownwebpage.Itshouldhelpgetyoustartedwithwhatismostappropriateforyourspecificusecase:
http://solr-vs-elasticsearch.com/
SummaryInthischapter,wecoveredtheHDFSsinkindepth,whichwritesstreamingdataintoHDFS.WecoveredhowFlumecanseparatedataintodifferentHDFSpathsbasedontimeorcontentsofFlumeheaders.Severalfile-rollingtechniqueswerealsodiscussed,includingtimerotation,eventcountrotation,sizerotation,androtationonidleonly.
CompressionwasdiscussedasameanstoreducestoragerequirementsinHDFS,andshouldbeusedwhenpossible.Besidesstoragesavings,itisoftenfastertoreadacompressedfileanddecompressinmemorythanitistoreadanuncompressedfile.ThiswillresultinperformanceimprovementsinMapReducejobsrunonthisdata.Thesplitabilityofcompresseddatawasalsocoveredasafactortodecidewhenandwhichcompressionalgorithmtouse.
EventSerializerswereintroducedasthemechanismbywhichFlumeeventsareconvertedintoanexternalstorageformat,includingtext(bodyonly),textandheaders(headersandbody),andAvroserialization(withoptionalcompression).
Next,variousfileformats,includingsequencefiles(Hadoopkey/valuefiles),DataStreams(uncompresseddatafiles,likeAvrocontainers),andCompressedDataStreams,werediscussed.
Next,wecoveredsinkgroupsasameanstorouteeventstodifferentsourcesusingloadbalancingorfailoverpaths,whichcanbeusedtoeliminatesinglepointsoffailureinroutingdatatoitsdestination.
Finally,wecoveredtwonewsinksaddedinFlume1.4towritedatatoApacheSolrandElasticSearchinaNearRealTime(NRT)way.Foryears,MapReducejobshaveserveduswell,andwillcontinuetodoso,butsometimesitstillisn’tfastenoughtosearchlargedatasetsquicklyandlookatthingsfromdifferentangleswithoutreprocessingdata.KiteSDKMorphlineswerealsointroducedasawaytopreparedataforwritingtoSolr.WewillrevisitMorphlinesagaininChapter6,Interceptors,ETL,andRouting,whenwelookataMorphline-poweredinterceptor.
Inthenextchapter,wewilldiscussvariousinputmechanisms(sources)thatwillfeedyourconfiguredchannelswhichwerecoveredbackinChapter3,Channels.
Chapter5.SourcesandChannelSelectorsNowthatwehavecoveredchannelsandsinks,wewillnowcoversomeofthemorecommonwaystogetdataintoyourFlumeagents.AsdiscussedinChapter1,OverviewandArchitecture,thesourceistheinputpointfortheFlumeagent.TherearemanysourcesavailablewiththeFlumedistributionaswellasmanyopensourceoptionsavailable.Likemostopensourcesoftware,ifyoucan’tfindwhatyouneed,youcanalwayswriteyourownbyextendingtheorg.apache.flume.source.AbstractSourceclass.SincetheprimaryfocusofthisbookisingestingfilesoflogsintoHadoop,we’llcoverafewofthemoreappropriatesourcestoaccomplishthis.
TheproblemwithusingtailIfyouhaveusedanyoftheFlume0.9releases,you’llnoticethattheTailSourceisnolongerapartofFlume.TailSourceprovidedamechanismto“tail”(http://en.wikipedia.org/wiki/Tail_(Unix))anyfileonthesystemandcreateFlumeeventsforeachlineofthefile.Itcouldalsohandlefilerotations,somanyusedthefilesystemasahandoffpointbetweentheapplicationcreatingthedata(forinstance,log4j)andthemechanismresponsibleformovingthosefilessomeplaceelse(forinstance,syslog).
Asisthecasewithbothchannelsandsinks,eventsareaddedandremovedfromachannelaspartofatransaction.Whenyouaretailingafile,thereisnowaytoparticipateproperlyinatransaction.Iffailuretowritesuccessfullytoachanneloccurred,orifthechannelwassimplyfull(amorelikelyeventthanfailure),thedatacouldn’tbe“putback”asrollbacksemanticsdictate.
Furthermore,iftherateofdatawrittentoafileexceedstherateFlumecouldreadthedata,itispossibletoloseoneormorelogfilesofinputoutright.Forexample,sayyouweretailing/var/log/app.log.Whenthatfilereachesacertainsize,itisrotatedorrenamed,to/var/log/app.log.1,andanewfilecalled/var/log/app.logiscreated.Let’ssayyouhadafavorablereviewinthepressandyourapplicationlogsaremuchhigherthanusual.Flumemaystillbereadingfromtherotatedfile(/var/log/app.log.1)whenanotherrotationoccurs,moving/var/log/app.logto/var/log/app.log.1.ThefileFlumeisreadingisnowrenamedto/var/log/app.log.2.WhenFlumefinisheswiththisfile,itwillmovetowhatitthinksisthenextfile(/var/log/app.log),thusskippingthefilethatnowresidesat/var/log/app.log.1.Thiskindofdatalosswouldgocompletelyunnoticedandissomethingwewanttoavoidifpossible.
Forthesereasons,itwasdecidedtoremovethetailfunctionalityfromFlumewhenitwasrefactored.TherearesomeworkaroundsforTailSourceafterit’sremoved,butitshouldbenotedthatnoworkaroundcaneliminatethepossibilityofdatalossunderloadthatoccursundertheseconditions.
TheExecsourceTheExecsourceprovidesamechanismtorunacommandoutsideFlumeandthenturntheoutputintoFlumeevents.TousetheExecsource,setthetypepropertytoexec:
agent.sources.s1.type=exec
AllsourcesinFlumearerequiredtospecifythelistofchannelstowriteeventstousingthechannels(plural)property.Thisisaspace-separatedlistofoneormorechannelnames:
agent.sources.s1.channels=c1
Theonlyotherrequiredparameteristhecommandproperty,whichtellsFlumewhatcommandtopasstotheoperatingsystem.Hereisanexampleoftheuseofthisproperty:
agent.sources=s1
agent.sources.s1.channels=c1
agent.sources.s1.type=exec
agent.sources.s1.command=tail-F/var/log/app.log
Here,Ihaveconfiguredasinglesources1foranagentnamedagent.Thesource,anExecsource,willtailthe/var/log/app.logfileandfollowanyrotationsthatoutsideapplicationsmayperformonthatlogfile.Alleventsarewrittentothec1channel.ThisisanexampleofoneoftheworkaroundsforthelackofTailSourceinFlume1.x.Thisisnotmypreferredworkaroundtousetail,butjustasimpleexampleoftheexecsourcetype.IwillshowmypreferredmethodinChapter7,PullingItAllTogether.
NoteShouldyouusethetail-FcommandinconjunctionwiththeExecsource,itisprobablethattheforkedprocesswillnotshutdown100percentofthetimewhentheFlumeagentshutsdownorrestarts.Thiswillleaveorphanedtailprocessesthatwillneverexit.Thetail–Fcommand,bydefinition,hasnoend.Evenifyoudeletethefilebeingtailed(atleastinLinux),therunningtailprocesswillkeepthefilehandleopenindefinitely.Thiskeepsthefile’sspacefromactuallybeingreclaimeduntilthetailprocessexits,whichwon’thappen.IthinkyouarebeginningtoseewhyFlumedevelopersdon’tliketailingfiles.
Ifyougothisroute,besuretoperiodicallyscantheprocesstablesfortail-FwhoseparentPIDis1.Theseareeffectivelydeadprocessesandneedtobekilledmanually.
HereisalistofotherpropertiesyoucanusewiththeExecsource:
Key Required Type Default
type Yes String exec
channels Yes String spaceseparatedlistofchannels
command Yes String
shell No String shellcommand
restart No boolean false
restartThrottle No long(milliseconds) 10000(milliseconds)
logStdErr No boolean false
batchSize No int 20
batchTimeout No long 3000(milliseconds)
Noteverycommandkeepsrunning,eitherbecauseitfails(forexample,whenthechannelitiswritingtoisfull)orbecauseitisdesignedtoexitimmediately.Inthisexample,wewanttorecordthesystemloadviatheLinuxuptimecommand,whichprintsoutsomesysteminformationtostdoutandexits:
agent.sources.s1.command=uptime
Thiscommandwillimmediatelyexit,soyoucanusetherestartandrestartThrottlepropertiestorunitperiodically:
agent.sources.s1.command=uptime
agent.sources.s1.restart=true
agent.sources.s1.restartThrottle=60000
Thiswillproduceoneeventperminute.Inthetailexample,shouldthechannelfillcausingtheExecsourcetofail,youcanusethesepropertiestorestarttheExecsource.Inthiscase,settingtherestartpropertywillstartthetailingofthefilefromthebeginningofthecurrentfile,thusproducingduplicates.DependingonhowlongtherestartThrottlepropertyis,youmayhavemissedsomedataduetoafilerotationoutsideFlume.Furthermore,thechannelmaystillbeunabletoacceptdata,inwhichcasethesourcewillfailagain.Settingthisvaluetoolowmeansgivinglesstimetothechanneltodrain,andunlikesomeofthesinkswesaw,thereisnotanoptionforexponentialbackoff.
Ifyouneedtouseshell-specificfeaturessuchaswildcardexpansion,youcansettheshellpropertyasinthisexample:
agent.sources.s1.command=grep–iapachelib/*.jar|wc-l
agent.sources.s1.shell=/bin/bash-c
agent.sources.s1.restart=true
agent.sources.s1.restartThrottle=60000
Thisexamplewillfindthenumberoftimesthecase-insensitiveapachestringisfoundinalltheJARfilesinthelibdirectory.Onceperminute,thatcountwillbesentasaFlumeeventpayload.
WhilethecommandoutputwrittentostdoutbecomestheFlumeeventbody,errorsaresometimeswrittentostderr.IfyouwanttheselinesincludedintheFlumeagent’ssystemlogs,setthelogStdErrpropertytotrue.Otherwise,theywillbesilentlyignored,whichisthedefaultbehavior.
Finally,youcanspecifythenumberofeventstowritepertransactionbychangingthebatchSizeproperty.Youmayneedtosetthisvaluehigherthanthedefaultof20ifyourinputdataislargeandyourealizethatyoucannotwritetoyourchannelfastenough.Usingahigherbatchsizereducestheoverallaveragetransactionoverheadperevent.Testingwithdifferentvaluesandmonitoringthechannel’sputrateistheonlywaytoknowthisforsure.TherelatedbatchTimeoutpropertysetsthemaximumtimetowaitwhenrecordsfewerthanthebatchsize’snumberofrecordshavebeenseenbefore,flushingapartialbatchtothechannel.Thedefaultsettingforthisis3seconds(specifiedinmilliseconds).
SpoolingDirectorySourceInanefforttoavoidalltheassumptionsinherentintailingafile,anewsourcewasdevisedtokeeptrackofwhichfileshavebeenconvertedintoFlumeeventsandwhichstillneedtobeprocessed.TheSpoolingDirectorySourceisgivenadirectorytowatchfornewfilesappearing.Itisassumedthatfilescopiedtothisdirectoryarecomplete.Otherwise,thesourcemighttryandsendapartialfile.Italsoassumesthatfilenamesneverchange.Otherwise,onrestarting,thesourcewouldforgetwhichfileshavebeensentandwhichhavenot.Thefilenameconditioncanbemetinlog4jusingDailyRollingFileAppenderratherthanRollingFileAppender.However,thecurrentlyopenfilewouldneedtobewrittentoonedirectoryandcopiedtothespooldirectoryafterbeingclosed.Noneofthelog4jappendersshippinghavethiscapability.
Thatsaid,ifyouareusingtheLinuxlogrotateprograminyourenvironment,thismightbeofinterest.Youcanmovecompletedfilestoaseparatedirectoryusingapostrotatescript.Thefinalflowmightlooksomethinglikethis:
TocreateaSpoolingDirectorySource,setthetypepropertytospooldir.YoumustspecifythedirectorytowatchbysettingthespoolDirproperty:
agent.sources=s1
agent.sources.channels=c1
agent.sources.s1.type=spooldir
agent.sources.s1.spoolDir=/path/to/files
HereisasummaryofthepropertiesfortheSpoolingDirectorySource:
Key Required Type Default
type Yes String spooldir
channels Yes String spaceseparatedlistofchannels
spoolDir Yes String pathtodirectorytospool
fileSuffix No String .COMPLETED
deletePolicy No String(neverorimmediate) never
fileHeader No boolean false
fileHeaderKey No String file
basenameHeader No boolean false
basenameHeaderKey No String basename
ignorePattern No String ^$
trackerDir No String ${spoolDir}/.flumespool
consumeOrder No String(oldest,youngest,orrandom) oldest
batchSize No int 10
bufferMaxLines No int 100
maxBufferLineLength No int 5000
maxBackoff No int 4000(milliseconds)
Whenafilehasbeencompletelytransmitted,itwillberenamedwitha.COMPLETEDextension,unlessoverriddenbysettingthefileSuffixproperty,likethis:
agent.sources.s1.fileSuffix=.DONE
StartingwithFlume1.4,anewproperty,deletePolicy,wascreatedtoremovecompletedfilesfromthefilesystemratherthanjustmarkingthemasdone.Inaproductionenvironment,thisiscriticalbecauseyourspooldiskwillfillupovertime.Currently,youcanonlysetthisforimmediatedeletionortoleavethefilesforever.Ifyouwantdelayeddeletion,you’llneedtoimplementyourownperiodic(cron)job,perhapsusingthefindcommandtofindfilesinthespooldirectorywiththeCOMPLETEDfilesuffixandamodificationtimelongerthansomeregularvalue,forexample:
find/path/to/spool/dir-typef-name"*.COMPLETED"–mtime7–execrm{}\;
Thiswillfindallfilescompletedmorethansevendaysagoanddeletethem.
Ifyouwanttheabsolutefilepathattachedtoeachevent,setthefileHeaderpropertytotrue.ThiswillcreateaheaderwiththefilekeyunlesssettosomethingelseusingthefileHeaderKeyproperty,likethiswouldaddthe{sourceFile=/path/to/files/foo.1234.log}headeriftheeventwasreadfromthe/path/to/files/foo.1234.logfile:
agent.sources.s1.fileHeader=true
agent.sources.s1.fileHeaderKey=sourceFile
Therelatedproperty,basenameHeader,ifsettotrue,willaddaheaderwiththebasenamekey,whichcontainsjustthefilename.ThebasenameHeaderKeypropertyallowsyoutochangethekey’svalue,asshownhere:
agent.sources.s1.basenameHeader=true
agent.sources.s1.basenameHeaderKey=justTheName
Thisconfigurationwouldaddthe{justTheName=foo.1234.log}headeriftheeventwasreadfromthesamefilelocatedat/path/to/files/foo.1234.log.
Iftherearecertainfilepatternsthatyoudonotwantthissourcetoreadasinput,youcanpassaregularexpressionusingtheignorePatternproperty.Personally,Iwon’tcopyanyfilesIdon’twanttransferredtoFlumeinthespoolDirinthefirstplace.Ifthissituationcannotbeavoided,usetheignorePatternpropertytopassaregularexpressiontomatchfilenamesthatshouldnotbetransferredasdata.Furthermore,subdirectoriesandfilesthatstartwitha"."(period)characterareignored,soyoucanavoidcostlyregularexpressionprocessingusingthisconventioninstead.
WhileFlumeissendingdata,itkeepstrackofhowfarithasgottenineachfilebykeepingametadatafileinthedirectoryspecifiedbythetrackerDirproperty.Bydefault,thisfilewillbe.flumespoolunderspoolDir.ShouldyouwantalocationotherthaninsidespoolDir,youcanspecifyanabsolutefilepath.
Filesinthedirectoryareprocessedinthe“oldestfirst”mannerascalculatedbylookingatthemodificationtimesofthefiles.YoucanchangethisbehaviorbysettingtheconsumeOrderproperty.Ifyousetthispropertytoyoungest,thenewestfileswillbeprocessedfirst.Thismaybedesiredifdataistimesensitive.Ifyou’drathergiveequalprecedencetoallfiles,youcansetthispropertytoavalueofrandom.
ThebatchSizepropertyallowsyoutotunethenumberofeventspertransactionforwritestothechannel.Increasingthismayprovidebetterthroughputatthecostoflargertransactions(andpossiblylargerrollbacks).ThebufferMaxLinespropertyisusedtosetthesizeofthememorybufferusedinreadingfilesbymultiplyingitwithmaxBufferLineLength.Ifyourdataisveryshort,youmightconsiderincreasingbufferMaxLineswhilereducingthemaxBufferLineLengthproperty.Inthiscase,itwillresultinbetterthroughputwithoutincreasingyourmemoryoverhead.Thatsaid,ifyouhaveeventslongerthan5000characters,you’llwanttosetmaxBufferLineLengthhigher.
Ifthereisaproblemwritingdatatothechannel,aChannelExceptionisthrownbacktothesource,whereit’llretryafteraninitialwaittimeof250ms.Eachfailedattemptwilldoublethistimeuptoamaximumof4seconds.Tosetthismaximumtimehigher,setthemaxBackoffproperty.Forinstance,ifIwantedamaximumof5minutes,Icansetitinmillisecondslikethis:
agent.sources.s1.maxBackoff=300000
Finally,you’llwanttoensurethatwhatevermechanismiswritingnewfilestoyourspoolingdirectorycreatesuniquefilenames,suchasaddingatimestamp(andpossibly
more).Reusingafilenamewillconfusethesource,andyourdatamaynotbeprocessed.
Asalways,rememberthatrestartsanderrorswillcreateduplicatesduetoretransmissionoffilespartiallysent,butnotmarkedascomplete,orbecausethemetadataisincomplete.
SyslogsourcesSysloghasbeenaroundfordecadesandisoftenusedasanoperating-system-levelmechanismtocaptureandmovelogsaroundsystems.Inmanyways,thereareoverlapswithsomeofthefunctionalityFlumeprovides.ThereisevenaHadoopmoduleforrsyslog,oneofthemoremodernvariantsofsyslog(http://www.rsyslog.com/doc/rsyslog_conf_modules.html/omhdfs.html).Generally,Idon’tlikesolutionsthatcoupletechnologiesthatmayversionindependently.Ifyouusethisrsyslog/Hadoopintegration,youwouldberequiredtoupdatetheversionofHadoopyoucompiledintorsyslogatthesametimeyouupgradedyourHadoopclustertoanewmajorversion.Thismaybelogisticallydifficultifyouhavealargenumberofserversand/orenvironments.BackwardcompatibilityinHadoopwireprotocolsissomethingthatisbeingactivelyworkedonintheHadoopcommunity,butcurrently,itisn’tthenorm.We’lltalkmoreaboutthisinChapter8,MonitoringFlume,whenwediscusstieringdataflows.
SysloghasanolderUDPtransportaswellasanewerTCPprotocolthatcanhandledatalargerthanasingleUDPpacketcantransmit(about64KB)anddealwithnetwork-relatedcongestioneventsthatmightrequirethedatatoberetransmitted.
Finally,therearesomeundocumentedpropertiesofsyslogsourcesthatallowustoaddmoreregular-expressionpattern-matchingformessagesthatdonotconformtoRFCstandards.Iwon’tbediscussingtheseadditionalsettings,butyoushouldbeawareofthemifyourunintofrequentparsingerrors.Inthiscase,takealookatthesourcefororg.apache.flume.source.SyslogUtilsforimplementationdetailstofindthecause.
Moredetailsonsyslogterms(suchasafacility)andstandardformatscanbefoundinRFC3164athttp://tools.ietf.org/html/rfc3164.
ThesyslogUDPsourceTheUDPversionofsyslogisusuallysafetousewhenyouarereceivingdatafromtheserver’slocalsyslogprocess,providedthedataissmallenough(lessthanabout64KB).
NoteTheimplementationforthissourcehaschosen2,500bytesasthemaximumpayloadsizeregardlessofwhatyournetworkcanactuallyhandle.Soifyourpayloadwillbelargerthanthis,useoneoftheTCPsourcesinstead.
TocreateaSyslogUDPsource,setthetypepropertytosyslogudp.Youmustsettheporttolistenonusingtheportproperty.Theoptionalhostpropertyspecifiesthebindaddress.Ifnohostisspecified,allIPsfortheserverwillbeused,whichisthesameasspecifying0.0.0.0.Inthisexample,wewillonlylistenforlocalUDPconnectionsonport5140:
agent.sources=s1
agent.sources.channels=c1
agent.sources.s1.type=syslogudp
agent.sources.s1.host=localhost
agent.sources.s1.port=5140
Ifyouwantsyslogtoforwardatailedfile,youcanaddalinelikethistoyoursyslogconfigurationfile:
*.err;*.alert;*.crit;*.emerg;kern.*@localhost:5140
Thiswillsendallerrorpriority,criticalpriority,emergencypriority,andkernelmessagesofanypriorityintoyourFlumesource.Thesingle@symboldesignatesthatUDPprotocolshouldbeused.
HereisasummaryofthepropertiesoftheSyslogUDPsource:
Key Required Type Default
type Yes String syslogudp
channels Yes String spaceseparatedlistofchannels
port Yes int
host No String 0.0.0.0
keepFields No boolean false
ThekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.
TheFlumeheaderscreatedbytheSyslogUDPsourcearesummarizedhere:
HeaderKey Description
Facility Thisisthesyslogfacility.Seethesyslogdocumentation.
Priority Thisisthesyslogpriority.Seethesyslogdocumentation.
timestampThisisthetimeofthesyslogevent,translatedintoanepochtimestamp.It’somittedifit’snotparsedfromoneofthestandardRFCformats.
hostname Thisistheparsedhostnameinthesyslogmessage.Itisomittedifitisnotparsed.
flume.syslog.status
Therewasaproblemparsingthesyslogmessage’sheaders.ThisvalueissettoInvalidifthepayloaddidn’tconformtotheRFCs,andsettoIncompleteifthemessagewaslongerthantheeventSizevalue(forUDP,thisissetinternallyto2,500bytes).It’somittedifeverythingisfine.
ThesyslogTCPsourceAspreviouslymentioned,theSyslogTCPsourceprovidesanendpointformessagesoverTCP,allowingforalargerpayloadsizeandTCPretrysemanticsthatshouldbeusedforanyreliableinter-servercommunications.
TocreateaSyslogTCPsource,setthetypepropertytosyslogtcp.Youmuststillsetthebindaddressandporttolistenon:
agent.sources=s1
agent.sources.s1.type=syslogtcp
agent.sources.s1.host=0.0.0.0
agent.sources.s1.port=12345
IfyoursyslogimplementationsupportssyslogoverTCP,theconfigurationisusuallythesame,exceptthatadouble@symbolisusedtoindicateTCPtransport.HereisthesameexampleusingTCP,whereIamforwardingthevaluestoaFlumeagentthatisrunningonadifferentservernamedflume-1.
*.err;*.alert;*.crit;*.emerg;kern.*@@flume-1:12345
TherearesomeoptionalpropertiesfortheSyslogTCPsource,aslistedhere:
Key Required Type Default
type Yes String syslogtcp
channels Yes String spaceseparatedlistofchannels
port Yes int
host No String 0.0.0.0
keepFields No boolean false
eventSize No int(bytes) 2500bytes
ThekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.
TheFlumeheaderscreatedbytheSyslogTCPsourcearesummarizedhere:
HeaderKey Description
Facility Thisisthesyslogfacility.Seethesyslogdocumentation.
Priority Thisisthesyslogpriority.Seethesyslogdocumentation.
timestampThisisthetimeofthesyslogeventtranslatedintoanepochtimestamp.It’somittedifnotparsedfromoneofthestandardRFCformats.
hostname Theparsedhostnameinthesyslogmessage.It’somittedifnotparsed.
flume.syslog.status
Therewasaproblemparsingthesyslogmessage’sheaders.It’ssettoInvalidifthepayloaddidn’tconformtotheRFCsandsettoIncompleteifthemessagewaslongerthantheconfiguredeventSize.It’somittedifeverythingisfine.
ThemultiportsyslogTCPsourceTheMultiportSyslogTCPsourceisnearlyidenticalinfunctionalitytotheSyslogTCPsource,exceptthatitcanlistentomultipleportsforinput.Youmayneedtousethiscapabilityifyouareunabletochangewhichportsyslogwilluseinitsforwardingrules(itmaynotbeyourserveratall).Itismorelikelythatyouwillusethistoreadmultipleformatsusingonesourcetowritetodifferentchannels.We’llcoverthatinamomentintheChannelSelectorssection.Underthehood,ahigh-performanceasynchronousTCPlibrarycalledMina(https://mina.apache.org/)isused,whichoftenprovidesbetterthroughputonmulticoreserversevenwhenconsumingonlyasingleTCPport.
Toconfigurethissource,setthetypepropertytomultiport_syslogtcp:
agent.sources.s1.type=multiport_syslogtcp
Liketheothersyslogsources,youneedtospecifytheport,butinthiscaseitisaspace-separatedlistofports.Youcanusethisonlyifyouhaveoneportspecified.Thepropertyforthisisports(plural):
agent.sources.s1.type=multiport_syslogtcp
agent.sources.s1.channels=c1
agent.sources.s1.ports=3333344444
agent.sources.s1.host=0.0.0.0
ThiscodeconfigurestheMultiportSyslogTCPsourcenameds1tolistentoanyincomingconnectionsonports33333and44444andsendthemtochannelc1.
Inordertotellwhicheventcamefromwhichport,youcansettheoptionalportHeaderpropertytothenameofthekeywhosevaluewillbetheportnumber.Let’saddthispropertytotheconfiguration:
agent.sources.s1.portHeader=port
Then,anyeventsreceivedfromport33333wouldhaveaheaderkey/valueof{"port"="33333"}.AsyousawinChapter4,SinksandSinkProcessors,youcannowusethisvalue(oranyheader)asapartofyourHDFSSinkfilepathconvention,likethis:
agent.sinks.k1.hdfs.path=/logs/%{hostname}/%{port}/%Y/%m/%D/%H
Hereisacompletetableoftheproperties:
Key Required Type Default
type Yes String syslogtcp
channels Yes String Space-separatedlistofchannels
ports Yes int Space-separatedlistofportnumbers
host No String 0.0.0.0
keepFields No boolean false
eventSize No int 2500(bytes)
portHeader No String
batchSize No int 100
readBufferSize No int(bytes) 1024
numProcessors No int automaticallydetected
charset.default No String UTF-8
charset.port.PORT# No String
ThisTCPsourcehassomeadditionaltunableoptionsoverthestandardTCPsyslogsource.ThefirstisthebatchSizeproperty.Thisisthenumberofeventsprocessedpertransactionwiththechannel.ThereisalsothereadBufferSizeproperty.ItspecifiestheinternalbuffersizeusedbyaninternalMinalibrary.Finally,thenumProcessorspropertyisusedtosizetheworkerthreadpoolinMina.Beforeyoutunetheseparameters,youmaywanttofamiliarizeyourselfwithMina(http://mina.apache.org/),andlookatthesourcecodebeforedeviatingfromthedefaults.
Finally,youcanspecifythedefaultandper-portcharacterencodingtousewhenconvertingbetweenStringsandbytes:
agent.sources.s1.charset.default=UTF-16
agent.sources.s1.charset.port.33333=UTF-8
ThissampleconfigurationshowsthatallportswillbeinterpretedusingUTF-16encoding,exceptforport33333traffic,whichwilluseUTF-8.
Asyou’vealreadyseenintheothersyslogsources,thekeepFieldspropertytellsthesourcetoincludethesyslogfieldsaspartofthebody.Bydefault,thesearesimplyremoved,astheybecomeFlumeheadervalues.
TheFlumeheaderscreatedbythissourcearesummarizedhere:
HeaderKey Description
Facility Thisisthesyslogfacility.Seethesyslogdocumentation.
Priority Thisisthesyslogpriority.Seethesyslogdocumentation.
timestampThisisthetimeofthesyslogeventtranslatedintoanepochtimestamp.OmittedifnotparsedfromoneofthestandardRFCformats.
hostname Thisistheparsedhostnameinthesyslogmessage.Omittedifnotparsed.
flume.syslog.status
Therewasaproblemparsingthesyslogmessage’sheaders.ThisissettoInvalidifthepayloaddidn’tconformtotheRFCs,settoIncompleteifthemessagewaslongerthantheconfiguredeventSize,andomittedifeverythingisfine.
JMSsourceSometimes,datacanoriginatefromasynchronousmessagequeues.Forthesecases,youcanuseFlume’sJMSsourcetocreateeventsreadfromaJMSQueueorTopic.WhileitistheoreticallypossibletouseanyJavaMessageService(JMS)implementation,FlumehasonlybeentestedwithActiveMQ,sobesuretotestthoroughlyifyouuseadifferentprovider.
LikethepreviouslycoveredElasticSearchsinkinChapter4,SinksandSinkProcessors,FlumedoesnotcomepackagedwiththeJMSimplementationyou’llbeusing,astheversionsneedtomatchup,soyou’llneedtoincludethenecessaryimplementationJARfilesontheFlumeagent’sclasspath.Thepreferredmethodistousethe--plugins-dirparametermentionedinChapter2,AQuickStartGuidetoFlume,whichwe’llcoverinmoredetailinthenextchapter.
NoteActiveMQisjustoneprovideroftheJavaMessageServiceAPI.Formoreinformation,seetheprojecthomepageathttp://activemq.apache.org.
Toconfigurethissource,setthetypepropertytojms:
agent.sources.s1.type=jms
ThefirstthreepropertiesareusedtoestablishaconnectionwiththeJMSServer.TheyareinitialContextFactory,connectionFactory,andproviderURL.TheinitialContextFactorypropertyforActiveMQwillbeorg.apache.activemq.jndi.ActiveMQInitialContextFactory.TheconnectionFactorypropertywillbetheregisteredJNDIname,whichdefaultstoConnectionFactoryifunspecified.Finally,providerURListheconnectionStringpassedtotheconnectionfactorytoactuallyestablishanetworkconnection.ItisusuallyaURL-likeStringconsistingoftheservernameandportinformation.Ifyouaren’tfamiliarwithyourJMSconfiguration,asksomebodywhodoeswhatvaluestouseinyourenvironment.
Thistablesummarizesthesesettingsandotherswe’lldiscussinamoment:
Key Required Type Default
type Yes String jms
channels Yes String spaceseparatedlistofchannels
initialContextFactory Yes String
connectionFactory No String ConnectionFactory
providerURL Yes String
userName No String usernameforauthentication
passwordFile No String pathtofilecontainingpasswordforauthentication
destinationName Yes String
destinationType Yes String
messageSelector No String
errorThreshold No int 10
pollTimeout No long 1000(milliseconds)
batchSize No int 100
IfyourJMSServerrequiresauthentication,passtheuserNameandpasswordFileproperties:
agent.sources.s1.userName=jms_bot_user
agent.sources.s1.passwordFile=/path/to/password.txt
Puttingthepasswordinafileyoureference,ratherthandirectlyintheFlumeconfiguration,allowsyoutokeepthepermissionsontheconfigurationopenforinspection,whilestoringthemoresensitivepassworddatainaseparatefilethatisaccessibleonlytotheFlumeagent(restrictionsarecommonlyprovidedbytheoperatingsystem’spermissionssystem).
ThedestinationNamepropertydecideswhichmessagetoreadfromourconnection.SinceJMSsupportsbothqueuesandtopics,youneedtosetthedestinationTypepropertytoqueueortopic,respectively.Here’swhatthepropertiesmightlooklikeforaqueue:
agent.source.s1.destinationName=my_cool_data
agent.source.s1.destinationType=queue
Foratopic,itwilllookasfollows:
agent.source.s1.destinationName=restart_events
agent.source.s1.destinationType=topic
Thedifferencebetweenaqueueandatopicisbasicallythenumberofentitiesthatwillbereadfromthenameddestination.Ifyouarereadingfromaqueue,themessagewillberemovedfromthequeueonceitiswrittentoFlumeasanevent.Atopic,onceread,isstillavailableforotherentitiestoread.
Shouldyouneedonlyasubsetofmessagespublishedtoatopicorqueue,theoptionalmessageSelectorStringpropertyprovidesthiscapability.Forexample,tocreateFlumeeventsonmessageswithafieldcalledAgethatislargerthan10,IcanspecifyamessageSelectorfilter:
agent.source.s1.messageSelector="Age>10"
NoteDescribingtheselectorcapabilitiesandsyntaxindetailisfarbeyondthescopeofthis
book.Seehttp://docs.oracle.com/javaee/1.4/api/javax/jms/Message.htmlorpickupabookonJMStobecomemorefamiliarwithJMSmessageselectors.
ShouldtherebeaproblemincommunicatingwiththeJMSserver,theconnectionwillresetitselfafteranumberoffailuresspecifiedbytheerrorThresholdproperty.Thedefaultvalueof10isreasonableandwillmostlikelynotneedtobechanged.
Next,theJMSsourcehasapropertyusedtoadjustthevalueofbatchSize,whichdefaultsto100.Bynow,youshouldbefairlyfamiliarwithadjustingbatchsizeswithothersourcesandsinks.Forhigh-volumeflows,you’llwanttosetthishighertoconsumedatainlargerchunksfromyourJMSservertogethigherthroughput.Settingthistoohighforlowvolumeflowscoulddelayprocessing.Asalways,testingistheonlysurewaytoadjustthisproperly.TherelatedpollTimeoutpropertyspecifieshowlongtowaitfornewmessagestoappearbeforeattemptingtoreadabatch.Thedefault,specifiedinmilliseconds,is1second.Ifnomessagesarereadbeforethepolltimeout,theSourcewillgointoanexponentialbackoffmodebeforeattemptinganotherread.Thiscoulddelaytheprocessingofmessagesuntilitwakesupagain.Chancesarethatyouwon’tneedtochangethisvaluefromthedefault.
WhenthemessagegetsconvertedintoaFlumeevent,themessagepropertiesbecomeFlumeheaders.ThemessagepayloadbecomestheFlumebody,butsinceJMSusesserializedJavaobjects,weneedtotelltheJMSsourcehowtointerpretthepayload.Wedothisbysettingtheconverter.typeproperty,whichdefaultstotheonlyimplementationthatispackagedwithFlumeusingtheDEFAULTString(implementedbytheDefaultJMSMessageConvererclass).ItcandeserializeJMSBytesMessages,TextMessages,andObjectMessages(Javaobjectsthatimplementthejava.io.DataOutputinterface).ItdoesnothandleStreamMessagesorMapMessages,soifyouneedtoprocessthem,you’llneedtoimplementyourownconvertertype.Todothis,you’llimplementtheorg.apache.flume.source.jms.JMSMessageConverterinterface,anduseitsfullyqualifiedclassnameastheconverter.typepropertyvalue.
Forcompleteness,herearethesepropertiesintabularform:
Key Required Type Default
converter.type No String DEFAULT
converter.charset No String UTF-8
ChannelselectorsAswediscussedinChapter1,OverviewandArchitecture,asourcecanwritetooneormorechannels.Thisiswhythepropertyisplural(channelsinsteadofchannel).Therearetwowaysmultiplechannelscanbehandled.Theeventcanbewrittentoallthechannelsortojustonechannel,basedonsomeFlumeheadervalue.TheinternalmechanismforthisinFlumeiscalledachannelselector.
Theselectorforanychannelcanbespecifiedusingtheselector.typeproperty.Allselector-specificpropertiesbeginwiththeusualSourceprefix:theagentname,keywordsources,andsourcename:
agent.sources.s1.selector.type=replicating
ReplicatingIfyoudonotspecifyaselectorforasource,replicatingisthedefault.Thereplicatingselectorwritesthesameeventtoallchannelsinthesource’schannelslist:
agent.sources.s1.channels=c1c2c3
agent.sources.s1.selector.type=replicating
Inthisexample,everyeventwillbewrittentoallthreechannels:c1,c2,andc3.
Thereisanoptionalpropertyonthisselector,calledoptional.Itisaspace-separatedlistofchannelsthatareoptional.Considerthismodifiedexample:
agent.sources.s1.channels=c1c2c3
agent.sources.s1.selector.type=replicating
agent.sources.s1.selector.optional=c2c3
Now,anyfailuretowritetochannelsc2orc3willnotcausethetransactiontofail,andanydatawrittentoc1willbecommitted.Intheearlierexamplewithnooptionalchannels,anysinglechannelfailurewouldrollbackthetransactionforallchannels.
MultiplexingIfyouwanttosenddifferenteventstodifferentchannels,youshoulduseamultiplexingchannelselectorbysettingthevalueofselector.typetomultiplexing.Youalsoneedtotellthechannelselectorwhichheadertousebysettingtheselector.headerproperty:
agent.sources.s1.selector.type=multiplexing
agent.sources.s1.selector.header=port
Let’sassumeweusedtheMultiportSyslogTCPsourcetolistenonfourports—11111,22222,33333,and44444—withaportHeadersettingofport:
agent.sources.s1.selector.default=c2
agent.sources.s1.selector.mapping.11111=c1c2
agent.sources.s1.selector.mapping.44444=c2
agent.sources.s1.selector.optional.44444=c3
Thisconfigurationwillresultinthetrafficofport22222andport33333goingtothec2channelonly.Thetrafficofport11111willgotothec1andc2channels.Afailureoneitherchannelwouldresultinnothingbeingaddedtoeitherchannel.Thetrafficofport44444willgotochannelsc2andc3.However,afailuretowritetoc3willstillcommitthetransactiontoc2,andc3willnotbeattemptedagainwiththatevent.
SummaryInthischapter,wecoveredindepththevarioussourcesthatwecanusetoinsertlogdataintoFlume,includingtheExecsource,theSpoolingDirectorySource,Syslogsources(UDP,TCP,andmultiportTCP),andtheJMSsource.
WediscussedreplicatingtheoldTailSourcefunctionalityinFlume0.9andproblemswithusingtailsemanticsingeneral.
Wealsocoveredchannelselectorsandsendingeventstooneormorechannels,specificallythereplicatingandmultiplexingchannelselectors.
Optionalchannelswerealsodiscussedasawaytoonlyfailaputtransactionforonlysomeofthechannelswhenmorethanonechannelisused.
Inthenextchapter,we’llintroduceinterceptorsthatwillallowin-flightinspectionandtransformationofevents.Usedinconjunctionwithchannelselectors,interceptorsprovidethefinalpiecetocreatecomplexdataflowswithFlume.Additionally,wewillcoverRPCmechanisms(source/sinkpairs)betweenFlumeagentsusingbothAvroandThrift,whichcanbeusedtocreatecomplexdataflows.
Chapter6.Interceptors,ETL,andRoutingThefinalpieceoffunctionalityrequiredinyourdataprocessingpipelineistheabilitytoinspectandtransformeventsinflight.Thiscanbeaccomplishedusinginterceptors.Interceptors,aswediscussedinChapter1,OverviewandArchitecture,canbeinsertedafterasourcecreatesanevent,butbeforewritingtothechanneloccurs.
InterceptorsAninterceptor’sfunctionalitycanbesummedupwiththismethod:
publicEventintercept(Eventevent);
AFlumeeventispassedtoit,anditreturnsaFlumeevent.Itmaydonothing,inwhichcase,thesameunalteredeventisreturned.Often,italterstheeventinsomeusefulway.Ifnullisreturned,theeventisdropped.
Toaddinterceptorstoasource,simplyaddtheinterceptorspropertytothenamedsource,forexample:
agent.sources.s1.interceptors=i1i2i3
Thisdefinesthreeinterceptors:i1,i2,andi3onthes1sourcefortheagentnamedagent.
NoteInterceptorsarerunintheorderinwhichtheyarelisted.Intheprecedingexample,i2willreceivetheoutputfromi1.Then,i3willreceivetheoutputfromi2.Finally,thechannelselectorreceivestheoutputfromi3.
Nowthatwehavedefinedtheinterceptorbyname,weneedtospecifyitstypeasfollows:
agent.sources.s1.interceptors.i1.type=TYPE1
agent.sources.s1.interceptors.i1.additionalProperty1=VALUE
agent.sources.s1.interceptors.i2.type=TYPE2
agent.sources.s1.interceptors.i3.type=TYPE3
Let’slookatsomeoftheinterceptorsthatcomebundledwithFlumetogetabetterideaofhowtoconfigurethem.
TimestampTheTimestampinterceptor,asitsnamesuggests,addsaheaderwiththetimestampkeytotheFlumeeventifonedoesn’talreadyexist.Touseit,setthetypepropertytotimestamp.
Iftheeventalreadycontainsatimestampheader,itwillbeoverwrittenwiththecurrenttimeunlessconfiguredtopreservetheoriginalvaluebysettingthepreserveExistingpropertytotrue.
HereisatablesummarizingthepropertiesoftheTimestampinterceptor:
Key Required Type Default
type Yes String timestamp
preserveExisting No boolean false
Hereiswhatatotalconfigurationforasourcemightlooklikeifweonlywantittoaddatimestampheaderifnoneexists:
agent.sources.s1.interceptors=i1
agent.sources.s1.interceptors.i1.type=timestamp
agent.sources.s1.interceptors.i1.preserveExisting=true
RecallthisHDFSSinkpathfromChapter4,SinksandSinkProcessors,utilizingtheeventdate:
agent.sinks.k1.hdfs.path=/logs/apache/%Y/%m/%d/%H
Thetimestampheaderiswhatdeterminesthispath.Ifitismissing,youcanbesureFlumewillnotknowwheretocreatethefiles,andyouwillnotgettheresultyouarelookingfor.
HostSimilarinsimplicitytotheTimestampinterceptor,theHostinterceptorwilladdaheadertotheeventcontainingtheIPaddressofthecurrentFlumeagent.Touseit,setthetypepropertytohost:
agent.sources.s1.interceptors=i1
agent.sources.s1.interceptors.type=host
ThekeyforthisheaderwillbehostunlessyouspecifysomethingelseusingthehostHeaderproperty.Likebefore,anexistingheaderwillbeoverwritten,unlessyousetthepreserveExistingpropertytotrue.Finally,ifyouwantareverseDNSlookupofthehostnametobeusedinsteadoftheIPasavalue,settheuseIPpropertytofalse.Rememberthatreverselookupswilladdprocessingtimetoyourdataflow.
HereisatablesummarizingthepropertiesoftheHostinterceptor:
Key Required Type Default
type Yes String host
hostHeader No String host
preserveExisting No boolean false
useIP No boolean true
HereiswhatatotalconfigurationforasourcemightlooklikeifweonlywantittoaddarelayHostheadercontainingtheDNShostnameofthisagenttoeveryevent:
agent.sources.s1.interceptors=i1
agent.sources.s1.interceptors.i1.type=host
agent.sources.s1.interceptors.i1.hostHeader=relayHost
agent.sources.s1.interceptors.i1.useIP=false
Thisinterceptormightbeusefulifyouwantedtorecordthepathyoureventstookthoughyourdataflow,forinstance.Chancesareyouaremoreinterestedintheoriginoftheeventratherthanthepathittook,whichiswhyIhaveyettousethis.
StaticTheStaticinterceptorisusedtoinsertasinglekey/valueheaderintoeachFlumeeventprocessed.Ifmorethanonekey/valueisdesired,yousimplyaddadditionalStaticinterceptors.Unliketheinterceptorswe’velookedatsofar,thedefaultbehavioristopreserveexistingheaderswiththesamekey.Asalways,myrecommendationistoalwaysspecifywhatyouwantandnotrelyonthedefaults.
Idonotknowwhythekeyandvaluepropertiesarenotrequired,asthedefaultsarenotterriblyuseful.
HereisatablesummarizingthepropertiesoftheStaticinterceptor:
Key Required Type Default
type Yes String static
key No String key
value No String value
preserveExisting No boolean true
Finally,let’slookatanexampleconfigurationthatinsertstwonewheaders,providedtheydon’talreadyexistintheevent:
agent.sources.s1.interceptors=posenv
agent.sources.s1.interceptors.pos.type=static
agent.sources.s1.interceptors.pos.key=pointOfSale
agent.sources.s1.interceptors.pos.value=US
agent.sources.s1.interceptors.env.type=static
agent.sources.s1.interceptors.env.key=environment
agent.sources.s1.interceptors.env.value=staging
RegularexpressionfilteringIfyouwanttofiltereventsbasedonthecontentofthebody,theregularexpressionfilteringinterceptorisyourfriend.Basedonaregularexpressionyouprovide,itwilleitherfilteroutthematchingeventsorkeeponlythematchingevents.Startbysettingthetypeinterceptortoregex_filter.ThepatternyouwanttomatchisspecifiedusingaJava-styleregularexpressionsyntax.Seethesejavadocsforusagedetailsathttp://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html.Thepatternstringissetintheregexproperty.BesuretoescapebackslashesinJavaStrings.Forinstance,the\d+patternwouldneedtobewrittenwithtwobackslashes:\\d+.Ifyouwantedtomatchabackslash,thedocumentationsaystotypetwobackslashes,buteachneedstobeescaped,resultinginfour,thatis,\\\\.Youwillseetheuseofescapedbackslashesthroughoutthischapter.Finally,youneedtotelltheinterceptorifyouwanttoexcludematchingrecordsbysettingtheexcludeEventspropertytotrue.Thedefault(false)indicatesthatyouwanttoonlykeepeventsthatmatchthepattern.
Hereisatablesummarizingthepropertiesoftheregularexpressionfilteringinterceptor:
Key Required Type Default
type Yes String regex_filter
regex No String .*
excludeEvents No boolean false
Inthisexample,anyeventscontainingtheNullPointerExceptionstringwillbedropped:
agent.sources.s1.interceptors=npe
agent.sources.s1.interceptors.npe.type=regex_filter
agent.sources.s1.interceptors.npe.regex=NullPointerException
agent.sources.s1.interceptors.npe.excludeEvents=true
RegularexpressionextractorSometimes,you’llwanttoextractbitsofyoureventbodyintoFlumeheaderssothatyoucanperformroutingviaChannelSelectors.Youcanusetheregularexpressionextractorinterceptortoperformthisfunction.Startbysettingthetypeinterceptortoregex_extractor:
agent.sources.s1.interceptors=e1
agent.sources.s1.interceptors.e1.type=regex_extractor
Liketheregularexpressionfilteringinterceptor,theregularexpressionextractorinterceptortoousesaJava-styleregularexpressionsyntax.Inordertoextractoneormorefields,youstartbyspecifyingtheregexpropertywithgroupmatchingparentheses.Let’sassumewearelookingforerrornumbersinoureventsintheError:Nform,whereNisanumber:
agent.sources.s1.interceptors=e1
agent.sources.s1.interceptors.e1.type=regex_extractor
agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+)
Asyoucansee,Iputcaptureparenthesesaroundthenumber,whichmaybeoneormoredigit.NowthatI’vematchedmydesiredpattern,IneedtotellFlumewhattodowithmymatch.Here,weneedtointroduceserializers,whichprovideapluggablemechanismforhowtointerpreteachmatch.Inthisexample,I’veonlygotonematch,somyspace-separatedlistofserializernameshasonlyoneentry:
agent.sources.s1.interceptors=e1
agent.sources.s1.interceptors.e1.type=regex_extractor
agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+)
agent.sources.s1.interceptors.e1.serializers=ser1
agent.sources.s1.interceptors.e1.serializers.ser1.type=default
agent.sources.s1.interceptors.e1.serializers.ser1.name=error_no
Thenamepropertyspecifiestheeventkeytouse,wherethevalueisthematchingtextfromtheregularexpression.Thetypeofdefaultvalue(alsothedefaultifnotspecified)isasimplepass-throughserializer.Forthiseventbody,lookatthefollowing:
NullPointerException:Aproblemoccurred.Error:123.TxnID:5X2T9E.
Thefollowingheaderwouldbeaddedtotheevent:
{"error_no":"123"}
IfIwantedtoaddtheTxnIDvalueasaheader,I’dsimplyaddanothermatchingpatterngroupandserializer:
agent.sources.s1.interceptors=e1
agent.sources.s1.interceptors.e1.type=regex_extractor
agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+).*TxnID:\\s(\\w+)
agent.sources.s1.interceptors.e1.serializers=ser1ser2
agent.sources.s1.interceptors.e1.serializers.ser1.type=default
agent.sources.s1.interceptors.e1.serializers.ser1.name=error_no
agent.sources.s1.interceptors.e1.serializers.ser2.type=default
agent.sources.s1.interceptors.e1.serializers.ser2.name=txnid
Then,Iwouldcreatetheseheadersfortheprecedinginput:
{"error_no":"123","txnid":"5x2T9E"}
However,takealookatwhatwouldhappenifthefieldswerereversedasfollows:
NullPointerException:Aproblemoccurred.TxnID:5X2T9E.Error:123.
Iwouldwindupwithonlyaheaderfortxnid.Abetterwaytohandlethiskindoforderingwouldbetousemultipleinterceptorssothattheorderdoesn’tmatter:
agent.sources.s1.interceptors=e1e2
agent.sources.s1.interceptors.e1.type=regex_extractor
agent.sources.s1.interceptors.e1.regex=Error:\\s(\\d+)
agent.sources.s1.interceptors.e1.serializers=ser1
agent.sources.s1.interceptors.e1.serializers.ser1.type=default
agent.sources.s1.interceptors.e1.serializers.ser1.name=error_no
agent.sources.s1.interceptors.e2.type=regex_extractor
agent.sources.s1.interceptors.e2.regex=TxnID:\\s(\\w+)
agent.sources.s1.interceptors.e2.serializers=ser1
agent.sources.s1.interceptors.e2.serializers.ser1.type=default
agent.sources.s1.interceptors.e2.serializers.ser1.name=txnid
TheonlyothertypeofserializerimplementationthatshipswithFlume,otherthanthepass-through,istospecifythefullyqualifiedclassnameoforg.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer.Thisserializerisusedtoconverttimesintomilliseconds.Youneedtospecifyapatternpropertybasedonorg.joda.time.format.DateTimeFormatpatterns.
Forinstance,let’ssayyouwereingestingApacheWebServeraccesslogs,forexample:
192.168.1.42--[29/Mar/2013:15:27:09-0600]"GET/index.htmlHTTP/1.1"
2001037
Thecompleteregularexpressionforthismightlooklikethis(intheformofaJavaString,withbackslashandquotesescapedwithanextrabackslash):
^([\\d.]+)\\S+\\S+\\[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})
(\\d+)
Thetimepatternmatchedcorrespondstotheorg.joda.time.format.DateTimeFormatpattern:
yyyy/MMM/dd:HH:mm:ssZ
Takealookatwhatwouldhappenifwemakeourconfigurationsomethinglikethis:
agent.sources.s1.interceptors=e1
agent.sources.s1.interceptors.e1.type=regex_extractor
agent.sources.s1.interceptors.e1.regex=^([\\d.]+)\\S+\\S+\\
[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})(\\d+)
agent.sources.s1.interceptors.e1.serializers=ipdturlscbc
agent.sources.s1.interceptors.e1.serializers.ip.name=ip_address
agent.sources.s1.interceptors.e1.serializers.dt.type=org.apache.flume.inter
ceptor.RegexExtractorInterceptorMillisSerializer
agent.sources.s1.interceptors.e1.serializers.dt.pattern=dd/MMM/yyyy:HH:mm:s
sZ
agent.sources.s1.interceptors.e1.serializers.dt.name=timestamp
agent.sources.s1.interceptors.e1.serializers.url.name=http_request
agent.sources.s1.interceptors.e1.serializers.sc.name=status_code
agent.sources.s1.interceptors.e1.serializers.bc.name=bytes_xfered
Thiswouldcreatethefollowingheadersfortheprecedingsample:
{"ip_address":"192.168.1.42","timestamp":"1364588829",
"http_request":"GET/index.htmlHTTP/1.1","status_code":"200",
"bytes_xfered":"1037"}
Thebodycontentisunaffected.You’llalsonoticethatIdidn’tspecifydefaultfortheothertypeofserializers,asthatisthedefault.
NoteThereisnooverwritecheckinginthisinterceptortype.Forinstance,usingthetimestampkeywilloverwritetheevent’sprevioustimevalueiftherewasone.
Youcanimplementyourownserializersforthisinterceptorbyimplementingtheorg.apache.flume.interceptor.RegexExtractorInterceptorSerializerinterface.However,ifyourgoalistomovedatafromthebodyofaneventtotheheader,you’llprobablywanttoimplementacustominterceptorsothatyoucanalterthebodycontentsinadditiontosettingtheheadervalue,otherwisethedatawillbeeffectivelyduplicated.
Tosummarize,let’sreviewthepropertiesforthisinterceptor:
Key Required Type Default
type Yes String regex_extractor
regex Yes String
serializers Yes Space-separatedlistofserializernames
serializers.NAME.name Yes String
serializers.NAME.type No DefaultorFQCNofimplementation default
serializers.NAME.PROP No Serializer-specificproperties
MorphlineinterceptorAswesawinChapter4,SinksandSinkProcessors,apowerfullibraryoftransformationsbackstheMorphlineSolrSinkfromtheKiteSDKproject.Itshouldcomeasnosurprisethatyoucanalsousetheselibrariesinmanyplaceswhereyou’dbeforcedtowriteacustominterceptor.Similarinconfigurationtoitssinkcounterpart,youonlyneedtospecifytheMorphlineconfigurationfile,andoptionally,theMorphlineuniqueidentifier(iftheconfigurationspecifiesmorethanoneMorphline).HereisasummarytableoftheFlumeinterceptorconfiguration:
Key Required Type Default
type Yes String org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
morphlineFile Yes String
morphlineId No String Picksthefirstifnotspecifiedandanerrorifnotspecified;morethanoneexists.
YourFlumeconfigurationmightlooksomethinglikethis:
agent.sources.s1.interceptors=i1m1
agent.sources.s1.interceptors.i1.type=timestamp
agent.sources.s1.interceptors.m1.type=org.apache.flume.sink.solr.morphline.
MorphlineInterceptor$Builder
agent.sources.s1.interceptors.m1.morphlineFile=/path/to/morph.conf
agent.sources.s1.interceptors.m1.morphlineId=goMorphy
Inthisexample,wehavespecifiedthes1sourceontheagentnamedagent,whichcontainstwointerceptors,i1andm1,processedinthatorder.Thefirstinterceptorisastandardinterceptorthatinsertsatimestampheaderifnoneexists.ThesecondwillsendtheeventthroughtheMorphlineprocessorspecifiedbytheMorphlineconfigurationfilecorrespondingtothegoMorphyID.
Eventsprocessedasaninterceptormustonlyoutputoneeventforeveryinputevent.IfyouneedtooutputmultiplerecordsfromasingleFlumeevent,youmustdothisintheMorphlineSolrSink.Withinterceptors,itisstrictlyoneeventinandoneeventout.
ThefirstcommandwillmostlikelybethereadLinecommand(orreadBlob,readCSV,andsoon)toconverttheeventintoaMorphlineRecord.Fromthere,yourunanyotherMorphlinecommandsyoulikewiththefinalRecordattheendoftheMorphlinechainbeingconvertedbackintoaFlumeevent.WeknowfromChapter4,SinksandSinkProcessors,thatthe_attachment_bodyspecialRecordkeyshouldbebyte[](thesameastheevent’sbody).Thismeansthatthelastcommandneedstodotheconversion(suchastoByteArrayorwriteAvroToByteArray).AllotherRecordkeysareconvertedtoStringvaluesandsetasheaderswiththesamekeys.
NoteTakealookatthereferenceguideforacompletelistofMorphlinecommands,theirpropertiesandusageinformationathttp://kitesdk.org/docs/current/kite-morphlines/morphlinesReferenceGuide.html
TheMorphlineconfigurationfilesyntaxhasalreadybeendiscussedindetailintheMorphlineconfigurationfilesectioninChapter4,SinksandSinkProcessors.
CustominterceptorsIfthereisonepieceofcustomcodeyouwilladdtoyourFlumeimplementation,itwillmostlikelybeacustominterceptor.Asmentionedearlier,youimplementtheorg.apache.flume.interceptor.Interceptorinterfaceandtheassociatedorg.apache.flume.interceptor.Interceptor.Builderinterface.
Let’ssayIneededtoURLCodemyeventbody.Thecodewouldlooksomethinglikethis:
publicclassURLDecodeimplementsInterceptor{
publicvoidinitialize(){}
publicEventintercept(Eventevent){
try{
byte[]decoded=URLDecoder.decode(newString(event.getBody()),"UTF-
8").getBytes("UTF-8");
event.setBody(decoded);
}catchUnsupportedEncodingExceptione){
//Shouldn'thappen.Fallthroughtounalteredevent.
}
returnevent;
}
publicList<Event>intercept(List<Event>events){
for(Eventevent:events){
intercept(event);
}
returnevents;
}
publicvoidclose(){}
publicstaticclassBuilderimplementsInterceptor.Builder{
publicInterceptorbuild(){
returnnewURLDecode();
}
publicvoidconfigure(Contextcontext){}
}
}
Then,toconfiguremynewinterceptor,usethefullyqualifiedclassnamefortheBuilderclassasthetype:
agent.sources.s1.interceptors=i1
agent.sources.s1.interceptors.i1.type=com.example.URLDecoder$Builder
Formoreexamplesofhowtopassandvalidateproperties,lookatanyoftheexistinginterceptorimplementationsintheFlumesourcecode.
Keepinmindthatanyheavyprocessinginyourcustominterceptorcanaffecttheoverallthroughput,sobemindfulofobjectchurnorcomputationallyintensiveprocessinginyourimplementations.
ThepluginsdirectoryCustomcode(sources,interceptors,andsoon)canalwaysbeinstalledalongsidetheFlumecoreclassesinthe$FLUME_HOME/libdirectory.YoucouldalsospecifyadditionalpathsforCLASSPATHonstartupbywayoftheflume-env.shshellscript(whichisoftensourcedatstartuptimewhenusingapackageddistributionofFlume).StartingwithFlume1.4,thereisnowacommand-lineoptiontospecifyadirectorythatcontainsthecustomcode.Bydefault,ifnotspecified,the$FLUME_HOME/plugins.ddirectoryisused,butyoucanoverridethisusingthe--plugins-pathcommand-lineparameter.
Withinthisdirectory,eachpieceofcustomcodeisseparatedintoasubdirectorywhosenameisofyourchoosing—picksomethingeasyforyoutokeeptrackofthings.Inthisdirectory,youcanincludeuptothreesubdirectories:lib,libext,andnative.
ThelibdirectoryshouldcontaintheJARfileforyourcustomcomponent.Anydependenciesitusesshouldbeaddedtothelibextsubdirectory.BothofthesepathsareaddedtotheJavaCLASSPATHvariableatstartup.
TipTheFlumedocumentationimpliesthatthisdirectoryseparationbycomponentallowsforconflictingJavalibrariestocoexist.Intruth,thisisnotpossibleunlesstheunderlyingimplementationmakesuseofdifferentclassloaderswithintheJVM.Inthiscase,theFlumestartupcodesimplyappendsallofthesepathstothestartupCLASSPATHvariable,sotheorderinwhichthesubdirectoriesareprocessedwilldeterminetheprecedence.Youcannotevenbeguaranteedthatthesubdirectorieswillbeprocessedinalexicographicorder,astheunderlyingbashshellforloopcangivenosuchguarantees.Inpractice,youshouldalwaystryandavoidconflictingdependencies.Codethatdependstoomuchonorderingtendstobebuggy,especiallyifyourclasspathreordersitselffromserverinstallationtoserverinstallation.
Thethirddirectory,native,iswhereyouputanynativelibrariesassociatedwithyourcustomcomponent.Thispath,ifitexists,getsaddedtoLD_LIBRARY_PATHatstartupsothattheJVMcanlocatethesenativecomponents.
So,ifIhadthreecustomcomponents,mydirectorystructuremightlooksomethinglikethis:
$FLUME_HOME/plugins.d/base64-enc/lib/base64interceptor.jar
$FLUME_HOME/plugins.d/base64-enc/libext/base64-2.0.0.jar
$FLUME_HOME/plugins.d/base64-enc/native/libFoo.so
$FLUME_HOME/plugins.d/uuencode/lib/uuEncodingInterceptor.jar
$FLUME_HOME/plugins.d/my-avro-serializer/myAvroSerializer.jar
$FLUME_HOME/plugins.d/my-avro-serializer/native/libAvro.so
Keepinmindthatallthesepathsgetmashedtogetheratstartup,soconflictingversionsoflibrariesshouldbeavoided.Thisstructureisanorganizationmechanismforeaseofdeploymentforhumans.Personally,Ihavenotusedityet,asIuseChef(orPuppet)toinstallcustomcomponentsonmyFlumeagentsdirectlyinto$FLUME_HOME/lib,butifyouprefertousethismechanism,itisavailable.
TieringflowsInChapter1,OverviewandArchitecture,wetalkedabouttieringyourdataflows.Thereareseveralreasonsforyoutowanttodothis.YoumaywanttolimitthenumberofFlumeagentsthatdirectlyconnecttoyourHadoopcluster,tolimitthenumberofparallelrequests.YoumayalsolacksufficientdiskspaceonyourapplicationserverstostoreasignificantamountofdatawhileyouareperformingmaintenanceonyourHadoopcluster.Whateveryourreasonorusecase,themostcommonmechanismtochainFlumeagentsistousetheAvrosource/sinkpair.
TheAvrosource/sinkWecoveredAvroabitinChapter4,SinksandSinkProcessors,whenwediscussedhowtouseitasanon-diskserializationformatforfilesstoredinHDFS.Here,we’llputittouseincommunicationbetweenFlumeagents.Atypicalconfigurationmightlooksomethinglikethis:
TousetheAvrosource,youspecifythetypepropertywithavalueofavro.Youneedtoprovideabindaddressandportnumbertolistenon:
collector.sources=av1
collector.sources.av1.type=avro
collector.sources.av1.bind=0.0.0.0
collector.sources.av1.port=42424
collector.sources.av1.channels=ch1
collector.channels=ch1
collector.channels.ch1.type=memory
collector.sinks=k1
collector.sinks.k1.type=hdfs
collector.sinks.k1.channel=ch1
collector.sinks.k1.hdfs.path=/path/in/hdfs
Here,wehaveconfiguredtheagentinthemiddlethatlistensonport42424,usesamemorychannel,andwritestoHDFS.I’veusedthememorychannelforbrevityinthisexampleconfiguration.AlsonotethatI’vegiventhisagentadifferentname,collector,justtoavoidconfusion.
Theagentsonthetopandbottomsidesfeedingthecollectortiermighthaveaconfigurationsimilartothis.Ihaveleftthesourcesoffthisconfigurationforbrevity:
client.channels=ch1
client.channels.ch1.type=memory
client.sinks=k1
client.sinks.k1.type=avro
client.sinks.k1.channel=ch1
client.sinks.k1.hostname=collector.example.com
client.sinks.k1.port=42424
Thehostname,collector.example.com,hasnothingtodowiththeagentnameonthismachine;itisthehostname(oryoucanuseanIP)ofthetargetmachinewiththereceivingAvrosource.Thisconfiguration,namedclient,wouldbeappliedtobothagentsonthetopandbottomsides,assumingbothhadsimilarsourceconfigurations.
AsIdon’tlikesinglepointsoffailure,Iwouldconfiguretwocollectoragentswiththeprecedingconfigurationandinstead,seteachclientagenttoroundrobinbetweenthetwo,usingasinkgroup.Again,I’veleftoffthesourcesforbrevity:
client.channels=ch1
client.channels.ch1.type=memory
client.sinks=k1k2
client.sinks.k1.type=avro
client.sinks.k1.channel=ch1
client.sinks.k1.hostname=collectorA.example.com
client.sinks.k1.port=42424
client.sinks.k2.type=avro
client.sinks.k2.channel=ch1
client.sinks.k2.hostname=collectorB.example.com
client.sinks.k2.port=42424
client.sinkgroups=g1
client.sinkgroups.g1=k1k2
client.sinkgroups.g1.processor.type=load_balance
client.sinkgroups.g1.processor.selector=round_robin
client.sinkgroups.g1.processor.backoff=true
TherearefouradditionalpropertiesassociatedwiththeAvrosinkthatyoumayneedtoadjustfromtheirsensibledefaults.
Thefirstisthebatch-sizeproperty,whichdefaultsto100.Inheavyloads,youmayseebetterthroughputbysettingthishigher,forexample:
client.sinks.k1.batch-size=1024
Thenexttwopropertiescontrolnetworkconnectiontimeouts.Theconnect-timeoutproperty,whichdefaultsto20seconds(specifiedinmilliseconds),istheamountoftimerequiredtoestablishaconnectionwithanAvrosource(receiver).Therelatedrequest-timeoutproperty,whichalsodefaultsto20seconds(specifiedinmilliseconds),isthe
amountoftimeforasentmessagetobeacknowledged.IfIwantedtoincreasethesevaluesto1minute,Icouldaddtheseadditionalproperties:
client.sinks.k1.connect-timeout=60000
client.sinks.k1.request-timeout=60000
Finally,thereset-connection-intervalpropertycanbesettoforceconnectionstoreestablishthemselvesaftersometimeperiod.ThiscanbeusefulwhenyoursinkisconnectingthroughaVIP(VirtualIPAddress)orhardwareloadbalancertokeepthingsbalanced,asservicesofferedbehindtheVIPmaychangeovertimeduetofailuresorchangesincapacity.Bydefault,theconnectionswillnotberesetexceptincasesoffailure.Ifyouwantedtochangethissothattheconnectionsresetthemselveseveryhour,forexample,youcanspecifythisbysettingthispropertywiththenumberofseconds,asfollows:
client.sinks.k1.reset-connection-interval=3600
CompressingAvroCommunicationbetweentheAvrosourceandsinkcanbecompressedbysettingthecompression-typepropertytodeflate.OntheAvrosink,youcanadditionally,setthecompression-levelpropertytoanumberbetween1and9,withthedefaultbeing6.Typically,therearediminishingreturnsathighercompressionlevels,buttestingmayproveanondefaultvaluethatworksbetterforyou.Clearly,youneedtoweightheadditionalCPUcostsagainsttheoverheadtoperformahigherlevelofcompression.Typically,thedefaultisfine.
Compressedcommunicationsareespeciallyimportantwhenyouaredealingwithhighlatencynetworks,suchassendingdatabetweentwodatacenters.Anothercommonusecaseforcompressioniswhereyouarechargedforthebandwidthconsumed,suchasmostpubliccloudservices.Inthesecases,youwillprobablychoosetospendCPUcycles,compressinganddecompressingyourflowratherthansendinghighlycompressibledatauncompressed.
TipItisveryimportantthatifyousetthecompression-typepropertyinasource/sinkpair,yousetthispropertyatbothends.Otherwise,asinkcouldbesendingdatathesourcecan’tconsume.
Continuingtheprecedingexample,toaddcompression,youwouldaddtheseadditionalpropertyfieldsonbothagents:
collector.sources.av1.compression-type=deflate
client.sinks.k1.compression-type=deflate
client.sinks.k1.compression-level=7
client.sinks.k2.compression-type=deflate
client.sinks.k2.compression-level=7
SSLAvroflowsSometimes,communicationisofasensitivenatureormaytraverseuntrustednetwork
paths.YoucanencryptyourcommunicationsbetweenanAvrosinkandanAvrosourcebysettingthesslpropertytotrue.Additionally,youwillneedtopassSSLcertificateinformationtothesource(thereceiver)andoptionally,thesink(thesender).
Iwon’tclaimtobeanexpertinSSLandSSLcertificates,butthegeneralideaisthatacertificatecaneitherbeself-signedorsignedbyatrustedthirdpartysuchasVeriSign(calledacertificateauthorityorCA).Generally,becauseofthecostassociatedwithgettingaverifiedcertificate,peopleonlydothisforwebbrowsercertificatesacustomermightseeinawebbrowser.SomeorganizationswillhaveaninternalCAwhocansigncertificates.Forthepurposeofthisexample,we’llbeusingself-signedcertificates.Thisbasicallymeansthatwe’llgenerateacertificatebutnothaveitsignedbyanyauthority.Forthis,we’llusethekeytoolutilitythatcomeswithallJavainstallations(thereferencecanbefoundathttp://docs.oracle.com/javase/7/docs/technotes/tools/solaris/keytool.html).Here,IcreateaJavaKeyStore(JKS)filethatcontainsa2048bitkey.Whenpromptedforapassword,I’llusethepasswordstring.Makesureyouusearealpasswordinyournontestenvironments:
%keytool-genkey-aliasflumey-keyalgRSA-keystorekeystore.jks-keysize
2048
Enterkeystorepassword:password
Re-enternewpassword:password
Whatisyourfirstandlastname?
[Unknown]:SteveHoffman
Whatisthenameofyourorganizationalunit?
[Unknown]:Operations
Whatisthenameofyourorganization?
[Unknown]:Me,MyselfandI
WhatisthenameofyourCityorLocality?
[Unknown]:Chicago
WhatisthenameofyourStateorProvince?
[Unknown]:Illinois
Whatisthetwo-lettercountrycodeforthisunit?
[Unknown]:US
IsCN=SteveHoffman,OU=Operations,O="Me,MyselfandI",L=Chicago,
ST=Illinois,C=UScorrect?
[no]:yes
Enterkeypasswordfor<flumey>
(RETURNifsameaskeystorepassword):
IcanverifythecontentsoftheJKSbyrunningthelistsubcommand:
%keytool-list-keystorekeystore.jks
Enterkeystorepassword:password
Keystoretype:JKS
Keystoreprovider:SUN
Yourkeystorecontains1entry
flumey,Nov11,2014,PrivateKeyEntry,Certificatefingerprint(SHA1):
5C:BC:3C:7F:7A:E7:77:EB:B5:54:FA:E2:8B:DD:D3:66:36:86:DE:E4
NowthatIhaveakeyinmykeystorefile,Icansettheadditionalpropertiesonthereceivingsource:
collector.sources.av1.ssl=true
collector.sources.av1.keystore=/path/to/keystore.jks
collector.sources.av1.keystore-password=password
AsIamusingaself-signedcertificate,thesinkwon’tbesurethatcommunicationscanbetrusted,soIhavetwooptions.Thefirstistotellittojusttrustallcertificates.Clearly,youwouldonlydothisonaprivatenetworkthathassomereasonableassurancesaboutitssecurity.Toignorethedubiousoriginofmycertificate,Icansetthetrust-all-certspropertytotrueasfollows:
client.sinks.k1.ssl=true
client.sinks.k1.trust-all-certs=true
Ifyoudidn’tsetthis,you’dseesomethinglikethisinthelogs:
org.apache.flume.EventDeliveryException:Failedtosendevents…
Causedby:javax.net.ssl.SSLHandshakeException:GeneralSSLEngineproblem…
Causedby:sun.security.validator.ValidatorException:Notrusted
certificatefound…
ForcommunicationsoverthepublicInternetorinmultitenantcloudenvironments,morecareshouldbetaken.Inthiscase,youcanprovideatruststoretothesink.Atruststoreissimilartoakeystore,exceptthatitcontainscertificateauthoritiesyouaretellingFlumecanbetrusted.Forwell-knowncertificateauthorities,suchasVeriSignandothers,youdon’tneedtospecifyatruststore,astheiridentitiesarealreadyincludedinyourJavadistribution(atleastforthemajorones).Youwillneedtohaveyourcertificatesignedbyacertificateauthorityandthenaddthesignedcertificatefiletothesource’skeystorefile.YouwillalsoneedtoincludethesigningCA’scertificateinthekeystoresothatthecertificatechaincanbefullyresolved.
Onceyouhaveaddedthekey,signedcertificate,andsigningCA’scertificatetothesource,youneedtoconfigurethesink(thesender)totrusttheseauthoritiessothatitwillpassthevalidationstepduringtheSSLhandshake.Tospecifythetruststore,setthetruststoreandtruststore-passwordpropertiesasfollows:
client.sinks.k1.ssl=true
client.sinks.k1.truststore=/path/to/truststore.jks
client.sinks.k1.truststore-password=password
Chancesarethereissomebodyinyourorganizationresponsibleforobtainingthethird-partycertificatesorsomeonewhocanissueanorganizationalCA-signed-certificatetoyou,soI’mnotgoingtogointodetailsabouthowtocreateyourowncertificateauthority.ThereisplentyofinformationontheInternetifyouchoosetotakethisroute.Rememberthatcertificateshaveanexpirationdate(usually,ayear),soyou’llneedtorepeatthisprocesseveryyearorcommunicationswillabruptlystopwhencertificatesorcertificateauthoritiesexpire.
TheThriftsource/sinkAnothersource/sinkpairyoucanusetotieryourdataflowsisbasedonThrift(http://thrift.apache.org/).UnlikeAvrodata,whichisself-documenting,Thriftusesanexternalschema,whichcanbecompiledintojustabouteveryprogramminglanguageontheplanet.TheconfigurationisalmostidenticaltotheAvroexamplealreadycovered.TheprecedingcollectorconfigurationthatusesThriftwouldnowlooksomethinglikethis:
collector.sources=th1
collector.sources.th1.type=thrift
collector.sources.th1.bind=0.0.0.0
collector.sources.th1.port=42324
collector.sources.th1.channels=ch1
collector.channels=ch1
collector.channels.ch1.type=memory
collector.sinks=k1
collector.sinks.k1.type=hdfs
collector.sinks.k1.channel=ch1
collector.sinks.k1.hdfs.path=/path/in/hdfs
Thereisoneadditionalpropertythatsetsthemaximumworkerthreadsintheunderlyingthreadpool.Ifunset,avalueofzeroisassumed,whichmakesthethreadpoolunbounded,soyoushouldprobablysetthisinyourproductionconfigurations.Testingshouldprovideyouwithareasonableupperlimitforyourenvironment.Tosetthethreadpoolsizeto10,forexample,youwouldaddthethreadsproperty:
collector.source.th1.threads=10
The“client”agentswouldusethecorrespondingThriftsinkasfollows:
client.channels=ch1
client.channels.ch1.type=memory
client.sinks=k1
client.sinks.k1.type=thrift
client.sinks.k1.channel=ch1
client.sinks.k1.hostname=collector.example.com
client.sinks.k1.port=42324
LikeitsAvrocounterpart,thereareadditionalsettingsforthebatchsize,connectiontimeouts,andconnectionresetintervalsthatyoumaywanttoadjustbasedonyourtestingresults.RefertotheTheAvrosource/sinksectionearlierinthischapterfordetails.
Usingcommand-lineAvroTheAvrosourcecanalsobeusedinconjunctionwithoneofthecommand-lineoptionsyoumayhavenoticedbackinChapter2,AQuickStartGuidetoFlume.Ratherthanrunningflume-ngwiththeagentparameter,youcanpasstheavro-clientparametertosendoneormorefilestoanAvrosource.Thesearetheoptionsspecifictoavro-clientfromthehelptext:
avro-clientoptions:
--dirname<dir>directorytostreamtoavrosource
--host,-H<host>hostnametowhicheventswillbesent(required)
--port,-p<port>portoftheavrosource(required)
--filename,-F<file>textfiletostreamtoavrosource[default:std
input]
--headerFile,-R<file>headerFilecontainingheadersaskey/valuepairson
eachnewline
--help,-hdisplayhelptext
Thisvariationisveryusefulfortesting,resendingdatamanuallyduetoerrors,orimportingolderdatastoredelsewhere.
JustlikeanAvrosink,youhavetospecifythehostnameandportyouwillbesendingdatato.Youcansendasinglefilewiththe--filenameoptionorallthefilesinadirectorywiththe--dirnameoption.Ifyouspecifyneitherofthese,stdinwillbeused.Hereishowyoumightsendafilenamedfoo.logtotheFlumeagentwepreviouslyconfigured:
$./flume-ngavro-client--filenamefoo.log--hostcollector.example.com--
port42424
EachlineoftheinputwillbeconvertedintoasingleFlumeevent.
Optionally,youcanspecifyafilecontainingkey/valuepairstosetFlumeheadervalues.ThefileusesJavapropertyfilesyntax.SupposeIhadafilenamedheaders.propertiescontaining:
pointOfSale=US
environment=staging
Then,includingthe--headerFileoptionwouldsetthesetwoheadersoneveryeventcreated:
$./flume-ngavro-client--filenamefoo.log--headerFileheaders.properties
--hostcollector.example.com--port42424
TheLog4JappenderAswediscussedinChapter5,SourcesandChannelSelectors,thereareissuesthatmayarisefromusingafilesystemfileasasource.OnewaytoavoidthisproblemistousetheFlumeLog4JAppenderinyourJavaapplication(s).Underthehood,itusesthesameAvrocommunicationthattheAvrosinkuses,soyouneedonlyconfigureittosenddatatoanAvrosource.
TheAppenderhastwoproperties,whichareshownhereinXML:
<appendername="FLUME"
class="org.apache.flume.clients.log4jappender.Log4jAppender">
<paramname="Hostname"value="collector.example.com"/>
<paramname="Port"value="42424"/>
</appender>
TheformatofthebodywillbedictatedbytheAppender’sconfiguredlayout(notshown).Thelog4jfieldsthatgetmappedtoFlumeheadersaresummarizedinthistable:
Flumeheaderkey Log4Jloggingeventfield
flume.client.log4j.logger.name event.getLoggerName()
flume.client.log4j.log.levelevent.getLevel()asanumber.Seeorg.apache.log4j.Levelformappings.
flume.client.log4j.timestamp event.getTimeStamp()
flume.client.log4j.message.encoding N/A—alwaysUTF8
flume.client.log4j.logger.otherWillonlyseethisiftherewasaproblemmappingoneoftheabovefields,sonormally,thiswon’tbepresent.
Refertohttp://logging.apache.org/log4j/1.2/formoredetailsonusingLog4J.
Youwillneedtoincludetheflume-ng-sdkJARintheclasspathofyourJavaapplicationatruntimetouseFlume’sLog4JAppender.
KeepinmindthatifthereisaproblemsendingdatatotheAvrosource,theappenderwillthrowanexceptionandthelogmessagewillbedropped,asthereisnoplacetoputit.KeepingitinmemorycouldquicklyoverloadyourJVMheap,whichisusuallyconsideredworsethandroppingthedatarecord.
TheLog4Jload-balancingappenderI’msureyounoticedthattheprecedingLog4jAppenderonlyhasasinglehostname/portinitsconfiguration.Ifyouwantedtospreadtheloadacrossmultiplecollectoragents,eitherforadditionalcapacityorforfaulttolerance,youcanusetheLoadBalancingLog4jAppender.ThisappenderhasasinglerequiredpropertynamedHosts,whichisaspace-separatedlistofhostnamesandportnumbersseparatedbyacolon,asfollows:
<appendername="FLUME"
class="org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender">
<paramname="Hosts"value="server1:42424server2:42424"/>
</appender>
Thereisanoptionalproperty,Selector,whichspecifiesthemethodthatyouwanttoloadbalance.ValidvaluesareRANDOMandROUND_ROBIN.Ifnotspecified,thedefaultisROUND_ROBIN.Youcanimplementyourownselector,butthatisoutsidethescopeofthisbook.Ifyouareinterested,gohavealookatthewell-documentedsourcecodefortheLoadBalancingLog4jAppenderclass.
Finally,thereisanotheroptionalpropertytooverridethemaximumtimeforexponentialbackoffwhenaservercannotbecontacted.Initially,ifaservercannotbecontacted,1secondwillneedtopassbeforethatserveristriedagain.Eachtimetheserverisunavailable,theretrytimedoubles,uptoadefaultmaximumof30seconds.Ifwewantedtoincreasethismaximumto2minutes,wecanspecifyaMaxBackoffpropertyinmillisecondsasfollows:
<appendername="FLUME"
class="org.apache.flume.clients.log4jappender.LoadBalancingLog4jAppender">
<paramname="Hosts"value="server1:42424server2:42424"/>
<paramname="Selector"value="RANDOM"/>
<paramname="MaxBackoff"value="120000"/>
</appender>
Inthisexample,wehavealsooverriddenthedefaultround_robinselectortousearandomselection.
TheembeddedagentIfyouarewritingaJavaprogramthatcreatesdata,youmaychoosetosendthedatadirectlyasstructureddatausingaspecialmodeofFlumecalledtheEmbeddedAgent.Itisbasicallyasimplesinglesource/singlechannelFlumeagentthatyouruninsideyourJVM.
Therearebenefitsanddrawbackstothisapproach.Onthepositiveside,youdon’tneedtomonitoranadditionalprocessonyourserverstorelaydata.Theembeddedchannelalsoallowsforthedataproducertocontinueexecutingitscodeimmediatelyafterqueuingtheeventtothechannel.TheSinkRunnerthreadhandlestakingeventsfromthechannelandsendingthemtotheconfiguredsinks.Evenifyoudidn’tuseembeddedFlumetoperformthishandofffromthecallingthread,youwouldmostlikelyusesomekindofsynchronizedqueue(suchasBlockingQueue)toisolatethesendingofthedatafromthemainexecutionthread.UsingEmbeddedFlumeprovidesthesamefunctionalitywithouthavingtoworrywhetheryou’vewrittenyourmultithreadedcodecorrectly.
ThemajordrawbacktoembeddingFlumeinyourapplicationisaddedmemorypressureonyourJVM’sgarbagecollector.Ifyouareusinganin-memorychannel,anyunsenteventsareheldintheheapandgetinthewayofcheapgarbagecollectionbywayofshort-livedobjects.However,thisiswhyyousetachannelsize:tokeepthemaximummemoryfootprinttoaknownquantity.Furthermore,anyconfigurationchangeswillrequireanapplicationrestart,whichcanbeproblematicifyouroverallsystemdoesn’thavesufficientredundancybuiltintotoleraterestarts.
ConfigurationandstartupAssumingyouarenotdissuaded(andyoushouldn’tbe),youwillfirstneedtoincludetheflume-ng-embedded-agentlibrary(anddependencies)intoyourJavaproject.Dependingonwhatbuildsystemyouareusing(Maven,Ivy,Gradle,andsoon),theexactformatwilldiffer,soI’lljustshowyoutheMavenconfigurationhere.Youcanlookupthealternativeformatsathttp://mvnrepository.com/:
<dependency>
<groupId>org.apache.flume</groupId>
<artifactId>flume-ng-embedded-agent</artifactId>
<version>1.5.2</version>
</dependency>
StartbycreatinganEmbeddedAgentobjectinyourJavacodebycallingtheconstructor(andpassingastringname—usedonlyinerrormessages):
EmbeddedAgenttoHadoop=newEmbeddedAgent("myData");
Next,youhavetosetpropertiesforthechannel,sinks,andsinkprocessorviatheconfigure()method,asshowninthisexample.Mostofthisconfigurationshouldlookveryfamiliartoyouatthispoint:
Map<String,String>config=newHashMap<String,String>;
config.put("channel.type","memory");
config.put("channel.capacity","75");
config.put("sinks","s1s2");
config.put("sink.s1.type","avro");
config.put("sink.s1.hostname","foo.example.com");
config.put("sink.s1.port","12345");
config.put("sink.s1.compression-type","deflate");
config.put("sink.s2.type","avro");
config.put("sink.s2.hostname","bar.example.com");
config.put("sink.s2.port","12345");
config.put("sink.s2.compression-type","deflate");
config.put("processor.type","failover");
config.put("processor.priority.s1","10");
config.put("processor.priority.s2","20");
toHadoop.configure(config);
Here,wedefineamemorychannelwithacapacityof75eventsalongwithtwoAvrosinksinanactive/standbyconfiguration(first,tofoo.example.com,andifthatfails,tobar.example.com)usingAvroserializationwithcompression.RefertoChapter3,Channels,forspecificsettingsformemory-orfile-backedchannelproperties.
Thesinkprocessoronlycomesintoplayifyouhavemorethanonesinkdefinedinyoursinksproperty.UnliketheoptionalsinkgroupsoptionscoveredinChapter4,SinksandSinkProcessors,youneedtospecifyasinklist,evenifitisjustone.Clearly,thereisnobehaviordifferencebetweenfailoverandloadbalancewhenthereisonlyoncesink.Refertoeachspecificsink’spropertiesinChapter4,SinksandSinkProcessors,forspecificconfigurationparameters.
Finally,beforeyoustartusingthisagent,youneedtocallthestart()methodto
instantiateeverythingbasedonyourpropertiesandstartallthebackgroundprocessingthreadsasfollows:
toHadoop.start();
Theclassisnowreadytostartreceivingandforwardingdata.
You’llwanttokeepareferencetothisobjectaroundforcleanuplater,aswellastopassittootherobjectsthatwillbesendingdataastheunderlyingchannelprovidesthreadsafety.Personally,IuseSpringFramework’sdependencyinjectiontoconfigureandpassreferencesatstartupratherthandoingitprogrammatically,asI’veshowninthisexample.RefertotheSpringwebsiteformoreinformation(http://spring.io/),asproperuseofSpringisawholebookuntoitself,anditisnottheonlydependencyinjectionframeworkavailabletoyou.
SendingdataDatacanbesenteitherassingleeventsorinbatches.Batchescansometimesbemoreefficientinhighvolumedatastreams.
Tosendasingleevent,justcalltheput()method,asshowninthisexample:
Evente=EventBuilder.withBody("HelloHadoop",Charset.forName("UTF8");
toHadoop.put(e);
Here,I’musingoneofmanymethodsavailableontheorg.apache.flume.event.EventBuilderclass.Hereisacompletelistofmethodsyoucanusetoconstructeventswiththishelperclass:
EventBuilder.withBody(byte[]body,Map<String,String>headers);
EventBuilder.withBody(byte[]body);
EventBuilder.withBody(Stringbody,Charsetcharset,Map<String,String>
headers);
EventBuilder.withBody(Stringbody,Charsetcharset);
Inordertosendseveraleventsinonebatch,youcancalltheputAll()methodwithalistofevents,asshowninthisexample,whichalsoincludesatimeheader:
Map<String,String>headers=newHashMap<String,String>;
headers.put("timestamp",Long.toString(System.currentTimeMillis());
List<Event>events=newArrayList<Event>;
events.add(EventBuilder.withBody("First".getBytes("UTF-8"),headers);
events.add(EventBuilder.withBody("Second".getBytes("UTF-8"),headers);
toHadoop.putAll(events);
Asinterceptorsarenotsupported,youwillneedtoaddanyheadersyouwanttoaddtotheeventsprogrammaticallyinJavabeforeyoucallput()orputAll().
ShutdownFinally,astherearebackgroundthreadswaitingfordatatoarriveonyourembeddedchannelwhicharemostlikelyholdingpersistentconnectionstoconfigureddestinations,whenitistimetoshutdownyourapplication,you’llwanttohaveFlumedoitscleanupaswell.Yousimplycallthestop()methodontheconfiguredEmbeddedAgentasfollows:
toHadoop.stop();
RoutingTheroutingofdatatodifferentdestinationsbasedoncontentshouldbefairlystraightforwardnowthatyou’vebeenintroducedtoallthevariousmechanismsinFlume.
ThefirststepistogetthedatayouwanttoswitchonintoaFlumeheaderbymeansofasource-sideinterceptoriftheheaderisn’talreadyavailable.ThesecondstepistouseaMultiplexingChannelSelectoronthatheadervaluetoswitchthedatatoanalternatechannel.
Forinstance,let’ssayyouwantedtocaptureallexceptionstoHDFS.Inthisconfiguration,youcanseeeventscominginonthes1sourceviaavroonport42424.TheeventistestedtoseewhetherthebodycontainsthetextException.Ifitdoes,itcreatesanexceptionheaderkey(withthevalueofException).Thisheaderisusedtoswitchtheseeventstochannelc1,andultimately,HDFS.Iftheeventdidn’tmatchthepattern,itwouldnothavetheexceptionheaderandwouldgetpassedtothec2channelviathedefaultselectorwhereitwouldbeforwardedviaAvroserializationtoport12345onserverfoo.example.com:
agent.sources=s1
agent.sources.s1.type=avro
agent.sources.s1.bind=0.0.0.0
agent.sources.s1.port=42424
agent.sources.s1.interceptors=i1
agent.sources.s1.interceptors.i1.type=regex_extractor
agent.sources.s1.interceptors.i1.regex=(Exception)
agent.sources.s1.interceptors.i1.serializers=ex
agent.sources.s1.intercetpros.i1.serializers.ex.name=exception
agent.sources.s1.selector.type=multiplexing
agent.sources.s1.selector.header=exception
agent.sources.s1.selector.mapping.Exception=c1
agent.sources.s1.selector.default=c2
agent.channels=c1c2
agent.channels.c1.type=memory
agent.channels.c2.type=memory
agent.sinks=k1k2
agent.sinks.k1.type=hdfs
agent.sinks.k1.channel=c1
agent.sinks.k1.hdfs.path=/logs/exceptions/%y/%M/%d/%H
agent.sinks.k2.type=avro
agent.sinks.k2.channel=c2
agent.sinks.k2.hostname=foo.example.com
agent.sinks.k2.port=12345
SummaryInthischapter,wecoveredvariousinterceptorsshippedwithFlume,including:
Timestamp:Theseareusedtoaddatimestampheader,possiblyoverwritinganexistingone.Host:ThisisusedtoaddtheFlumeagenthostnameorIPasaheaderintheevent.Static:ThisisusedtoaddstaticStringheaders.Regularexpressionfiltering:Thisisusedtoincludeorexcludeeventsbasedonamatchedregularexpression.Regularexpressionextractor:Thisisusedtocreateheadersfrommatchedregularexpressions.It’susefulforroutingwithChannelSelectors.Morphline:ThisisusedtodelegatetransformationtoaMorphlinecommandchain.Custom:Thisisusedtocreateanycustomtransformationsyouneedthatyoucan’tfindelsewhere.
WealsocoveredtieringdataflowsusingtheAvrosourceandsink.OptionalcompressionandSSLwithAvroflowswerecoveredaswell.Finally,Thriftsourcesandsinkswerebrieflycovered,assomeenvironmentsmayalreadyhaveThriftdataflowstointegratewith.
Next,weintroducedtwoLog4JAppenders,asinglepathandaload-balancingversion,fordirectintegrationwithJavaapplications.
TheEmbeddedFlumeAgentwascoveredforthosewishingtodirectlyintegratebasicFlumefunctionalityintotheirJavaapplications.
Finally,wegaveyouanexampleofusinginterceptorsinconjunctionwithaChannelSelectortoprovideroutingdecisionlogic.
Inthenextchapter,wewilldiveintoanexampletostitchtogethereverythingwehavecoveredsofar.
Chapter7.PuttingItAllTogetherNowthatwe’vewalkedthroughallthecomponentsandconfigurations,let’sputtogetheraworkingend-to-endconfiguration.Thisexampleisbynomeansexhaustive,nordoesitcovereverypossiblescenarioyoumightneed,butIthinkitshouldcoveracoupleofcommonusecasesI’veseenoverandover:
FindingerrorsbysearchinglogsacrossmultipleserversinnearrealtimeStreamingdatatoHDFSforlong-termbatchprocessing
Inthefirstsituation,yoursystemsmaybeimpaired,andyouhavemultipleplaceswhereyouneedtosearchforproblems.Bringingallofthoselogstoasingleplacethatyoucansearchmeansgettingyoursystemsrestoredquickly.Inthesecondscenario,youareinterestedincapturingdatainthelongtermforanalyticsandmachinelearning.
WeblogstosearchableUILet’ssimulateawebapplicationbysettingupasimplewebserverwhoselogswewanttostreamintosomesearchableapplication.Inthiscase,we’llbeusingaKibanaUItoperformadhocqueriesagainstElasticsearch.
Forthisexample,I’llstartthreeserversinAmazon’sElasticComputeCluster(EC2),asshowninthisdiagram:
EachserverhasapublicIP(startingwith54)andaprivateIP(startingwith172).Forinterservercommunication,I’llbeusingtheprivateIPsinmyconfigurations.Mypersonalinteractionwiththewebserver(tosimulatetraffic)andwithKibanaandElasticsearchwillrequirethepublicIPs,sinceI’msittinginmyhouseandnotinAmazon’sdatacenters.
NotePaycarefulattentiontotheIPaddressesintheshellprompts,aswewillbejumpingfrommachinetomachine,andIdon’twantyoutogetlost.Forinstance,ontheCollectorbox,thepromptwillcontainitsprivateIP:[ec2-user@ip-172-31-26-205~]$
IfyoutrythisoutyourselfinEC2,youwillgetdifferentIPassignments,soadjusttheconfigurationfilesandURLsreferencedasneeded.Forthisexample,I’llbeusingAmazon’sLinuxAMIforanoperatingsystemandthet1.microsizeserver.Also,besuretoadjustthesecuritygroupstoallownetworktrafficbetweentheseservers(andyourself).Whenindoubtaboutconnectivity,useutilitiessuchastelnetornctotestconnectionsbetweenserversandports.
TipWhilegoingthroughthisexercise,Imademistakesinmysecuritygroupsmorethanonce,soperformthesetestsonconnectivityexceptions.
SettingupthewebserverLet’sstartbysettingupthewebserver.Forthis,we’llinstallthepopularNginxwebserver(http://nginx.org).Sincewebserverlogsareprettymuchstandardizednowadays,Icouldhavechosenanywebserver,butthepointhereisthatthisapplicationwritesitlogs(bydefault)toafileonthedisk.SinceI’vewarnedyoufromtryingtousethetailprogramtostreamthem,we’llbeusingacombinationoflogrotateandtheSpoolingDirectorySourcetoingestthedata.First,let’slogintotheserver.ThedefaultaccountwithsudoforanAmazonLinuxserverisec2-user.You’llneedtopassthisinyoursshcommand:
%ssh54.69.112.61-lec2-user
Theauthenticityofhost'54.69.112.61(54.69.112.61)'can'tbe
established.
RSAkeyfingerprintisdc:ee:7a:1a:05:e1:36:cd:a4:81:03:97:48:8d:b3:cc.
Areyousureyouwanttocontinueconnecting(yes/no)?yes
Warning:Permanentlyadded'54.69.112.61'(RSA)tothelistofknownhosts.
__|__|_)
_|(/AmazonLinuxAMI
___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2014.09-release-notes/
18package(s)neededforsecurity,outof41available
Run"sudoyumupdate"toapplyallupdates.
Nowthatyouareloggedintotheserver,let’sinstallNginx(I’mnotgoingtoshowalloftheoutputtosavepaper):
[ec2-user@ip-172-31-18-146~]$sudoyum-yinstallnginx
Loadedplugins:priorities,update-motd,upgrade-helper
ResolvingDependencies
-->Runningtransactioncheck
--->Packagenginx.x86_641:1.6.2-1.22.amzn1willbeinstalled
[SNIP]
Installed:
nginx.x86_641:1.6.2-1.22.amzn1
DependencyInstalled:
GeoIP.x86_640:1.4.8-1.5.amzn1gd.x86_640:2.0.35-11.10.amzn1
gperftools-libs.x86_640:2.0-11.5.amzn1libXpm.x86_640:3.5.10-2.9.amzn1
libunwind.x86_640:1.1-2.1.amzn1
Complete!
Next,let’sstarttheNginxwebserver:
[ec2-user@ip-172-31-18-146~]$sudo/etc/init.d/nginxstart
Startingnginx:[OK]
Atthispoint,thewebservershouldberunning,solet’suseourcomputer’swebbrowsertogotothepublicIP,http://54.69.112.61/inthiscase.Ifitisworking,youshouldseethedefaultwelcomepage,likethis:
Tohelpgenerateasampleinput,youcanusealoadgeneratorprogramsuchasApachebenchmark(http://httpd.apache.org/docs/2.4/programs/ab.html)orwrk(https://github.com/wg/wrk).HereisthecommandIusedtogeneratedatafor20minutesatatime:
%wrk-c2-d20mhttp://54.69.112.61
Running20mtest@http://54.69.112.61
2threadsand2connections
ConfiguringlogrotationtothespooldirectoryBydefault,theserver’saccesslogsarewrittento/var/log/nginx/access.log.Theproblemweneedtoovercomehereistomovethefilestoaseparatedirectorywhileitisstillopenbythenginxprocess.Thisiswherelogrotatecomesintothepicture.Let’sfirstcreateaspooldirectorythatwillbecometheinputpathfortheSpoolingDirectorySourcelateron.Forthepurposeofthisexample,I’lljustcreatethespooldirectoryinthehomedirectoryofec2-user.Inaproductionenvironment,youwouldplacethespooldirectorysomewhereelse:
[ec2-user@ip-172-31-18-146~]$pwd
/home/ec2-user
[ec2-user@ip-172-31-18-146~]$mkdirspool
Let’salsocreatealogrotatescriptinthehomedirectoryofec2-user,calledrotateAccess.conf.Openaneditorandpastethesecontents(aftertheprompt)intoafile
calledaccessRotate.conf:
[ec2-user@ip-172-31-18-146~]$cataccessRotate.conf
/var/log/nginx/access.log{
missingok
notifempty
rotate0
copytruncate
sharedscripts
olddir/home/ec2-user/spool
postrotate
chownec2-user:ec2-user/home/ec2-user/spool/access.log.1
ts=$(date+%s)
mv/home/ec2-user/spool/access.log.1/home/ec2-
user/spool/access.log.$ts
endscript
}
Let’sgooverthissothatyouunderstandwhatitisdoing.Thisisbynomeansacompleteintroductiontothelogrotateutility.Readtheonlinedocumentationathttp://linuxconfig.org/logrotateformoredetails.
Whenlogrotateruns,itwillcopy/var/log/nginx/access.logtothe/home/ec2-user/spooldirectory,butonlyifithasanonzerolength(meaning,thereisdatatosend).Therotate0commandtellslogrotatenottoremoveanyfilesfromthetargetdirectory(sincetheFlumesourcewilldothatafterithassuccessfullytransmittedeachfile).We’llmakeuseofthecopytruncatefeaturebecauseitkeepsthefileopenasfarasthenginxprocessisconcerned,whileresettingittozerolength.Thisavoidstheneedtosignalnginxthatarotationhasjustoccurred.Thedestinationoftherotatedfileinthespooldirectoryis/home/ec2-user/spool/access.log.1.Next,inthepostrotatescriptsection,wewillchangetheownershipofthefilefromtherootusertoec2-usersothattheFlumeagentcanreadit,andlater,deleteit.Finally,werenamethefilewithauniquetimestampsothatwhenthenextrotationoccurs,theaccess.log.1filewillnotexist.Thisalsomeansthatwe’llneedtoaddanexclusionruletotheFlumesourcelatertokeepitfromreadingaccess.log.1fileuntilthecopyingprocesshascompleted.Oncerenamed,itisreadytobeconsumedbyFlume.
Nowlet’stryrunningitindebugmode,whichwillbejustadryrunsothatwecanseewhatitcando.Normally,logrotatehasadailyrotationsettingthatpreventsitfromdoinganythingunlessenoughtimehaspassed:
[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-d/home/ec2-
user/accessRotate.conf
readingconfigfile/home/ec2-user/accessRotate.conf
readingconfiginfofor/var/log/nginx/access.log
olddirisnow/home/ec2-user/spool
Handling1logs
rotatingpattern:/var/log/nginx/access.log1048576bytes(nooldlogs
willbekept)
olddiris/home/ec2-user/spool,emptylogfilesarenotrotated,oldlogs
areremoved
consideringlog/var/log/nginx/access.log
logdoesnotneedrotating
notrunningpostrotatescript,sincenologswererotated
Therefore,wewillalsoincludethe-f(force)flagtooverridetheprecedingfunctionality:
[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-df/home/ec2-
user/accessRotate.conf
readingconfigfile/home/ec2-user/accessRotate.conf
readingconfiginfofor/var/log/nginx/access.log
olddirisnow/home/ec2-user/spool
Handling1logs
rotatingpattern:/var/log/nginx/access.logforcedfromcommandline(no
oldlogswillbekept)
olddiris/home/ec2-user/spool,emptylogfilesarenotrotated,oldlogs
areremoved
consideringlog/var/log/nginx/access.log
logneedsrotating
rotatinglog/var/log/nginx/access.log,log->rotateCountis0
dateextsuffix'-20141207'
globpattern'-[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
renaming/home/ec2-user/spool/access.log.1to/home/ec2-
user/spool/access.log.2(rotatecount1,logstart1,i1),
renaming/home/ec2-user/spool/access.log.0to/home/ec2-
user/spool/access.log.1(rotatecount1,logstart1,i0),
copying/var/log/nginx/access.logto/home/ec2-user/spool/access.log.1
truncating/var/log/nginx/access.log
runningpostrotatescript
runningscriptwitharg/var/log/nginx/access.log:"
chownec2-user:ec2-user/home/ec2-user/spool/access.log.1
ts=$(date+%s)
mv/home/ec2-user/spool/access.log.1/home/ec2-
user/spool/access.log.$ts
"
removingoldlog/home/ec2-user/spool/access.log.2
Asyoucansee,logrotatewillcopythelogtothespooldirectorywiththe.1extensionasexpected,followedbyourscriptblockattheendtochangepermissionsandrenameitwithauniquetimestamp.
Nowlet’srunitagain,butthistimeforreal,withoutthedebugflag,andthenwewilllistthesourceandtargetdirectoriestoseewhathappened:
[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-f/home/ec2-
user/accessRotate.conf
[ec2-user@ip-172-31-18-146~]$ls-lspool/
total188
-rw-r--r--1ec2-userec2-user189344Dec717:27access.log.1417973241
[ec2-user@ip-172-31-18-146~]$ls-l/var/log/nginx/
total4
-rw-r--r--1rootroot0Dec717:27access.log
-rw-r--r--1rootroot520Dec716:44error.log
[ec2-user@ip-172-31-18-146~]$
Youcanseetheaccesslogisemptyandtheoldcontentshavebeencopiedtothespooldirectory,withcorrectpermissionsandthefilenameendinginauniquetimestamp.
Let’salsoverifythattheemptyaccesslogwillresultinnoactionifrunwithoutanydata.You’llneedtoincludethedebugflagtoseewhattheprocessisthinking:
[ec2-user@ip-172-31-18-146~]$sudo/usr/sbin/logrotate-df/home/ec2-
user/accessRotate.conf
readingconfigfileaccessRotate.conf
readingconfiginfofor/var/log/nginx/access.log
olddirisnow/home/ec2-user/spool
Handling1logs
rotatingpattern:/var/log/nginx/access.logforcedfromcommandline(1
rotations)
olddiris/home/ec2-user/spool,emptylogfilesarenotrotated,oldlogs
areremoved
consideringlog/var/log/nginx/access.log
logdoesnotneedrotating
Nowthatwehaveaworkingrotationscript,weneedsomethingtorunitperiodically(notindebugmode)insomechoseninterval.Forthis,wewillusethecrondaemon(http://en.wikipedia.org/wiki/Cron)bycreatingafileinthe/etc/cron.ddirectory:
[ec2-user@ip-172-31-18-146~]$cat/etc/cron.d/rotateLogsToSpool
#Movefilestospooldirectoryevery5minutes
*/5****root/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf
Here,I’veindicatedafive-minuteintervalandtorunastherootuser.Sincetherootuserownstheoriginallogfile,weneedelevatedaccesstoreassignownershiptoec2-usersothatitcanberemovedlaterbyFlume.
Onceyou’vesavedthiscronconfigurationfile,youshouldseeourscriptevery5minutes(aswellasotherprocesses),byinspectingthecrondaemon’slogfile:
[ec2-user@ip-172-31-18-146~]$sudotail/var/log/cron
Dec717:45:01ip-172-31-18-146CROND[22904]:(root)CMD
(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)
Dec717:50:01ip-172-31-18-146CROND[22929]:(root)CMD
(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)
Dec717:54:01ip-172-31-18-146anacron[2426]:Job'cron.weekly'started
Dec717:54:01ip-172-31-18-146anacron[2426]:Job'cron.weekly'
terminated
Dec717:55:01ip-172-31-18-146CROND[22955]:(root)CMD
(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)
Dec718:00:01ip-172-31-18-146CROND[23046]:(root)CMD
(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)
Dec718:01:01ip-172-31-18-146CROND[23060]:(root)CMD(run-parts
/etc/cron.hourly)
Dec718:01:01ip-172-31-18-146run-parts(/etc/cron.hourly)[23060]:
starting0anacron
Dec718:01:01ip-172-31-18-146run-parts(/etc/cron.hourly)[23069]:
finished0anacron
Dec718:05:01ip-172-31-18-146CROND[23117]:(root)CMD
(/usr/sbin/logrotate-f/home/ec2-user/accessRotate.conf)
Aninspectionofthespooldirectoryalsoshowsnewfilescreatedonourarbitrarilychosen5-minuteinterval:
[ec2-user@ip-172-31-18-146~]$ls-lspool/
total1072
-rw-r--r--1ec2-userec2-user145029Dec718:02access.log.1417975373
-rw-r--r--1ec2-userec2-user164169Dec718:05access.log.1417975501
-rw-r--r--1ec2-userec2-user166779Dec718:10access.log.1417975801
-rw-r--r--1ec2-userec2-user613785Dec718:15access.log.1417976101
KeepinmindthattheNginxRPMpackagealsoinstalledarotationconfigurationlocatedat/etc/logrotate.d/nginxtoperformdailyrotations.Ifyouaregoingtousethisexampleinproduction,you’llwanttoremovetheconfiguration,sincewedon’twantitclashingwithourmorefrequentlyrunningcronscript.I’llleavehandlingerror.logasanexerciseforyou.You’lleitherwanttosenditsomeplace(andhaveFlumeremoveit)orrotateitperiodicallysothatyourdiskdoesn’tfillupovertime.
Settingupthetarget–ElasticsearchNowlet’smovetotheotherserverinourdiagramandsetupElasticsearch,ourdestinationforsearchabledata.I’mgoingtobeusingtheinstructionsfoundathttp://www.elasticsearch.org/overview/elkdownloads/.First,let’sdownloadandinstalltheRPMpackage:
[ec2-user@ip-172-31-26-120~]$wget
https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc
h-1.4.1.noarch.rpm
--2014-12-0718:25:35--
https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc
h-1.4.1.noarch.rpm
Resolvingdownload.elasticsearch.org(download.elasticsearch.org)...
54.225.133.195,54.243.77.158,107.22.222.16,...
Connectingtodownload.elasticsearch.org
(download.elasticsearch.org)|54.225.133.195|:443…connected.
HTTPrequestsent,awaitingresponse…200OK
Length:26326154(25M)[application/x-redhat-package-manager]
Savingto:'elasticsearch-1.4.1.noarch.rpm'
100%
[==========================================================================
==================================>]26,326,1549.92MB/sin2.5s
2014-12-0718:25:38(9.92MB/s)-'elasticsearch-1.4.1.noarch.rpm'saved
[26326154/26326154]
[ec2-user@ip-172-31-26-120~]$sudorpm-ivhelasticsearch-1.4.1.noarch.rpm
Preparing…#################################[100%]
Updating/installing…
1:elasticsearch-1.4.1-1#################################[100%]
###NOTstartingoninstallation,pleaseexecutethefollowingstatements
toconfigureelasticsearchtostartautomaticallyusingchkconfig
sudo/sbin/chkconfig--addelasticsearch
###Youcanstartelasticsearchbyexecuting
sudoserviceelasticsearchstart
Theinstallationiskindenoughtotellmehowtoconfiguretheservicetoautomaticallystartonsystemboot-up,anditalsotellsmetolaunchtheservicenow,solet’sdothat:
[ec2-user@ip-172-31-26-120~]$sudo/sbin/chkconfig--addelasticsearch
[ec2-user@ip-172-31-26-120~]$sudoserviceelasticsearchstart
Startingelasticsearch:[OK]
Let’sperformaquicktestwithcurltoverifythatitisrunningonthedefaultport,whichis9200:
[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/
{
"status":200,
"name":"Siege",
"cluster_name":"elasticsearch",
"version":{
"number":"1.4.1",
"build_hash":"89d3241d670db65f994242c8e8383b169779e2d4",
"build_timestamp":"2014-11-26T15:49:29Z",
"build_snapshot":false,
"lucene_version":"4.10.2"
},
"tagline":"YouKnow,forSearch"
}
Sincetheindexesarecreatedasdatacomesin,andwehaven’tsentanydata,weexpecttoseenoindexescreatedyet:
[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/_cat/indices?v
healthstatusindexprirepdocs.countdocs.deletedstore.size
pri.store.size
Allweseeistheheaderwithnoindexeslisted,solet’sheadovertothecollectorserverandsetuptheFlumerelaywhichwillwritedatatoElasticsearch.
SettingupFlumeoncollector/relayTheFlumeconfigurationontheFlumecollectorwillbecompressedAvrocominginandElasticsearchgoingout.Goaheadandlogintothecollectorserver(172.31.26.205).
Let’sstartbydownloadingtheFlumebinaryfromtheApachewebsite.Followthedownloadlinkandselectamirror:
[ec2-user@ip-172-31-26-205~]$wget
http://apache.arvixe.com/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz
--2014-12-0719:50:30--http://apache.arvixe.com/flume/1.5.2/apache-flume-
1.5.2-bin.tar.gz
Resolvingapache.arvixe.com(apache.arvixe.com)...198.58.87.82
Connectingtoapache.arvixe.com(apache.arvixe.com)|198.58.87.82|:80…
connected.
HTTPrequestsent,awaitingresponse…200OK
Length:25323459(24M)[application/x-gzip]
Savingto:'apache-flume-1.5.2-bin.tar.gz'
100%[==========================>]25,323,4598.38MB/sin2.9s
2014-12-0719:50:33(8.38MB/s)-'apache-flume-1.5.2-bin.tar.gz'saved
[25323459/25323459]
Next,expandandchangedirectories:
[ec2-user@ip-172-31-26-205~]$tar-zxfapache-flume-1.5.2-bin.tar.gz
[ec2-user@ip-172-31-26-205~]$cdapache-flume-1.5.2-bin
Createtheconfigurationfile,calledcollector.conf,inFlume’sconfigurationdirectoryusingyourfavoriteeditor:
[[email protected]]$catconf/collector.conf
collector.sources=av
collector.channels=m1
collector.sinks=es
collector.sources.av.type=avro
collector.sources.av.bind=0.0.0.0
collector.sources.av.port=12345
collector.sources.av.compression-type=deflate
collector.sources.av.channels=m1
collector.channels.m1.type=memory
collector.channels.m1.capacity=10000
collector.sinks.es.type=org.apache.flume.sink.elasticsearch.ElasticSearchSi
nk
collector.sinks.es.channel=m1
collector.sinks.es.hostNames=172.31.26.120
Here,youcanseethatweareusingasimplememorychannelconfiguredwithacapacityof10,000events.
ThesourceisconfiguredtoacceptcompressedAvroonport12345andpassittoour
memorychannel.
Finally,thesinkisconfiguredtowritetoElasticsearchontheserverwejustsetupatthe172.31.26.120privateIP.Weareusingthedefaultsettings,whichmeansit’llwritetotheindexnamedflume-YYYY-MM-DDwiththelogtype.
Let’stryrunningtheFlumeagent:
[[email protected]]$./bin/flume-ngagent-n
collector-cconf-fconf/collector.conf-Dflume.root.logger=INFO,console
You’llseeanexceptioninthelog,includingsomethinglikethis:
2014-12-0720:00:13,184(conf-file-poller-0)[ERROR-
org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatche
rRunnable.run(PollingPropertiesFileConfigurationProvider.java:145)]Failed
tostartagentbecausedependencieswerenotfoundinclasspath.Error
follows.
java.lang.NoClassDefFoundError:org/elasticsearch/common/io/BytesStream
TheFlumeagentcan’tfindtheElasticsearchclasses.RememberthatthesearenotpackagedwithFlume,asthelibrariesneedtobecompatiblewiththeversionofElasticsearchyouarerunning.LookingbackattheElasticsearchserver,wecangetanideaofwhatweneed.Rememberthatthislistincludesmanyruntimeserverdependencies,soitisprobablymorethanwhatyou’llneedforafunctionalElasticsearchclient:
[ec2-user@ip-172-31-26-120~]$rpm-qilelasticsearch|grepjar
/usr/share/elasticsearch/lib/elasticsearch-1.4.1.jar
/usr/share/elasticsearch/lib/groovy-all-2.3.2.jar
/usr/share/elasticsearch/lib/jna-4.1.0.jar
/usr/share/elasticsearch/lib/jts-1.13.jar
/usr/share/elasticsearch/lib/log4j-1.2.17.jar
/usr/share/elasticsearch/lib/lucene-analyzers-common-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-core-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-expressions-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-grouping-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-highlighter-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-join-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-memory-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-misc-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-queries-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-queryparser-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-sandbox-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-spatial-4.10.2.jar
/usr/share/elasticsearch/lib/lucene-suggest-4.10.2.jar
/usr/share/elasticsearch/lib/sigar/sigar-1.6.4.jar
/usr/share/elasticsearch/lib/spatial4j-0.4.1.jar
Really,youonlyneedtheelasticsearch.jarfileanditsdependencies,butwearegoingtobelazyandjustdownloadtheRPMagaininthecollectormachine,andcopytheJARfilestoFlume:
[ec2-user@ip-172-31-26-205~]$wget
https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc
h-1.4.1.noarch.rpm
--2014-12-0720:03:38--
https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearc
h-1.4.1.noarch.rpm
Resolvingdownload.elasticsearch.org(download.elasticsearch.org)...
54.225.133.195,54.243.77.158,107.22.222.16,...
Connectingtodownload.elasticsearch.org
(download.elasticsearch.org)|54.225.133.195|:443…connected.
HTTPrequestsent,awaitingresponse…200OK
Length:26326154(25M)[application/x-redhat-package-manager]
Savingto:'elasticsearch-1.4.1.noarch.rpm'
100%
[==========================================================================
==================================>]26,326,1549.70MB/sin2.6s
2014-12-0720:03:41(9.70MB/s)-'elasticsearch-1.4.1.noarch.rpm'saved
[26326154/26326154]
[ec2-user@ip-172-31-26-205~]$sudorpm-ivhelasticsearch-1.4.1.noarch.rpm
Preparing…#################################
[100%]
Updating/installing…
1:elasticsearch-1.4.1-1#################################
[100%]
###NOTstartingoninstallation,pleaseexecutethefollowingstatements
toconfigureelasticsearchtostartautomaticallyusingchkconfig
sudo/sbin/chkconfig--addelasticsearch
###Youcanstartelasticsearchbyexecuting
sudoserviceelasticsearchstart
Thistimewewillnotconfiguretheservicetostart.Instead,we’llcopytheJARfilesweneedtoFlume’spluginsdirectoryarchitecture,whichwelearnedaboutinthepreviouschapter:
[ec2-user@ip-172-31-26-205~]$cdapache-flume-1.5.2-bin
[[email protected]]$mkdir-p
plugins.d/elasticsearch/libext
[[email protected]]$cp
/usr/share/elasticsearch/lib/*.jarplugins.d/elasticsearch/libext/
NowtryrunningtheFlumeagentagain:
[[email protected]]$./bin/flume-ngagent-n
collector-cconf-fconf/collector.conf-Dflume.root.logger=INFO,console
Noexceptionsthistimearound,butstillnodata.Let’sgobacktothewebservermachineandsetupthefinalFlumeagent.
SettingupFlumeontheclientOnthefirstserver,thewebserver,let’sdownloadtheFlumebinariesagainandexpandthepackage:
[ec2-user@ip-172-31-18-146~]$wget
http://apache.arvixe.com/flume/1.5.2/apache-flume-1.5.2-bin.tar.gz
--2014-12-0720:12:53--http://apache.arvixe.com/flume/1.5.2/apache-flume-
1.5.2-bin.tar.gz
Resolvingapache.arvixe.com(apache.arvixe.com)...198.58.87.82
Connectingtoapache.arvixe.com(apache.arvixe.com)|198.58.87.82|:80…
connected.
HTTPrequestsent,awaitingresponse…200OK
Length:25323459(24M)[application/x-gzip]
Savingto:'apache-flume-1.5.2-bin.tar.gz'
100%[=========================>]25,323,4598.13MB/sin3.0s
2014-12-0720:12:56(8.13MB/s)-'apache-flume-1.5.2-bin.tar.gz'saved
[25323459/25323459]
[ec2-user@ip-172-31-18-146~]$tar-zxfapache-flume-1.5.2-bin.tar.gz
[ec2-user@ip-172-31-18-146~]$cdapache-flume-1.5.2-bin
Thistime,ourFlumeconfigurationtakesthespooldirectorywesetupbefore,withlogrotateasinput,anditneedstowritecompressedAvroasthecollectorserver.Openaneditorandcreatetheclient.conffileusingyourfavoriteeditor:
[[email protected]]$catconf/client.conf
client.sources=sd
client.channels=m1
client.sinks=av
client.sources.sd.type=spooldir
client.sources.sd.spoolDir=/home/ec2-user/spool
client.sources.sd.deletePolicy=immediate
client.sources.sd.ignorePattern=access.log.1$
client.sources.sd.channels=m1
client.channels.m1.type=memory
client.channels.m1.capacity=10000
client.sinks.av.type=avro
client.sinks.av.hostname=172.31.26.205
client.sinks.av.port=12345
client.sinks.av.compression-type=deflate
client.sinks.av.channel=m1
Again,forsimplicity,weareusingamemorychannelwith10,000-recordcapacity.
Forthesource,weconfiguretheSpoolingDirectorySourcewith/home/ec2-user/spoolastheinput.Additionally,weconfigurethedeletionpolicytoremovethefilesaftersendingiscomplete.Thisisalsowherewesetuptheexclusionrulefortheaccess.log.1filenamepatternmentionedearlier.Notethedollarsignattheendofthefilename,
denotingtheendoftheline.Withoutthis,theexclusionpatternwouldalsoexcludevalidfiles,suchasaccess.log.1417975373.
Finally,anAvrosinkisconfiguredtopointatthecollector’sprivateIPandport12345.Additionally,wesetthecompressionsothatitmatchesthereceivingAvrosource’ssettings.
Nowlet’stryrunningtheagent:
[[email protected]]$./bin/flume-ngagent-n
client-cconf-fconf/client.conf-Dflume.root.logger=INFO,console
Noexceptions!Butmoreimportantly,Iseethelogfilesinthespooldirectorybeingprocessedanddeleted:
2014-12-0720:59:04,041(pool-4-thread-1)[INFO-
org.apache.flume.client.avro.ReliableSpoolingFileEventReader.deleteCurrentF
ile(ReliableSpoolingFileEventReader.java:390)]Preparingtodeletefile
/home/ec2-user/spool/access.log.1417976401
2014-12-0720:59:05,319(pool-4-thread-1)[INFO-
org.apache.flume.client.avro.ReliableSpoolingFileEventReader.deleteCurrentF
ile(ReliableSpoolingFileEventReader.java:390)]Preparingtodeletefile
/home/ec2-user/spool/access.log.1417976701
2014-12-0720:59:06,245(pool-4-thread-1)[INFO-
org.apache.flume.client.avro.ReliableSpoolingFileEventReader.deleteCurrentF
ile(ReliableSpoolingFileEventReader.java:390)]Preparingtodeletefile
/home/ec2-user/spool/access.log.1417977001
2014-12-0720:59:06,245(pool-4-thread-1)[INFO-
org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(Spo
olDirectorySource.java:254)]SpoolingDirectorySourcerunnerhasshutdown.
2014-12-0720:59:06,746(pool-4-thread-1)[INFO-
org.apache.flume.source.SpoolDirectorySource$SpoolDirectoryRunnable.run(Spo
olDirectorySource.java:254)]SpoolingDirectorySourcerunnerhasshutdown.
Onceallthefileshavebeenprocessed,thelastlinesarerepeatedevery500milliseconds.ThisisaknownbuginFlume(https://issues.apache.org/jira/browse/FLUME-2385).Ithasalreadybeenfixedandisslatedforthe1.6.0release,sobesuretosetuplogrotationonyourFlumeagentandcleanupthismessbeforeyourunoutofdiskspace.
Onthecollectorbox,weseewritesoccurringtoElasticsearch:
2014-12-0722:10:03,694(SinkRunner-PollingRunner-DefaultSinkProcessor)
[INFO-
org.apache.flume.sink.elasticsearch.client.ElasticSearchTransportClient.exe
cute(ElasticSearchTransportClient.java:181)]Sendingbulktoelasticsearch
cluster
QueryingtheElasticsearchRESTAPI,wecanseeanindexwithrecords:
[ec2-user@ip-172-31-26-120~]$curlhttp://localhost:9200/_cat/indices?v
healthstatusindexprirepdocs.countdocs.deletedstore.size
pri.store.size
yellowopenflume-2014-12-07513210201.3mb
1.3mb
Let’sreadthefirst5records:
[ec2-user@ip-172-31-26-120elasticsearch]$curl-XGET
'http://localhost:9200/flume-2014-12-07/_search?pretty=true&q=*.*&size=5'
{
"took":26,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":32102,
"max_score":1.0,
"hits":[{
"_index":"flume-2014-12-07",
"_type":"log",
"_id":"AUomjGVVbObD75ecNqJg",
"_score":1.0,
"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47
+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}
},{
"_index":"flume-2014-12-07",
"_type":"log",
"_id":"AUomjGVWbObD75ecNqJj",
"_score":1.0,
"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47
+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}
},{
"_index":"flume-2014-12-07",
"_type":"log",
"_id":"AUomjGVWbObD75ecNqJo",
"_score":1.0,
"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47
+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}
},{
"_index":"flume-2014-12-07",
"_type":"log",
"_id":"AUomjGVWbObD75ecNqJt",
"_score":1.0,
"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47
+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}
},{
"_index":"flume-2014-12-07",
"_type":"log",
"_id":"AUomjGVWbObD75ecNqJy",
"_score":1.0,
"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:01:47
+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@fields":{}}
}]
}
}
Asyoucansee,theloglinesareinthereunderthe@messagefield,andtheycannowbesearched.However,wecandobetter.Let’sbreakthatmessagedownintosearchablefields.
CreatingmoresearchfieldswithaninterceptorLet’sborrowsomecodefromwhatwecoveredearlierinthisbooktoextractsomeFlumeheadersfromthiscommonlogformat,knowingthatallFlumeheaderswillbecomefieldsinElasticsearch.SincewearecreatingfieldstobesearchedbyinElasticsearch,I’mgoingtoaddthemtothecollector’sconfigurationratherthanthewebserver’sFlumeagent.
ChangetheagentconfigurationonthecollectortoincludeaRegularExpressionExtractorinterceptor:
[[email protected]]$catconf/collector.conf
collector.sources=av
collector.channels=m1
collector.sinks=es
collector.sources.av.type=avro
collector.sources.av.bind=0.0.0.0
collector.sources.av.port=12345
collector.sources.av.compression-type=deflate
collector.sources.av.channels=m1
collector.sources.av.interceptors=e1
collector.sources.av.interceptors.e1.type=regex_extractor
collector.sources.av.interceptors.e1.regex=^([\\d.]+)\\S+\\S+\\
[([\\w:/]+\\s[+\\-]\\d{4})\\]\"(.+?)\"(\\d{3})(\\d+)
collector.sources.av.interceptors.e1.serializers=ipdturlscbc
collector.sources.av.interceptors.e1.serializers.ip.name=source
collector.sources.av.interceptors.e1.serializers.dt.type=org.apache.flume.i
nterceptor.RegexExtractorInterceptorMillisSerializer
collector.sources.av.interceptors.e1.serializers.dt.pattern=dd/MMM/yyyy:dd:
HH:mm:ssZ
collector.sources.av.interceptors.e1.serializers.dt.name=timestamp
collector.sources.av.interceptors.e1.serializers.url.name=http_request
collector.sources.av.interceptors.e1.serializers.sc.name=status_code
collector.sources.av.interceptors.e1.serializers.bc.name=bytes_xfered
collector.channels.m1.type=memory
collector.channels.m1.capacity=10000
collector.sinks.es.type=org.apache.flume.sink.elasticsearch.ElasticSearchSi
nk
collector.sinks.es.channel=m1
collector.sinks.es.hostNames=172.31.26.120
TofollowtheLogstashconvention,I’verenamedthehostnameheadertosource.Now,tomakethenewformateasytofind,I’mgoingtodeletetheexistingindex:
[ec2-user@ip-172-31-26-120~]$curl-XDELETE'http://localhost:9200/flume-
2014-12-07/'
{"acknowledged":true}
[ec2-user@ip-172-31-26-120~]$curl-XGET'http://localhost:9200/flume-
2014-12-07/_search?pretty=true&q=*.*&size=5'
{
"error":"IndexMissingException[[flume-2014-12-07]missing]",
"status":404
}
Next,Icreatemoretrafficonmywebserverandwaitforittoappearattheotherend.ThenIquerysomerecordstoseewhatitlookslike:
[ec2-user@ip-172-31-26-120~]$curl-XGET'http://localhost:9200/flume-
2014-12-07/_search?pretty=true&q=*.*&size=5'
{
"took":95,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":12083,
"max_score":1.0,
"hits":[{
"_index":"flume-2014-12-07",
"_type":"log",
"_id":"AUomq4FnbObD75ecNx_H",
"_score":1.0,
"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:15
+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-
12-07T18:13:15.000Z","@source":"207.222.127.224","@fields":
{"timestamp":"1417975995000","status_code":"200","source":"207.222.127.224"
,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}
},{
"_index":"flume-2014-12-07",
"_type":"log",
"_id":"AUomq4FnbObD75ecNx_M",
"_score":1.0,
"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16
+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-
12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":
{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"
,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}
},{
"_index":"flume-2014-12-07",
"_type":"log",
"_id":"AUomq4FnbObD75ecNx_R",
"_score":1.0,
"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16
+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-
12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":
{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"
,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}
},{
"_index":"flume-2014-12-07",
"_type":"log",
"_id":"AUomq4FnbObD75ecNx_W",
"_score":1.0,
"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16
+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-
12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":
{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"
,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}
},{
"_index":"flume-2014-12-07",
"_type":"log",
"_id":"AUomq4FnbObD75ecNx_a",
"_score":1.0,
"_source":{"@message":"207.222.127.224--[07/Dec/2014:18:13:16
+0000]\"GET/HTTP/1.1\"2003770\"-\"\"-\"\"-\"","@timestamp":"2014-
12-07T18:13:16.000Z","@source":"207.222.127.224","@fields":
{"timestamp":"1417975996000","status_code":"200","source":"207.222.127.224"
,"http_request":"GET/HTTP/1.1","bytes_xfered":"3770"}}
}]
}
}
Asyoucansee,wenowhaveadditionalfieldsthatwecansearchby.Additionally,youcanseethatElasticsearchhastakenourmillisecond-basedtimestampfieldandcreateditsown@timestampfieldinISO-8601format.
Nowwecandosomemoreinterestingqueriessuchasfindingouthowmanysuccessful(status200)pageswesaw:
[ec2-user@ip-172-31-26-120elasticsearch]$curl-XGET
'http://localhost:9200/flume-2014-12-07/_count'-d'{"query":{"term":
{"status_code":"200"}}}'
{"count":46018,"_shards":{"total":5,"successful":5,"failed":0}}
Settingupabetteruserinterface–KibanaWhileitappearsasifwearedone,thereisonemorethingweshoulddo.Thedataisallthere,butyouneedtobeanexpertinqueryingElasticsearchtomakegooduseoftheinformation.Afterall,ifthedataisdifficulttoconsumeandgetsignored,thenwhybothercollectingitatall?Whatyoureallyneedisanice,searchablewebinterfacethathumanswithnon-technicalbackgroundscanuse.Forthis,wearegoingtosetupKibana.Inanutshell,KibanaisawebapplicationthatrunsasadynamicHTMLpageonyourbrowser,makingcallsfordatatoElasticsearchwhennecessary.Theresultisaninteractivewebinterfacethatdoesn’trequireyoutolearnthedetailsoftheElasticsearchqueryAPI.Thisisnottheonlyoptionavailabletoyou;itisjustwhatI’musinginthisexample.Let’sdownloadKibana3fromtheElasticsearchwebsite,andinstallthisonthesameserver(althoughyoucouldeasilyservethisfromanotherHTTPserver):
[ec2-user@ip-172-31-26-120~]$wget
https://download.elasticsearch.org/kibana/kibana/kibana-3.1.2.tar.gz
--2014-12-0718:31:16--
https://download.elasticsearch.org/kibana/kibana/kibana-3.1.2.tar.gz
Resolvingdownload.elasticsearch.org(download.elasticsearch.org)...
54.225.133.195,54.243.77.158,107.22.222.16,...
Connectingtodownload.elasticsearch.org
(download.elasticsearch.org)|54.225.133.195|:443…connected.
HTTPrequestsent,awaitingresponse…200OK
Length:1074306(1.0M)[application/octet-stream]
Savingto:'kibana-3.1.2.tar.gz'
100%
[==========================================================================
==================================>]1,074,3061.33MB/sin0.8s
2014-12-0718:31:17(1.33MB/s)-'kibana-3.1.2.tar.gz'saved
[1074306/1074306]
[ec2-user@ip-172-31-26-120~]$tar-zxfkibana-3.1.2.tar.gz
[ec2-user@ip-172-31-26-120~]$cdkibana-3.1.2
NoteAtthetimeofwritingthisbook,anewerversionofKibana(Kibana4)isinbeta.Theseinstructionsmaybeoutdatedbythetimethisbookisreleased,butrestassuredthatsomewhereintheKibanasetup,youcaneditaconfigurationfiletopointtoyourElasticsearchserver.Ihavenottriedoutthenewerversionyet,butthereisagoodoverviewofitontheElasticsearchblogathttp://www.elasticsearch.org/blog/kibana-4-beta-3-now-more-filtery.
Opentheconfig.jsfiletoeditthelinewiththeelasticsearchkey.ThisneedstobetheURLofthepublicnameorIPoftheElasticsearchAPI.Forourexample,thislineshouldlookasshownhere:
elasticsearch:'http://ec2-54-148-230-252.us-west-
2.compute.amazonaws.com:9200',
Nowweneedtoprovidethisdirectorywithawebbrowser.Let’sdownloadandinstallNginxagain:
[ec2-user@ip-172-31-26-120~]$sudoyuminstallnginx
Bydefault,therootdirectoryis/usr/share/nginx/html.WecanchangethisconfigurationinNginx,buttomakethingseasy,let’sjustcreateasymboliclinktopointtotherightlocation.First,movetheoriginalpathoutofthewaybyrenamingit:
[ec2-user@ip-172-31-26-120~]$sudomv/usr/share/nginx/html
/usr/share/nginx/html.dist
Next,linktheconfiguredKibanadirectoryasthenewwebroot:
[ec2-user@ip-172-31-26-120~]$sudoln-s~/kibana-3.1.2
/usr/share/nginx/html
Finally,startthewebserver:
[ec2-user@ip-172-31-26-120~]$sudo/etc/init.d/nginxstart
Startingnginx:[OK]
Fromyourcomputer,gotohttp://54.148.230.252/.Ifyouseethispage,itmeansyoumayhavemadeamistakeinyourKibanaconfiguration:
Thiserrorpagemeansyourwebbrowsercan’tconnecttoElasticsearch.Youmayneedtoclearyourbrowsercacheifyoufixedtheconfiguration,aswebpagesaretypicallycachedlocallyforsomeperiodoftime.Ifyougotitright,thescreenshouldlooklikethis:
GoaheadandselectSampleDashboard,thefirstoption.Youshouldseesomethinglikewhatisshowninthenextscreenshot.Thisincludessomeofthedataweingested,recordcountsinthecenter,andthefilteringfieldsintheleft-handmargin.
I’mnotgoingtoclaimtobeaKibanaexpert(I’mfarfromthat),soI’llleavefurthercustomizationofthistoyou.Usethisasabasetogobackandmakeadditionalmodificationstothedatatomakeiteasiertoconsume,search,orfilter.Todothat,you’llprobablyneedtogetmorefamiliarwithhowElasticsearchworks,butthat’sokaybecauseknowledgeisn’tabadthing.Acopiousamountofdocumentationiswaitingforyouathttp://www.elasticsearch.org/guide/en/kibana/current/index.html.
Atthispoint,wehavecompletedanend-to-endimplementationofdatafromawebserverstreamedinanearreal-timefashiontoaweb-basedtoolforsearching.Sinceboththesourceandtargetformatsweredictatedbyothers,weusedaninterceptortotransformthe
dataenroute.Thisusecaseisverygoodforshort-termtroubleshooting,butit’sclearlynotvery“Hadoopy”sincewehaveyettousecoreHadoop.
ArchivingtoHDFSWhenpeoplespeakofHadoop,theyusuallyrefertostoringlotsofdataforalongtime,usuallyinHDFS,somoreinterestingdatascienceormachinelearningcanbedonelater.Let’sextendourusecasebysplittingthedataflowatthecollectortostoreanextracopyinHDFSforlateruse.
So,backinAmazonAWS,IstartafourthservertorunHadoop.IfyouplanondoingallyourworkinHadoop,you’llprobablywanttowritethisdatatoS3,butforthisexample,let’sstickwithHDFS.Nowourserverdiagramlookslikethis:
IusedCloudera’sone-lineinstallationinstructionstospeedupthesetup.It’sinstructionscanbefoundathttp://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_qs_mrv1_pseudo.html
SincetheAmazonAMIiscompatiblewithEnterpriseLinux6,IselectedtheEL6RPMRepositoryandimportedthecorrespondingGPGkey:
[ec2-user@ip-172-31-7-175~]$sudorpm-ivh
http://archive.cloudera.com/cdh5/one-click-
install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm
Retrievinghttp://archive.cloudera.com/cdh5/one-click-
install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm
Preparing…#################################
[100%]
Updating/installing…
1:cloudera-cdh-5-0#################################
[100%]
[ec2-user@ip-172-31-7-175~]$sudorpm--import
http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
Next,Iinstalledthepseudo-distributedconfigurationtoruninasingle-nodeHadoop
cluster:
[ec2-user@ip-172-31-7-175~]$sudoyuminstall-yhadoop-0.20-conf-pseudo
ThismighttakeawhileasitdownloadsalltheClouderaHadoopdistributiondependencies.
Sincethisconfigurationisforsingle-nodeuse,weneedtoadjustthefs.defaultFSpropertyin/etc/hadoop/conf/core-site.xmltoadvertiseourprivateIPinsteadoflocalhost.Ifwedon’tdothis,thenamenodeprocesswillbindto127.0.0.1,andotherservers,suchasourcollector’sFlumeagent,willnotbeabletocontactit:
<property>
<name>fs.defaultFS</name>
<value>hdfs://172.31.7.175:8020</value>
</property>
Next,weformatthenewHDFSvolumeandstarttheHDFSdaemon(sincethatisallweneedforthisexample):
[ec2-user@ip-172-31-7-175~]$sudo-uhdfshdfsnamenode-format
[ec2-user@ip-172-31-7-175~]$forxin'cd/etc/init.d;lshadoop-hdfs-*'
;dosudoservice$xstart;done
startingdatanode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-
172-31-7-175.out
StartedHadoopdatanode(hadoop-hdfs-datanode):[OK]
startingnamenode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-namenode-ip-
172-31-7-175.out
StartedHadoopnamenode:[OK]
startingsecondarynamenode,loggingto/var/log/hadoop-hdfs/hadoop-hdfs-
secondarynamenode-ip-172-31-7-175.out
StartedHadoopsecondarynamenode:[OK]
[ec2-user@ip-172-31-7-175~]$hadoopfs-df
FilesystemSizeUsedAvailableUse%
hdfs://172.31.7.175:802083187834882457666618245120%
NowthatHDFSisrunning,let’sgobacktothecollectorboxconfigurationtocreateasecondchannelandanHDFSSinkbyaddingtheselines:
collector.channels.h1.type=memory
collector.channels.h1.capacity=10000
collector.sinks.hadoop.type=hdfs
collector.sinks.hadoop.channel=h1
collector.sinks.hadoop.hdfs.path=hdfs://172.31.7.175/access_logs/%Y/%m/%d/%H
collector.sinks.hadoop.hdfs.filePrefix=access
collector.sinks.hadoop.hdfs.rollInterval=60
collector.sinks.hadoop.hdfs.rollSize=0
collector.sinks.hadoop.hdfs.rollCount=0
Thenwemodifythetop-levelchannelsandsinkskeys:
collector.channels=m1h1
collector.sinks=eshadoop
Asyoucansee,I’vegonewithasimplememorychannelagain,butfeelfreetousea
durablefilechannelifyouneedit.FortheHDFSconfiguration,I’llbeusingadatedfilepathfromthe/access_logsrootdirectorywitha60-secondrotationregardlessofsize.Wearenotalteringthesourcejustyet,sodon’tworry.
Ifweattempttostartthecollectornow,weseethisexception:
java.lang.NoClassDefFoundError:
org/apache/hadoop/io/SequenceFile$CompressionType
RememberthatforApacheFlumetospeaktoHDFS,weneedcompatibleHDFSclassesanddependenciesfortheversionofHadoopwearespeakingto.Let’sgetsomehelpfromourfriendsatClouderaandinstalltheHadoopclientRPM(outputremovedtosavepaper):
[ec2-user@ip-172-31-26-205~]$sudorpm-ivh
http://archive.cloudera.com/cdh5/one-click-
install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm
[ec2-user@ip-172-31-26-205~]$sudorpm--import
http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
[ec2-user@ip-172-31-26-205~]$sudoyuminstallhadoop-client
TestwhetherRPMworksbycreatingthedestinationdirectoryforourdata.We’llalsosetpermissionstomatchtheaccountwe’llberunningtheFlumeagentunder(ec2-userinthiscase):
[ec2-user@ip-172-31-26-205~]$sudo-uhdfshadoopfs-mkdir
hdfs://172.31.7.175/access_logs
[ec2-user@ip-172-31-26-205~]$sudo-uhdfshadoopfs-chownec2-user:ec2-
userhdfs://172.31.7.175/access_logs
[ec2-user@ip-172-31-26-205~]$hadoopfs-lshdfs://172.31.7.175/
Found1items
drwxr-xr-x-ec2-userec2-user02014-12-1104:35
hdfs://172.31.7.175/access_logs
Now,ifwerunthecollectorFlumeagent,weseenoexceptionsduetomissingHDFSclasses.TheFlumestartupscriptdetectsthatwehavealocalHadoopinstallation,andappendsitsclasspathtoitsown.
Finally,wecangobacktotheFlumeconfigurationfile,andsplittheincomingdataatthesourcebylistingbothchannelsonthesourceasdestinationchannels:
collector.sources.av.channels=m1h1
Bydefault,areplicatingchannelselectorisused.Thisiswhatwewant,sothatnofurtherconfigurationisneeded.SavetheconfigurationfileandrestarttheFlumeagent.
Whendataisflowing,youshouldseetheexpectedHDFSactivityintheFlumecollectoragent’slogs:
2014-12-1105:05:04,141(SinkRunner-PollingRunner-DefaultSinkProcessor)
[INFO-
org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:261)]
Creating
hdfs://172.31.7.175/access_logs/2014/12/11/05/access.1418274302083.tmp
YoushouldalsoseedataappearinginHDFS:
[ec2-user@ip-172-31-7-175~]$ls-R/access_logs
drwxr-xr-x-ec2-userec2-user02014-12-1105:05
/access_logs/2014
drwxr-xr-x-ec2-userec2-user02014-12-1105:05
/access_logs/2014/12
drwxr-xr-x-ec2-userec2-user02014-12-1105:05
/access_logs/2014/12/11
drwxr-xr-x-ec2-userec2-user02014-12-1105:06
/access_logs/2014/12/11/05
-rw-r--r--3ec2-userec2-user104862014-12-1105:05
/access_logs/2014/12/11/05/access.1418274302082
-rw-r--r--3ec2-userec2-user104862014-12-1105:05
/access_logs/2014/12/11/05/access.1418274302083
-rw-r--r--3ec2-userec2-user64292014-12-1105:06
/access_logs/2014/12/11/05/access.1418274302084
IfyourunthiscommandfromaserverotherthantheHadoopnode,you’llneedtospecifythefullHDFSURI(hdfs://172.31.7.175/access_logs)orsetthedefault.FSpropertyinthecore-site.xmlconfigurationfileinHadoop.
NoteInretrospect,Iprobablyshouldhaveconfiguredthedefault.FSpropertyonanyinstalledHadoopclientthatwilluseHDFStosavemyselfalotoftyping.Liveandlearn!
LiketheprecedingElasticsearchimplementation,youcannowusethisexampleasabaseend-to-endconfigurationforHDFS.Thenextstepwillbetogobackandmodifytheformat,compression,andsoon,ontheHDFSSinktobettermatchwhatyou’lldowiththedatalater.
SummaryInthischapter,weiterativelyassembledanend-to-enddataflow.WestartedbysettingupanNginxwebservertocreateaccesslogs.Wealsoconfiguredcrontoexecutealogrotateconfigurationperiodicallytosafelyrotateoldlogstoaspoolingdirectory.
Next,weinstalledandconfiguredasingle-nodeElasticsearchserverandtestedsomeinsertionsanddeletions.ThenweconfiguredaFlumeclienttoreadinputfromourspoolingdirectoryfilledwithweblogs,andrelaythemtoaFlumecollectorusingcompressedAvroserialization.ThecollectorthenrelayedtheincomingdatatoourElasticsearchserver.
Oncewesawdataflowingfromoneendtoanother,wesetupasingle-nodeHDFSserverandmodifiedourcollectorconfigurationtosplittheinputdatafeedandrelayacopyofthemessagetoHDFS,simulatingarchivalstorage.Finally,wesetupaKibanaUIinfrontofourElasticsearchinstancetoprovideaneasy-searchfunctionfornontechnicalconsumers.
Inthenextchapter,wewillcovermonitoringFlumedataflowsusingGanglia.
Chapter8.MonitoringFlumeTheuserguideforFlumestates:
MonitoringinFlumeisstillaworkinprogress.Changescanhappenveryoften.SeveralFlumecomponentsreportmetricstotheJMXplatformMBeanserver.ThesemetricscanbequeriedusingJconsole.
WhileJMXisfineforcasualbrowsingofmetricvalues,thenumberofeyeballslookingatJconsoledoesn’tscalewhenyouhavehundredsoreventhousandsofserverssendingdataallovertheplace.Whatyouneedisawaytowatcheverythingatonce.However,whataretheimportantthingstolookfor?Thatisaverydifficultquestion,butI’lltryandcoverseveraloftheitemsthatareimportant,aswecovermonitoringoptionsinthischapter.
MonitoringtheagentprocessThemostobvioustypeofmonitoringyou’llwanttoperformisFlumeagentprocessmonitoring,thatis,makingsuretheagentisstillrunning.Therearemanyproductsthatdothiskindofprocessmonitoring,sothereisnowaywecancoverthemall.Ifyouworkatacompanyofanyreasonablesize,chancesarethereisalreadyasysteminplaceforthis.Ifthisisthecase,donotgooffandbuildyourown.Thelastthingoperationswantsisyetanotherscreentowatch24/7.
MonitIfyoudonotalreadyhavesomethinginplace,onefreemiumoptionisMonit(http://mmonit.com/monit/).ThedevelopersofMonithaveapaidversionthatprovidesmorebellsandwhistlesyoumaywanttoconsider.Eveninthefreeform,itcanprovideyouwithawaytocheckwhethertheFlumeagentisrunning,restartitifitisn’t,andsendyouane-mailwhenthishappenssothatyoucanlookintowhyitdied.
Monitdoesmuchmore,butthisfunctionalityiswhatwewillcoverhere.Ifyouaresmart,andIknowyouare,youwilladdcheckstothedisk,CPU,andmemoryusageasaminimum,inadditiontowhatwecoverinthischapter.
NagiosAnotheroptionforFlumeagentprocessmonitoringisNagios(http://www.nagios.org/).LikeMonit,youcanconfigureittowatchyourFlumeagentsandalertyouviaawebUI,e-mail,oranSNMPtrap.Thatsaid,itdoesn’thaverestartcapabilities.Thecommunityisquitestrong,andtherearemanypluginsforotheravailableapplications.
MycompanyusesthistochecktheavailabilityofHadoopwebUIs.Whilenotacompletepictureofhealth,itdoesprovidemoreinformationtotheoverallmonitoringofourHadoopecosystem.
Again,ifyoualreadyhavetoolsinplaceatyourcompany,seewhetheryoucanreusethembeforebringinginanothertool.
MonitoringperformancemetricsNowthatwehavecoveredsomeoptionsforprocessmonitoring,howdoyouknowwhetheryourapplicationisactuallydoingtheworkyouthinkitis?Onmanyoccasions,I’veseenastucksyslog-ngprocessthatappearstoberunning,butitjustwasn’tsendinganydata.I’mnotpickingonsyslog-ngspecifically;allsoftwaredoesthiswhenconditionsthatarenotdesignedforoccur.
WhentalkingaboutFlumedataflows,youneedtomonitorthefollowing:
DataenteringsourcesiswithinexpectedratesDataisn’toverflowingyourchannelsDataisexitingsinksatexpectedrates
Flumehasapluggablemonitoringframework,butasmentionedatthebeginningofthechapter,itisstillverymuchaworkinprogress.Thisdoesnotmeanyoushouldn’tuseit,asthatwouldbefoolish.Itmeansyou’llwanttoprepareextratestingandintegrationtimeanytimeyouupgrade.
NoteWhilenotcoveredintheFlumedocumentation,itiscommontoenableJMXinyourFlumeJVM(http://bit.ly/javajmx)andusetheNagiosJMXplugin(http://bit.ly/nagiosjmx)toalertyouaboutperformanceabnormalitiesinyourFlumeagents.
GangliaOneoftheavailablemonitoringoptionstowatchFlumeinternalmetricsisGangliaintegration.Ganglia(http://ganglia.sourceforge.net/)isanopensourcemonitoringtoolthatisusedtocollectmetricsanddisplaygraphs,anditcanbetieredtohandleverylargeinstallations.TosendyourFlumemetricstoyourGangliacluster,youneedtopasssomepropertiestoyouragentatstartuptime:
Javaproperty Value Description
flume.monitoring.type ganglia Settoganglia
flume.monitoring.hostshost1:port1,host2:port2
Acomma-separatedlistofhost:portpairsforyourgmondprocess(es)
flume.monitoring.pollFrequency 60Thenumberofsecondsbetweensendingofdata(default60seconds)
flume.monitoring.isGanglia3 falseSettotrueifusingolderGanglia3protocol.Thedefaultprocessistosenddatausingv3.1protocol.
Lookateachinstanceofgmondwithinthesamenetworkbroadcastdomain(asreachabilityisbasedonmulticastpackets),andfindtheudp_recv_channelblockingmond.conf.Let’ssayIhadtwonearbyserverswiththesetwocorrespondingconfigurationblocks:
udp_recv_channel{
mcast_join=239.2.14.22
port=8649
bind=239.2.14.22
retry_bind=true
}
udp_recv_channel{
mcast_join=239.2.11.71
port=8649
bind=239.2.11.71
retry_bind=true
}
InthiscasetheIPandportare239.2.14.22/8649forthefirstserverand239.2.11.71/8649forthesecond,leadingtothesestartupproperties:
-Dflume.monitoring.type=ganglia
-Dflume.monitoring.hosts=239.2.14.22:8649,239.2.11.71:8649
Here,Iamusingdefaultsforthepollinterval,andI’malsousingthenewerGangliawireprotocol.
NoteWhilereceivingdataviaTCPissupportedinGanglia,thecurrentFlume/GangliaintegrationonlysupportssendingdatausingmulticastUDP.Ifyouhavealarge/complicatednetworksetup,you’llwanttogeteducatedbyyournetworkengineersif
thingsdon’tworkasyouexpect.
InternalHTTPserverYoucanconfiguretheFlumeagenttostartanHTTPserverthatwilloutputJSONthatcanusequeriesbyoutsidemechanisms.UnliketheGangliaintegration,anexternalentityhastocalltheFlumeagenttopollthedata.Intheory,youcanuseNagiostopollthisJSONdataandalertoncertainconditions,butIhavepersonallynevertriedit.Ofcourse,thissetupisveryusefulindevelopmentandtesting,especiallyifyouarewritingcustomFlumecomponentstobesuretheyaregeneratingusefulmetrics.HereisasummaryoftheJavapropertiesyou’llneedtosetatthestartupoftheFlumeagent:
Javaproperty Value Description
flume.monitoring.type http Thevalueissettohttp
flume.monitoring.port PORT ThisistheportnumbertobindtheHTTPserver
TheURLformetricswillbehttp://SERVER_OR_IP_OF_AGENT:PORT/metrics.
Let’slookatthefollowingFlumeconfiguration:
agent.sources=s1
agent.channels=c1
agent.sinks=k1
agent.sources.s1.type=avro
agent.sources.s1.bind=0.0.0.0
agent.sources.s1.port=12345
agent.sources.s1.channels=c1
agent.channels.c1.type=memory
agent.sinks.k1.type=avro
agent.sinks.k1.hostname=192.168.33.33
agent.sinks.k1.port=9999
agent.sinks.k1.channel=c1
StarttheFlumeagentwiththeseproperties:
-Dflume.monitoring.type=http
-Dflume.monitoring.port=44444
Now,whenyougotohttp://SERVER_OR_IP:44444/metrics,youmightseesomethinglikethis:
{
"SOURCE.s1":{
"OpenConnectionCount":"0",
"AppendBatchAcceptedCount":"0",
"AppendBatchReceivedCount":"0",
"Type":"SOURCE",
"EventAcceptedCount":"0",
"AppendReceivedCount":"0",
"StopTime":"0",
"EventReceivedCount":"0",
"StartTime":"1365128622891",
"AppendAcceptedCount":"0"},
"CHANNEL.c1":{
"EventPutSuccessCount":"0",
"ChannelFillPercentage":"0.0",
"Type":"CHANNEL",
"StopTime":"0",
"EventPutAttemptCount":"0",
"ChannelSize":"0",
"StartTime":"1365128621890",
"EventTakeSuccessCount":"0",
"ChannelCapacity":"100",
"EventTakeAttemptCount":"0"},
"SINK.k1":{
"BatchCompleteCount":"0",
"ConnectionFailedCount":"4",
"EventDrainAttemptCount":"0",
"ConnectionCreatedCount":"0",
"BatchEmptyCount":"0",
"Type":"SINK",
"ConnectionClosedCount":"0",
"EventDrainSuccessCount":"0",
"StopTime":"0",
"StartTime":"1365128622325",
"BatchUnderflowCount":"0"}
}
Asyoucansee,eachsource,sink,andchannelarebrokenoutseparatelywiththeircorrespondingmetrics.Eachtypeofsource,channel,andsinkprovidetheirownsetofmetrickeys,althoughthereissomecommonality,sobesuretocheckwhatlooksinteresting.Forinstance,thisAvrosourcehasOpenConnectionCount,thatis,thenumberofconnectedclients(whoaremostlikelysendingdatain).Thismayhelpyoudecidewhetheryouhavetheexpectednumberofclientsrelyingonthatdataor,perhaps,toomanyclients,andyouneedtostarttieringyouragents.
Generallyspeaking,thechannel’sChannelSizeorChannelFillPercentagemetricswillgiveyouagoodideawhetherthedataiscominginfasterthanitisgoingout.Itwillalsotellyouwhetheryouhaveitsetlargeenoughformaintenance/outagesofyourdatavolume.
Lookingatthesink,EventDrainSuccessCountversusEventDrainAttemptCountwilltellyouhowoftenoutputissuccessfulwhencomparedtothetimestried.Inthisexample,IamconfiguringanAvrosinktoanonexistenttarget.Asyoucansee,theConnectionFailedCountmetricisgrowing,whichisagoodindicatorofpersistentconnectionproblems.EvenagrowingConnectionCreatedCountmetriccanindicatethatconnectionsaredroppingandreopeningtoooften.
Really,therearenohardandfastrulesbesideswatchingChannelSize/ChannelFillPercentage.Eachusecasewillhaveitsownperformanceprofile,sostartsmall,setupyourmonitoring,andlearnasyougo.
CustommonitoringhooksIfyoualreadyhaveamonitoringsystem,youmaywanttotaketheextraefforttodevelopacustommonitoringreportingmechanism.Youmaythinkthisisassimpleasimplementingtheorg.apache.flume.instrumentation.MonitorServiceinterface.Youdoneedtodothis,butlookingattheinterface,youwillonlyseeastart()andstop()method.Unlikethemoreobviousinterceptorparadigm,theagentexpectsthatyourMonitorServiceimplementationwillstart/stopathreadtosenddataontheexpectedorconfiguredintervalifitisthetypetosenddatatoareceivingservice.Ifyouaregoingtooperateaservice,suchastheHTTPservice,thenstart/stopwouldbeusedtostartandstopyourlisteningservice.ThemetricsthemselvesarepublishedinternallytoJMXbythevarioussources,sinks,channels,andinterceptorsusingobjectnamesthatstartwithorg.apache.flume.YourimplementationwillneedtoreadthesefromMBeanServer.
NoteThebestadviceIcangiveyou,shouldyoudecidetoimplementyourown,istolookatthesourceoftwoexistingimplementations(includedinthesourcedownloadreferencedinChapter2,AQuickStartGuidetoFlume)anddowhattheydo.Touseyourmonitoringhook,settheflume.monitoring.typepropertytothefullyqualifiedclassnameofyourimplementationclass.ExpecttohavetoreworkanycustomhookswithnewFlumeversionsuntiltheframeworkmaturesandstabilizes.
SummaryInthischapter,wecoveredmonitoringFlumeagentsbothfromtheprocesslevelandthemonitoringofinternalmetrics(whetheritisworking).
MonitandNagioswereintroducedasopensourceoptionsforprocesswatching.
Next,wecoveredtheFlumeagentinternalmonitoringmetricswithGangliaandJSONoverHTTPimplementationsthatshipwithApacheFlume.
Finally,wecoveredhowtointegrateacustommonitoringimplementationifyouneedtodirectlyintegratetosomeothertoolthat’snotsupportedbyFlumebydefault.
Inourfinalchapter,wewilldiscusssomegeneralconsiderationsforyourFlumedeployment.
Chapter9.ThereIsNoSpoon–theRealitiesofReal-timeDistributedDataCollectionInthislastchapter,Ithoughtweshouldcoversomeofthelessconcrete,randomthoughtsIhavearounddatacollectionintoHadoop.There’snohardsciencebehindsomeofthis,andyoushouldfeelperfectlyalrighttodisagreewithme.
WhileHadoopisagreattooltoconsumevastquantitiesofdata,Ioftenthinkofapictureofthelogjamthatoccurredin1886intheSt.CroixRiverinMinnesota(http://www.nps.gov/sacn/historyculture/stories.htm).Whendealingwithtoomuchdata,youwanttomakesureyoudon’tjamyourriver.Besuretotakethepreviouschapteronmonitoringseriouslyandnotjustasnice-to-haveinformation.
TransporttimeversuslogtimeIhadasituationwheredatawasbeingplacedusingdatepatternsinthefilenameand/orthepathinHDFSdidn’tmatchthecontentsofthedirectories.Theexpectationwasthatthedatainthe2014/12/29directorypathcontainedallthedataforDecember29,2014.However,therealitywasthatthedatewasbeingpulledfromthetransport.Itturnsoutthattheversionofsyslogwewereusingwasrewritingtheheader,includingthedateportion,causingthedatatotakeonthetransporttimeandnotreflecttheoriginaltimeoftherecord.Usually,theoffsetsweretiny,justasecondortwo,sonobodyreallytooknotice.However,oneday,oneoftherelayserversdiedandwhenthedatathathadgotstuckonupstreamserverswasfinallysent,ithadthecurrenttime.Inthiscase,itwasshiftedbyacoupleofdays,causingasignificantdatacleanupeffort.
Besurethisisn’thappeningtoyouifyouareplacingdatabydate.Checkthedateedgecasestoseethattheyarewhatyouexpect,andmakesureyoutestyouroutagescenariosbeforetheyhappenforrealinproduction.
AsImentionedpreviously,theseretransmitsduetoplannedorunplannedmaintenance(orevenatinynetworkhiccup)willmostlikelycauseduplicateandout-of-ordereventstoarrive,sobesuretoaccountforthiswhenprocessingrawdata.TherearenosingledeliveryororderingguaranteesinFlume.Ifyouneedthat,useatransactionaldatabaseordistributedtransactionlogsuchasApacheKafka(http://kafka.apache.org/)instead.Ofcourse,ifyouaregoingtouseKafka,youwouldprobablyonlyuseFlumeforthefinallegofyourdatapath,withyoursourceconsumingeventsfromKafka(https://github.com/baniuyao/flume-ng-kafka-source).
NoteRememberthatyoucanalwaysworkaroundduplicatesinyourdataatquerytimeaslongasyoucanuniquelyidentifyyoureventsfromoneanother.Ifyoucannotdistinguisheventseasily,youcanaddaUniversallyUniqueIdentifier(UUID)(http://en.wikipedia.org/wiki/Universally_unique_identifier)headerusingthebundledinterceptor,UUIDInterceptor(configurationdetailsareintheFlumeUserGuide).
TimezonesareevilIncaseyoumissedmybiasagainstusinglocaltimeinChapter4,SinksandSinkProcessors,I’llrepeatitherealittlestronger:timezonesareevil—evillikeDr.Evil(http://en.wikipedia.org/wiki/Dr._Evil)—andlet’snotforgetabouthisMiniMecounterpart,(http://en.wikipedia.org/wiki/Mini-Me)—DaylightSavingsTime.
Weliveinaglobalworldnow.YouarepullingdatafromallovertheplaceintoyourHadoopcluster.Youmayevenhavemultipledatacentersindifferentpartsofthecountry(ortheworld).Thelastthingyouwanttobedoingwhiletryingtoanalyzeyourdataistodealwithaskewdata.DaylightSavingsTimechangesatleastsomewhereonEarthadozentimesinayear.Justlookatthehistory:ftp://ftp.iana.org/tz/releases/.SaveyourselftheheadacheandjustnormalizeittoUTC.Ifyouwanttoconvertitto“localtime”onitswaytohumaneyeballs,feelfree.However,whileitlivesinyourcluster,keepitnormalizedtoUTC.
NoteConsideradoptingUTCeverywhereviathisJavastartupparameter(ifyoucan’tsetitsystem-wide):-Duser.timezone=UTC
Also,usetheISO8601(http://en.wikipedia.org/wiki/ISO_8601)timestandardwherepossibleandbesuretoincludetimezoneinformation(evenifitisUTC).Everymoderntoolontheplanetsupportsthisformatandwillsaveyoupaindowntheroad.
IliveinChicago,andourcomputersatworkuseCentralTime,whichadjustsfordaylightsavings.InourHadoopcluster,weliketokeepdatainaYYYY/MM/DD/HHdirectorylayout.Twiceayear,somethingsbreakslightly.Inthefall,wehavetwiceasmuchdatainour2a.m.directory.Inthespring,thereisno2a.m.directory.Madness!
CapacityplanningRegardlessofhowmuchdatayouthinkyouhave,thingswillchangeovertime.Newprojectswillpopupanddatacreationratesforyourexistingprojectswillchange(upordown).Datavolumewillusuallyebbandflowwiththetrafficoftheday.Finally,thenumberofserversfeedingyourHadoopclusterwillchangeovertime.
TherearemanyschoolsofthoughtonhowmuchextrastoragecapacityyoushouldkeepinyourHadoopcluster(weusethetotallyunscientificvalueof20percent,whichmeansthatweusuallyplanfor80percentfullwhenorderingadditionalhardwarebutdon’tstarttopanicuntilwehitthe85-90percentutilizationnumber).Generally,youwanttokeepenoughextraspacesothatthefailureand/ormaintenanceofaserverortwowon’tcausetheHDFSblockreplicationtoconsumealltheremainingspace.
Youmayalsoneedtosetupmultipleflowsinsideasingleagent.Thesourceandsinkprocessorsarecurrentlysingle-threaded,sothereissomelimittowhattuningbatchsizescanaccomplishwhenunderheavydatavolumes.Beverycarefulinthesesituationswhereyousplityourdataflowatthesourceusingareplicatingchannelselectortomultiplechannels/sinks.Ifoneofthepath’schannelsfillsup,anexceptionisthrownbacktothesource.Ifthatfullchannelisnotmarkedasoptionalandthedataisdropped,thesourcewillstopconsumingnewdata.Thiseffectivelyjamstheagentforallotherchannelsattachedtothatsource.Youmaynotwanttodropthedata(markingthechannelasoptional)becausethedataisimportant.Unfortunately,thisistheonlyfan-outmechanismprovidedinFlumetosendtomultipledestinations,somakesureyoucatchissuesquicklysothatallyourdataflowsarenotimpairedduetoacascadebackupofevents.
ForanumberofFlumeagentsfeedingHadoop,thistooshouldbeadjustedbasedonrealnumbers.Watchthechannelsizetoseehowwellthewritesarekeepingupundernormalloads.Adjustthemaximumchannelcapacitytohandlewhateveramountofoverheadmakesyoufeelgood.Youcanalwayspurchasewaymorehardwarethanyouneed,butevenaprolongedoutagemayoverfloweventhemostconservativeestimates.Thisiswhenyouhavetopickandchoosewhichdataismoreimportanttoyouandadjustyourchannelcapacitiestoreflectthat.Thisway,ifyouexceedyourlimits,theleastimportantdatawillbethefirsttobedropped.
Chancesareyourcompanydoesn’thaveaninfiniteamountofmoneyandatsomepoint,thevalueofthedataversusthecostofcontinuingtoexpandyourclusterwillstarttobequestioned.Thisiswhysettinglimitsonthevolumeofdatacollectedisveryimportant.Thisisjustoneaspectofyourdataretentionpolicy,wherecostisthedrivingfactor.Inamoment,we’lldiscusssomeofthecomplianceaspectsofthispolicy.Sufficetosay,anyprojectsendingdataintoHadoopshouldbeabletosaywhatthevalueofthatdataisandwhatthelossisifwedeletetheolderstuff.Thisistheonlywaythepeoplewritingthecheckscanmakeaninformeddecision.
ConsiderationsformultipledatacentersIfyourunyourbusinessoutofmultipledatacentersandhavealargevolumeofdatacollected,youmaywanttoconsidersettingupaHadoopclusterineachdatacenterratherthansendingallyourcollecteddatabacktoasingledatacenter.Theremayberegulatoryimplicationsregardingdatacrossingcertaingeographicboundaries.ChancesarethereissomebodyinyourcompanywhoknowsmuchmoreaboutcompliancethanyouorI,soseekthemoutbeforeyoustartcopyingdataacrossborders.Ofcourse,notcollatingyourdatawillmakeitmoredifficulttoanalyzeit,asyoucan’tjustrunoneMapReducejobagainstallthedata.Instead,youwouldhavetorunparalleljobsandthencombinetheresultsinasecondpass.Adjustingyourdataprocessingproceduresisbetterthanpotentiallybreakingthelaw.Besuretodoyourhomework.
Pullingallyourdataintoasingleclustermayalsobemorethanyournetworkingcanhandle.Dependingonhowyourdatacentersareconnectedtoeachother,yousimplymaynotbeabletotransmitthedesiredvolumeofdata.Ifyouusepubliccloudservices,therearesurelydatatransfercostsbetweendatacenters.Finally,considerthatacompleteclusterfailureorcorruptionmaywipeouteverything,asmostclustersareusuallytoobigtobackupeverythingexcepthighvaluedata.Havingsomeoftheolddatainthiscaseissometimesbetterthanhavingnothing.WithmultipleHadoopclusters,youhavetheabilitytouseaFailoverSinkProcessortoforwarddatatoadifferentclusterifyoudon’twanttowaittosendtothelocalone.
Ifyoudochoosetosendallyourdatatoasingledestination,consideraddingalargediskcapacitymachineasarelayserverforthedatacenter.Thisway,ifthereisacommunicationissueorextendedclustermaintenance,youcanletdatapileuponamachinethat’sdifferentfromtheonestryingtoserviceyourcustomers.Thisissoundadviceeveninasingledatacentersituation.
ComplianceanddataexpiryRememberthatthedatayourcompanyiscollectingfromyourcustomersshouldbeconsideredsensitiveinformation.Youmaybeboundbyadditionalregulatorylimitationsonaccessingdatasuchas:
Personallyidentifiableinformation(PII):Howyouhandleandsafeguardcustomer’sidentitieshttp://en.wikipedia.org/wiki/Personally_identifiable_informationPaymentCardIndustryDataSecurityStandard(PCIDSS):Howyousafeguardcreditcardinformationhttp://en.wikipedia.org/wiki/PCI_DSSServiceOrganizationControl(SOC-2):Howyoucontrolaccesstoinformation/systemshttp://www.aicpa.org/InterestAreas/FRC/AssuranceAdvisoryServices/Pages/AICPASOC2Report.aspxStatementsonStandardsforAttestationEngagements(SSAE-16):Howyoumanagechangeshttp://www.aicpa.org/Research/Standards/AuditAttest/DownloadableDocuments/AT-00801.pdfSarbanesOxley(SOX):http://en.wikipedia.org/wiki/Sarbanes%E2%80%93Oxley_Act
Thisisbynomeansadefinitivelist,sobesuretoseekoutyourcompany’scomplianceexpertsforwhatdoesanddoesn’tapplytoyoursituation.Ifyouaren’tproperlyhandlingaccesstothisdatainyourcluster,thegovernmentwillleanonyou,orworse,youwon’thavecustomersanymoreiftheyfeelyouaren’tprotectingtheirpersonalinformation.Considerscrambling,trimming,orobfuscatingyourdataofpersonalinformation.Chancesarethebusinessinsightyouarelookingfallsmoreintothecategoryof“howmanypeoplewhosearchfor“hammer”actuallybuyone?”ratherthan“howmanycustomersarenamedBob?”AsyousawinChapter6,Interceptors,ETL,andRouting,itwouldbeveryeasytowriteaninterceptortoobfuscatePIIasyoumoveitaround.
YourcompanyprobablyhasadocumentretentionpolicythatincludesthedatayouareputtingintoHadoop.Makesureyouremovedatathatyourpolicysaysyouaren’tsupposedtobekeepingaroundanymore.Thelastthingyouwantisavisitfromthelawyers.
SummaryInthischapter,wecoveredseveralreal-worldconsiderationsyouneedtothinkaboutwhenplanningyourFlumeimplementation,including:
TransporttimedoesnotalwaysmatcheventtimeThemayhemintroducedwithDaylightSavingsTimetocertaintime-basedlogicCapacityplanningconsiderationsItemstoconsiderwhenyouhavemorethanonedatacenterDatacomplianceDataretentionandexpiration
Ihopeyouenjoyedthisbook.Hopefully,youwillbeabletoapplymuchofthisinformationdirectlyinyourapplication/Hadoopintegrationefforts.
Thanks,thiswasfun!
IndexA
ActiveMQURL/JMSsource
agent/Sources,channels,andsinksagentcommand/Startingupwith“Hello,World!”agentidentifier(name)/AnoverviewoftheFlumeconfigurationfileagentprocess
monitoring/Monitoringtheagentprocessagentprocess,monitoring
Monit/MonitNagios/Nagios
AOPSpringFramework/Interceptors,channelselectors,andsinkprocessorsApacheAvro
about/ApacheAvroURL/ApacheAvro
ApachebenchmarkURL/Settingupthewebserver
ApacheBigTopprojectURL/FlumeinHadoopdistributions
ApacheFlumeURL/Thememorychannel
ApacheKafkaURL/Transporttimeversuslogtime
Avrosource/sinkused,fortieringdataflows/TheAvrosource/sink,TheThriftsource/sinkcompressedAvro/CompressingAvro
avro_eventserializer/ApacheAvro
BbasenameHeaderKeyproperty/SpoolingDirectorySourcebatchDurationMillisproperty/SinkconfigurationbatchSizeproperty/SinkconfigurationbatchTimeoutproperty/TheExecsourceBesteffort(BE)/Flume0.9byteCapacity
URL/ThememorychannelbyteCapacityBufferPercentage
URL/Thememorychannel
Ccapacityplanning/CapacityplanningCDH3distribution/Flume0.9CDH5
URL/ArchivingtoHDFScentralizedmaster(masters)/Flume0.9cf-enginetool/Flume1.X(Flume-NG)ChannelProcessorfunction/Thememorychannelchannels/Sources,channels,andsinkschannelselector
about/Channelselectorsreplicating/Replicatingmultiplexing/Multiplexing
channelselectors/Interceptors,channelselectors,andsinkprocessorscheckpointIntervalproperty/ThefilechannelCheftool/Flume1.X(Flume-NG)Cloudera
URL/FlumeinHadoopdistributionsClouderadistribution
issuesolution,URL/AnoverviewoftheFlumeconfigurationfilecodecs
about/Compressioncodecscommand-lineAvro
about/Usingcommand-lineAvrocompliance/ComplianceanddataexpiryCompressedStream/CompressedStreamconsumeOrderproperty/SpoolingDirectorySourcecrondaemon
URL/Configuringlogrotationtothespooldirectorycustominterceptors
about/Custominterceptorspluginsdirectory/Thepluginsdirectory
custommonitoringreportingmechanismdeveloping/Custommonitoringhooks
D-Dflume.root.loggerproperty/Startingupwith“Hello,World!”data/logs
streaming/TheproblemwithHDFSandstreamingdata/logsdataexpiry/Complianceanddataexpirydataflows
tiering/Tieringflowsdataflows,tiering
Avrosource/sink,using/TheAvrosource/sinkSSLAvro/SSLAvroflowsThriftsource/sink,using/TheThriftsource/sinkcommand-lineAvro/Usingcommand-lineAvroLog4Jappender/TheLog4JappenderLog4Jload-balancingappender/TheLog4Jload-balancingappender
DataStream/DataStreamdestinationNameproperty/JMSsourceDiskFailover(DFO)/Flume0.9
EElasticComputeCluster(EC2)/WeblogstosearchableUIElasticSearch
about/ElasticSearchSinkURL/ElasticSearchSink,Settingupthetarget–Elasticsearch,Settingupabetteruserinterface–KibanaversusApacheSolr/DynamicSerializersettingup/Settingupthetarget–Elasticsearch
ElasticSearchDynamicSerializerserializer/DynamicSerializerElasticSearchLogStashEventSerializer/LogStashSerializerElasticSearchSink
about/ElasticSearchSinksettings/ElasticSearchSinkElasticSearchLogStashEventSerializer/LogStashSerializerElasticSearchDynamicSerializerserializer/DynamicSerializer
EmbeddedAgentabout/Theembeddedagentconfiguration/Configurationandstartupstartup/Configurationandstartupalternativeformats,URL/Configurationandstartupdata,sending/Sendingdatashutdown/Shutdown
End-to-End(E2E)/Flume0.9EventSerializer
about/EventSerializerstextoutput/Textoutputtext_with_headersserializer/TextwithheadersApacheAvro/ApacheAvrouser-providedAvroschema/User-providedAvroschemafiletype/Filetypetimeouts/Timeoutsandworkersworkers/Timeoutsandworkers
Execsourceabout/TheExecsourceproperties/TheExecsource
Ffilechannel
about/Thefilechannelconfigurationparameters/Thefilechannel
filetypeabout/FiletypeSequenceFile/SequenceFileDataStream/DataStreamCompressedStream/CompressedStream
Flumeabout/Flume0.9events/FlumeeventsURL/DownloadingFlumedownloading/DownloadingFlumeinHadoopdistributions/FlumeinHadoopdistributionsconfigurationfile/AnoverviewoftheFlumeconfigurationfilesettingup,oncollector/relay/SettingupFlumeoncollector/relaysettingup,onclient/SettingupFlumeontheclientagentprocess,monitoring/Monitoringtheagentprocess
Flume-NG(FlumetheNextGeneration)/Flume0.9flume-ng-kafka-source
URL/TransporttimeversuslogtimeFlume0.9/Flume0.9Flume1.X(Flume-NG)/Flume1.X(Flume-NG)
URL/Flume1.X(Flume-NG)Flumeconfigurationfile
overview/AnoverviewoftheFlumeconfigurationfileFlumeevents
about/Flumeeventsinterceptors/Interceptors,channelselectors,andsinkprocessorschannelselectors/Interceptors,channelselectors,andsinkprocessorssinkprocessors/Interceptors,channelselectors,andsinkprocessors
FlumeJVMURL/Monitoringperformancemetrics
FlumeUserGuideURL/Thefilechannel,HDFSsink
GGanglia
about/GangliaURL/Ganglia
grokcommandURL/TheKiteSDK
HHadoopdistributions
Flume/FlumeinHadoopdistributionsbenefits/FlumeinHadoopdistributionslimitations/FlumeinHadoopdistributions
HDFSissue/TheproblemwithHDFSandstreamingdata/logsarchivingto/ArchivingtoHDFS
hdfs.batchSizeparameter/Filerotationhdfs.maxOpenFilesproperty/Pathandfilenamehdfs.timeZoneproperty/Pathandfilenamehdfs.useLocalTimeStampbooleanproperty/PathandfilenameHDFSsink
about/HDFSsinkconfigurationparameters/HDFSsinkpath/Pathandfilenamefilename/Pathandfilenamefilerotation/Filerotationcompressioncodecs/Compressioncodecs
Hello,World!examplefileconfiguration/Startingupwith“Hello,World!”
helpcommand/Startingupwith“Hello,World!”Hortonworks
URL/FlumeinHadoopdistributionsHostinterceptor
about/Hostproperties/Host
HumanOptimizedConfigurationObjectNotation(HOCON)URL/Morphlineconfigurationfiles
IindexNameproperty/ElasticSearchSinkinterceptor/Interceptors,channelselectors,andsinkprocessors
used,forcreatingsearchfields/Creatingmoresearchfieldswithaninterceptorinterceptors
about/Interceptorsadding/InterceptorsTimestamp/TimestampHost/HostStatic/Staticregularexpressionfiltering/Regularexpressionfilteringregularexpression/RegularexpressionextractorMorphline/Morphlineinterceptorcustom/Custominterceptors
internalHTTPserverusing/InternalHTTPserver
ISO8601URL/Timezonesareevil
JJavaKeyStore(JKS)/SSLAvroflowsJavaproperties
flume.monitoring.type/Gangliaflume.monitoring.hosts/Gangliaflume.monitoring.isGanglia3/Ganglia
JMSmessageselectorsURL/JMSsource
JMSsourceabout/JMSsourceconfiguring/JMSsourcesettings/JMSsource
Kkeep-aliveparameter/ThefilechannelkeepFieldsproperty/ThesyslogUDPsource,ThesyslogTCPsourceKibana
URL/ElasticSearchSink,Settingupabetteruserinterface–Kibanasettingup/Settingupabetteruserinterface–Kibana
KiteSDKabout/TheKiteSDKURL/TheKiteSDK
LLog4Jappender
about/TheLog4JappenderURL/TheLog4Jappender
Log4Jload-balancingappenderabout/TheLog4Jload-balancingappender
logrotateutilityURL/Configuringlogrotationtothespooldirectory
LogstashURL/ElasticSearchSink
logtimeversustransporttime/Transporttimeversuslogtime
MMapR
URL/FlumeinHadoopdistributionsmemoryCapacityproperty/SpillableMemoryChannelmemorychannel
about/Thememorychannelconfigurationparameters/Thememorychannel
metricsURL/InternalHTTPserver
MiniMecounterpart/TimezonesareevilminimumRequiredSpaceproperty/ThefilechannelMonit
URL/Monitabout/Monit
Morphlineconfigurationfile/MorphlineconfigurationfilesURL/Morphlineconfigurationfiles,TypicalSolrSinkconfiguration,Morphlineinterceptor
MorphlinecommandsURL/TheKiteSDK
morphlineIdproperty/SinkconfigurationMorphlineinterceptor
about/MorphlineinterceptorMorphlineSolrSink
about/MorphlineSolrSinkMorphlineconfigurationfile/MorphlineconfigurationfilesSolrSinkconfiguration/TypicalSolrSinkconfigurationsinkconfiguration/SinkconfigurationSinkconfiguration/Sinkconfiguration
multipledatacentersconsiderations/Considerationsformultipledatacenters
MultiportSyslogTCPsourceabout/ThemultiportsyslogTCPsourceproperties/ThemultiportsyslogTCPsourceFlumeheaders/ThemultiportsyslogTCPsource
NNagios
URL/Nagiosabout/Nagios
NagiosJMXpluginURL/Monitoringperformancemetrics
nccommand/Startingupwith“Hello,World!”Nginxwebserver
URL/Settingupthewebserver
OoverflowCapacityproperty/SpillableMemoryChanneloverflowDeactivationThresholdproperty/SpillableMemoryChanneloverflowTimeoutproperty/SpillableMemoryChannel
PPaymentCardIndustryDataSecurityStandard(PCIDSS)
URL/Complianceanddataexpiryperformancemetrics
monitoring/Monitoringperformancemetricsperformancemetrics,monitoring
Ganglia/GangliainternalHTTPserver/InternalHTTPservercustommonitoringhooks/Custommonitoringhooks
Personallyidentifiableinformation(PII)URL/Complianceanddataexpiry
pluginsdirectoryabout/Thepluginsdirectory$FLUME_HOME/plugins.ddirectory/Thepluginsdirectorylibdirectory/Thepluginsdirectorynativedirectory/Thepluginsdirectory
pollTimeoutproperty/JMSsourcepom.xmlfile
URL/DynamicSerializerPOSIX-stylefilesystem/TheproblemwithHDFSandstreamingdata/logsprocessor.backoffproperty/LoadbalancingPuppettool/Flume1.X(Flume-NG)
RRedHatEnterpriseLinux(RHEL)/FlumeinHadoopdistributionsregularexpressionextractorinterceptor
about/Regularexpressionextractorproperties/Regularexpressionextractor
regularexpressionfilteringinterceptorabout/RegularexpressionfilteringURL/Regularexpressionfiltering
routing/RoutingRunners/Flume1.X(Flume-NG)
SSarbanesOxley(SOX)
URL/ComplianceanddataexpirySequenceFilefiletype/SequenceFileserializer.syncIntervalBytesproperty/ApacheAvroserializers/Regularexpressionextractorsinkgroup
about/Sinkgroupsloadbalancing/Loadbalancingfailover/Failover
SinkProcessorabout/Sinkgroups
sinkprocessors/Interceptors,channelselectors,andsinkprocessorssinks/Sources,channels,andsinksSolr/MorphlineSolrSinkSolrCloud/MorphlineSolrSinkSolrSink
configuration/TypicalSolrSinkconfigurationsources/Sources,channels,andsinksSpillableMemoryChannel
about/SpillableMemoryChannelconfigurationparameters/SpillableMemoryChannel
spooldirectorylogrotation,configuring/Configuringlogrotationtothespooldirectory
SpoolingDirectorySourceabout/SpoolingDirectorySourcecreating/SpoolingDirectorySourceproperties/SpoolingDirectorySource
SpringURL/Configurationandstartup
start()method/CustommonitoringhooksStatementsonStandardsforAttestationEngagements(SSAE-16)
URL/ComplianceanddataexpiryStaticinterceptor
about/Staticproperties/Static
syslogsourcesabout/SyslogsourcesURL/SyslogsourcesUDPsource/ThesyslogUDPsourceTCPsource/ThesyslogTCPsourceMultiportSyslogTCPsource/ThemultiportsyslogTCPsource
SyslogTCPsource
about/ThesyslogTCPsourcecreating/ThesyslogTCPsourceFlumeheaders/ThesyslogTCPsource
syslogtermsURL/SyslogsourcesSyslogUDPsource
about/ThesyslogUDPsourceproperties/ThesyslogUDPsourceFlumeheaders/ThesyslogUDPsource
Ttail
URL/Theproblemwithusingtailabout/Theproblemwithusingtailissues/Theproblemwithusingtail
tail-Fcommand/TheExecsourcetake/Thememorychanneltext_with_headersserializer/TextwithheadersThrift
URL/TheThriftsource/sinkThriftsource/sink
used,fortieringdataflows/TheThriftsource/sinktiereddatacollection/Tiereddatacollection(multipleflowsand/oragents)Timestampinterceptor
about/Timestampproperties/Timestamp
timestampkey/Regularexpressionextractortimezones/TimezonesareeviltransactionCapacityproperty/Thefilechannel,SpillableMemoryChanneltransporttime
versuslogtime/Transporttimeversuslogtime
UUniversallyUniqueIdentifier(UUID)
URL/Transporttimeversuslogtimeuser-providedAvroschema/User-providedAvroschema
VVeriSign/SSLAvroflows
Wwebapplication
simulating/WeblogstosearchableUIwebserver
settingup/Settingupthewebserverlogrotation,configuringtospooldirectory/Configuringlogrotationtothespooldirectory
workerdaemons(agents)/Flume0.9WriteAheadLog(WAL)/Thefilechannelwrk
URL/Settingupthewebserver
ZZookeeper/Flume0.9