Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

56
Apache Flume Getting Logs/Data to Hadoop Steve Hoffman Chicago Hadoop User Group (CHUG) 2014-04-09T10:30:00Z

description

Slides for presentation I gave to Chicago Hadoop User Group on April 9, 2014

Transcript of Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Page 1: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Apache FlumeGetting Logs/Data to Hadoop

Steve Hoffman Chicago Hadoop User Group (CHUG)

2014-04-09T10:30:00Z

Page 2: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

About Me• Steve Hoffman

• twitter: @bacoboy else: http://bit.ly/bacoboy

Page 3: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

About Me• Steve Hoffman

• twitter: @bacoboy else: http://bit.ly/bacoboy

• Tech Guy @Orbitz

Page 4: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

About Me• Steve Hoffman

• twitter: @bacoboy else: http://bit.ly/bacoboy

• Tech Guy @Orbitz

• Wrote a book on Flume

Page 5: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Why do I need Flume?• Created to deal with streaming data/logs to HDFS

• Can’t mount HDFS (usually)

• Can’t “copy” to files to HDFS if the files aren’t closed (aka log files)

• Need to buffer “some”, then write and close a file — repeat

• May involve multiple hops due to topology (# of machines, datacenter separation, etc).

• A lot can go wrong here…

Page 6: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Agent• Java daemon

• Has a name (usually ‘agent’)

• Receive data from sources and write events to 1 or more channels

• Move events from 1 channel to sink. Remove from channel if successfully written.

Page 7: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Events• Headers = Key/Value Pairs — Map<String,  String>

• Body = byte array — byte[]

• For example:

10.10.1.1 - - [29/Jan/2014:03:36:04 -0600] "HEAD /ping.html HTTP/1.1" 200 0 "-" "-" “-"!

{“timestamp”:”1391986793111”, “host”:”server1.example.com”} 31302e31302e312e31202d202d205b32392f4a616e2f323031343a30333a33363a3034202d303630305d202248454144202f70696e672e68746d6c20485454502f312e312220323030203020222d2220222d2220222d22

Page 8: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Channels

• Place to hold Events

• Memory or File Backed (also JDBC, but why?)

• Bounded - Size is configurable

• Resources aren’t infinite

Page 9: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Sources• Feeds data to one or more Channels

• Usually data pushed to it (listen for data on a socket. i.e. HTTP Source) or from Avro log4J appender.

• Or can periodically poll another system and generate events (i.e. run a command every minute, and parse output into Event, Query a DB/Mongo/etc.)

Page 10: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Sinks

• Move Events from a single Channel to a destination

• Only removes from Channel if write successful

• HDFSSink you’ll use the most — most likely…

Page 11: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Configuration Sample# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!

Page 12: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Startup# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!

name.{sources|sinks|channels}

Page 13: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Startup# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!

name.{sources|sinks|channels}

Find instance name + type

Page 14: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Startup# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!

name.{sources|sinks|channels}

Find instance name + type

Connect channel(s)

Page 15: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Startup# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!

name.{sources|sinks|channels}

Find instance name + type

Connect channel(s)

Apply type specificconfigurations

Page 16: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Startup# Agent named ‘agent’!# Input (source)!agent.sources.r1.type = seq!agent.sources.r1.channels = c1!!# Output (sink)!agent.sinks.k1.type = logger!agent.sinks.k1.channel = c1!!# Channel!agent.channels.c1.type = memory!agent.channels.c1.capacity = 1000!!# Wire everything together!agent.sources = r1!agent.sinks = k1!agent.channels = c1!

name.{sources|sinks|channels}

Find instance name + type

Connect channel(s)Apply type specificconfigurations

RTM - Flume User Guide https://flume.apache.org/FlumeUserGuide.html"

or my book :)

Page 17: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Configuration Sample (logs)Creating channels!

Creating instance of channel c1 type memory!

Created channel c1!

Creating instance of source r1, type seq!

Creating instance of sink: k1, type: logger!

Channel c1 connected to [r1, k1]!

Starting new configuration:{ sourceRunners:{r1=PollableSourceRunner: { source:org.apache.flume.source.SequenceGeneratorSource{name:r1,state:IDLE} counterGroup:{ name:null counters:{} } }} sinkRunners:{k1=SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor@19484a05 counterGroup:{ name:null counters:{} } }} channels:{c1=org.apache.flume.channel.MemoryChannel{name: c1}} }!

Event: { headers:{} body: 30 0 }!

Event: { headers:{} body: 31 1 }!

Event: { headers:{} body: 32 2 }!

and so on…

Page 18: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Using Cloudera Manager• Same stuff, just in

a GUI

• Centrally managed in a Database (instead of source control/Git)

• Distributed from central location (instead of Chef/Puppet)

Page 19: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Multiple destinations need multiple channels

Page 20: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Channel Selector

• When more than 1 channel specified on Source

• Replicating (Each channel gets a copy) - default

• Multiplexing (Channel picked based on a header value)

• Custom (If these don’t work for you - code one!)

Page 21: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Channel Selector Replicating

• Copy sent to all channels associated with Source

agent.sources.r1.selector.type=replicating agent.sources.r1.channels=c1  c2  c3  

• Can specify “optional” channels

agent.sources.r1.selector.optional=c3  

• Transaction success if all non-optional channels take the event (in this case c1 & c2)

Page 22: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Channel Selector Multiplexing

• Copy sent to only some of the channels

agent.sources.r1.selector.type=multiplexingagent.sources.r1.channels=c1  c2  c3  c4  

• Switch based on header key (i.e. {“currency”:“USD”} → c1)

agent.sources.r1.selector.header=currencyagent.sources.r1.selector.mapping.USD=c1agent.sources.r1.selector.mapping.EUR=c2  c3agent.sources.r1.selector.default=c4

Page 23: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Interceptors• Zero or more on Source (before written to channel)

• Zero or more on Sink (after read from channel)

• Or Both

• Use for transformations of data in-flight (headers OR body)

public  Event  intercept(Event  event);public  List<Event>  intercept(List<Event>  events);  

• Return null or empty List to drop Events

Page 24: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Interceptor Chaining• Processed in Order Listed in Configuration (source r1 example):

agent.sources.r1.interceptors=i1  i2  i3agent.sources.r1.interceptors.i1.type=timestamp agent.sources.r1.interceptors.i1.preserveExisting=true agent.sources.r1.interceptors.i2.type=static agent.sources.r1.interceptors.i2.key=datacenter agent.sources.r1.interceptors.i2.value=CHIagent.sources.r1.interceptors.i3.type=hostagent.sources.r1.interceptors.i3.hostHeader=relay agent.sources.r1.interceptors.i3.useIP=false  

• Resulting Headers added before writing to Channel:

{“timestamp”:“1392350333234”,  “datacenter”:“CHI”,  “relay”:“flumebox.example.com”}

Page 25: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Morphlines• Interceptor and Sink forms.

• See Cloudera Website/Blog

• Created to ease transforms and Cloudera Search/Flume integration.

• An example:

# convert the timestamp field to "yyyy-MM-dd'T'HH:mm:ss.SSSZ" # The input may match one of "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" # or "yyyy-MM-dd'T'HH:mm:ss" or "yyyy-MM-dd". convertTimestamp { field : timestamp inputFormats : ["yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", "yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"] inputTimezone : America/Chicago outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" outputTimezone : UTC }

Page 26: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Avro• Apache Avro - Data Serialization

• http://avro.apache.org/

• Storage Format and Wire Protocol

• Self-Describing (schema written with the data)

• Supports Compression of Data (not container — so MapReduce friendly — “splitable”)

• Binary friendly — Doesn’t require records separated by \n

Page 27: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Avro Source/Sink

• Preferred inter-agent transport in Flume

• Simple Configuration (host + port for sink and port for source)

• Minimal transformation needed for Flume Events

• Version of Avro in client & server don’t need to match — only payload versioning matters (think protocol buffers vs Java serialization)

Page 28: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Avro Source/Sink Config

foo.sources=…foo.channels=channel-­‐foofoo.channels.channel-­‐foo.type=memoryfoo.sinks=sink-­‐foofoo.sinks.sink-­‐foo.channel=channel-­‐foofoo.sinks.sink-­‐foo.type=avrofoo.sinks.sink-­‐foo.hostname=bar.example.comfoo.sinks.sink-­‐foo.port=12345foo.sinks.sink-­‐foo.compression-­‐type=deflate  

bar.sources=datafromfoobar.sources.datafromfoo.type=avrobar.sources.datafromfoo.bind=0.0.0.0bar.sources.datafromfoo.port=12345bar.sources.datafromfoo.compression-­‐type=deflate bar.sources.datafromfoo.channels=channel-­‐bar bar.channels=channel-­‐barbar.channels.channel-­‐bar.type=memorybar.sinks=…

Page 29: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

log4j Avro Sink• Remember that Web

Server pushing data to Source?

• Use the Flume Avro log4j appender!

• log level, category, etc. become headers in Event

• “message” String becomes the body

Page 30: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

log4j Configuration• log4j.properties sender (include flume-­‐ng-­‐sdk-­‐1.X.X.jar in project):

log4j.appender.flume=org.apache.flume.clients.log4jappender.Log4jAppenderlog4j.appender.flume.Hostname=example.comlog4j.appender.flume.Port=12345log4j.appender.flume.UnsafeMode=truelog4j.logger.org.example.MyClass=DEBUG,flume  

• flume avro receiver:

agent.sources=logsagent.sources.logs.type=avroagent.sources.logs.bind=0.0.0.0agent.sources.logs.port=12345agent.sources.logs.channels=…

Page 31: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Avro Client• Send data to AvroSource from command line

• Run flume program with avro-­‐client instead of agent parameter

$  bin/flume-­‐ng  avro-­‐client  -­‐H  server.example.com        -­‐p  12345  [-­‐F  input_file]  

• Each line of the file (or stdin if no file given) becomes an event

• Useful for testing or injecting data from outside Flume sources (ExecSource vs cronjob which pipes output to avro-­‐client).

Page 32: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

HDFSSink• Read from Channel and write

to a file in HDFS in chunks

• Until 1 of 3 things happens:

• some amount of time elapses (rollInterval)

• some number of records have been written (rollCount)

• some size of data has been written (rollSize)

• Close that file and start a new one

Page 33: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

HDFS Configurationfoo.sources=…foo.channels=channel-­‐foofoo.channels.channel-­‐foo.type=memoryfoo.sinks=sink-­‐foofoo.sinks.sink-­‐foo.channel=channel-­‐foofoo.sinks.sink-­‐foo.type=hdfsfoo.sinks.sink-­‐foo.hdfs.path=hdfs://NN/data/%Y/%m/%d/%H foo.sinks.sink-­‐foo.hdfs.rollInterval=60foo.sinks.sink-­‐foo.hdfs.filePrefix=logfoo.sinks.sink-­‐foo.hdfs.fileSuffix=.avrofoo.sinks.sink-­‐foo.hdfs.inUsePrefix=_foo.sinks.sink-­‐foo.serializer=avro_eventfoo.sinks.sink-­‐foo.serializer.compressionCodec=snappy

Page 34: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

HDFS writing…drwxr-­‐x-­‐-­‐-­‐      -­‐  flume  flume                    0  2014-­‐02-­‐16  17:04  /data/2014/02/16/23 -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume                    0  2014-­‐02-­‐16  17:04  /data/2014/02/16/23/_log.1392591607925.avro.tmp -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              1877  2014-­‐02-­‐16  17:01  /data/2014/02/16/23/log.1392591607923.avro -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              1955  2014-­‐02-­‐16  17:02  /data/2014/02/16/23/log.1392591607924.avro -­‐rw-­‐r-­‐-­‐-­‐-­‐-­‐      3  flume  flume              2390  2014-­‐02-­‐16  17:04  /data/2014/02/16/23/log.1392591798436.avro  

• The zero length .tmp file is the current file. Won’t see real size until it closes (just like when you do a hadoop  fs  -­‐put)

• Use …hdfs.inUsePrefix=_ to prevent open files from being included in MapReduce jobs

Page 35: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Event Serializers• Defines how the Event gets written to Sink

• Just the body as a UTF-8 String

agent.sinks.foo-­‐sink.serializer=text  

• Headers and Body as UTF-8 String

agent.sinks.foo-­‐sink.serializer=header_and_text  

• Avro (Flume record Schema)

agent.sinks.foo-­‐sink-­‐serializer=avro_event  

• Custom (none of the above meets your needs)

Page 36: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Lessons Learned

Page 37: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Source: https://xkcd.com/1179/

Too Many…

Page 38: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Timezones are Evil• Daylight savings time causes problems twice a year (in Spring: no 2am

hour. In Fall: twice the data during 2am hour — 02:15? Which one?)

• Date processing in MapReduce jobs: Hourly jobs, filters, etc.

• Dated paths: hdfs://NN/data/%Y/%m/%d/%H

• Use UTC: -­‐Duser.timezone=UTC  

• Use one of the ISO8601 formats like 2014-­‐02-­‐26T18:00:00.000Z

• Sorts the way you usually want

• Every time library supports it* - and if not, easy to parse.

Page 39: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Generally Speaking…• Async handoff doesn’t work under load when bad

stuff happens

Write Read

Filesystem or

Queue or

Database or whatever

Not ∞

Page 40: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Async Handoff Oops

Flume Agent

tail -F foo.log

foo.log

Page 41: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Async Handoff Oops

Flume Agent

tail -F foo.log

foo.log.1

Page 42: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Async Handoff Oops

Flume Agent

tail -F foo.log

foo.log

foo.log.1

Page 43: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Async Handoff Oops

Flume Agent

tail -F foo.log

foo.log

foo.log.2

Page 44: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Async Handoff Oops

Flume Agent

tail -F foo.log

foo.log.1 foo.log.2

Page 45: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Async Handoff Oops

Flume Agent

tail -F foo.log

foo.log.1 foo.log.2

foo.log

Page 46: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Async Handoff Oops

Flume Agent

tail -F foo.log

foo.log.1 foo.log.2

foo.log

Page 47: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Async Handoff Oops

Flume Agent

tail -F foo.log

foo.log.1 foo.log.2

foo.log

X

Page 48: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Don’t Use Tail• Tailing a file for input is bad - assumptions are made that

aren’t guarantees.

• Direct support removed during Flume rewrite

• Handoff can go bad with files: when writer faster than reader

• With Queue: when reader doesn’t read before expire time

• No way to apply “back pressure” to tell tail there is a problem. It isn’t listening…

Page 49: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

What can I use?• If you can’t use the log4j Avro Appender…

• Use logrotate to move old logs to “spool” directory

• SpoolingDirectorySource

• Finally, cron job to remove .COMPLETED files (for delayed delete) OR set deletePolicy=true (immediate)

• Alternatively use log rotate with avro_client? (probably other ways too…)

Page 50: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

RAM or Disk Channels?

Source:http://blog.scoutapp.com/articles/2011/02/10/understanding-disk-i-o-when-should-you-be-worried

Page 51: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Duplicate Events

• Transactions only at Agent level

• You may see Events more than once

• Distributed Transactions are expensive

• Just deal with in query/scrub phase — much less costly than trying to prevent it from happening

Page 52: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Late Data• Data could be “late”/delayed

• Outages

• Restarts

• Act of Nature

• Only sure thing is a “database” — single write + ACK

• Depending on your monitoring, it could be REALLY LATE.

Page 53: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Monitoring• Know when it breaks so you can fix it before you can’t ingest new data

(and it is lost)

• This time window is small if volume is high

• Flume Monitoring still WIP, but hooks are there

Page 54: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Other Operational Concerns

• resource utilization - number of open files when writing (file descriptors), disk space used for file channel, disk contention, disk speed*

• number of inbound and outbound sockets - may need to tier (Avro Source/Sink)

• minimize hops if possible - another place for data to get stuck

Page 55: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Not everything is a nail• Flume is great for handling individual records

• What if you need to compute an average?

• Get a Stream Processing system

• Storm (twitter’s)

• Samza (linkedIn’s)

• Others…

• Flume can co-exist with these — use most appropriate tool

Page 56: Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014

Questions?…and thanks!

Slides @ http://slideshare.net/bacoboy