How to collect Big Data into Hadoop
-
Upload
sadayuki-furuhashi -
Category
Documents
-
view
5.625 -
download
2
description
Transcript of How to collect Big Data into Hadoop
Sadayuki Furuhashi!uentd.org
How to collect Big Data
into HadoopBig Data processing to collect Big Data
Self-introduction
> Sadayuki Furuhashi
> Treasure Data, Inc.Founder & Software Architect
> Open source projectsMessagePack - efficient serializer (original author)
Fluentd - event collector (original author)
Today’s topic
Report &
Big Data
Monitor
Collect Store Process Visualize
Report &
Big Data
Monitor
Store Process
ClouderaHorton WorksMapR
Collect Visualize
TableauExcel
R
easier & shorter time
Store ProcessCollect Visualize
ClouderaHorton WorksMapR
TableauExcel
R
easier & shorter timeHow to shorten here?
Problems to collect data
Poor man’s data collection
1. Copy files from servers using rsync
2. Create a RegExp to parse the files
3. Parse the files and generate a 10GB CSV file
4. Put it into HDFS
Problems to collect “big data”
> Includes broken values> needs error handling & retrying
> Time-series data are changing and uncler> parse logs before storing
> Takes time to read/write> tools have to be optimized and parallelized
> Takes time for trial & error> Causes network traffic spikes
Problem of poor man’s data collection
> Wastes time to implement error handling> Wastes time to maintain a parser> Wastes time to debug the tool> Not reliable> Not efficient
Basic theoriesto collect big data
Divide & Conquer
error
error
Divide & Conquer & Retry
error retry
error retry retry
retry
Streaming
Don’t handle big files here Do it here
Apache Flume and Fluentd
Apache Flume
AgentAgentAgentAgent
Collector
Collector
Apache Flume
access logs
app logs
system logs
...
Apache Flume - network topology
AgentAgentAgentAgent
Collector
CollectorCollector
Master
AgentAgentAgentAgent
Collector
CollectorCollector
ack
send
send/ack
Flume OG
Flume NG
plugin
Apache Flume - pipeline
Flume OG
Flume NG
Source Sink
Source SinkChannel
Apache Flume - con!guration
Master
AgentAgentAgentAgent
Collector
CollectorCollectorFlume NG
Master managesall configuration
(optional)
Apache Flume - con!guration
# sourcehost1.sources = avro-source1host1.sources.avro-source1.type = avrohost1.sources.avro-source1.bind = 0.0.0.0host1.sources.avro-source1.port = 41414host1.sources.avro-source1.channels = ch1
# channelhost1.channels = ch_avro_loghost1.channels.ch_avro_log.type = memory
# sinkhost1.sinks = log-sink1host1.sinks.log-sink1.type = loggerhost1.sinks.log-sink1.channel = ch1
Fluentd
Fluentd - network topology
fluentdfluentdfluentdfluentd
fluentd
fluentdfluentd
send/ackFluentd
AgentAgentAgentAgent
Collector
CollectorCollector
send/ackFlume NG
plugin
Fluentd - pipeline
FluentdInput OutputBuffer
Source SinkChannelFlume NG
Fluentd - con!guration
Fluentd
fluentdfluentdfluentdfluentd
fluentd
fluentdfluentd
Use chef, puppet, etc. for configuration(they do things better)
No central node - keep things simple
Fluentd - con!guration
<source> type forward port 24224</source>
<match **> type file path /var/log/logs</match>
Fluentd - con!guration
<source> type forward port 24224</source>
<match **> type file path /var/log/logs</match>
# source
host1.sources = avro-source1
host1.sources.avro-source1.type = avro
host1.sources.avro-source1.bind = 0.0.0.0
host1.sources.avro-source1.port = 41414
host1.sources.avro-source1.channels = ch1
# channel
host1.channels = ch_avro_log
host1.channels.ch_avro_log.type = memory
# sink
host1.sinks = log-sink1
host1.sinks.log-sink1.type = logger
host1.sinks.log-sink1.channel = ch1
Fluentd - Users
Fluentd - plugin distribution platform
$ fluent-gem search -rd fluent-plugin
$ fluent-gem install fluent-plugin-mongo
Fluentd - plugin distribution platform
$ fluent-gem search -rd fluent-plugin
$ fluent-gem install fluent-plugin-mongo
94 plugins!
Concept of Fluentd
Customization is essential> small core + many plugins
Fluentd core helps to implement plugins> common features are already implemented
Divide & Conquer
Retrying
Parallelize
Error handling
Message routing
Fluentd core Plugins
read / receive data
write / send data
Fluentd plugins
in_tail
fluentdapache
access.log
✓ read a log file✓ custom regexp✓ custom parser in Ruby
out_mongo
fluentdapache
access.log buffer
in_tail
out_mongo
fluentdapache
access.log buffer
✓ retry automatically✓ exponential retry wait✓ persistent on a file
in_tail
out_s3
fluentdapache
access.log buffer
✓ retry automatically✓ exponential retry wait✓ persistent on a file
Amazon S3
✓ slice files based on time
in_tail
2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...
out_hdfs
fluentdapache
access.log buffer
✓ retry automatically✓ exponential retry wait✓ persistent on a file
✓ slice files based on time
in_tail
2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...
HDFS
✓ custom text formater
out_hdfs
fluentdapache
access.log buffer
✓ retry automatically✓ exponential retry wait✓ persistent on a file
✓ slice files based on time
in_tail
2013-01-01/01/access.log.gz2013-01-01/02/access.log.gz2013-01-01/03/access.log.gz...
fluentd
fluentd
fluentd
✓ automatic fail-over✓ load balancing
Fluentd examples
Fluentd at Treasure Data - REST API logs
API servers
fluentdRails app
fluentd
fluentdRails app
fluent-logger-ruby+ in_forward
out_forward
watch server
Fluentd at Treasure Data - backend logs
API servers
fluentdRails app Ruby app
fluentd
fluentdRails app
worker servers
Ruby appfluentd
fluent-logger-ruby+ in_forward
out_forward
fluentdwatch server
Fluentd at Treasure Data - monitoring
API servers
fluentdRails app
fluentd
Queue
PerfectQueue
Ruby appfluentd
fluentdRails app
worker servers
Ruby appfluentd
fluent-logger-ruby+ in_forward
watch server
scriptout_forwardin_exec
Fluentd at Treasure Data - Hadoop logs
fluentd watch server
scriptin_exec
✓ resource consumption statistics for each user✓ capacity monitoring
thrift API call
HadoopJobTracker
Fluentd at Treasure Data - store & analyze
fluentd watch server
Librato Metricsfor realtime analysis
Treasure Datafor historical analysis
out_tdlog out_metricsense✓ streaming aggregation
Plugin development
class SomeInput < Fluent::Input Fluent::Plugin.register_input('myin', self)
config_param :tag, :string
def start Thread.new { while true time = Engine.new record = {“user”=>1, “size”=>1} Engine.emit(@tag, time, record) end } end
def shutdown ... endend
<source> type myin tag myapp.api.heartbeat</source>
class SomeOutput < Fluent::BufferedOutput Fluent::Plugin.register_output('myout', self)
config_param :myparam, :string
def format(tag, time, record) [tag, time, record].to_json + "\n" end
def write(chunk) puts chunk.read endend
<match **> type myout myparam foobar</match>
class MyTailInput < Fluent::TailInput Fluent::Plugin.register_input('mytail', self)
def configure_parser(conf) ... end
def parse_line(line) array = line.split(“\t”) time = Engine.now record = {“user”=>array[0], “item”=>array[1]} return time, record endend
<source> type mytail</source>
Fluentd v11
Error stream
Streaming processing
Better DSL
Multiprocess