Fluentd and Embulk Game Server 4

Masahiro NakagawaApr 18, 2015

Game Server meetup #4

Fluentd / EmbulkFor reliable transfer

Who are you?

> Masahiro Nakagawa > github/twitter: @repeatedly

> Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer

> Living at OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of several meetups (Presto, DTM, etc…) > etc…

Structured logging !

Reliable forwarding !

Pluggable architecture

http://fluentd.org/

http://fluentd.org/

What’s Fluentd?

> Data collector for unified logging layer > Streaming data transfer based on JSON > Written in Ruby

> Gem based various plugins > http://www.fluentd.org/plugins

> Working in production > http://www.fluentd.org/testimonials

http://www.fluentd.org/plugins

http://www.fluentd.org/testimonials

Background

Data Analytics Flow

Collect Store Process Visualize

Data source

Reporting

Monitoring

Data Analytics Flow

Store Process

Cloudera

Horton Works

Treasure Data

Collect Visualize

Tableau

Excel

R

easier & shorter time

???

TD Service Architecture

Time to Value

Send query result Result Push

Acquire Analyze Store

Plazma DB Flexible, Scalable, Columnar Storage

Web Log

App Log

Censor

CRM

ERP

RDBMS

Treasure Agent(Server) SDK(JS, Android, iOS, Unity)

Streaming Collector

Batch / Reliability

Ad-hoc /Low latency

KPI$

KPI Dashboard

BI Tools

Other Products

RDBMS, Google Docs, AWS S3, FTP Server, etc.

Metric Insights

Tableau, Motion Board��etc.

POS

REST API ODBC / JDBC �SQL, Pig�

Bulk Uploader

Embulk,TD Toolbelt

SQL-based query

@AWS or @IDCF

Connectivity

Economy & Flexibility Simple & Supported

Dive into Concept

Divide & Conquer & Retry

error retry

error retry retry

retryBatch

Stream

Other stream

Application

･･･

Server2

Application

･･･

Server3

Application

･･･

Server1

FluentLog Server

High Latency!must wait for a day...

Before…

Application

･･･

Server2

Application

･･･

Server3

Application

･･･

Server1

Fluentd Fluentd Fluentd

Fluentd Fluentd

In streaming!

After…

Why JSON / MessagePack? (1

> Schema on Write (Traditional MPP DB) > Writing data using schema for improving

query performance

> Pros > minimum query overhead

> Cons

> Need to design schema and workload before

> Data load is expensive operation

Why JSON / MessagePack? (2

> Schema on Read (Hadoop) > Writing data without schema and map schema

at query time

> Pros > Robust over schema and workload change > Data load is cheap operation

> Cons

> High overhead at query time

Features

Core Plugins

> Divide & Conquer

> Buffering & Retrying

> Error handling

> Message routing

> Parallelism

> Read / receive data > Parse data > Filter data > Buffer data > Format data > Write / send data

Core Plugins

> Divide & Conquer

> Buffering & Retrying

> Error handling

> Message routing

> Parallelism

> Read / receive data > Parse data > Filter data > Buffer data > Format data > Write / send data

Common Concerns

Use Case Specific

> default second unit

> from data source

Event structure(log message)

✓ Time

> for message routing

> where is from?

✓ Tag

> JSON format

> MessagePackinternally

> schema-free

✓ Record

Architecture (v0.12 or later)

EngineInput

Filter Output

Buffer

> grep > record_transfomer > …

> Forward > File tail > ...

> Forward > File > ...

Output

> File > Memory

not pluggable

FormatterParser

Configuration and operation

> No central / master node > @include helps configuration sharing

> Operation depends on your environment > Use your deamon / deploy tools > Use Chef in Treasure Data

> Apache like syntax

How to use

Setup fluentd (e.g. Ubuntu)

$ apt-get install ruby!

!

$ gem install fluentd!

!

$ edit fluent.conf!

!

$ fluentd -c fluent.conf

http://docs.fluentd.org/articles/faq#w-what-version-of-ruby-does-fluentd-support

http://docs.fluentd.org/articles/faq#w-what-version-of-ruby-does-fluentd-support

Treasure Agent (td-agent)

> Treasure Data distribution of Fluentd > include ruby, popular plugins and etc

> Treasure Agent 2 is current stable > Recommend to use v2, not v1 > rpm, deb and dmg

> Latest version is 2.2.0 with fluentd v0.12

Setup td-agent

$ curl -L http://toolbelt.treasuredata.com/sh/install-redhat-td-agent2.sh | sh!

!

$ edit /etc/td-agent/td-agent.conf!

!

$ sudo service td-agent start

See: http://docs.fluentd.org/categories/installation

http://docs.fluentd.org/categories/installation

Apache to Mongo

tail

insert

event buffering routing

127.0.0.1 - - [11/Dec/2014:07:26:27] "GET / ... 127.0.0.1 - - [11/Dec/2014:07:26:30] "GET / ... 127.0.0.1 - - [11/Dec/2014:07:26:32] "GET / ... 127.0.0.1 - - [11/Dec/2014:07:26:40] "GET / ... 127.0.0.1 - - [11/Dec/2014:07:27:01] "GET / ...

...

Fluentd

Web Server

2014-02-04 01:33:51 apache.log

{ "host": "127.0.0.1", "method": "GET", ... }

Plugins - use rubygems

$ fluent-gem search -rd fluent-plugin!

!

$ fluent-gem search -rd fluent-mixin!

!

$ fluent-gem install fluent-plugin-mongoIn td-agent: /usr/sbin/td-agent-gem install fluent-plugin-mongo

# receive events via HTTP <source> @type http port 8888 </source> !# read logs from a file <source> @type tail path /var/log/httpd.log format apache tag apache.access </source> !# save access logs to MongoDB <match apache.access> @type mongo database apache collection log </match>

# save alerts to a file <match alert.**> @type file path /var/log/fluent/alerts </match> !# forward other logs to servers <match **> @type forward <server> host 192.168.0.11 weight 20 </server> <server> host 192.168.0.12 weight 60 </server> </match> !@include http://example.com/conf

http://example.com/conf

> Apply filtering routine to event stream > No more tag tricks!

Filter

<match access.**> @type record_reformer tag reformed.${tag} </match> !<match reformed.**> @type growthforecast </match>

<filter access.**> @type record_transformer … </filter>

v0.10: v0.12:

<match access.**> @type growthforecast </match>

Before

After

or Embulk

Nagios

MongoDB

Hadoop

Alerting

Amazon S3

Analysis

Archiving

MySQL

Apache

Frontend

Access logs

syslogd

App logs

System logs

Backend

Databasesbuffering / processing / routing

M x N → M + N

Roadmap> v0.10 (old stable) > v0.12 (current stable)

> Filter / Label / At-least-once > v0.14 (spring - early summer, 2015)

> New plugin APIs, ServerEngine, Time… > v1 (summer - fall, 2015)

> Fix new features / APIs

https://github.com/fluent/fluentd/wiki/V1-Roadmap

https://github.com/fluent/fluentd/wiki/V1-Roadmap

Use-cases

Simple forwarding

# logs from a file<source> type tail path /var/log/httpd.log pos_file /tmp/pos_file format apache2 tag backend.apache</source>!# logs from client libraries<source> type forward port 24224</source>!

# store logs to MongoDB<match backend.*> type mongo database fluent collection test</match>

# Ruby!Fluent.open(“myapp”)!Fluent.event(“login”, {“user” => 38})!#=> 2014-12-11 07:56:01 myapp.login {“user”:38}

> Ruby > Java > Perl > PHP > Python > D > Scala > ...

Client libraries

Less Simple Forwarding

- At-most-once / At-least-once - HA (failover) - Load-balancing

All data

Near realtime and batch combo!

Hot data

# logs from a file<source> type tail path /var/log/httpd.log pos_file /tmp/pos_file format apache2 tag web.access</source>!# logs from client libraries<source> type forward port 24224</source>!

# store logs to ES and HDFS<match web.*> type copy <store> type elasticsearch logstash_format true </store> <store> type webhdfs host namenode port 50070 path /path/on/hdfs/ </store></match>

CEP for Stream Processing

Norikra is a SQL based CEP engine: http://norikra.github.io/

http://norikra.github.io/

Container Logging

> Kubernetes

!

!

!

!

!

> Google Compute Engine > https://cloud.google.com/logging/docs/install/compute_install

Fluentd on Kubernetes / GCE

https://cloud.google.com/logging/docs/install/compute_install

Treasure Data

FrontendJob Queue

WorkerHadoop

Presto

Fluentd

Applications push metrics to Fluentd (via local Fluentd)

Datadogfor realtime monitoring

Treasure Datafor historical analysis

Fluentd sums up data minutes(partial aggregation)

hundreds of app servers

sends event logs

sends event logs

sends event logs

Rails app td-agent

td-agent

td-agent

GoogleSpreadsheet

Treasure Data

MySQL

Logs are available

after several mins.

Daily/Hourly

Batch

KPI

visualizationFeedback rankings

Rails app

Rails app

✓ Unlimited scalability✓ Flexible schema✓ Realtime✓ Less performance impact

Cookpad

✓ Over 100 RoR servers (2012/2/4)

Slideshare

http://engineering.slideshare.net/2014/04/skynet-project-monitor-scale-and-auto-heal-a-system-in-the-cloud/

http://engineering.slideshare.net/2014/04/skynet-project-monitor-scale-and-auto-heal-a-system-in-the-cloud/

Log Analysis System And its designs in LINE Corp. 2014 early

http://www.slideshare.net/tagomoris/log-analysis-system-and-its-designs-in-line-corp-2014-early

Line BusinessConnect

http://developers.linecorp.com/blog/?p=3386

Eco-system

fluent-bit> Made for Embedded Linux

> OpenEmbedded & Yocto Project > Intel Edison, RasPi & Beagle Black boards > https://github.com/fluent/fluent-bit

> Standalone application or Library mode > Built-in plugins

> input: cpu, kmsg, output: fluentd > First release at the end of Mar 2015

https://github.com/fluent/fluent-bit

fluentd-forwarder> Forwarding agent written in Go

> Focusing log forwarding to Fluentd > Work on Windows

> Bundle TCP input/output and TD output > No flexible plugin mechanizm > We have a plan to add some input/output

> Similar product > fluent-agent-lite, fluent-agent-hydra, ik

fluentd-ui

> Manage Fluentd instance via Web UI > https://github.com/fluent/fluentd-ui

Bulk loading !

Parallel processing!

Pluggable architecture

http://embulk.org/

http://embulk.org/

The problems at Treasure Data> Treasure Data Service on the Cloud > Customers want to try Treasure Data, but

> SEs write scripts to bulk load their data. Hard work :(

> Customers want to migrate their big data, but > Hard work :(

> Fluentd solved streaming data collection, but > bulk data loading is another problem.

Embulk

> Bulk Loader version of Fluentd > Pluggable architecture

> JRuby, JVM languages > High performance parallel processing

> Share your script as a plugin > https://github.com/embulk

https://github.com/embulk

The problems of bulk load

> Data cleaning (normalization) > How to normalize broken records?

> Error handling > How to remove broken records?

> Idempotent retrying > How to retry without duplicated loading?

> Performance optimization

HDFS

MySQL

Amazon S3

Embulk

CSV Files

SequenceFile

Salesforce.com

Elasticsearch

Cassandra

Hive

Redis

✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behaviour ✓ Idempotent retrying

Plugins Plugins

bulk load

http://www.embulk.org/plugins/

http://www.embulk.org/plugins/

How to use

Setup embulk (e.g. Linux/Mac)

$ curl --create-dirs -o ~/.embulk/bin/embulk -L “http://dl.embulk.org/embulk-latest.jar"!

!

$ chmod +x ~/.embulk/bin/embulk!

!

$ echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc!

!

$ source ~/.bashrc

Try example

$ embulk example ./try1!

!

$ embulk guess ./example.yml -o config.yml!

!

$ embulk preview config.yml!

!

$ embulk run config.yml

# install $ wget http://dl.embulk.org/embulk-latest.jar -O

embulk.jar $ chmod 755 embulk.jar!

# guess $ vi example.yml $ ./embulk guess example.yml

-o config.yml

Guess format & schema in: type: file path_prefix: /path/to/sample_ out: type: stdout

in: type: file path_prefix: /path/to/sample_ decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' skip_header_lines: 1 columns: - {name: id, type: long} - {name: account, type: long} - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S’} - {name: purchase, type: timestamp, format: ‘%Y%m%d'} - {name: comment, type: string} out: type: stdout

guess

by guess plugins




-o config.yml!

# preview $ ./embulk preview config.yml $ vi config.yml # if necessary

+--------------------------------------+---------------+--------------------+ | time:timestamp | uid:long | word:string | +--------------------------------------+---------------+--------------------+ | 2015-01-27 19:23:49 UTC | 32,864 | embulk | | 2015-01-27 19:01:23 UTC | 14,824 | jruby | | 2015-01-28 02:20:02 UTC | 27,559 | plugin | | 2015-01-29 11:54:36 UTC | 11,270 | fluentd | +--------------------------------------+---------------+--------------------+

Preview & fix config




-o config.yml!

# preview $ ./embulk preview config.yml $ vi config.yml # if necessary !# run $ ./embulk run config.yml -o config.yml

exec: {} in: type: file path_prefix: /path/to/sample_ decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' skip_header_lines: 1 columns: - {name: id, type: long} - {name: account, type: long} - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S’} - {name: purchase, type: timestamp, format: ‘%Y%m%d'} - {name: comment, type: string} last_path: /path/to/sample_001.csv.gz out: type: stdout

Deterministic run

exec: {} in: type: file path_prefix: /path/to/sample_ decoders: - {type: gzip} parser: charset: UTF-8 newline: CRLF type: csv delimiter: ',' quote: '"' skip_header_lines: 1 columns: - {name: id, type: long} - {name: account, type: long} - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S’} - {name: purchase, type: timestamp, format: ‘%Y%m%d'} - {name: comment, type: string} last_path: /path/to/sample_01.csv.gz out: type: stdout

Repeat




-o config.yml!

# preview $ ./embulk preview config.yml $ vi config.yml # if necessary !# run $ ./embulk run config.yml -o config.yml !# repeat $ ./embulk run config.yml -o config.yml $ ./embulk run config.yml -o config.yml

Use-cases

Quipper from GDS slide

https://docs.google.com/presentation/d/1uHpaIUnMvWmwcXkTKpt2jJw6OvW5HVllopREoH-6YX0/edit#slide=id.g98250d3c5_3_5

Other cases

> Treasure Data > Embulk worker for automatic import

> Web services > Send existing logs to Elasticsearch

> Business / Batch systems > Database to Database

> etc…

Check: treasuredata.comCloud service for the entire data pipeline

http://treasure-data.com

Fluentd and Embulk Game Server 4

Technology

Transcript of Fluentd and Embulk Game Server 4