Fluentd meetup #3

Sadayuki FuruhashiTreasuare Data, Inc.Founder & Software Architect

Collecting app metricsin decentralized systemsDecision making based on facts

Fluentd meetup #3

Self-introduction

> Sadayuki Furuhashi

> Treasure Data, Inc.Founder & Software Architect

> Open source projectsMessagePack - efficient serializer (original author)

Fluentd - event collector (original author)

∈

What’s our service?

What’s the problems we faced?

How did we solve them?

What did we learn?

We open sourced the system

My Talk

What’s Treasure Data?

Treasure Data provides cloud-based data warehouse as a service.

Apache

App

App

Other data sources

td-agent RDBMS

Treasure Data columnar data

warehouse

Query Processing Cluster

Query API

HIVE, PIG (to be supported)

JDBC, REST

MAPREDUCE JOBS

User

td-command

BI apps

Treasure Data Service Architecture

open sourced

writes logs to text files

Rails app

GoogleSpreadsheet

MySQL

MySQL

MySQL

MySQL


NightlyINSERT

hundreds of app servers

Daily/HourlyBatch

KPIvisualizationFeedback rankings

Rails app


Rails app

- Limited scalability- Fixed schema- Not realtime- Unexpected INSERT latency

Example Use Case – MySQL to TD

hundreds of app servers

sends event logs

sends event logs

sends event logs

Rails app td-agent

td-agent

td-agent

GoogleSpreadsheet

Treasure Data

MySQL

Logs are availableafter several mins.

Daily/HourlyBatch

KPIvisualizationFeedback rankings

Rails app

Rails app

✓ Unlimited scalability✓ Flexible schema✓ Realtime✓ Less performance impact

Example Use Case – MySQL to TD

What’s Treasure Data?

Key differentiators:> TD delivers BigData analytics> in days, not months> without specialists or IT resources> for 1/10th the cost of the alternatives

Why? Because it’s a multi-tenant service.

Problem 1:investigating problems took time

Customers need support...

> “I uploaded data but can’t get on queries”

> “Download query results take time”

> “Our queries take longer time recently”

Problem 1:investigating problems took time

Investigating these problems took timebecause:

doubts.count.times { servers.count.times { ssh to a server grep logs }}

* the actual facts

> Actually data were not uploaded(clients had a problem; disk full)

We had ought to monitor uploading so that we immediately know we’re not getting data from the user.

> Our servers were getting slower because of increasing load

We had ought to notice it and add servers before having the problem.

> There was a bug which occurs under a specific condition

We had ought to collect unexpected errors and fix it as soon as possible so that both we and users save time.

Problem 2:many tasks to do but hard to prioritizeWe want to do...

> fix bugs> improve performance> increase number of sign-ups> increase number of queries by customers> incrasse number of periodic queries

What’s the “bottleneck”, whch should be solved first?

data: Performance is getting worse.decision: Let’s add servers.

data: Many customers upload data but few customers issue queries.decision: Let’s improve documents.

data: A customer stopped to run upload data.decision: They might got a problem at the client side.

Problem 2:many tasks to do but hard to prioritize

We need data to make decision.

How did we solve?

We collected application metrics.

Treasure Data’s backend architecture

FrontendJob Queue

WorkerHadoop

Hadoop

Solution v1:

FrontendJob Queue

WorkerHadoop

Hadoop

Fluentd Fluentd pulls metrics every minuts(in_exec plugin)

Librato Metricsfor realtime analysis

Treasure Datafor historical analysis

What’s solved

We can monitor overal behavior of servers.

We can notice performance decreasing.

We can get alerts when a problem occurs.

What’s not solved

We can’t get detailed information.> how large data is “this user” uploading?

Configuration file is complicated.> we need to add lines to declare new metrics

Monitoring server is SPOF.

Solution v2:

FrontendJob Queue

WorkerHadoop

Hadoop

Fluentd

Applications push metrics to Fluentd(via local Fluentd)

Librato Metricsfor realtime analysis

Treasure Datafor historical analysis

Fluentd sums up data minuts(partial aggregation)

What’s solved by v2

We can get detailed information directly from applications

> graphs for each customers

DRY - we can keep configuration files simple> Just add one line to apps> No needs to update fluentd.conf

Decentralized streaming aggregation> partial aggregation on fluentd,

total aggregation on Librato Metrics

API

MetricSense.value {:size=>32}

MetricSense.segment {:account=>1}

MetricSense.fact {:path=>‘/path1’}

MetricSense.measure!

What did we learn?

> We always have lots of tasks> we need data to prioritize them.

> Problems are usually complicated> we need data to save time.

> Adding metrics should be DRY> otherwise you feel bored and will not add metrics.

> Realtime analysis is useful,but we still need batch analysis.> “who are not issuing queries, despite of storing data last month?”> “which pages did users look before sign-up?”> “which pages did not users look before getting trouble?”

We open sourced

MetricSensehttps://github.com/treasure-data/metricsense

https://github.com/treasure-data/metricsense

https://github.com/treasure-data/metricsense

Components of MetricSense

metricsense.gem> client library for Ruby to send metrics

fluent-plugin-metricsense> plugin for Fluentd to collect metrics> pluggable backends:

> Librato Metrics backend> RDBMS backend

RDB backend for MetricSense

Aggregate metrics on RDBMS in optimized form for time-series data.

> Borrowed concepts from OpenTSDB and OLAP cube.

base_time, metric_id, segment_id, m0, m1, m2, ..., m59 19:00 1 5 25 31 19 ... 21 21:00 2 5 75 94 68 ... 72 21:00 2 6 63 82 55 ... 63

metric_id, metric_name, segment_name 1 “import.size” NULL 2 “import.size” “account”

segment_id, name 5 “a001” 6 “a002”

metric_tags: segment_values:

data:

Solution v3 (future work):

Alerting using historical data> simple machine largning to adjust threashold

values

Historical averageAlert!

We’re Hiring!

Sales EngineerEvangelize TD/Fluentd. Get everyone excited!

Help customers deploy and maintain TD successfully.

Preferred experience: OS, DB, BI, statistics and data science

Devops engineerDevelopment, operation and monitoring of our large-scale, multi-tenant system

Preferred experience: large-scale system development and management

Competitive salary + equity package

Who we wantSTRONG business and customer support DNA

Everyone is equally responsible for customer supportCustomer success = our success

Self-discipline and responsibleBe your own manager

Team player with excellent communication skillsDistributed team and global customer base

Contact me: [email protected]

mailto:[email protected]


contact: [email protected]



Fluentd meetup #3

Technology

Transcript of Fluentd meetup #3