Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

61
1 Stat, 2 Stat, 3 Stat A Trillion Cody A. Ray Dev-Ops @ BrightTag

description

Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB Many startups collect and display stats and other time-series data for their users. A supposedly-simple NoSQL option such as MongoDB is often chosen to get started... which soon becomes 50 distributed replica sets as volume increases. This talk describes how we designed a scalable distributed stats infrastructure from the ground up. KairosDB, a rewrite of OpenTSDB built on top of Cassandra, provides a solid foundation for storing time-series data. Unfortunately, though, it has some limitations: millisecond time granularity and lack of atomic upsert operations which make counting (critical to any stats infrastructure) a challenge. Additionally, running KairosDB atop Cassandra inside AWS brings its own set of challenges, such as managing Cassandra seeds and AWS security groups as you grow or shrink your Cassandra ring. In this deep-dive talk, we explore how we've used a mix of open-source and in-house tools to tackle these challenges and build a robust, scalable, distributed stats infrastructure.

Transcript of Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Page 1: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

1 Stat, 2 Stat, 3 StatA Trillion

Cody A. RayDev-Ops @ BrightTag

Page 2: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Outline1. Initial Attempt: MongoDB2. Ideal Stats System: KairosDB?3. Making KairosDB Work for Us

Page 3: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

What Kind of Stats?Counting!

sum, min, max, etc

Any recurrence relation:yn = f(x, y0, …, yn-1)

Page 4: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

The First Pass: MongoDB● JSON Documents, Schema-less, Flexible● Aggregation Pipeline, MapReduce● Master-Slave Replication● Atomic Operators!

Page 5: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

http://fearlessdeveloper.com/race-condition-java-concurrency/

read countercounter = 0

read countercounter = 0

increment value by 1

increment value by 1

write value to counter = 1

write value to counter = 1

incorrect value of counter = 1

Page 6: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 7: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 8: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 9: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Simple, Right? What’s the Problem?

Only 3500 writes/second! (m1.large)

up to 7000 wps (with m1.xlarge)

Page 10: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Scale Horizontally?

Page 11: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Redundancy → Mongo Explosion!!!

Page 12: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 13: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Feel the Pain● Scale 3x. 3x != x. Big-O be damned.● Managing 50+ Mongo replica sets globally ● 10s of $1000s of dollars “wasted” each year

Page 14: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Ideal Stats System?● Linearly scalable time-series database● Store arbitrary metrics and metadata● Support aggregations, other complex

queries

● Bonus points foro good for storing both application and system metricso Graphite web integration

Page 15: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Enter KairosDB● “fast distributed scalable time series” db● General metric storage and retrieval● Based upon Cassandra

o linearly scalableo tuned for fast writeso eventually consistent, tunable replication

Page 16: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Adding Data

[ { "name": "archive_file_tracked", "datapoints": [[1359788400000, 123]], "tags": { "host": "server1", "data_center": "DC1" } }]

Page 17: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Querying Data

{ "start_absolute": 1357023600000, "end_absolute": 1357455600000 "metrics": [{ "name": "abc.123", "tags": { "host": ["foo", "foo2"], "type": ["bar"] }, "aggregators": [{ "name": "sum", "sampling": { "value": 10, "unit": "minutes" }}]}]}

Page 18: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

The Catch(es)● Lack of atomic operations

o + millisecond time granularity

● Bad support for high cardinality “tags”

● Headache managing Cassandra in AWS

Page 19: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

The Catch(es)● Lack of atomic operations

o + millisecond time granularity

● Bad support for high cardinality “tags”

● Headache managing Cassandra in AWS

Page 20: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Cassandra on AWS

Page 21: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Agathon

Page 22: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

The Catch(es)● Lack of atomic operations

o + millisecond time granularity

● Bad support for high cardinality “tags”

● Headache managing Cassandra in AWS

Page 23: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Cassandra Schema

http://prezi.com/ajkjic0jdws3/kairosdb-cassandra-schema/

Page 24: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Cassandra Schema

http://prezi.com/ajkjic0jdws3/kairosdb-cassandra-schema/

Page 25: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Custom Data

[ { "name": "archive_file_tracked", "datapoints": [[1359788400000, "value,metadata,...", "string"]], "tags": { "host": "server1", "data_center": "DC1" } }]

https://github.com/proofpoint/kairosdb/tree/feature/custom_data

Page 26: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Custom Data

[ { "name": "archive_file_tracked", "datapoints": [[1359788400000, "value,metadata,...", "string"]], "tags": { "host": "server1", "data_center": "DC1" } }]

https://github.com/proofpoint/kairosdb/tree/feature/custom_data

Page 27: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

The Catch(es)● Lack of atomic operations

o + millisecond time granularity

● Bad support for high cardinality “tags”

● Headache managing Cassandra in AWS

Page 28: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Pieces of the Solution● Shard the data

o avoids concurrency race conditions

● Pre-aggregationo solves time-granularity issue

● Stream processing, exactly-once semantics

Page 29: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Queue/Worker Stream Processing

https://www.youtube.com/watch?v=bdps8tE0gYo

Page 30: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Enter Storm/Trident

Page 31: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

StormStormStorms

StormStormKafkas

StormStormZoos

StormStormStormStormApp Server 1

suro

StormStormMongosStormStormKairoses

StormStormStormStormApp Server 2

suro

Stats 1.5 Stats 2.0

Stats 1.5 Stats 2.0

STA

TS P

IPE

LIN

E

1.5 2.0

1.5 2.0

Page 32: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

groupBy(timeRange, metric, tags)

Kafka Broker

Partition

Kafka Broker

Partition

Kafka Spout Kafka Spout

Transforms

30s Writer Bolt30s Writer Bolt30m Writer Bolt30s Writer Bolt30s Writer Bolt30s Writer Bolt

Kafka Layer

Spout Layer

Transform Layer

Persistence Layer

shuffle()

round-robin(?)

round-robin(haproxy)

KairosDB Layer KairosDB Cluster

Page 33: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 34: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 35: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 36: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 37: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 38: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Pieces of the Solution● Shard the data

o avoids concurrency race conditions

● Pre-aggregationo solves time-granularity issue

● Stream processing, exactly-once semantics

Page 39: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Pieces of the Solution● Shard the data

o avoids concurrency race conditions

● Pre-aggregationo solves time-granularity issue

● Stream processing, exactly-once semantics

Page 40: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 41: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Pieces of the Solution● Shard the data

o avoids concurrency race conditions

● Pre-aggregationo solves time-granularity issue

● Stream processing, exactly-once semantics

Page 42: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 43: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Pieces of the Solution● Shard the data

o avoids concurrency race conditions

● Pre-aggregationo solves time-granularity issue

● Stream processing, exactly-once semantics

Page 44: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Non-Transactional 123

Transactional “[9, 123]”

Opaque Transactional “[9, 123, 120]”

Page 45: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Transactional Tags[ { "name": "archive_file_tracked", "datapoints": [[1359788400000, 123]], "tags": { "txid": 9, "prev": 120 } }]

Page 46: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 47: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 48: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 49: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

The Catch(es)● Lack of atomic operations

o + millisecond time granularity

● Bad support for high cardinality “tags”

● Headache managing Cassandra in AWS

Page 50: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Does It Work?

… the counts still match! (whew)

Page 51: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Average latency remains < 10 seconds

Page 52: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Stats 1.0 vs Stats 1.5 Performance

Replacing 9 mongo sets with 2

Page 53: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Cody A. Ray, [email protected]

Open Source: github.com/brighttagSlides: bit.ly/gluecon-stats

Page 54: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

These following slides weren’t presented at Gluecon.You may find them interesting anyway. :)

Page 55: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Bolt

Bolt

SpoutBolt

spout each

group bypersistent aggregate

eachshuffle

TridentState

each

each

each each

each each

group bypersistent aggregate

TridentState

Trident → Storm TopologyCompilation

Page 56: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Page 57: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

23

24

Page 58: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

Tuning Rules

1. Number of workers should be a multiple of number of machines

2. Number of partitions should be a multiple of spout parallelism

3. Parallelism should be a multiple of number of workers

4. Persistence parallelism should be equal to the number of workers

Page 59: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

multi get:values of((ts1, metric1), (ts2, metric2), (ts2, metric3))

reducer / combiner

multi put:values of((ts1, metric1), (ts2, metric2), (ts2, metric3))

multi get:values of((ts2, metric1))

reducer / combiner

multi put:values of((ts1, metric1), (ts2, metric2), (ts2, metric3))

multi get:values of((ts4, metric2), (ts3, metric4))

reducer / combiner

multi put:values of((ts1, metric1), (ts2, metric2), (ts2, metric3))

group by (ts, metric)

http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/

Batch from Kafka Persistent Aggregate

value = ...

(ts1, metric1)

value = ...

(ts2, metric2)

value = ...

(ts2, metric3)

value = ...

(ts2, metric1)

value = ...

(ts4, metric2)

value = ...

(ts3, metric4)

Page 60: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

multi get:values of((ts1, metric1), (ts2, metric2), (ts2, metric3))

reducer / combiner

multi put:values of((ts1, metric1), (ts2, metric2), (ts2, metric3))

multi get:values of((ts2, metric1))

reducer / combiner

multi put:values of((ts1, metric1), (ts2, metric2), (ts2, metric3))

multi get:values of((ts4, metric2), (ts3, metric4))

reducer / combiner

multi put:values of((ts1, metric1), (ts2, metric2), (ts2, metric3))

group by (ts, metric)

value = ...

(ts1, metric1)

value = ...

(ts2, metric2)

value = ...

(ts2, metric3)

value = ...

(ts2, metric1)

value = ...

(ts4, metric2)

value = ...

(ts3, metric4)

http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/

Persistent AggregateBatch from Kafka

Page 61: Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB

multi get:values of((ts1, metric1), (ts2, metric2), (ts2, metric3))

reducer / combiner

reducer / combiner

reducer / combiner

value = ... value = ... value = ...

(ts1, metric1) (ts1, metric1) (ts1, metric1)

value = ... value = ...

(ts2, metric3)

value = ...

(ts2, metric2)

(ts2, metric3)

value = ...

(ts2, metric3)

value = ...

(ts2, metric2) multi put:values of((ts1, metric1), (ts2, metric2), (ts2, metric3))

http://svendvanderveken.wordpress.com/2013/07/30/scalable-real-time-state-update-with-storm/

From the batchFrom the underlying persistent state