How we scaled Rudder to 10k, and the road to 50k

34
How we scaled Rudder to 10k nodes And the road to 50k nodes Nicolas CHARLES Co-founder and COO @nico_charles

Transcript of How we scaled Rudder to 10k, and the road to 50k

Page 1: How we scaled Rudder to 10k, and the road to 50k

How we scaled Rudder to 10k nodes

And the road to 50k nodes

Nicolas CHARLES Co-founder and COO

@nico_charles

Page 2: How we scaled Rudder to 10k, and the road to 50k

2

Scalability ?

Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth

https://en.wikipedia.org/wiki/Scalability

Page 3: How we scaled Rudder to 10k, and the road to 50k

3

Scalability – why is it an issue in Rudder?

What does Rudder do ?● Users define policies● Apply them on groups of nodes● Rudder computes the policies for each

nodes● Agents apply them, and send back

information● Rudder computes the compliance

Page 4: How we scaled Rudder to 10k, and the road to 50k

4

Scalability – why is it an issue in Rudder?

Each of these points need to go fast● Process nodes inventory quickly● Have a fast UI● Generate policies in a reasonable time● Have fast agents, and don’t overflow the

network● Compliance of actual state available

Page 5: How we scaled Rudder to 10k, and the road to 50k

5

Rudder Architecture

Page 6: How we scaled Rudder to 10k, and the road to 50k

6

Rudder Architecture

Rudder Server Root

Interfaces

CLI

WEB UI

API

Uses

Applications

Compliance Configuration Inventory

Plugins

Node

Rudder Agent

Node

Rudder relay

Node

Rudder Agent

Rudder Engine Techniques

Page 7: How we scaled Rudder to 10k, and the road to 50k

7

The origin of Rudder

● At first, Rudder was thought for hundred(s) of nodes● No real goal for scalability

● It was, retrospectively, an MVP

Page 8: How we scaled Rudder to 10k, and the road to 50k

8

The origin of Rudder

● Scalability went up, driven from● Users and usages

– Frustration over slowdowns

– More managed servers

● Features

– Some features needed much improved performance

– Some needed massive architectural change

Page 9: How we scaled Rudder to 10k, and the road to 50k

9

First bottlenecks to tackle

● Reporting in Rudder● Display compliance of nodes

– Change the data model, as everything was Rule Centric in Rudder 2.3

● Slow display of reports and compliance

– Remember, we are supporting Postgresql 8.x

– Adding relevant indexes

● Agent side● Agent was already used in critical systems, but impacted performance of

nodes

– Rewrite some policies

– Add tooling around agent to prevent clogging

● Rudder 2.5 was not more scalable, but more consistent

Page 10: How we scaled Rudder to 10k, and the road to 50k

10

Scalability – Step by Step

Rudder Server Root

Interfaces

CLI

WEB UI

API

Uses

Compliance Configuration Inventory

Rudder Engine

Node

Rudder Agent

Node

Rudder relay

Node

Rudder Agent

Techniques

Bandwidth & Network- Flag files to detect new policies- Relay servers

Page 11: How we scaled Rudder to 10k, and the road to 50k

11

Scalability – Step by Step

Rudder Server Root

Interfaces

CLI

WEB UI

API

Uses

Compliance Configuration Inventory

Rudder Engine

Node

Rudder Agent

Node

Rudder relay

Node

Rudder Agent

Techniques

Scale the uses- Validation workflow- Synchronisation of Rudder servers- API- More Techniques

Page 12: How we scaled Rudder to 10k, and the road to 50k

12

Scalability – Step by Step

Rudder Server Root

Interfaces

CLI

WEB UI

API

Uses

Compliance Configuration Inventory

Rudder Engine

Node

Rudder Agent

Node

Rudder relay

Node

Rudder Agent

Techniques

Improve performance- Save only changes of Inventories

(several order of magnitude faster)

- Change data model for Compliance(30 % faster compliance)

Page 13: How we scaled Rudder to 10k, and the road to 50k

13

Scalability – 2.9 & 2.10

● Improving performances is one of the focus● Refactoring and code improvements to improve policy generation time

– Use of hashes and caches

● Fighting with the ORM to have lighter queries

– Much less commits

● Make impact on network and node adjustable● Configure agent run frequency : can configure based on the

performance of nodes and available bandwidth

Page 14: How we scaled Rudder to 10k, and the road to 50k

14

Scalability – 2.9 & 2.10

● First industrialized performances test – With Tsung● Generated inventories automatically, and send them to endpoint

● Tests with thousands of inventories

● Thank you @cscmeu !

http://tsung.erlang-projects.org/

Page 15: How we scaled Rudder to 10k, and the road to 50k

15

Scalability – 2.11

● Goal: manage thousand nodes● Distributed setup

– Make Rudder scale by adding more servers for components

● UI more responsive to user requests

– Async

– LDAP optimizations● No more indexes (everything fits in RAM)

● Much faster policy generation

– Changed of variable lookup, more caching

– Used a bit of parallelism when it wass easy

● More performance tests

– A big thank to users pushing the limits

Page 16: How we scaled Rudder to 10k, and the road to 50k

16

Scale the uses – Rudder 2.11

● Technique Editor : everyone can create techniques● Uses ncf

● Graphical User Interface to make Techniques easier to write

Page 17: How we scaled Rudder to 10k, and the road to 50k

17

Rudder 3

Rudder Server Root

Interfaces

CLI

WEB UI

API

Uses

Compliance Configuration Inventory

Rudder Engine

Node

Rudder Agent

Node

Rudder relay

Node

Rudder Agent

Techniques

Complete change of UI- Design and layout

Compliance is everywhere

- Everything is async- Everything is cached

Page 18: How we scaled Rudder to 10k, and the road to 50k

18

Rudder 3

Rudder Server Root

Interfaces

CLI

WEB UI

API

Uses

Compliance Configuration Inventory

Rudder Engine

Node

Rudder Agent

Node

Rudder relay

Node

Rudder Agent

Techniques

New data model : Node Centric- Compliance is per node- Cached- And lazyly computed

Page 19: How we scaled Rudder to 10k, and the road to 50k

19

Rudder 3

Rudder Server Root

Interfaces

CLI

WEB UI

API

Uses

Compliance Configuration Inventory

Rudder Engine

Node

Rudder Agent

Node

Rudder relay

Node

Rudder Agent

Techniques

Lightweight reports- Change only reporting- Send reports only for changes

And much less disk usage

Page 20: How we scaled Rudder to 10k, and the road to 50k

20

Rudder 3

● For this release, devs had between 1000 and 2000 nodes on their dev systems

● A lot of timing info embedded in Rudder

● Permitted to identify low hanging fruits

● As a result, everything was much faster● 500ms compute time with 2000 nodes was considered slow, and

reported as a bug

Page 21: How we scaled Rudder to 10k, and the road to 50k

21

Rudder 3.1 – 5000 nodes

● Rudder 3.1 – reaching the 5000 nodes limit (well – 7500 at the end of its life)

● This is the land of micro-optimization, pushing the limits of the model

– Lazy variables to prevent computation of unwanted values

● Micro tuning of techniques to make policy generation faster

– But we are still talking about 45 minutes for 5000 nodes with policy validation

● Massive performance upgrade of the agent

– Change complexity of managing big policy

Page 22: How we scaled Rudder to 10k, and the road to 50k

22

Rudder 3.1 – 5000 nodes

● Tooling to generate compliance reports from nodes● Load servers, detect issues in compliance computing

● Extensive use of PgBadger to analyze PostgreSQL logs

– From both tests benchs and production systems

– Finding the slow queries and the limits

● Thank you @matya_j !!

https://github.com/dalibo/pgbadger

Page 23: How we scaled Rudder to 10k, and the road to 50k

23

Rudder 4: going beyond

Page 24: How we scaled Rudder to 10k, and the road to 50k

24

Rudder 4.0: massive changes

● Policies● Each policy is identified by an id

● Change database model

– Use Doobie, an excellent ORM that lets you write proper SQL

– Configuration is stored in JSON rather than JOINs

● No « leaking » of policies changes from one node to another

– Regenerate only for the nodes that have been changed

● Policy generation is much faster

– About 30 times faster (without policy validation)

Page 25: How we scaled Rudder to 10k, and the road to 50k

25

Rudder 4.0: massive changes

● Compliance● Compliance is computed when reports are received server side, cached,

– Twice as fast display of compliance with 1000 nodes, order of magnitude faster with 5000 nodes

● Audit mode

● New LDAP backend (lmdb based)

Page 26: How we scaled Rudder to 10k, and the road to 50k

26

Rudder 4.1: the road to 10k

● UI is much faster● Everything ressources are cached

● Compress everything (big impact on bad network with large installs and distant server)

● Policy generation is pretty fast (if we don’t validate them)

● About 3 minutes for 7000 nodes

● External data sources

● We can trigger from changes remote tool

● Hooks on events

● Allow to fine tune behaviour of node acceptation/deletion/policy generation

● Thank you @FlorianHeigl1 !

Page 27: How we scaled Rudder to 10k, and the road to 50k

27

Rudder 4.3: 10k

● Policy engine has been rewritten● Pluggable, less mutable, a bit faster

● We can manage 10k nodes on one Rudder server

● Recommended configuration is 11GB for the Web Interface for 10k nodes

● Adding more RAM/CPU/IO is enough to go to 15k nodes

● Still not perfect

● Policy generation is long with 10k and policy validation activated

● UI will be sluggish – because of DOM computations

– Might be ok with Firefox 59

● API will be ok

Page 28: How we scaled Rudder to 10k, and the road to 50k

28

What’s next ?

● Improve tooling suite

● Working with Florian Heigl to automate a super large test plateform– Automatically create nodes, rules, reports– At high rate– Checks application response rate and loads

● Find new bottleneck using sysdig

Page 29: How we scaled Rudder to 10k, and the road to 50k

29

What’s next ?

● Improve tooling suite

● Improve usability and documentation of load tools– So that more users/contributors can use them

● Automated tests of UI and measure the response time at each commit

Page 30: How we scaled Rudder to 10k, and the road to 50k

30

The road to 50k nodes

● Several types of bottleneck

● Policy validation– We can’t realistically validate on the server 50 000 policies– Policy validation on client side via 2 steps policy updates

● GUI– Paginate results on the server side

● Ease client side burden● Improve response rate (especially over slow networks)

– Switch from Angular to ELM

Page 31: How we scaled Rudder to 10k, and the road to 50k

31

The road to 50k nodes

● Several types of bottleneck

● Network– Current protocol is not fit to update hundreds of thousands of files

– Reports are sent back from nodes to Rudder server via syslog● Missing compression● Rsyslog-psql does one insert/commit in database per received logs :(

● Policy generation– Upgrade or replace StringTemplate to lessen IO

– More static files

● Database– Use PostgreSQL 10 partitioning to speed up compliance and archiving

Page 32: How we scaled Rudder to 10k, and the road to 50k

32

The road to 50k nodes

● Missing features

● We can expect every users of a given installation to need to manage the whole 50k nodes

– Fine grained authorization (OrBAC)

– Multi-tenancy

– Federation/Synchronisation of different Rudder servers● A lot of thinking need to be put in there

● Improve collaboration– Notifications everywhere!

– Warn if another user is modifying the current object

● Change management– Canary testing

– Ramp-up deployment

Page 33: How we scaled Rudder to 10k, and the road to 50k

33

Final words

● We are very lucky to have great users pushing the limits

● A special thank to all of you

Dennis, Olivier, Florian, Christophe, Janos, Pierre, Stéphane, Marc, Alexander, David, Fabrice, Daniel, Dmitry, Ferenc, François, Vincent, Jean, Lionel, Maxime, Michael, Enrico, Ilan, Jean Marie, Jeremy, …

(and I’m terribly sorry for all those that I did not mentionned)

● Tools, softwares and resources evolved during Rudder life

● They helped improve the scalability as well

Page 34: How we scaled Rudder to 10k, and the road to 50k

How we scaled Rudder to 10k nodes

Questions?

Nicolas CHARLES Co-founder and COO

@nico_charles