Post on 16-Mar-2018
How we scaled Rudder to 10k nodes
And the road to 50k nodes
Nicolas CHARLES Co-founder and COO
@nico_charles
2
Scalability ?
Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth
https://en.wikipedia.org/wiki/Scalability
3
Scalability – why is it an issue in Rudder?
What does Rudder do ?● Users define policies● Apply them on groups of nodes● Rudder computes the policies for each
nodes● Agents apply them, and send back
information● Rudder computes the compliance
4
Scalability – why is it an issue in Rudder?
Each of these points need to go fast● Process nodes inventory quickly● Have a fast UI● Generate policies in a reasonable time● Have fast agents, and don’t overflow the
network● Compliance of actual state available
5
Rudder Architecture
6
Rudder Architecture
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Applications
Compliance Configuration Inventory
Plugins
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Rudder Engine Techniques
7
The origin of Rudder
● At first, Rudder was thought for hundred(s) of nodes● No real goal for scalability
● It was, retrospectively, an MVP
8
The origin of Rudder
● Scalability went up, driven from● Users and usages
– Frustration over slowdowns
– More managed servers
● Features
– Some features needed much improved performance
– Some needed massive architectural change
9
First bottlenecks to tackle
● Reporting in Rudder● Display compliance of nodes
– Change the data model, as everything was Rule Centric in Rudder 2.3
● Slow display of reports and compliance
– Remember, we are supporting Postgresql 8.x
– Adding relevant indexes
● Agent side● Agent was already used in critical systems, but impacted performance of
nodes
– Rewrite some policies
– Add tooling around agent to prevent clogging
● Rudder 2.5 was not more scalable, but more consistent
10
Scalability – Step by Step
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
Bandwidth & Network- Flag files to detect new policies- Relay servers
11
Scalability – Step by Step
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
Scale the uses- Validation workflow- Synchronisation of Rudder servers- API- More Techniques
12
Scalability – Step by Step
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
Improve performance- Save only changes of Inventories
(several order of magnitude faster)
- Change data model for Compliance(30 % faster compliance)
13
Scalability – 2.9 & 2.10
● Improving performances is one of the focus● Refactoring and code improvements to improve policy generation time
– Use of hashes and caches
● Fighting with the ORM to have lighter queries
– Much less commits
● Make impact on network and node adjustable● Configure agent run frequency : can configure based on the
performance of nodes and available bandwidth
14
Scalability – 2.9 & 2.10
● First industrialized performances test – With Tsung● Generated inventories automatically, and send them to endpoint
● Tests with thousands of inventories
● Thank you @cscmeu !
http://tsung.erlang-projects.org/
15
Scalability – 2.11
● Goal: manage thousand nodes● Distributed setup
– Make Rudder scale by adding more servers for components
● UI more responsive to user requests
– Async
– LDAP optimizations● No more indexes (everything fits in RAM)
● Much faster policy generation
– Changed of variable lookup, more caching
– Used a bit of parallelism when it wass easy
● More performance tests
– A big thank to users pushing the limits
16
Scale the uses – Rudder 2.11
● Technique Editor : everyone can create techniques● Uses ncf
● Graphical User Interface to make Techniques easier to write
17
Rudder 3
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
Complete change of UI- Design and layout
Compliance is everywhere
- Everything is async- Everything is cached
18
Rudder 3
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
New data model : Node Centric- Compliance is per node- Cached- And lazyly computed
19
Rudder 3
Rudder Server Root
Interfaces
CLI
WEB UI
API
Uses
Compliance Configuration Inventory
Rudder Engine
Node
Rudder Agent
Node
Rudder relay
Node
Rudder Agent
Techniques
Lightweight reports- Change only reporting- Send reports only for changes
And much less disk usage
20
Rudder 3
● For this release, devs had between 1000 and 2000 nodes on their dev systems
● A lot of timing info embedded in Rudder
● Permitted to identify low hanging fruits
● As a result, everything was much faster● 500ms compute time with 2000 nodes was considered slow, and
reported as a bug
21
Rudder 3.1 – 5000 nodes
● Rudder 3.1 – reaching the 5000 nodes limit (well – 7500 at the end of its life)
● This is the land of micro-optimization, pushing the limits of the model
– Lazy variables to prevent computation of unwanted values
● Micro tuning of techniques to make policy generation faster
– But we are still talking about 45 minutes for 5000 nodes with policy validation
● Massive performance upgrade of the agent
– Change complexity of managing big policy
22
Rudder 3.1 – 5000 nodes
● Tooling to generate compliance reports from nodes● Load servers, detect issues in compliance computing
● Extensive use of PgBadger to analyze PostgreSQL logs
– From both tests benchs and production systems
– Finding the slow queries and the limits
● Thank you @matya_j !!
https://github.com/dalibo/pgbadger
23
Rudder 4: going beyond
24
Rudder 4.0: massive changes
● Policies● Each policy is identified by an id
● Change database model
– Use Doobie, an excellent ORM that lets you write proper SQL
– Configuration is stored in JSON rather than JOINs
● No « leaking » of policies changes from one node to another
– Regenerate only for the nodes that have been changed
● Policy generation is much faster
– About 30 times faster (without policy validation)
25
Rudder 4.0: massive changes
● Compliance● Compliance is computed when reports are received server side, cached,
– Twice as fast display of compliance with 1000 nodes, order of magnitude faster with 5000 nodes
● Audit mode
● New LDAP backend (lmdb based)
26
Rudder 4.1: the road to 10k
● UI is much faster● Everything ressources are cached
● Compress everything (big impact on bad network with large installs and distant server)
● Policy generation is pretty fast (if we don’t validate them)
● About 3 minutes for 7000 nodes
● External data sources
● We can trigger from changes remote tool
● Hooks on events
● Allow to fine tune behaviour of node acceptation/deletion/policy generation
● Thank you @FlorianHeigl1 !
27
Rudder 4.3: 10k
● Policy engine has been rewritten● Pluggable, less mutable, a bit faster
● We can manage 10k nodes on one Rudder server
● Recommended configuration is 11GB for the Web Interface for 10k nodes
● Adding more RAM/CPU/IO is enough to go to 15k nodes
● Still not perfect
● Policy generation is long with 10k and policy validation activated
● UI will be sluggish – because of DOM computations
– Might be ok with Firefox 59
● API will be ok
28
What’s next ?
● Improve tooling suite
● Working with Florian Heigl to automate a super large test plateform– Automatically create nodes, rules, reports– At high rate– Checks application response rate and loads
● Find new bottleneck using sysdig
29
What’s next ?
● Improve tooling suite
● Improve usability and documentation of load tools– So that more users/contributors can use them
● Automated tests of UI and measure the response time at each commit
30
The road to 50k nodes
● Several types of bottleneck
● Policy validation– We can’t realistically validate on the server 50 000 policies– Policy validation on client side via 2 steps policy updates
● GUI– Paginate results on the server side
● Ease client side burden● Improve response rate (especially over slow networks)
– Switch from Angular to ELM
31
The road to 50k nodes
● Several types of bottleneck
● Network– Current protocol is not fit to update hundreds of thousands of files
– Reports are sent back from nodes to Rudder server via syslog● Missing compression● Rsyslog-psql does one insert/commit in database per received logs :(
● Policy generation– Upgrade or replace StringTemplate to lessen IO
– More static files
● Database– Use PostgreSQL 10 partitioning to speed up compliance and archiving
32
The road to 50k nodes
● Missing features
● We can expect every users of a given installation to need to manage the whole 50k nodes
– Fine grained authorization (OrBAC)
– Multi-tenancy
– Federation/Synchronisation of different Rudder servers● A lot of thinking need to be put in there
● Improve collaboration– Notifications everywhere!
– Warn if another user is modifying the current object
● Change management– Canary testing
– Ramp-up deployment
33
Final words
● We are very lucky to have great users pushing the limits
● A special thank to all of you
Dennis, Olivier, Florian, Christophe, Janos, Pierre, Stéphane, Marc, Alexander, David, Fabrice, Daniel, Dmitry, Ferenc, François, Vincent, Jean, Lionel, Maxime, Michael, Enrico, Ilan, Jean Marie, Jeremy, …
(and I’m terribly sorry for all those that I did not mentionned)
● Tools, softwares and resources evolved during Rudder life
● They helped improve the scalability as well
How we scaled Rudder to 10k nodes
Questions?
Nicolas CHARLES Co-founder and COO
@nico_charles