SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale
-
Upload
saltstack -
Category
Technology
-
view
398 -
download
1
description
Transcript of SaltConf14 - Craig Sebenik, LinkedIn - SaltStack at Web Scale
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Salt at Web Scale
Craig Sebenik SRE
29 January 2014 SaltConf
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Who Am I?
•Programming for 30-ish years
•Scientific computing
• Java and Perl Developer (web apps)
•HATE doing the same thing more than once
•Been at LinkedIn overy 3 years
•From the very beginning of us using salt
•Manage/architect the entire salt infrastructure at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
What is LinkedIn?
•Social media company connecting the world’s professionals
• 5000+ employees
•Offices throughout the world
• Based in Mountain View, CA
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
How Big Is lnkedin.com?
•Several data centers
•Customer facing apps (aka “production”)
•Staging for production apps
• Internal only apps
• Several Hundred Apps
• 30+K Hosts
•90+% Linux
•Solaris
•Mac and Linux Desktops
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
LinkedIn Operations
•Several operations groups
•Systems (eg. OS install/config, “rack and stack”)
•Database Admins
•Network
•Application (i.e. SRE)
•Different groups have different needs for automation
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
What Is An SRE?
•Assist application developers deploy their apps
•Advise on rollout plans
•Coordinate rollouts
•Generally, the group in-between all of operations and all of the developers
•Lots of troubleshooting
• SREs write code (automation)
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
SREs Use Salt
•Using salt since 0.8.9
!
• Installation of new apps
!
•Config management
!
•Some troubleshooting
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Salt Architecture
•Each physical data center
•multiple “fabrics” (logical grouping of hosts)
• single salt master (largest set of minions = 8+k)
•warm backup (same private key)
•minions configured with CNAME to master
• Files stored in subversion
•states, grains, modules
• runners
• reactor
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Building Salt
• Internal fork from github
•Add another number. E.g. 2014.01.0.0
•Allows for internal only patches
•Create specific package for testing
•same git repo, with same tags
•LNKD-salt-dev-2014.01.0.0-12345.noarch.rpm
•Allows for emergency changes elsewhere
• salt-dev is deployed on a set of virtual machines
•custom test suite is run
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Installing Salt
•OS is managed by cfengine
• cfengine will push new salt releases and restart minions
•cfengine also manages minion configs
•master is a set of RPMs
• includes config
• Solaris install is handled by systems team
•Roll out to one data center at a time
•Entire process can take over a week
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Salt Master
• salt master is wrapped in a “runit” script
• runit is a process supervisor
• restarts the master if is dies/stops
• salt API
• use the reactor system to send metrics
•metrics gathering is all home grown
• trying to open source it
• file updates (every 5 mins)
•modules, states, grains
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Master Access
• Logins to the host are managed via cfengine
•Have to be in a whitelisted group to log on
• Access to salt command controlled via sudo
•sudo logs provide audit trail
• Disable cmd.* from salt cli
• If you want to automate; write a state and/or module
• salt API access via a whitelist of IPs
•Auth using LDAP
•Only a handful of commands
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Minions
• basic salt RPM
• includes “salt” command (unfortunately)
•module sync
•every hr
• small python script using client API
•minion metrics
• “age” of modules (via a tracker file)
•uptime of minion
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Deployment With Salt
• LinkedIn.com apps are deployed via a custom app
•App is showing its age and needs to be replaced
• Team outside of operations is writing new deployment app
•Uses salt api
•Has a lot of custom code
•Not in salt
• Needs to deploy locally (for testing)
•This includes Mac desktop/laptops
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Custom Modules and States
• couchbase management (via runner)
• runit
• Apache Traffic Server
•metrics system
•alerts
•data collection
•data display
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Module Promotion
•Small oversight last year caused massizve issues
•Developed process to “promote”modules
• Salt environments:
•dev -> vm -> test -> stage -> prod
•different dirs in svn
•sparse directories
•minions are configured to look at certain environments
•Changes are managed with “review board”
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Problems
•Education!
•Most salt customizations in 2 groups (out of 10)
•Few power users
•Corrupted keys
• Syncing only every hour
•No syncing on solaris
•No highstate enforcement
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
More Problems
• Lots of CPU issues on master
• Key management
•Reinstall of OS with same host name
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
Future
•Multi master
•shared job cache via file system isn’t what we want
• investigating using a returner to share job info
•More training
•Whitelist of states
•Non-ops users
•Eg. devs that want to deploy just their code
• Increase amount of data in grains
©2013 LinkedIn Corporation. All Rights Reserved. ORGANIZATION NAME©2014 LinkedIn Corporation. All Rights Reserved.
More Future
•Pillar data
•Metrics
• Better visibility when things go wrong
•Tools to see job cache
•Logs on master are too chatty
•Ability to watch all traffic from a specific minion(s)
• Key management
• reactor system, possibly