Pythian operational visibility
-
Upload
laine-campbell -
Category
Technology
-
view
372 -
download
2
Transcript of Pythian operational visibility
![Page 1: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/1.jpg)
Pythian Operational Visibility
Percona Live Santa Clara, 2015
April 13, 2015
Derek Downey
MySQL Principal Consultant
Laine Campbell
Co-Founder, Open Source Database Practice
![Page 2: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/2.jpg)
ABOUT PYTHIAN
• 200+ leading brands trust us to keep their systems fast,
relaible, and secure
• Elite DBA & SysAdmin workforce: 7 Oracle ACEs, 2 Oracle
ACE Directors, 5 Microsoft MVPs, 1 Cloudera Champion of
Big Data, 1 Datastax Platinum Administrator — More than any
other company, regardless of head count
• Oracle, Microsoft, MySQL, Hadoop, Cassandra, MongoDB,
and more.
• Infrastructure, Cloud, SRE, DevOps, and application expertise
• Big data practice includes architects, R&D, data scientists,
and operations capabilities
• Zero lock-in, utility billing model, easily blended into existing
teams.
10,000Pythian currently manages more than 10,000
systems.
350Pythian currently employs more than 350
people in 25 countries worldwide.
1997Pythian was founded in 1997
![Page 3: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/3.jpg)
content
© 2014 Pythian
● discussion
○ laine campbell
○ one hour
● set-up host environments to monitor
○ derek downey
○ 30 minutes
● review observability stack
○ laine campbell
○ 30 minutes
● attach to observability stack, hands-on, Q&A
○ derek downey and laine campbell
○ one hour
![Page 4: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/4.jpg)
Our Goals: to understand
© 2014 Pythian
● observability objectives, principles and outcomes
● current state and problems
● metrics
● observability architecture
● choosing what to measure
○ business KPIs and ways to track them
○ pre-emptive and diagnostics measurements
● the Pythian opsviz stack
○ what it is
○ how to set it up
○ how to visualize data
![Page 5: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/5.jpg)
operational visibility
continuous improvement
kaizen recognizes improvement can be small or large.
many small improvements can make a big change.
![Page 6: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/6.jpg)
to improve a system, you must...
© 2014 Pythian
● understand it
● describe it
● involve, and motivate all
stakeholders
![Page 7: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/7.jpg)
enabling kaizen
© 2014 Pythian
plan do
act study
![Page 8: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/8.jpg)
we can do none of this...
© 2014 Pythian
without visibility
![Page 9: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/9.jpg)
the objectives of observability
© 2014 Pythian
● business velocity
● business availability
● business efficiency
● business scalability
![Page 10: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/10.jpg)
the principles of observability
© 2014 Pythian
● store business and operations data together
● store at low resolution for core KPIs
● support self-service visualization
● keep your architecture simple and scalable
● democratize data for all
● collect and store once
![Page 11: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/11.jpg)
the outcomes of observability
© 2014 Pythian
● high trust and transparency
● continuous deployment
● engineering velocity
● happy staff (in all groups)
● cost-efficiency
![Page 12: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/12.jpg)
state of the union
© 2014 Pythian
starting to address the traditional problems with
opsviz
![Page 13: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/13.jpg)
traditional monitoring
© 2014 Pythian
![Page 14: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/14.jpg)
the problems
© 2014 Pythian
● too many dashboards
● data collected multiple
times
● resolution much too
high
● does not support
ephemeral
● hard to automate
● logs not centralized
![Page 15: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/15.jpg)
better...
© 2014 Pythian
● telemetry collected
once
● logs centralized
● logs alerted on and
graphed
● 1 second resolution
possible
● supports ephemeral
● plays well with CM
● database table data
into dashboards
![Page 16: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/16.jpg)
what must we improve?
© 2014 Pythian
● architectural component complexity and fragility
● functional automation and ephemeral support
● storage and summarization
● naive alerting and anomaly detection
● not understanding and using good math
● insufficient visualization and analysis
![Page 17: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/17.jpg)
what’s in a metric?
© 2014 Pythian
resolution
latency
diversity
telemetry
● counters
● gauges
events
traditional
synthetic
![Page 18: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/18.jpg)
architectural components
© 2014 Pythian
sensing
collecting
analysis
storage
visualization
alerting
![Page 19: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/19.jpg)
architectural components
© 2014 Pythian
● Telemetry
● Events and Logs
● Applications
● Databases and SQL
● Servers and Resources
sensing
![Page 20: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/20.jpg)
architectural components
© 2014 Pythian
● Agent or Agentless
● Push and Pull
● Filtering and Tokenizing
● Scaling
● Performance Impact
sensing
collecting
![Page 21: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/21.jpg)
architectural components
© 2014 Pythian
● In-Stream
● Feeding into Automation
● Anomaly Detection
● Aggregation and
Calculations
sensing
collecting
analysis
![Page 22: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/22.jpg)
architectural components
© 2014 Pythian
● Telemetry
● Events
● Resolution and
Aggregation
● Backends
sensing
collecting
analysis
storage
![Page 23: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/23.jpg)
architectural components
© 2014 Pythian
● rules-based processing
● notification routing
● event aggregation and
management
● under, not over paging
● actionable alerts
sensing
collecting
analysis
storage
alerting
![Page 24: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/24.jpg)
architectural components
© 2014 Pythian
● executive dashboards
● operational dashboards
● anomaly identification
● capacity planning
sensing
collecting
analysis
storage
alerting
visualization
![Page 25: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/25.jpg)
what to measure?
© 2014 Pythian
we measure to support our KPIs
we measure to pre-empt incidents
we measure to diagnose problems
we alert when customers feel the pain
![Page 26: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/26.jpg)
supporting our KPIs
© 2014 Pythian
velocity
efficiency
security
performance
availability
https://derpicdn.net/img/view/2012/8/3/65841__safe_fluttershy_tank_scooter_artist-colon-giantmosquito_tortoise_vespa.jpg
![Page 27: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/27.jpg)
velocity
© 2014 Pythian
velocity
https://derpicdn.net/img/view/2012/8/3/65841__safe_fluttershy_tank_scooter_artist-colon-giantmosquito_tortoise_vespa.jpg
how fast can the org push new features?
how fast can the org pivot?
how fast can the org scale up or down?
deployment counts
DB object changes
data loads and changes
provisioning counts
cluster add/removal
member add/removal
engineering support
query review turnaround
data model review
turnaround
deployment time
DDL timings
data load timings
provisioning timing
cluster add/removal
member add/removal
deployment errors
failed DDL/DML
schema mismatch
provisioning errors
cluster add/removal
member add/removal
![Page 28: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/28.jpg)
efficiency
© 2014 Pythian
velocity
https://derpicdn.net/img/view/2012/8/3/65841__safe_fluttershy_tank_scooter_artist-colon-giantmosquito_tortoise_vespa.jpg
how cost-efficient is our environment?
how elastic is our environment?
cloud spend
data storage costs
data compute costs
data in/out costs
physical spend
data storage costs
data compute costs
data in/out costs
staffing spend
DBE spend
Ops spend
cloud utilization
database capacity
database utilization
physical utilization
database capacity
database utilization
staffing utilization
DBE utilization
Ops utilization
provisioning counts
cluster add/removal
member add/removal
application utilization
percent capacity used
mapped to product or
feature
staffing elasticity
DBE/ops hiring time
DBE/ops training time
efficiency
![Page 29: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/29.jpg)
security
© 2014 Pythian
velocity
https://derpicdn.net/img/view/2012/8/3/65841__safe_fluttershy_tank_scooter_artist-colon-giantmosquito_tortoise_vespa.jpg
how secure is our environment?
penetration tests
frequency
success
classified storage
live
in backups
audit results
frequency
results
audit trail data
utilization
access
users with access
account access
account audit
infosec incidents
event frequency
efficiency
security
![Page 30: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/30.jpg)
performance
© 2014 Pythian
velocity
https://derpicdn.net/img/view/2012/8/3/65841__safe_fluttershy_tank_scooter_artist-colon-giantmosquito_tortoise_vespa.jpg
What is the AppDex of our environment?
AppDex(n), where n is the latency
AppDex(2.5), score of 95 indicates:
● 70% of queries under 2.5
● 25% of queries tolerable (5)
● 5% of queries as outliers
efficiency
security
performance
![Page 31: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/31.jpg)
© 2014 Pythian
![Page 32: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/32.jpg)
availability
© 2014 Pythian
velocity
https://derpicdn.net/img/view/2012/8/3/65841__safe_fluttershy_tank_scooter_artist-colon-giantmosquito_tortoise_vespa.jpg
how available is our environment to
customers?
how available is each component to the
application?
external response
pings
websites
APIs
system availability
server uptime
daemon uptime
accessibility to app
resource consumption
CPU
storage
memory
network
efficiency
security
performance
availability
![Page 33: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/33.jpg)
pre-empting incidents
supporting diagnostics
© 2014 Pythian
identify anomalies in latency or utilization
identify dangerous trends in latency or utilization
identify error rates indicating potential failure
![Page 34: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/34.jpg)
measure as much as possible
alert on as little as possible
© 2014 Pythian
● align alerts to customer pain
● automate remediation if possible
● use metrics and events to solve
what cannot be remediated
https://fc04.deviantart.net/fs71/i/2012/110/0/1/ninja_lyra_by_x72assassin-d4x0xqn.png
![Page 35: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/35.jpg)
https://lockerdome.com/vox.com/7101308691941396
when measuring, understand...
© 2014 Pythian
your artificial bounds, which:
● to ensure resources are not exhausted
● to ensure application concurrency stays in control
your resource constraints
![Page 36: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/36.jpg)
compromising on storage
© 2014 Pythian
![Page 37: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/37.jpg)
telemetry data
© 2014 Pythian
collect and then flush
storing more than just averages● min/max
● standard deviation
● percentiles for outlier removal
averages lie
storing histograms
![Page 38: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/38.jpg)
http://s3.amazonaws.com/bronibooru/941e30f3cb2129951df0d7d674fefcad.png
the application
© 2014 Pythian
closest to perceived customer experience
documenting application components in SQL
understanding end to end
transactions
prioritization by latency
budgets
![Page 39: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/39.jpg)
the application
© 2014 Pythian
application performance management tools
application logging to logstash
fire and forget to an event processing system
telemetry for occurrence counts
histograms for visualizing
![Page 40: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/40.jpg)
the server
© 2014 Pythian
the basics, resource utilization, process behavior and the
network
log aggregation and measuring● syslogs
● mysql logs
● cron, authentication, mail logs
aggregation up in distributed systems
![Page 41: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/41.jpg)
https://lockerdome.com/vox.com/7101308691941396
the database: mysql
© 2014 Pythian
exposed database metrics
sql analytics and metrics
connection layer
how does they impact:
● availability KPIs
● concurrency KPIs
● latency KPIs
![Page 42: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/42.jpg)
database metrics: workload
© 2014 Pythian
generic workload distribution
impacts to latency budgets
how fast are we hitting our resource and concurrency bounds?
● selects
● prepared statements
● ddl - data definition language
● dml - data manipulation language
● administrative commands
correlate DDL and admin to availability impacts
measure shifts in workload that may be impacting latency
![Page 43: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/43.jpg)
database metrics: workload
© 2014 Pythian
data access behavior● sort statistics
● join statistics
● handler status variables
○ index scans vs. full table scans
○ key index access
○ commits and rollbacks
how are we impacting our latency
budgets?
https://e621.net/data/e4/e4/e4e434d480672a67550c7c4113c85c73.jpg
![Page 44: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/44.jpg)
database metrics: workloadevent metrics, inside out
© 2014 Pythian
statement event
stage/sql/creating
sort event
stage/sql/copying
to tmp table
stage/sql/checking
query cache for
sql
sql stage
stage/com/ping
stage/com/quit
stage/com/error
com stage
helps identify where
time is spent, in SQL
statements
this helps to
diagnose and
improve
performance
![Page 45: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/45.jpg)
database metrics: workloadwhere is time being spent?
© 2014 Pythian
wait event
event name
event source code
event operation
(read|write|lock)
stage/com/ping
stage/com/quit
stage/com/error
joined with threads,
allows for trending of
counts of wait
events.
this helps to
diagnose and
improve concurrency
and performance
issues
![Page 46: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/46.jpg)
database metrics: sql
© 2014 Pythian
all sql should be logged, with context● comments pointing to application source code
● latency, and the components therein
● resources consumed
● data access paths taken
performance schema -> event and log analysis and visualization
slow logs -> event and log analysis and visualization
network sniffing -> event and log analysis and visualization
![Page 47: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/47.jpg)
© 2014 Pythian
![Page 48: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/48.jpg)
database metrics: sql
© 2014 Pythian
all SQL should be logged, with context
THIS IS HARD
solutions like vivid cortex can radically improve velocity
![Page 49: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/49.jpg)
the database: connection layer
© 2014 Pythian
this is key to latency and availability KPIs
you must connect to your database
for latency, you have a fixed budget
getting to your network, other transaction
components and SQL consume much of it
for availability, you must understand
your bounds
![Page 50: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/50.jpg)
tcp ports and mysql
© 2014 Pythian
max_connections (mysql)● take one tcp port
time_wait (kernel)● how long the port stays open
● 60 in most linux kernels
● effectively reduces port range by a factor of 60
port range of 30,000 limits to 5,000 total network
connections
![Page 51: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/51.jpg)
http://fc06.deviantart.net/fs71/f/2012/119/d/3/fat_pinkie_pie_by_nice123456-d4xy2w3.png
the database: networkmysql (5.6)
© 2014 Pythian
OS fixed amounts
● max tcp ports
● max tcp backlog
● network bandwidth
![Page 52: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/52.jpg)
the database: networkmysql (5.6)
© 2014 Pythian
mysql configuration
● max_connections
● back_log
● max_connect_errors
● connect_timeout
● net_read_timeout
● net_write_timeout
● open_files_limit
![Page 53: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/53.jpg)
connections: putting it togetheruse your network bounds
monitor proximity
© 2014 Pythian
status counters - network
● packets per sec
● request times
● requests per sec
● connection state (time_wait, listen, established)
● backlog
● socket queue drops and overflows
● time to get a tcp connection
![Page 54: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/54.jpg)
the database: memoryoperating system
© 2014 Pythian
shared memory, file descriptors, semaphores
● max shared memory segment size
● max number of segments
● max total of all memory available
● max file_descriptors per system and user
● max semaphores
![Page 55: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/55.jpg)
the database: memorymysql (5.6)
© 2014 Pythian
global memory bounds
● key_buffer_size
● innodb_buffer_pool_size
● innodb_additional_mem_pool_size
● innodb_log_buffer_size
● query_cache_size
![Page 56: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/56.jpg)
the database: memory mysql (5.6)
© 2014 Pythian
connection memory (max_connections)
● stack (thread_stack)
● connection and result buffers (net_buffer_length)
○ up to max_allowed_packet
● random read buffer (read_rnd_buffer_size)
● sequential read buffer (read_buffer_size)
● sort buffer (max of sort_buffer_size or max_heap_table_size)
● join buffer (join_buffer_size)
● (binlog_cache_size)
![Page 57: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/57.jpg)
the database: memory mysql (5.6)
© 2014 Pythian
max_connections = 1000● thread_stack = 256k
● net_buffer_length x2 = 16k x 2 = 32k
○ up to max_allowed_packet x2 = 1m x2 = 2m
● read_rnd_buffer_size = 256k
● read_buffer_size = 128k
● sort_buffer_size = 256k max_heap_table_size = 16M
● join_buffer_size = 128k
● binlog_cache_size = 32k
total = 18.78m (reduce to ~3m by reducing max_heap…)
1000 connections = 2.9 GB
![Page 58: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/58.jpg)
connections: putting it togetherunderstand mysql impact to latency
understand proximity to mysql bounds
© 2014 Pythian
status counters: mysql● processlist: connection and state counts
● thread statistics (thread_xxx)
● connection durations
● open_tables and open_files
● semaphores globally and per thread
● aborted_clients
● aborted_connects
● connection high water marks
● mysql network traffic
● mysql handlers - for buffer usage
● query response times
![Page 59: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/59.jpg)
what’s next?
© 2014 Pythian
better time series storage
● automatically distributed and federated, thus manageable and scalable
● leverage parallelism and data aggregation
● proper data consistency, backup and recovery
● instrumented and tuneable
● can consume billions of metrics and store them
![Page 60: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/60.jpg)
what’s next?
© 2014 Pythian
machine learning
● using code to pull out the signal from the noise
● easier correlation of metrics
● anomaly detection
● incident prediction
● capacity prediction
● REAL MATH
![Page 61: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/61.jpg)
what’s next?
© 2014 Pythian
consolidation
● business metrics
● telemetry data
● event and log text data
![Page 62: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/62.jpg)
lab time● set-up your hosts
● stretch, hydrate and use facilities
● 30 minutes
![Page 63: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/63.jpg)
setup ec2
● https://github.com/dtest/plsc15-opvis
![Page 64: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/64.jpg)
asbolusthe pythian opsviz stack
© 2014 Pythian
https://github.com/pythian/opsviz
http://dashboard.pythian.asbol.us/
resides in AWS, built via cloudformation and opsworks
internet facing rabbitMQ listener
● for external logstash/statsd/sensu clients
● using AMQP, SSL elastic load balancer
● in AWS VPC
![Page 65: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/65.jpg)
asbolusthe pythian opsviz stack
© 2014 Pythian
originally conceived and built by blackbird devops team
● taylor ludwig https://github.com/taylorludwig
● jonathan dietz https://github.com/jonathandietz
● aaron lee https://github.com/aaronmlee
continued development by pythian
● alex lovell-troy https://github.com/alexlovelltroy
● derek downey https://github.com/dtest
● laine campbell https://github.com/lainevcampbell
● dennis walker https://github.com/denniswalker
![Page 66: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/66.jpg)
asbolus
© 2014 Pythian
![Page 67: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/67.jpg)
telemetry data: sensu
© 2014 Pythian
generated from sensu agent
● sensu agent on host polls from 1 to 60 seconds
● agent pushes to rabbitMQ
● rabbitMQ sends to sensu
once in sensu
● event handlers review
● flushed to carbon/graphite
![Page 68: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/68.jpg)
telemetry data: logstash
© 2014 Pythian
generated from logstash agent
● logstash agent on host pushes to logstash server
● logstash server tokenizes and submits to statsD
● statsD flushes and sends to carbon/graphite
![Page 69: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/69.jpg)
event data: logstash
© 2014 Pythian
generated from logstash agent
● logstash agent on host pushes to logstash server
● logstash server tokenizes and submits to elasticsearch
![Page 70: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/70.jpg)
monitoring: sensu
© 2014 Pythian
sensu server receives data from
● sensu agents
● statsD on logstash host
sensu handlers:
● flush to graphite
● send alerts to pagerduty
● create tickets in jira
● send messages to chat rooms
![Page 71: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/71.jpg)
sensu
architecture
© 2014 Pythian
![Page 72: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/72.jpg)
monitoring: why sensu?
© 2014 Pythian
clients subscribe to checks, supporting ephemeral hosts
sensu server can be parallelized and ephemeral
clients easily added to configuration management
backwards compatible to nagios
multiple multi-site strategies available
excellent API
![Page 73: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/73.jpg)
telemetry storage: graphite
© 2014 Pythian
why graphite?
● works with many different pollers, to graph everything!
● combines maturity with functionality better than others
● can be clustered for scale
what are the limitations?
● clustering for scale is complex, not native
● flat files means no joining multiple series for complex
queries
● advanced statistical analysis not easy
![Page 74: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/74.jpg)
event storage: elasticsearch
© 2014 Pythian
why?
● native distribution via clustering and sharding
● performant indexing and querying
● elasticity (unicorn scale)
what are the limitations?
● security still minimal
● enterprise features becoming available (at a price)
![Page 75: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/75.jpg)
visualization
© 2014 Pythian
telemetry: grafana
logs/events: kibana
incidents/alerts: uchiwa
![Page 76: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/76.jpg)
scaled and available sensu
© 2014 Pythian
rabbitMQ
● network partition
○ pause-minority
● node failures
○ mirrored queues
○ tcp load balancers
● AZ failures
○ multiple availability zones
○ pause minority recovery
● scaling concerns
○ auto-scaling nodes
○ elastic load balancer
![Page 77: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/77.jpg)
scaled and available sensu
© 2014 Pythian
sensu servers
● redis failure
○ elasticache, multi-AZ
● sensu main host
○ multiple hosts
○ use the same redis service
○ multi-AZ
![Page 78: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/78.jpg)
monitoring your monitor
© 2014 Pythian
rabbitMQ
● monitor queue growth for anomalies
● monitor for network partitions
● monitor auto scaling cluster size
sensu
● sensu cluster size (n+1 sensu hosts)
● redis availability
![Page 79: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/79.jpg)
© 2014 Pythian
workflow
● metric sent to ELB
● ELB sends to rabbitMQ cluster
● rabbitMQ
○ writes to master
○ replicates to mirror
● rabbitMQ sends to sensu
![Page 80: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/80.jpg)
scaled and available graphite
© 2014 Pythian
carbon cache (tcp daemon, listening for metrics)
● scale with multiple caches on each host
carbon relay
● used to distribute to multiple carbon caches
● by metric name, or consistent hashing
● can be redundant, using load balancers
whisper (flat file database, storage)
● can be replicated at the relay level
● running out of capacity and having to grow requires
rehashing
![Page 81: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/81.jpg)
© 2014 Pythian
workflow
● metric sent to ELB
● ELB sends to carbon relay
● carbon relay
○ chooses carbon cache
○ replicates as needed
● carbon cache flushes to
whisper
![Page 82: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/82.jpg)
scaling and available
elasticsearch
© 2014 Pythian
node and cluster scaling
● clustering scales reads
● distribute across availability zones
● sharding indices allows for distributing data
● multiple clusters for multiple indexes
network partitions
● running masters on dedicated nodes
● running data nodes on dedicated nodes
● run search load balancers on dedicated nodes
![Page 83: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/83.jpg)
© 2014 Pythian
workflow
● master/replica nodes route data
and manage the cluster
● client nodes redirect queries
● data nodes store index shards
![Page 84: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/84.jpg)
what’s next?
© 2014 Pythian
visualization
● use sensu API to get incidents/alerts into graphite/
● merge kibana and grafana to one page
monitoring
● integrate flapjack for event aggregation and routing
● continue to add more metrics
full stack
● anomaly detection via heka or skyline
● influxdb for storage
![Page 85: Pythian operational visibility](https://reader034.fdocuments.us/reader034/viewer/2022042615/55a694001a28ab604d8b4804/html5/thumbnails/85.jpg)
lab time● work on dashboards!
● 15 minute walkthrough w/ derek
● 45 minute play time