No sourcers? No problem, says Netflix's recruiting researcher team
Winston - Netflix's event driven auto remediation and diagnostics tool
-
Upload
vinay-shah -
Category
Software
-
view
455 -
download
0
Transcript of Winston - Netflix's event driven auto remediation and diagnostics tool
![Page 1: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/1.jpg)
Winston
Diagnostic and Remediation Engineering (DaRE)Vinay Shah & Jean-Sebastien Jeannotte
![Page 2: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/2.jpg)
● Introduction● Internals - How it works?● Demo - See it in action!● Learnings and challenges● Metrics & Road ahead● Additional resources
Topics
![Page 3: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/3.jpg)
Introduction
![Page 4: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/4.jpg)
Landscape
Operational load vs.
new features
Scale and Growth Availability
![Page 5: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/5.jpg)
Application or Service
Monitoring
Alerting
Pagerduty Email Winston
![Page 6: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/6.jpg)
● Reduce MTTR
● Reduce risk of human errors
● Reduce pager fatigue, provide tier 1 support
● Don’t worry about infrastructure, focus on your business logic
● Best practice for runbook lifecycle management
Business goals
![Page 7: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/7.jpg)
Winston is an event driven runbook automation platform. It is designed to host and execute runbooks in response to operational events.
![Page 8: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/8.jpg)
Internals
![Page 9: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/9.jpg)
How
is it
dep
loye
d?
![Page 10: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/10.jpg)
Execution Flow
![Page 11: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/11.jpg)
● One stop portal for all things Winston
● Supports Create, Read, Update, Delete, Execute and Diagnose functionality
● Implements best practises
○ Compliance/Auditing
○ Persistence
○ Security (Authentication/Authorization)
● Self serve & scalable
Winston Studio
![Page 12: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/12.jpg)
● Pack
A group of related automations typically organized around a discreet
service or product
● Action
Set of steps to help with diagnostics or remediations written as code
● Event & event source
External services that are the source of events that trigger a runbook
Terminology
![Page 13: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/13.jpg)
Demo
![Page 14: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/14.jpg)
Winston Studio
DEMO
![Page 15: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/15.jpg)
● False positives
○ Cassandra ring health
● Diagnostics - correlation could point towards causation - e.g:
○ Querying Chronos events
○ Querying dependencies upstream and downstream for anomalous behaviour
● Remediation
○ Clean up disk space
○ Restart Kafka process
Sample use cases
![Page 16: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/16.jpg)
Learnings & challenges
![Page 17: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/17.jpg)
Common patterns
![Page 18: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/18.jpg)
● Usage
○ Culture of automating the manual and repeatable
○ Noisy signals become more interesting
○ Lesser the control more the opportunity
● Product
○ Safety is crucial
○ Usability is important
○ Resiliency
Insights
![Page 19: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/19.jpg)
● Don’t reinvent the wheel
● Start simple and iterate
● Allow experimentation
● Pay special care to usability of your product
● Push for changing the culture - usage will follow
● Talk to us/others who have gone through some of the pains and learnings
Recommendations to get started
![Page 20: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/20.jpg)
Metrics and Road ahead
![Page 21: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/21.jpg)
● Adoption. Adoption. Adoption.
● Usability
○ Polyglot support (Groovy based actions)
○ Deeper Integrations
● Safety
○ Resource isolation (Containers)
○ Rate limiting
The road ahead
![Page 22: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/22.jpg)
● Introducing Winston: http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html
● Stackstorm: https://docs.stackstorm.com/
● Reach out: [email protected] or [email protected]
We are hiring
Senior Software Engineer - https://jobs.netflix.com/jobs/860752
Links & resources
![Page 23: Winston - Netflix's event driven auto remediation and diagnostics tool](https://reader030.fdocuments.us/reader030/viewer/2022021415/58ed49b81a28ab63158b459d/html5/thumbnails/23.jpg)
Thank you.