Post on 18-Jan-2017
The Netflix API Platform for Server-Side Scripting
The risks of modifying running production servers
Problem identified: new servers aren’t coming up healthy!
Ugh! There’s a problem. Errors from API are up.
Escalating customer impact
Stream starts per second more and more off.
Expected value
Actual value
Resolving the issue
Finally root-caused!Now restarting all unhealthy servers.
Back to normal!
Resolving the issue
Stream starts per second also back to normal.
Expected value
Actual value
The Netflix API
Access to Netflix mid-tier services
Today’s system
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
Network boundary
API Server JVM
Today’s system (simplified)
What we need
What we need
What we need
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
API Server JVM
~700 active
Today’s architecture
groovy
Flexibility for devices
[...]
Device1VideoCommon. formatKidsSeason (apiRequest,[...], imageUrl)
[...]
[...]
Device2Common.formatAllSeasons([...])
[...]
[...]
dataPublishingService.getShowFeedbackBuilder(user, video)
[...]
What we need
Developer Velocity: Decoupled deployments of versions
n+3i+4
i+1i+2i+3
i
n+2
n+1
n
k+1
k j
j+1
l
What we need
Challenge #1: Resiliency
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
script
script
...
script
script
Network boundary
API Server JVM
Resiliency in today’s system
Strong resiliency with Hystrix
What about resiliency on this side?
groovy
Example: memory usage
Periodic cleanup
New upload increases memory usage.
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
script
script
...
script
script
Network boundary
API Server JVM
1-2 years ago
few, small scriptsfewer uploads
groovy
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
script/app
script/app
script/app
script/app
...
script/app
script/app
script/app
script/app
Network boundary
API Server JVM
Today
script/app
script/app
~700 more complex scripts/apps,10-50 uploads per day
groovy
Streaming Hours Per Year in Billions
Changing risk profile
→
→
Lack of process isolation is a growing risk.
Some possible mitigations
Velocity vs. Resiliency
Moving toward our ideal API:What will change
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
node script
node script
...
node script
node script
Network boundary API Server JVM
The (near) future
node.js
process isolation
Why containers?
Isolated failures: scripts don’t affect each other(usually)
API
Temporarily unavailable!
Independent autoscaling
API
Fast startup
Challenge #2: Great developer experience
Step-through-debugging (today)
Docker Machine
localproject
Local Container
live reload file watcher
docker build / run
File watcher agent
Proxy
NetworkAgent
Local development (future)
node-inspector
debugger
Run-time debugging/optimization (today)
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
script
script
script
script
...
script
script
script
script
Network boundary
API Server JVM
script
script
Problems hard to root cause, hard to measure/optimize performance
groovy
Script → API interaction (today)
API
device server-side script
device client
Script → API interaction (future)
API
device server-side script
Platform for device teams
Default configuration
Default configuration
Default configuration
Titus
ATLAS
NeWT: Netflix Workflow Toolkit (CLI)
Corresponding UI
VersioningEasy access to instances
Rollback
Initial impressions
Client AClient BClient CClient E
Netflix Microservicesnode script
Network boundary API Server JVM
First end-to-end implementation and shadow traffic
Client AClient BClient CClient E
Netflix Microservicesnode script
Network boundary API Server JVM
Problem isolation (ex: memory leak)
Memory leak makes RSL blow up. Clearer idea of where the problem is.
node.js
Client AClient BClient CClient E
Netflix Microservicesnode script
Network boundary API Server JVM
Problem isolation (ex: memory leak)
Same with node script.
Request tracing: clearer picture of fan-out
Js(mostly)
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
node script
node script
...
node script
node script
Network boundary API Server JVM
node.js
How not to compromise what we’re good at
What we need
Other Netflix talks at QCon New York
Thank you!
Script management
Outtakes
Limited device-server chattiness
Operational insights
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...Netflix Microservices
node script
node script
...
node script
node script
Network boundary API Server JVM
node.js
process isolation