Boulder/Denver Lean Startup Meetup - Business Model Elements
Simple Solutions for Complex Problems - Boulder Meetup
-
Upload
apcera -
Category
Technology
-
view
407 -
download
2
Transcript of Simple Solutions for Complex Problems - Boulder Meetup
Simple Solutions for Complex ProblemsTyler Treat / Workiva
Boulder NATS Meetup 6/7/2016
• Embracing the reality of complex systems
• Using simplicity to your advantage
• Why NATS?
• How Workiva uses NATS
ABOUT THIS TALK
• Messaging tech lead at Workiva
• Platform infrastructure
• Distributed systems
• bravenewgeek.com
ABOUT THE SPEAKER
There are a lot of parallels between real-world systems anddistributed software systems.
The world is eventually consistent…
…and the database is just an optimization.[1]
[1] https://christophermeiklejohn.com/lasp/erlang/2015/10/27/tendency.html
“There will be no further print editions [of the Merck Manual]. Publishing a printed book every five years and sending reams of paper around the world on trucks, planes, and boats is no longer the optimal way to provide medical information.”
Dr. Robert S. PorterEditor-in-Chief, The Merck Manuals
Programmers find asynchrony hard to reason about, but the truth is…
Life is mostly asynchronous.
What does this mean for us as programmers?
time / complexity
timesharing
monoliths
soa
virtualization
microservices
???
Complicated made complex…
Distributed!
Distributed computation is inherently asynchronous and the network is inherently unreliable[2]…
[2] http://queue.acm.org/detail.cfm?id=2655736
…but the natural tendency is to build distributed systems as if they aren’t distributed at all because it’s easy to reason about.
strong consistency - reliable messaging - predictability
• Complicated algorithms
• Transaction managers
• Coordination services
• Distributed locking
What’s in a guarantee?
• Message handed to the transport layer?
• Enqueued in the recipient’s mailbox?
• Recipient started processing it?
• Recipient finished processing it?
What’s a delivery guarantee?
Each of these has a very different set of conditions, constraints, and costs.
Guaranteed, ordered,exactly-once deliveryis expensive (if not impossible[3]).
[3] http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
Over-engineered
Complex
Difficult to deploy & operate
Fragile
Slow
At large scale, guarantees will give out.
0.1% failure at scale is huge.
Replayable > Guaranteed
Replayable > Guaranteed
Idempotent > Exactly-once
Replayable > Guaranteed
Idempotent > Exactly-once
Commutative > Ordered
But delivery != processing
Also, what does it even mean to “process” a message?
It depends on thebusiness context!
If you need business-level guarantees, build them intothe business layer.
We can always buildstronger guarantees on top,but we can’t always removethem from below.
End-to-end system semantics matter much more than the semantics of an individual building block[4].
[4] http://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf
Embrace the chaos!
“Simplicity is the ultimate sophistication.”
EMBRACING THE CHAOS MEANSLOOKING AT THE NEGATIVE SPACE.
A simple technologyin a sea of complexity.
Simple doesn’t mean easy.
[5] https://blog.wearewizards.io/some-a-priori-good-qualities-of-software-development
“Simple can be harder than complex. You have to work hard to get your thinking clean to make it simple. But it’s worth it in the end because once you get there, you can move mountains.”
• Wdesk: platform for enterprises to collect, manage, and report critical business data in real time
• Increasing amounts of data and complexity of formats
• Cloud solution: - Data accuracy - Secure - Highly available - Scalable - Mobile-enabled
About Workiva
• First solution built on Google App Engine
• Scaling new solutions requires service-oriented approach
• Scaling new services requires a low-latency communication backplane
About Workiva
Why ?
Availabilityover
everything.
• Always on, always available
• Protects itself at all costs—no compromises on performance
• Disconnects slow consumers and lazy listeners
• Clients have automatic failover and reconnect logic
• Clients buffer messages while temporarily partitioned
Availability over Everything
Simplicity as a feature.
• Single, lightweight binary
• Embraces the “negative space”: - Simplicity —> high-performance - No complicated configuration or external dependencies (e.g. ZooKeeper) - No fragile guarantees —> face complexity head-on, encourage async
• Simple pub/sub semantics provide a versatile primitive: - Fan-in - Fan-out - Request/response - Distributed queueing
• Simple text-based wire protocol
Simplicity as a Feature
Fast as hell.
[6] http://bravenewgeek.com/benchmarking-message-queue-latency/
• Fast, predictable performance at scale and at tail
• ~8 million messages per second
• Auto-pruning of interest graph allows efficient routing
• When SLAs matter, it’s hard to beat NATS
Fast as Hell
• Low-latency service bus
• Pub/Sub
• RPC
How We Use NATS
Service
Service
Service
NATSService Gateway
Web Client
Web Client
Web Client
Service
Service
Service
NATSService Gateway
Web Client
Web Client
Web Client
Service
Service
Service
NATSService Gateway
Web Client
Web Client
Web Client
Service
Service
Service
NATSService Gateway
Web Client
Web Client
Web Client
Service
Service
Service
Service
Service
NATSService Gateway
Web Client
Web Client
Web Client
Web Client
Web Client
Web Client
Service Gateway NATS
Service
Service
Service
Service
Service
Service
NATS
Pub/Sub
“Just send this thing containing these fields serialized in this way using that encoding to this topic!”
“Just subscribe to this topic and decode using that encoding then deserialize in this way and extract these fields fromthis thing!”
Pub/Sub is meant to decouple services but often ends up coupling the teams developing them.
How do we evolve services in isolation and reduce development overhead?
• Extension of Apache Thrift
• IDL and cross-language, code-generated pub/sub APIs
• Allows developers to think in terms of services and APIs rather than opaque messages and topics
• Allows APIs to evolve while maintaining compatibility
• Transports are pluggable (we use NATS)
Frugal RPC
struct Event { 1: i64 id, 2: string message, 3: i64 timestamp, }
scope Events prefix {user} { EventCreated: Event EventUpdated: Event EventDeleted: Event}
subscriber.SubscribeEventCreated( "user-1", func(e *event.Event) { fmt.Println(e) },)
. . .
publisher.PublishEventCreated( "user-1", event.NewEvent())
generated
• Service instances form a queue group
• Client “connects” to instance by publishing a message to the service queue group
• Serving instance sets up an inbox for the client and sends it back in the response
• Client sends requests to the inbox
• Connecting is cheap—no service discovery and no sockets to create, just a request/response
• Heartbeats used to check health of server and client
• Very early prototype code: https://github.com/workiva/thrift-nats
RPC over NATS
• Store JSON containing cluster membership in S3
• Container reads JSON on startup and creates routes w/ correct credentials
• Services only talk to the NATS daemon on their VM via localhost
• Don’t have to worry about encryption between services and NATS, only between NATS peers
NATS per VM
• Only messages intended for a process on another host go over the network since NATS cluster maintains interest graph
• Greatly reduces network hops (usually 0 vs. 2-3)
• If local NATS daemon goes down, restart it automatically
NATS per VM
• Doesn’t scale to large number of VMs
• Fairly easy to transition to floating NATS cluster or running on a subset of machines per AZ
• NATS communication abstracted from service
• Send messages to services without thinking about routing or service discovery
• Queue groups provide service load balancing
NATS per VM
• We’re a SaaS company, not an infrastructure company
• High availability
• Operational simplicity
• Performance
• First-party clients: Go Java C C# Python Ruby Elixir Node.js
NATS as a Messaging Backplane
• Handle failure at the client - The less state in your middleware & infrastructure, the easier it is to scale - Exponential backoffs with jitter
• But never trust the client - Rate limits, message size limits, back pressure - Be strict in what you accept - Limit failure domain by forcing applications to make design decisions upfront instead of punting
Important Corollaries
Assume every client is trying to DoS you (because they probably are, intentionally or not).
Assume every client is trying to DoS you (because they probably are, intentionally or not).
–Derek Landy, Skulduggery Pleasant
“Every solution to every problem is simple… It's the distance between the two where the mystery lies.”
@tyler_treat
github.com/tylertreat
bravenewgeek.com
Thanks!