Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-scale applications
Transcript of Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-scale applications
Konrad 'ktoso' MalawskiGeeCON 2014 @ Kraków, PL
Konrad `@ktosopl` Malawski
need for asynchot pursuit
for Scalable apps
Konrad `ktoso` Malawski
Akka Team, Reactive Streams TCK
Konrad `@ktosopl` Malawski
akka.iotypesafe.comgeecon.org
Java.pl / KrakowScala.plsckrk.com / meetup.com/Paper-Cup @ London
GDGKrakow.pl lambdakrk.pl
Nice to meet you! Who are you guys?
Why scalability matters:
High Performance Software Development
For the majority of the time, high performance software development
is not about compiler hacks and bit twiddling.
It is about fundamental design principles that are key to doing any effective software development.
Martin Thompson
practicalperformanceanalyst.com/2015/02/17/getting-to-know-martin-thompson-...
Agenda• Why?
• Async and Synch basics / definitions
• Back-pressure applied
• Async where it matters: Scheduling
• How NOT to measure Latency
• Concurrent < lock-free < wait-free
• I/O: IO, AIO, NIO, Zero
• C10K: select, poll, epoll / kqueue
• Distributed Systems: Where Async is at Home
• Wrapping up and Q/A
Sync / Async Basics
Sync / Async
Sync / Async
Sync / Async
Sync / Async
Sync / Async
Sync / Async
Sync / Async
Sync / Async
Sync / Async
Asynchronous systems rely on buffering.
Asynchronous systems rely on buffering.
What if they overflow? How do you back-pressure?
Back-pressure? Example Without
Fast Publisher Slow Subscriber
Back-pressure? (example without)
Subscriber usually has some kind of buffer.
Back-pressure? (example without)
Back-pressure? (example without)
Back-pressure? (example without)
What if the buffer overflows?
Back-pressure? (example without)
Use bounded buffer, drop messages + require re-sending
Back-pressure? (example without)
Kernel does this!Routers do this!
(TCP)
Use bounded buffer, drop messages + require re-sending
Back-pressure “not needed” when:
speed(publisher) < speed(subscriber)
Back-pressure? Fast Subscriber, No Problem
No problem!
Back-pressure?NACKing is NOT enough.
Back-pressure? (example without)
Back-pressure? Example NACKingTelling the Publisher to slow down / stop sending…
Back-pressure? Example NACKingNACK did not make it in time,
because M was in-flight!
Back-pressure?Reactive-Streams
= “Dynamic Push/Pull”
Just push – not safe when Slow Subscriber
Just pull – too slow when Fast Subscriber
Back-pressure? RS: Dynamic Push/Pull
Solution:Dynamic adjustment
Back-pressure? RS: Dynamic Push/Pull
Just push – not safe when Slow Subscriber
Just pull – too slow when Fast Subscriber
Back-pressure? RS: Dynamic Push/PullSlow Subscriber sees it’s buffer can take 3 elements. Publisher will never blow up it’s buffer.
Back-pressure? RS: Dynamic Push/PullFast Publisher will send at-most 3 elements. This is pull-based-backpressure.
Back-pressure? RS: Dynamic Push/Pull
Fast Subscriber can issue more Request(n), before more data arrives!
Back-pressure? RS: Dynamic Push/PullFast Subscriber can issue more Request(n), before more data arrives.
Publisher can accumulate demand.
Back-pressure? RS: Accumulate demandPublisher accumulates total demand per subscriber.
Back-pressure? RS: Accumulate demandTotal demand of elements is safe to publish. Subscriber’s buffer will not overflow.
Back-pressure? Reactive StreamsMini in-line pitch
Specification and TCKreactive-streams.org
ImplementationsAkka Streams / Akka Http
Slick RxJava
ReactorRatPackVert.x
MongoDB driver
http://www.reactive-streams.org/announce-1.0.0
Highly parallel systemsEvent loops
Async where it matters:
Async where it matters: Scheduling
Async where it matters: Scheduling
Async where it matters: Scheduling
Async where it matters: Scheduling
Async where it matters: Scheduling
Async where it matters: Scheduling
Async where it matters: Scheduling
Async where it matters: Scheduling
Async where it matters: Scheduling
Scheduling (notice they grey sync call)
Scheduling (notice they grey sync call)
Scheduling (now with Async db call)
Scheduling (now with Async db call)
Scheduling (now with Async db call)
Scheduling (now with Async db call)
Latency
Latency
Time interval between
the stimulation and response.
Latency Quiz
Gil Tene style, see: “How NOT to Measure Latency”
Is 10s latency acceptable in your app? Is 200ms latency acceptable?
How about most responses within 200ms?So mostly 20ms and some 1 minute latencies is OK?
Do people die when we go above 200ms?
So 90% below 200ms, 99% bellow 1s, 99.99% below 2s?
Latency in the “real world”
Gil Tene style, see: “How NOT to Measure Latency”
“Our response time is 200ms average, stddev is around 60ms”
— a typical quote
Latency in the “real world”
Gil Tene style, see: “How NOT to Measure Latency”
“Our response time is 200ms average, stddev is around 60ms”
— a typical quote
Latency does NOT behave like normal distribution!
“So yeah, our 99,99%’ is…”
Gil Tene style, see: “How NOT to Measure Latency”
Hiccups
Gil Tene style, see: “How NOT to Measure Latency”
Hiccups
Gil Tene style, see: “How NOT to Measure Latency”
Hiccups
Gil Tene style, see: “How NOT to Measure Latency”
Hiccups
Gil Tene style, see: “How NOT to Measure Latency”
Hiccups
Concurrent < lock-free < wait-free
Concurrent < lock-free < wait-free
Concurrent < lock-free < wait-free
Concurrent < lock-free < wait-free
Concurrent data structure
A locks; B tries lock... A writes;
A unlocks; B acquires the lock... B (finally) writes & unlocks
What can happen in concurrent data structures:
Concurrent < lock-free < wait-free
A locks; B tries lock... A writes;
A unlocks; B acquires the lock... B (finally) writes & unlocks
What can happen in concurrent data structures:
Moral? 1) Thread A is not very nice to B. 2) A lot of time is wasted.
Concurrent < lock-free < wait-freeW
asted time
Concurrent < lock-free < wait-less
Concurrency is NOT Parallelism.
Rob Pike - Concurrency is NOT Parallelism (video)
def offer(a: A): Boolean // returns on failuredef add(a: A): Unit // throws on failure
def put(a: A): Boolean // blocks until able to enqueue
Concurrent < lock-free < wait-free
concurrent data structure
< lock-free* data structure
* lock-free a.k.a. lockless
What lock-free programming looks like:
An algorithm is lock-free if it satisfies that:When the program threads are run sufficiently long,
at least one of the threads makes progress.
Concurrent < lock-free < wait-free
Concurrent < lock-free < wait-freeWhat can happen in concurrent data structures:
A tries to write; B tries to write; B wins!A tries to write; C tries to write; C wins!A tries to write; D tries to write; D wins!A tries to write; B tries to write; B wins!A tries to write; E tries to write; E wins!A tries to write; F tries to write; F wins!…
Moral? 1) Thread A is a complete loser. 2) Thread A eventually should be able to progress…
* Both versions are used: lock-free / lockless
class CASBackedQueue[A] { val _queue = new AtomicReference(Vector[A]())
// does not block, may spin though @tailrec final def put(a: A): Unit = { val queue = _queue.get val appended = queue :+ a
if (!_queue.compareAndSet(queue, appended)) put(a) }}
Concurrent < lock-free < wait-free
class CASBackedQueue[A] { val _queue = new AtomicReference(Vector[A]())
// does not block, may spin though @tailrec final def put(a: A): Unit = { val queue = _queue.get val appended = queue :+ a
if (!_queue.compareAndSet(queue, appended)) put(a) }}
Concurrent < lock-free < wait-free
class CASBackedQueue[A] { val _queue = new AtomicReference(Vector[A]())
// does not block, may spin though @tailrec final def put(a: A): Unit = { val queue = _queue.get val appended = queue :+ a
if (!_queue.compareAndSet(queue, appended)) put(a) }}
Concurrent < lock-free < wait-free
Concurrent < lock-free < wait-free
Concurrent < lock-free < wait-free
Concurrent < lock-free < wait-free
Concurrent < lock-free < wait-free
Concurrent < lock-free < wait-free
Concurrent < lock-free < wait-free
Concurrent < lock-free < wait-free
“concurrent” data structure
< lock-free* data structure
< wait-free data structure
* Both versions are used: lock-free / lockless
Concurrent < lock-free < wait-free
Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue AlgorithmsMaged M. Michael Michael L. Scott
An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the
operation completes.
wait-free: j.u.c.ConcurrentLinkedQueue
Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue AlgorithmsMaged M. Michael Michael L. Scott
public boolean offer(E e) { checkNotNull(e); final Node<E> newNode = new Node<E>(e);
for (Node<E> t = tail, p = t;;) { Node<E> q = p.next; if (q == null) { // p is last node if (p.casNext(null, newNode)) { // Successful CAS is the linearization point // for e to become an element of this queue, // and for newNode to become "live". if (p != t) // hop two nodes at a time casTail(t, newNode); // Failure is OK. return true; } // Lost CAS race to another thread; re-read next } else if (p == q) // We have fallen off list. If tail is unchanged, it // will also be off-list, in which case we need to // jump to head, from which all live nodes are always // reachable. Else the new tail is a better bet. p = (t != (t = tail)) ? t : head; else // Check for tail updates after two hops. p = (p != t && t != (t = tail)) ? t : q; }}
This is a modification of the Michael & Scott algorithm,adapted for a garbage-collected environment, with supportfor interior node deletion (to support remove(Object)). For explanation, read the paper.
I / O
IO / AIO
IO / AIO / NIO
IO / AIO / NIO / Zero
Synchronous I / O
— Havoc Pennington(HAL, GNOME, GConf, D-BUS, now Typesafe)
When I learned J2EE about 2008 with some of my desktop colleagues our reactions included something like:
”wtf is this sync IO crap,
where is the main loop?!” :-)
Interruption!CPU: User Mode / Kernel Mode
Kernels and CPUs
Kernels and CPUs
Kernels and CPUs
Kernels and CPUs
Kernels and CPUs
[…] switching from user-level to kernel-level on a (2.8 GHz) P4 is 1348 cycles.
[…] Counting actual time, the P4 takes 481ns […]
http://wiki.osdev.org/Context_Switching
I / O
I / O
I / O
I / O
I / O
I / O
“Don’t worry. It only gets worse!”
I / O“Don’t worry.
It only gets worse!”
Same data in 3 buffers!4 mode switches!
Asynchronous I / O [Linux]
Linux AIO = JVM NIO
Asynchronous I / O [Linux]
NewIO… since 2004!(No-one calls it “new” any more)
Linux AIO = JVM NIO
Asynchronous I / O [Linux]
Less time wasted waiting. Same amount of buffer copies.
ZeroCopy = sendfile [Linux]
“Work smarter. Not harder.”
http://fourhourworkweek.com/
ZeroCopy = sendfile [Linux]
ZeroCopy = sendfile [Linux]
Data never leaves kernel mode!
ZeroCopy…
C10K and beyond
C10K and beyond
“10.000 concurrent connections”
Not a new problem, pretty old actually: ~12 years old.
http://www.kegel.com/c10k.html
C10K and beyond
It’s not about performance.It’s about scalability.
These are orthogonal things.
Threading differences: apache httpd / nginx / akka
(1 thread = 1 request) == Very heavy
select/poll
C10K – poll
C10K – poll
C10K – poll
epoll
C10K – epoll [Linux]
C10K – epoll [Linux]
C10K – epoll [Linux]
C10K – epoll [Linux]
C10K – epoll [Linux]
C10K
O(n) is a no-go for epic scalability.
C10K
O(n) is a no-go for epic scalability.
State of Linux scheduling: O(n) ⇒ O(1) ⇒ CFS (O(1)/ O(log n))
And Socket selection: Select/Poll O(n) ⇒ EPoll (O(1))
C10K
O(n) is a no-go for epic scalability.
State of Linux scheduling: O(n) ⇒ O(1) ⇒ CFS (O(1))
And Socket selection: Select/Poll O(n) ⇒ EPoll (O(1))
O(1) IS a go for epic scalability.Moral:
Distributed Systems
Distributed Systems“… in which the failure of a computer you didn't even know existed canrender your own computer unusable.”
— Leslie Lamport
http://research.microsoft.com/en-us/um/people/lamport/pubs/distributed-system.txt
Distributed SystemsThe bigger the system, the more “random” latency / failure noise.
Embrace instead of hiding it.
Distributed SystemsBackup Requests
Backup requests
A technique for fighting “long tail latencies”.By issuing duplicated work, when SLA seems in danger.
Backup requests
Backup requests
Backup requests - send
Backup requests - send
Backup requests - send
Backup requests
Avg Std dev 95%ile 99%ile 99.9%ile
No backups 33 ms 1524 ms 24 ms 52 ms 994 ms
After 10ms 14 ms 4 ms 20 ms 23 ms 50 ms
After 50ms 16 ms 12 ms 57 ms 63 ms 68 ms
Jeff Dean - Achieving Rapid Response Times in Large Online Services Peter Bailis - Doing Redundant Work to Speed Up Distributed Queries
Akka - Krzysztof Janosz @ Akkathon, Kraków - TailChoppingRouter (docs, pr)
Distributed SystemsCombined Requests
Combined requests
A technique for avoiding duplicated work.By aggregating requests, possibly increasing latency.
Combined requests
A technique for avoiding duplicated work.By aggregating requests, possibly increasing latency.
“Wat? Why would I increase latency!?”
Combined requests
Combined requests
Combined requests
Combined requests
Combined requests
Combined requests
World of Tradeoffs
Combined requests with backpressure
reactive-streams.org
Timer can be replaced with back-pressure.
Wrapping up
Wrapping up
• Someone has to bite the bullet though!
• We’re all running on real hardware.
• We do it so you don’t have to.
Summing up the goal of this talk:
• Keep your apps pure and simple
• Be aware of internals
• Use libraries which apply these techniques
• Maybe indeed
• Messaging all the way!
These techniques are hard to get right, yet:
Links• akka.io • reactive-streams.org• akka-user
• Gil Tene - How NOT to measure latency, 2013• Jeff Dean @ Velocity 2014• Alan Bateman, Jeanfrancois Arcand (Sun) Async IO Tips @ JavaOne• http://linux.die.net/man/2/select• http://linux.die.net/man/2/poll• http://linux.die.net/man/4/epoll• giltene/jHiccup• Linux Journal: ZeroCopy I, Dragan Stancevis 2013
• Last slide car picture: http://actu-moteurs.com/sprint/gt-tour/jean-philippe-belloc-un-beau-challenge-avec-le-akka-asp-team/2000
Links• http://wiki.osdev.org/Context_Switching• CppCon: Herb Sutter "Lock-Free Programming (or, Juggling Razor Blades)" • http://www.infoq.com/presentations/reactive-services-scale• Gil Tene’s HdrHistogram.org
• http://hdrhistogram.github.io/HdrHistogram/plotFiles.html• Rob Pike - Concurrency is NOT Parallelism (video)• Brendan Gregg - Systems Performance: Enterprise and the Cloud (book)• http://psy-lob-saw.blogspot.com/2015/02/hdrhistogram-better-latency-capture.html• Jeff Dean, Luiz Andre Barroso - The Tail at Scale (whitepaper, ACM)• http://highscalability.com/blog/2012/3/12/google-taming-the-long-latency-tail-when-
more-machines-equal.html• http://www.ulduzsoft.com/2014/01/select-poll-epoll-practical-difference-for-system-
architects/• Marcus Lagergren - Oracle JRockit: The Definitive Guide (book)• http://mechanical-sympathy.blogspot.com/2013/08/lock-based-vs-lock-free-
concurrent.html• Handling of Asynchronous Events - http://www.win.tue.nl/~aeb/linux/lk/lk-12.html• http://www.kegel.com/c10k.html
Links• www.reactivemanifesto.org/• Seriously the only right way to micro benchmark on the JVM:
• JMH openjdk.java.net/projects/code-tools/jmh/• JMH for Scala: https://github.com/ktoso/sbt-jmh
• http://www.ibm.com/developerworks/library/l-async/• http://lse.sourceforge.net/io/aio.html• https://code.google.com/p/kernel/wiki/AIOUserGuide• ShmooCon: C10M - Defending the Internet At Scale (Robert Graham)
• http://blog.erratasec.com/2013/02/scalability-its-question-that-drives-us.html#.VO6E11PF8SM
• User-level threads....... with threads. - Paul Turner @ Linux Plumbers Conf 2013• Jan Pustelnik’s talk yesterday: https://github.com/gosubpl/sortbench
Special thanks to:
• Aleksey Shipilëv• Andrzej Grzesik• Gil Tene• Kirk Pepperdine• Łukasz Dubiel• Marcus Lagergren• Martin Thompson• Mateusz Dymczyk• Nitsan Wakart
Thanks guys, you’re awesome.
alphabetically, mostly
• Peter Lawrey• Richard Warburton• Roland Kuhn• Sergey Kuksenko• Steve Poole• Viktor Klang a.k.a. √• Antoine de Saint Exupéry :-)• and the entire Akka Team• the Mechanical Sympathy Mailing List
Learn more at:
• SCKRK.com –
• KrakowScala.pl – Kraków Scala User Group
• LambdaKRK.pl – Lambda Lounge Kraków
• GeeCON.org – awesome “all around the JVM” conference, [Kraków, Poznań, Prague, and more…!]
Software Craftsmanship KrakówComputer Science Whitepaper Reading Club Kraków
Questions?
ktoso @ typesafe.com twitter: ktosopl
github: ktosoteam blog: letitcrash.com
home: akka.io
©Typesafe 2015 – All Rights Reserved