On the way to low latency
-
Upload
artem-orobets -
Category
Engineering
-
view
101 -
download
0
Transcript of On the way to low latency
![Page 1: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/1.jpg)
On the way to low latency
Artem Orobets Smartling Inc
![Page 2: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/2.jpg)
Long story short
We realized that latency is important for us
Our fabulous architecture supposed to work, but it didn’t
The issues that we have faced on the way
![Page 3: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/3.jpg)
Those guys consider 10µs latencies slow
We have only 100ms threshold
We are not a trading company
![Page 4: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/4.jpg)
Latencies about 50ms is barely noticeable for human
Trans-Atlantic Path 91 ms*
Trans-Pacific Path 141 ms*
From Earth to Mars 3-22 min
![Page 5: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/5.jpg)
Importance of latency
• SLA • Negative correlation
to income
![Page 6: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/6.jpg)
How to measure it?
![Page 7: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/7.jpg)
Duration of a single test run
![Page 8: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/8.jpg)
Average of test run durations
![Page 9: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/9.jpg)
![Page 10: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/10.jpg)
Quantiles of test run durations
(usually 95th, 99th percentiles)
![Page 11: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/11.jpg)
Performance benchmark
98.47% <= 2 ms 99.95% <= 10 ms 99.98% <= 16 ms 99.99% <= 17 ms 100.00% <= 18 ms
750 rpsThroughput
Latency percentiles
![Page 12: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/12.jpg)
Benchmarks are hard
Almost all latency benchmarks are broken because almost all benchmarking tools are broken.
![Page 13: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/13.jpg)
0 100 200
System Stalled
10000 req
Avg = 50 sec
10000 req
Avg = 1 ms
Normal Operation
Real system
![Page 14: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/14.jpg)
0 100 200
System Stalled
Avg = 25 sec 50th perc = 1 ms 75th perc = 50 sec 99.99th perc = 100 sec
Real system
![Page 15: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/15.jpg)
0 100 200
System Stalled
1 req
Avg = 100 sec
10000 req
Avg = 1 ms
Normal Operation
Performance test
![Page 16: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/16.jpg)
0 100 200
System Stalled
Avg = 25 sec 50th perc = 1 ms 75th perc = 1 ms 99.99th perc = 1 ms
Perfomance test
![Page 17: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/17.jpg)
Imagine we’ve made some amazing improvement
![Page 18: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/18.jpg)
0 100 200
Degradation
10000 req
Avg = 5 ms
10000 req
Avg = 1 ms
Normal Operation
After improvement
![Page 19: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/19.jpg)
0 100 200
Degradation
Avg = 25 sec 50th perc = 1 ms 75th perc = 2.5 ms (was 1ms) 99.99th perc = 5 ms (was 1ms)
Perfomance test
![Page 20: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/20.jpg)
But what can do?
![Page 21: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/21.jpg)
A good tool can give you a clue
![Page 22: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/22.jpg)
KPI
![Page 23: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/23.jpg)
Problem that we faced
![Page 24: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/24.jpg)
How much time GC could take?
![Page 25: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/25.jpg)
GC logging• -Xloggc:path_to_log_file
• -XX:+PrintGCDetails
• -XX:+PrintGCDateStamps
• -XX:+PrintHeapAtGC
• -XX:+PrintTenuringDistribution
![Page 26: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/26.jpg)
-XX:+PrintGCDetails
[GC (Allocation Failure) 260526.491: [ParNew
…
[Times: user=0.02 sys=0.00, real=0.01 secs]
![Page 27: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/27.jpg)
-XX:+PrintHeapAtGCHeap after GC invocations=43363 (full 3):
par new generation total 59008K, used 1335K
eden space 52480K, 0%
from space 6528K, 20% used
to space 6528K, 0% used
concurrent mark-sweep generation total 2031616K, used 1830227K
![Page 28: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/28.jpg)
-XX:+PrintTenuringDistribution
Desired survivor size 3342336 bytes, new threshold 2 (max 2)
- age 1: 878568 bytes, 878568 total
- age 2: 1616 bytes, 880184 total
: 53829K->1380K(59008K), 0.0083140 secs] 1884058K->1831609K(2090624K), 0.0084006 secs]
![Page 29: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/29.jpg)
~100ms GC pauses in logs
![Page 30: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/30.jpg)
-XX:+UseConcMarkSweepGC
![Page 31: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/31.jpg)
Note: CMS collector on young generation uses the same algorithm
as that of the parallel collector.
Java GC documentation at oracle.com
* http://www.oracle.com/webfolder/technetwork/tutorials/obe/java/gc01/index.html
![Page 32: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/32.jpg)
![Page 33: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/33.jpg)
Too many alive objects during young gen GC
• Minimize survivors
• Watch the tenuring threshold, might need to tune it to tenure long lived objects faster
• Reduce NewSize
• Reduce survivor spaces
![Page 34: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/34.jpg)
Watch your GC
*time span is 2h
![Page 35: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/35.jpg)
Watch your GC
![Page 36: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/36.jpg)
Some requests take almost a second
And it seems it always happens after deploy
![Page 37: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/37.jpg)
is so lazy
![Page 38: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/38.jpg)
Smoke tests
• A good practice when you have continuous delivery
• It makes all your code initialized by the time real load comes in
![Page 39: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/39.jpg)
Logging
Synchronous logging is not appropriate for asynchronous application
![Page 40: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/40.jpg)
log4j2: Asynchronous Loggers for Low-Latency Logging
http://logging.apache.org/log4j/2.x/manual/async.html
![Page 41: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/41.jpg)
Sync Async
98.85% <= 1 ms 99.95% <= 7 ms 99.98% <= 13 ms 99.99% <= 15 ms 100.00% <= 18 ms
1658 rps
98.47% <= 2 ms 99.95% <= 10 ms 99.98% <= 16 ms 99.99% <= 17 ms 100.00% <= 18 ms
769.05 rps
Logging
![Page 42: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/42.jpg)
Pauses 50-150ms
A network according to logs
![Page 43: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/43.jpg)
Disappear when I scroll through logs via SSH
![Page 44: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/44.jpg)
![Page 45: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/45.jpg)
Any ideas?
![Page 46: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/46.jpg)
TCP_NODELAY
![Page 47: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/47.jpg)
![Page 48: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/48.jpg)
Nagle's algorithm
• the "small packet problem”
• TCP packets have a 40 byte header (20 bytes for TCP, 20 bytes for IPv4)
• combining a number of small outgoing messages, and sending them all at once
![Page 49: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/49.jpg)
• Pauses ~100 ms every couple of hours
• During connection creation
• Doesn’t reproduces on a local setup
![Page 50: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/50.jpg)
How to diagnose that?
![Page 51: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/51.jpg)
tcpdump -i eth0
![Page 52: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/52.jpg)
TCPDUMP15:47:57.250119 IP (tos 0x0, ttl 64, id 44402, offset 0, flags [DF], proto TCP (6), length 569) 192.168.3.131.58749 > 93.184.216.34.80: Flags [P.], cksum 0x76b5 (correct), seq 3847355529:3847356046, ack 3021125542, win 4096, options [nop,nop,TS val 848825338 ecr 1053000005], length 517: HTTP, length: 517 GET / HTTP/1.1 Host: example.com Connection: keep-alive …
![Page 53: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/53.jpg)
TCPDUMP
15:58:32.009884 IP (tos 0x0, ttl 255, id 39809, offset 0, flags [none], proto UDP (17), length 63) 192.168.3.131.56546 > 192.168.3.1.53: [udp sum ok] 52969+ A? www.google.com.ua. …
15:58:32.012844 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 127) 192.168.3.1.53 > 192.168.3.131.56546: [udp sum ok] 52969 q: A? www.google.com.ua. …
![Page 54: On the way to low latency](https://reader034.fdocuments.us/reader034/viewer/2022042706/58a994bb1a28abc2518b4b89/html5/thumbnails/54.jpg)
DNS lookups
• After hours of looking through tcp dumps
• We have found that DNS lookups sometimes take more than 100ms