Network latency - measurement and improvement

31
Network Latency Measurement and improvement

Transcript of Network latency - measurement and improvement

Network LatencyMeasurement and improvement

What is latency?

The time delay

What is latency?

The time delay

between cause

What is latency?

The time delay

between cause and effect

What is latency?

• Latency impacts the user experience

• Lower latency = more responsive = better

experience

• A fast download over link of high latency can take

longer than a slow down load over a low latency

link

Why measure latency?

• Efficiency:

• Improved resource usage

• Improved user experience

• Spotting and diagnosing defects

Where is Latency?

• Between:

• A CPU and it’s cache

• Client and server over a network

• Application and disk

• Anywhere a system does work

Where is latency?

• L1 cache reference 0.5 ns

• Branch mispredict 5 ns

• L2 cache reference 7 ns

• Mutex lock/unlock 100 ns

• Main memory reference 100 ns

• Compress 1K bytes with Zippy 10,000 ns

• Send 2K bytes over 1 Gbps network 20,000 ns

• Read 1 MB sequentially from memory 250,000 ns

• Round trip within same datacenter 500,000 ns

• Disk seek 10,000,000 ns

• Read 1 MB sequentially from network 10,000,000 ns

• Read 1 MB sequentially from disk 30,000,000 ns

• Send packet CA->Netherlands->CA 150,000,000 ns

Causes of network latency

• Physical limitations - speed of light, wire speeds

• Congestion at switches, routers and servers

• Packet loss due to noise, congestion, faults

Round Trip Times

• aka RTT

• Time to go their and back again

• Return route my be different from the outbound

Network Latency Tools

• Ping. Time between sending ICMP Echo Request and

receiving ICMP Echo Reply

• Traceroute. Time between sending a packet with

incremented TTL value and receiving ICMP Time

Exceeded package..

• tcptraceroute. traceroute using TCP packages to

configurable ports

• mtr - does ICMP, UDP and TCP traceroute

Transmission Control

Protocol (TCP)

• Reliable connections, with retransmission

Transmission Control

Protocol

• Stateful, connection oriented protocol for reliable

data transmission

• Guarantees data delivery and ordering

• Server maintain state tables of connections

• HTTP, SMTP, SSL/TLS, IRC, SSH…

TCP

• Three way handshake. 1.5 roundtrips to set up

connection

TCP Latency Improvements

• By reducing number of round trips:

• Compress content into fewer packets. 1500 MTU

=1460 byte payload

• TCP timestamps take an extra 12 bytes = 1448

byte payload. Timestamp can be disabled.

TCP Improvements

• Move your content closer to your users:

• Make good use of local caches (e.g. browser)

• Content Delivery Networks (Cloudflare,

Cloudfront, Akamai)

• Host geographically closely

• Host at locations with low latency links

HTTP Latency

• Use HTTP/1.1, HTTP/2 (née SPDY)

• Ensure pipelining is enabled

• Tune TCP keep alive

• Try TCP corking (buffer stream and

send), nodelay (buffer small

payload

HTTP Latency

• Take care over caching and provide well formed

headers

• Use tools like Pagespeed Insight to analyse

performance

• Pagespeed module to modify content on the

server

SSL/TLS

• Use AES and compatible libraries on processors

with AES-NI for hardware acceleration

• Elliptic Curve (EC-DSA) for smaller certs & keys

and better performance.

• Terminate SSL at the edge and consider using

lightweight or no encryption inside the local

network.

User Datagram Protocol

• ‘Fire and forget’ - no inbuilt reliability, connection-

less

• No hand shake

• Ordering and retransmission at the application

level

• Stateless, so no connect states to manage

• DNS, VOIP, SNMP, RIP, VPNs, Games, Mosh

Domain Name Service

• DNS lookups can hamper user experience

significantly

• Synchronous lookup before each resource

access

• Uses UDP (usually) for client/server lookups

DNS

• Caches are distributed nearer to the user (DNS

resolvers/forwarders)

• Great for popular sites

• For lower traffic site may still require an

authoritative lookup

DNS CNAMES

• DNS CNAMEs - name -> name -> IP

• Two DNS lookups. Two round trips.

• Never use a CNAME at a zone apex if you have

other records in that zone.

DNS Time to Live

• Time a DNS record is cached in a non-

authoritative servers.

• Need to strike a balance between keeping the

record cached near the user and the ability to

update the record

• 1 day is a good starting point. Decrease before

record switch overs.

DNS clients

• Avoid synchronous DNS lookups where possible:

async libraries, or batch process results later

• Consider local hosts files, use config

management to distribute

DNS

• Keep DNS geographically close to users

• Use providers with anycast DNS servers

• Globally distribute records if the audience is

global

• Can make initial load significantly faster

QUIC

• Experimental protocol from Google for encrypted,

multiplexed streams over UDP

• Aims to reduce number of round trips

• May make the next TLS standard

• Supported by Chrome, prototype server

Client and Servers hosts

• Watch for queuing - something in a queue means

not enough resource to service the request

• Disk IO historically a problem. Throughput in

IOPS. SSDs are reducing this latency.

• Be familiar with the standard system monitoring

tools

• Be wary of multi-threaded processes and locks

Cloud

• Get familiar with cloud providers tools. Useful views

outside the hosts.

• Load test for 5+ cycles of monitoring

• Can provide protocol level information

• Test apps from the point of view of the users -

Nagios, Pingdom, hitting representative end points

• Don’t take their word for performance - measure it

Questions?