Download - (SDD423) Elastic Load Balancing Deep Dive and Best Practices | AWS re:Invent 2014

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

November 13, 2014 | Las Vegas

Elastic Load Balancing

Deep Dive & Best Practices

David Brown, Director, Software Engineering

Elastic Load Balancing automatically distributes

incoming application traffic across multiple

Amazon EC2 instances.

SecureElastic Integrated Cost Effective

EC2

Instance

Load Balancer used to

route incoming requests

to multiple EC2

instances.

ELB

EC2

Instance

EC2

Instance

EC2

Instance

Load balance over classic EC2

instances.

Support for public IP addresses only.

No control over the load balancer

security group.

Load balance over EC2 instances

within a VPC.

Support for both public and private IP

addresses.

Full control over the load balancer

security group.

Tightly integrated into the associated

VPC and subnets.

EC2-Classic EC2-VPC

ArchitectureCustomer VPC

EC2

Instance

EC2

Instance

us-w

est-

1a

us-w

est-

1b

Amazon

Route 53

ELB VPC

ELB

ELB

HTTP/HTTPSTCP/SSL

Incoming client connection bound to

server connection

No header modification

Proxy Protocol prepends source and

destination IP and ports to request

Round robin algorithm used for

request routing

Connection terminated at the load

balancer and pooled to the server

Headers may be modified

X-Forwarded-For header contains

client IP address

Least outstanding requests algorithm

used for request routing

Sticky session support available

Health checks allow for

traffic to be shifted away

from failed instances

ELB

EC2

Instance

EC2

Instance

EC2

Instance

Health checks ensure

that request traffic is

shifted away from a

failed instance.

Health Checks

Support for TCP and HTTP health checks.

Customize the frequency and failure

thresholds.

Must return a 2xx response.

Consider the depth and accuracy of your

health checks.

Health Checks

Idle timeouts allow for connections to be closed by

the load balancer when no longer in use.

Length of time that an idle connection should be kept open.

For both client and back-end connections.

Defaults to 60 seconds but can be set between 1 and 3,600

seconds.

Timeouts should decrease as you go

up the stack.

Idle Timeouts

15s

3s

3sELB

15sEC2

Instances

Amazon S3

Amazon RDS

Amazon SWF

3s

9s

Idle Timeouts

Using multipleAvailability Zones

Multiple Availability ZonesELB VPC Customer VPC

EC2

InstanceELB

ELBEC2

Instance

us-w

est-

1a

us-w

est-

1b

Amazon

Route 53

Multiple Availability ZonesELB VPC Customer VPC

EC2

InstanceELB

ELB

us-w

est-

1a

us-w

est-

1b

Amazon

Route 53

Always associate two or more subnets in

different zones with the load balancer

Using multiple Availability Zones does

bring a few challenges.

Re

qu

es

t C

ou

nt

Time

Traffic Imbalances

Imbalanced Instance CapacityELB VPC Customer VPC

EC2

InstanceELB

ELB

us-w

est-

1a

us-w

est-

1b

Amazon

Route 53

EC2

Instances

Cross-Zone Load BalancingELB VPC Customer VPC

EC2

InstanceELB

ELB

us-w

est-

1a

us-w

est-

1b

Amazon

Route 53

EC2

Instances

Re

qu

es

t C

ou

nt

Time

Traffic Imbalances

Cross-Zone Enabled

Load balancer absorbs impact of DNS caching.

Eliminates imbalances in back-end instance utilization.

Requests distributed evenly across multiple

Availability Zones.

Check connection limits before enabling.

No additional bandwidth charge for

cross-zone traffic.

Cross-Zone Load Balancing

Each load balancer domain may contains multiple records.

Round robin used to balance traffic between Availability Zones.

DNS records will to change over time; never

target IP addresses directly.

After being removed from DNS, IP addresses

are drained and quarantined for up to 7 days.

Understanding DNS

DNS caching by clients and ISPs can often cause clients to target

a specific IP address or stop resolving at all.

Register a wildcard CNAME or ALIAS within Amazon Route 53.

// Create a wildcard CNAME or ALIAS in Route 53.

*.example.com ALIAS … elb-12345.us-east-1.elb.amazon.com

*.example.com CNAME elb-12345.us-east-1.elb.amazon.com

// prepend random content for each lookup made by the application.

PROMPT> dig +short 25a8ade5-6557-4a54-a60e-8f51f3b195d1.example.com

192.0.2.1

192.0.2.2

DNS Optimization

http://25a8ade5-6557-4a54-a60e-8f51f3b195d1.example.com

SSL Offloading

Support for both SSL and HTTPs is provided.

Support for latest ciphers and protocols including

Elliptical Curve Ciphers and Perfect Forward Secrecy.

Ability to fully customize ciphers and protocols to be

used by each load balancer.

SSL Negotiation Suites provided to remove complexity

of selecting ciphers and protocols.

SSL Negotiation Policies

Provide selection of ciphers and protocols that adhere to the latest

industry best practices.

Balance security best practices with client’s ability to negotiate a

connection, generated using traffic to Amazon.com.

Released on a regular cadence or when new

vulnerabilities are published.

Default for all new load balancers.

POODLE Mitigation

Within 24 hours, 62% of load

balancers migrated to the latest SSL

Negotiation Policy, disabling SSLv3.

@awscloud Thank-you #AWS for making it

so easy to prevent #sslv3 #poodleattack Only

took about 3 clicks of my mouse.“”@granticini

13 CloudWatch metrics provided for each load

balancer.

Provide detailed insight into the health of the load

balancer and application stack.

CloudWatch alarms can be configured to notify or

take action should any metric go outside of the

acceptable range.

All metrics provided at the 1-minute granularity.

Amazon CloudWatch Metrics

HealthyHostCount

The count of the number of healthy instances

in each Availability Zone.

Most common cause of unhealthy hosts are

health check exceeding the allocated timeout.

Test by making repeated requests to the back-

end instance from another EC2 instance.

View at the zonal dimension.

Latency

Measures the time elapsed in seconds after the request leaves the load

balancer until the response is received.

Test by sending requests to the back-end instance from another instance.

Using min, average and max CloudWatch stats

provide upper and lower bounds for latency.

Debug individual requests using Access Logs.

SurgeQueue and Spillovers

Count of the number of requests that could not be sent to back-end

instances.

Queue up to 1024 requests per load balancer

node, after which 503 errors will be returned.

Often caused by not being able to open

connections to the back-end instance.

Normally a sign of an under-scaled application.

CloudWatch and AutoScaling

All load balancer metrics can be used for AutoScaling.

Allow you to scale dynamically based on the load

balancers view of the application.

Important to consider all metrics when using

AutoScaling, may not be aware of resource

contention on another metric.

You may be at peak multiple times a day.

Provide detailed information on each

request processed by the load balancer.

Includes request time, client IP address,

latencies, request path, and server

responses.

Delivered to an Amazon S3 bucket every

5 or 60 minutes.

Access Logs

Access Logs

ELB VPC

ELB

ELB

ELB Amazon S3

Logs indexed by date

but include the IP

address of the load

balancer node itself.

• timestamp

• elb name

• client:port

• backend:port

• request_processing_time

• backend_processing_time

• response_processing_time

• elb_status_code

• backend_state_code

• received_bytes

• sent_bytes

• “request”

2014-02-15T23:39:43.945958Z my-test-loadbalancer

192.168.131.39:2817 10.0.0.0.1 0.000073 0.001048 0.000057

200 200 0 29 "GET http://www.example.com:80/HTTP/1.1"

Access Logs

http://www.example.com:80/HTTP/1.1

“Everything fails all the time”Werner Vogels, CTO, Amazon.com

Be prepared to do nothing!

Mitigation Isolation Restore

Redundancy

Mitigation

All load balancers scaled to handle loss

of single Availability Zone.

Amazon Route 53 health checks shift

traffic away from the failed Availability

Zone.

Completed within 150 seconds.

No other external or control plane

dependencies.

Isolation

Other zones must remain unaffected.

Avoid dependencies between zones.

Be careful of work generated as a result

of the event.

Operating at reduced capacity but stable.

Health checkers and edge locations

perform the same volume of activity

whether endpoints are healthy or

unhealthy.

Constant Work

time

System activity

Time to react

When nothing is failing, volume of API

calls is zero. When failure occurs,

volume of API calls spikes.

time

System activity

Time to react

Work on Failure

Restore Redundancy

Restoring the system back to full capacity.

Avoid putting additional load on the system

by rushing this step.

Ensure that recovered resources are left in

a consistent state.

Full recovered when done.

Please give us your feedback on this

presentation

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Join the conversation on Twitter with

#reinvent

SDD423