© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
November 13, 2014 | Las Vegas
Elastic Load Balancing
Deep Dive & Best Practices
David Brown, Director, Software Engineering
Elastic Load Balancing automatically distributes
incoming application traffic across multiple
Amazon EC2 instances.
SecureElastic Integrated Cost Effective
EC2
Instance
Load Balancer used to
route incoming requests
to multiple EC2
instances.
ELB
EC2
Instance
EC2
Instance
EC2
Instance
Load balance over classic EC2
instances.
Support for public IP addresses only.
No control over the load balancer
security group.
Load balance over EC2 instances
within a VPC.
Support for both public and private IP
addresses.
Full control over the load balancer
security group.
Tightly integrated into the associated
VPC and subnets.
EC2-Classic EC2-VPC
ArchitectureCustomer VPC
EC2
Instance
EC2
Instance
us-w
est-
1a
us-w
est-
1b
Amazon
Route 53
ELB VPC
ELB
ELB
HTTP/HTTPSTCP/SSL
Incoming client connection bound to
server connection
No header modification
Proxy Protocol prepends source and
destination IP and ports to request
Round robin algorithm used for
request routing
Connection terminated at the load
balancer and pooled to the server
Headers may be modified
X-Forwarded-For header contains
client IP address
Least outstanding requests algorithm
used for request routing
Sticky session support available
Health checks allow for
traffic to be shifted away
from failed instances
ELB
EC2
Instance
EC2
Instance
EC2
Instance
Health checks ensure
that request traffic is
shifted away from a
failed instance.
Health Checks
Support for TCP and HTTP health checks.
Customize the frequency and failure
thresholds.
Must return a 2xx response.
Consider the depth and accuracy of your
health checks.
Health Checks
Idle timeouts allow for connections to be closed by
the load balancer when no longer in use.
Length of time that an idle connection should be kept open.
For both client and back-end connections.
Defaults to 60 seconds but can be set between 1 and 3,600
seconds.
Timeouts should decrease as you go
up the stack.
Idle Timeouts
15s
3s
3sELB
15sEC2
Instances
Amazon S3
Amazon RDS
Amazon SWF
3s
9s
Idle Timeouts
Using multipleAvailability Zones
Multiple Availability ZonesELB VPC Customer VPC
EC2
InstanceELB
ELBEC2
Instance
us-w
est-
1a
us-w
est-
1b
Amazon
Route 53
Multiple Availability ZonesELB VPC Customer VPC
EC2
InstanceELB
ELB
us-w
est-
1a
us-w
est-
1b
Amazon
Route 53
Always associate two or more subnets in
different zones with the load balancer
Using multiple Availability Zones does
bring a few challenges.
Re
qu
es
t C
ou
nt
Time
Traffic Imbalances
Imbalanced Instance CapacityELB VPC Customer VPC
EC2
InstanceELB
ELB
us-w
est-
1a
us-w
est-
1b
Amazon
Route 53
EC2
Instances
Cross-Zone Load BalancingELB VPC Customer VPC
EC2
InstanceELB
ELB
us-w
est-
1a
us-w
est-
1b
Amazon
Route 53
EC2
Instances
Re
qu
es
t C
ou
nt
Time
Traffic Imbalances
Cross-Zone Enabled
Load balancer absorbs impact of DNS caching.
Eliminates imbalances in back-end instance utilization.
Requests distributed evenly across multiple
Availability Zones.
Check connection limits before enabling.
No additional bandwidth charge for
cross-zone traffic.
Cross-Zone Load Balancing
Each load balancer domain may contains multiple records.
Round robin used to balance traffic between Availability Zones.
DNS records will to change over time; never
target IP addresses directly.
After being removed from DNS, IP addresses
are drained and quarantined for up to 7 days.
Understanding DNS
DNS caching by clients and ISPs can often cause clients to target
a specific IP address or stop resolving at all.
Register a wildcard CNAME or ALIAS within Amazon Route 53.
// Create a wildcard CNAME or ALIAS in Route 53.
*.example.com ALIAS … elb-12345.us-east-1.elb.amazon.com
*.example.com CNAME elb-12345.us-east-1.elb.amazon.com
// prepend random content for each lookup made by the application.
PROMPT> dig +short 25a8ade5-6557-4a54-a60e-8f51f3b195d1.example.com
192.0.2.1
192.0.2.2
DNS Optimization
SSL Offloading
Support for both SSL and HTTPs is provided.
Support for latest ciphers and protocols including
Elliptical Curve Ciphers and Perfect Forward Secrecy.
Ability to fully customize ciphers and protocols to be
used by each load balancer.
SSL Negotiation Suites provided to remove complexity
of selecting ciphers and protocols.
SSL Negotiation Policies
Provide selection of ciphers and protocols that adhere to the latest
industry best practices.
Balance security best practices with client’s ability to negotiate a
connection, generated using traffic to Amazon.com.
Released on a regular cadence or when new
vulnerabilities are published.
Default for all new load balancers.
POODLE Mitigation
Within 24 hours, 62% of load
balancers migrated to the latest SSL
Negotiation Policy, disabling SSLv3.
@awscloud Thank-you #AWS for making it
so easy to prevent #sslv3 #poodleattack Only
took about 3 clicks of my mouse.“”@granticini
13 CloudWatch metrics provided for each load
balancer.
Provide detailed insight into the health of the load
balancer and application stack.
CloudWatch alarms can be configured to notify or
take action should any metric go outside of the
acceptable range.
All metrics provided at the 1-minute granularity.
Amazon CloudWatch Metrics
HealthyHostCount
The count of the number of healthy instances
in each Availability Zone.
Most common cause of unhealthy hosts are
health check exceeding the allocated timeout.
Test by making repeated requests to the back-
end instance from another EC2 instance.
View at the zonal dimension.
Latency
Measures the time elapsed in seconds after the request leaves the load
balancer until the response is received.
Test by sending requests to the back-end instance from another instance.
Using min, average and max CloudWatch stats
provide upper and lower bounds for latency.
Debug individual requests using Access Logs.
SurgeQueue and Spillovers
Count of the number of requests that could not be sent to back-end
instances.
Queue up to 1024 requests per load balancer
node, after which 503 errors will be returned.
Often caused by not being able to open
connections to the back-end instance.
Normally a sign of an under-scaled application.
CloudWatch and AutoScaling
All load balancer metrics can be used for AutoScaling.
Allow you to scale dynamically based on the load
balancers view of the application.
Important to consider all metrics when using
AutoScaling, may not be aware of resource
contention on another metric.
You may be at peak multiple times a day.
Provide detailed information on each
request processed by the load balancer.
Includes request time, client IP address,
latencies, request path, and server
responses.
Delivered to an Amazon S3 bucket every
5 or 60 minutes.
Access Logs
Access Logs
ELB VPC
ELB
ELB
ELB Amazon S3
Logs indexed by date
but include the IP
address of the load
balancer node itself.
• timestamp
• elb name
• client:port
• backend:port
• request_processing_time
• backend_processing_time
• response_processing_time
• elb_status_code
• backend_state_code
• received_bytes
• sent_bytes
• “request”
2014-02-15T23:39:43.945958Z my-test-loadbalancer
192.168.131.39:2817 10.0.0.0.1 0.000073 0.001048 0.000057
200 200 0 29 "GET http://www.example.com:80/HTTP/1.1"
Access Logs
“Everything fails all the time”Werner Vogels, CTO, Amazon.com
Be prepared to do nothing!
Mitigation Isolation Restore
Redundancy
Mitigation
All load balancers scaled to handle loss
of single Availability Zone.
Amazon Route 53 health checks shift
traffic away from the failed Availability
Zone.
Completed within 150 seconds.
No other external or control plane
dependencies.
Isolation
Other zones must remain unaffected.
Avoid dependencies between zones.
Be careful of work generated as a result
of the event.
Operating at reduced capacity but stable.
Health checkers and edge locations
perform the same volume of activity
whether endpoints are healthy or
unhealthy.
Constant Work
time
System activity
Time to react
When nothing is failing, volume of API
calls is zero. When failure occurs,
volume of API calls spikes.
time
System activity
Time to react
Work on Failure
Restore Redundancy
Restoring the system back to full capacity.
Avoid putting additional load on the system
by rushing this step.
Ensure that recovered resources are left in
a consistent state.
Full recovered when done.
Please give us your feedback on this
presentation
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Join the conversation on Twitter with
#reinvent
SDD423
Top Related