Post on 15-Apr-2017
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Adam Boeglin, HPC Solutions Architect
July 13, 2016
Deep Dive on Delivering Amazon
EC2 Instance Performance
Understanding the factors that go into choosing an EC2 instance
Defining system performance and how it is characterized for
different workloads
How Amazon EC2 instances deliver performance while providing
flexibility and agility
How to make the most of your EC2 instance experience through the
lens of several instance types
What to Expect from the Session
InstancesAPI
Networking
EC2EC2
Purchase options
Amazon Elastic Compute Cloud is Big
Host Server
Hypervisor
Guest 1 Guest 2 Guest n
Amazon EC2 Instances
In the Past
First launched in August 2006
M1 instance
“One size fits all” M1
2006 2008 2010 2012 2014 2016
m1.small
m1.large
m1.xlarge
c1.medium
c1.xlarge
m2.xlarge
m2.4xlarge
m2.2xlarge
cc1.4xlarge
t1.micro
cg1.4xlarge
cc2.8xlarge
m1.medium
hi1.4xlarge
m3.xlarge
m3.2xlarge
hs1.8xlarge
cr1.8xlarge
c3.large
c3.xlarge
c3.2xlarge
c3.4xlarge
c3.8xlarge
g2.2xlarge
i2.xlarge
i2.2xlarge
i2.4xlarge
i2.4xlarge
m3.medium
m3.large
r3.large
r3.xlarge
r3.2xlarge
r3.4xlarge
r3.8xlarge
t2.micro
t2.small
t2.med
c4.large
c4.xlarge
c4.2xlarge
c4.4xlarge
c4.8xlarge
d2.xlarge
d2.2xlarge
d2.4xlarge
d2.8xlarge
g2.8xlarge
t2.large
m4.largem4.xlarge
m4.2xlarge
m4.4xlarge
m4.10xlarge
Amazon EC2 Instances History
x1.32xlarge
t2.nano
Instance generation
c4.largeInstance family Instance size
Choices and Flexibility
Choice of Processor
Memory
Storage Options
Accelerated Graphics
Burstable Performance
Servers are hired to do jobs
Performance is measured differently depending on the job
Hiring a Server
?
Performance Factors
Resource Performance factors Key indicators
CPU Sockets, number of cores, clock
frequency, bursting capability
CPU utilization, run queue length
Memory Memory capacity Free memory, anonymous paging,
thread swapping
Network
interface
Max. bandwidth, packet rate Receive throughput, transmit throughput
over max. bandwidth
Disks Input/output operations per second,
throughput
Wait queue length, device utilization,
device errors
Resource Utilization
For given performance, how efficiently are
resources being used?
Something at 100% utilization can’t
accept any more work
Low utilization can indicate more resource
is being purchased than needed
Example: Web Application
MediaWiki installed on Apache with 140 pages of content
Load increased in intervals over time
Example: Web Application
Memory stats
Example: Web Application
Disk stats
Example: Web Application
Network stats
Example: Web Application
CPU stats
“Launching new instances and running tests
in parallel is easy…[when choosing an
instance] there is no substitute for measuring
the performance of your full application.”
- EC2 documentation
How Not to Choose an EC2 Instance
Brute force testing
Ignoring metrics
Favoring old generation instances
Guessing based on what you already have
EC2 Instance Families
General
purpose
Compute-
optimized
C3
Storage and IO-
optimized
I2 G2
GPU-
enabled
Memory-
optimized
R3C4M4
D2
X1
Give back instances as easily as you can acquire new ones
Find an ideal instance type and workload combination
EC2 Instance Pages provide “Use Case” Guidance
With Amazon Elastic Block Store, storage and instance size don’t
need to be coupled
Instance Selection = Performance Tuning
Instance Sizing
c4.8xlarge 2 - c4.4xlarge
≈
4 - c4.2xlarge
≈
8 - c4.xlarge
≈
Choosing the Right Size
Understand your unit of work
Web request
Database/table
Batch process
What is that unit’s requirements?
CPU threads
Memory constraints
Disk and network
What are it’s availability requirements?
CPU Instructions and Protection Levels
CPU has at least two protection levels.
Privileged instructions can’t be executed in user mode to protect
system. Applications leverage system calls to the kernel.
Kernel
Application
VMM
Application
Kernel
PV
X86 CPU Virtualization: Prior to Intel VT-x
Binary translation for privileged instructions
Paravirtualization (PV)
PV requires going through the VMM, adding latency
Applications that are system call bound are most affected
Kernel
Application
VMM
PV-HVM
X86 CPU Virtualization: After Intel VT-x
Hardware-assisted virtualization (HVM)
PV-HVM uses PV drivers opportunistically for operations that
are slow emulated:
E.g., network and block I/O
Tip: Use HVM AMIs with Amazon EBS
Time Keeping Explained
Time keeping in an instance is deceptively hard
gettimeofday(), clock_gettime(), QueryPerformanceCounter()
The TSC
CPU counter, accessible from userspace
Requires calibration, vDSO
Invariant on Sandy Bridge+ processors
Xen pvclock; does not support vDSO
On current generation instances, use TSC as clocksource
Tip: Use TSC as Clocksource
Review: C4 Instances
Custom Intel E5-2666 v3 at 2.9 GHz
P-state and C-state controls
Model vCPU Memory (GiB) EBS (Mbps)
c4.large 2 3.75 500
c4.xlarge 4 7.5 750
c4.2xlarge 8 15 1,000
c4.4xlarge 16 30 2,000
c4.8xlarge 36 60 4,000
Batch and HPC Workloads, Game Servers, Ad Serving, and High Traffic Web Servers
What’s New in C4: P-State and C-State Control
Intel Turbo Boost up to 3.5 Ghz
By entering deeper idle states, non-idle cores can achieve up to 300 MHz
higher clock frequencies
But…deeper idle states require more time to exit, may not be appropriate
for latency-sensitive workloads
Tip: P-State Control for AVX2
If an application makes heavy use of AVX2 on all cores, the processor
may attempt to draw more power than it should
Processor will transparently reduce frequency
Frequent changes of CPU frequency can slow an application
Review: T2 Instances
Lowest-cost EC2 instance at $0.0065 per hour
Burstable performance
Fixed allocation enforced with CPU credits
Model vCPU Baseline CPU Credits
/Hour
Memory
(GiB)
Storage
t2.nano 1 5% 3 .5 EBS Only
t2.micro 1 10% 6 1 EBS Only
t2.small 1 20% 12 2 EBS Only
t2.medium 2 40%** 24 4 EBS Only
t2.large 2 60%** 36 8 EBS Only
General Purpose, Web Serving, Developer Environments, Small Databases
How Credits Work
A CPU credit provides the performance of a
full CPU core for one minute
An instance earns CPU credits at a steady rate
An instance consumes credits when active
Credits expire (leak) after 24 hours
Baseline rate
Credit
balance
Burst
rate
Tip: Monitor CPU Credit Balance
Tip: How to Interpret Steal Time
Fixed CPU allocations of CPU can be offered through CPU caps
Steal time happens when CPU cap is enforced
Leverage Amazon CloudWatch metrics
Announced: X1 Instances
Largest memory instance with 2 TB of DRAM
Quad socket, Intel E7 processors with 128 vCPUs
Model vCPU Memory (GiB) Local
Storage
x1.32xlarge 128 1952 2x 1920 GB
In-Memory Databases, Big Data Processing, HPC Workloads
NUMA
Non-uniform memory access
Each processor in a multi-CPU system has local memory that is
accessible through a fast interconnect
Each processor can also access memory from other CPUs, but local
memory access is a lot faster than remote memory
Performance is related to the number of CPU sockets and how they
are connected - Intel QuickPath Interconnect (QPI)
QPI
122 GB 122 GB
16 vCPU’s 16 vCPU’s
r3.8xlarge
QPI
QPI
QPIQPI
QPI
488 GB
488 GB
488 GB
488 GB
32 vCPU’s 32 vCPU’s
32 vCPU’s 32 vCPU’s
x1.32xlarge
Tip: Kernel Support for NUMA Balancing
An application will perform best when the threads of its processes are
accessing memory on the same NUMA node.
NUMA balancing moves tasks closer to the memory they are accessing.
This is all done automatically by the Linux kernel when automatic NUMA
balancing is active: version 3.8+ of the Linux kernel.
Windows support for NUMA first appeared in the Enterprise and Data
Center SKUs of Windows Server 2003.
Review: I2 Instances
16 vCPU: 3.2 TB SSD; 32 vCPU: 6.4 TB SSD
365 K random read IOPS for 32 vCPU instance
Model vCPU Memory
(GiB)
Storage Read IOPS Write IOPS
i2.xlarge 4 30.5 1 x 800 SSD 35,000 35,000
i2.2xlarge 8 61 2 x 800 SSD 75,000 75,000
i2.4xlarge 16 122 4 x 800 SSD 175,000 155,000
i2.8xlarge 32 244 8 x 800 SSD 365,000 315,000
NoSQL Databases, Clustered Databases, Online Transaction Processing (OLTP)
Hardware
Split-Driver Model
Driver Domain Guest Domain Guest Domain
VMM
Front-end
driver
Front-
end
driver
Back-
end
driver
Device
Driver
Physical
CPU
Physical
Memory
Network
Device
Virtual CPUVirtual
Memory
CPU
Scheduling
Sockets
Application1
23
4
5
Granting in Pre-3.8.0 Kernels
Requires “grant mapping” prior to 3.8.0
Grant mappings are expensive operations due to TLB flushes
read(fd, buffer,…)
I/O domain Instance
Granting in 3.8.0+ Kernels, Persistent and Indirect
Grant mappings are set up in a pool once
Data is copied in and out of the grant pool
read(fd, buffer…)
Copy to and from grant pool
2009–Longer Ago Than You Think
Avatar was the top movie in the theaters
Facebook overtook MySpace in active users
President Obama was sworn into office
The 2.6.32 Linux kernel was released
Tip: Use 3.8+ Kernel
Amazon Linux 13.09 or later
Ubuntu 14.04 or later
RHEL/Centos 7 or later
Etc.
Device Passthrough: Enhanced Networking
SR-IOV eliminates need for driver domain
Physical network device exposes virtual function to instance
Requires a specialized driver, which means:
Your instance OS needs to know about it
EC2 needs to be told your instance can use it
Hardware
After Enhanced Networking
Driver Domain Guest Domain Guest Domain
VMM
NIC
Driver
Physical
CPU
Physical
Memory
SR-IOV Network
Device
Virtual CPUVirtual
Memory
CPU
Scheduling
Sockets
Application1
2
3
NIC
Driver
Elastic Network Adapter
Next Generation of Enhanced Networking
Hardware checksums
Multi-Queue support
Receive-side steering
20 Gbps in a placement group
New Open Source Amazon Network Driver
EBS Performance
Instance size matters
Match your volume size and
type to your instance
Use EBS optimization if EBS
performance is important
Choose HVM AMI’s
Time keeping: use TSC
C state and P state controls
Monitor T2 CPU credits
Use a modern Linux kernel
NUMA balancing
Persistent grants for I/O performance
Enhanced networking
Summary: Getting the Most Out of EC2 Instances
Bare metal performance goal, and in many scenarios already there
History of eliminating hypervisor intermediation and driver domains
Hardware-assisted virtualization
Scheduling and granting efficiencies
Device passthrough
Virtualization Themes
Next Steps
Visit the Amazon EC2 documentation
Launch an instance and try your app!
Remember to complete
your evaluations!
Thank you!