2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
SMB 3.0 Performance
Dan Lovinger
Principal Architect
Microsoft
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Overview
Stats & Methods
Scenario: OLTP Database
Scenario: Cluster Motion
SMB 3.0 Multi Channel
Agenda: challenges during the development of SMB
3.0 during the development of Windows Server 2012
(and Windows 8)
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Summary Statistics
3
Then there is the man who drowned crossing a
stream with an average depth of six inches.
W.I.E. Gates (?)
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Starting Point – Metric v. Time
0
1000
2000
3000
4000
5000
6000
0 200 400 600 800 1000 1200 1400 1600
Tra
nsa
cti
on
s/s
Time (s)
Transaction Rate
• All sorts of interesting things afoot!
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Average
• Classic problem – completely different behaviors have the same average
• understates steady state
• hides very interesting behavior
• Performance is often in the variation
• How often … how consistently … where are the outliers to analyze …
0
1000
2000
3000
4000
5000
6000
0 200 400 600 800 1000 1200 1400 1600
Tra
nsa
cti
on
s/s
Time (s)
Transaction Rate
Average 4103
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Distribution
• Histogram / Distribution
• How often was the system running this fast?
• Note that data must be resampled into buckets
• In this example
• buckets are 20 transactions/s wide
• average located in the [4100 – 4120) transactions/s bucket, with 21 samples
0
10
20
30
40
50
60
0 1000 2000 3000 4000 5000 6000 7000
# o
f S
am
ple
s
Transactions/s
Distribution : Transaction Rate
Average in 4100-4120 bucket
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Standard Deviation
• Familiar first approximation – here, +/- 1100 transactions/s
• Strong meaning if the type of distribution is known
• Gaussian / Normal – the classic Bell Curve - ~34% to each side
• … which this is not
• Spans >90% of the total distribution, here
0
10
20
30
40
50
60
0 1000 2000 3000 4000 5000 6000 7000
# o
f S
am
ple
s
Transactions/s
Distribution : Transaction Rate
-σ
Average
+σ
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Median
• The mid-point of all of the data points
• 50% higher
• 50% lower
• 50th Percentile
• Coincidentally matches the peak of this distribution - 4300 transactions/s
0
10
20
30
40
50
60
0 1000 2000 3000 4000 5000 6000 7000
# o
f S
am
ple
s
Transactions/s
Distribution : Transaction Rate
Average
Median
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Percentiles
• Percentage of the dataset – cut lines
• Relevant to guarantees of behavior / performance
• Median locates the center
• 10’s and 90’s
• No worse than / At least X, Y% of the time
• What makes sense for latency? Bandwidth?
0
10
20
30
40
50
60
0 1000 2000 3000 4000 5000 6000 7000
# o
f S
am
ple
s
Transactions/s
Distribution : Transaction Rate
10
20
30
40
Median
60
70
80
90
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
10
20
30
40
50
60
0 1000 2000 3000 4000 5000 6000 7000
# o
f S
am
ple
s
Transactions/s
Distribution : Transaction Rate
-σ
10
20
30
40
Median
60
70
80
90
+σ
Percentiles
• Looking back to standard deviation, it really didn’t work.
• … for a 30 minute real workload with a well defined steady state!
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Cumulative Distributions
• Visualization of percentiles
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1000 2000 3000 4000 5000 6000
Perc
en
tile
Transactions/s
Cumulative Distribution : Transaction Rate
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1000 2000 3000 4000 5000 6000
Perc
en
tile
Transactions/s
Transaction Rate - Cumulative Distribution OLTP Simulation - Full Run
SMB
SMB+
DAS
OLTP
12
• Note: for this section, SMB and SMB+ refer to a before and after change, not a
revision of the protocol.
• The issue was not in SMB and … very specific to workload. See summary.
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
OLTP
OLTP – OnLine Transaction Processing
Log File – small-midsize sequential IO
Database File(s) – small random IO
Fundamental dependency on the SMB 3.0 feature set
for continuously available connections to the
database content
New workloads … new problems for an
implementation
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Simulated OLTP Hardware Configuration
14
Client Server
CPU Type 2 sockets with 6 cores @ 2.66 Ghz (12 cores total)
Memory Amount 36 GiB
Network Type
Onboard 1GbE network interface
Number of Network
Links 1 x 1GbE
Storage Adapter N/A
1 Fibre Channel Adapter
2x 4Gb/s connectivity
Storage N/A
14x 10KRPM HDD RAID0
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Simulated OLTP Hardware Configuration
15
14 Disk RAID10
ServerClient
1 Gb Ethernet 2x4Gb Fibre Channel
SMBDAS
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
18 Years Later
The Write Bubble
NT Filesystem Lingo: Valid Data Length
Offset in file to which user data has written data
Allows efficient physical zeroing fully allocated files
Code sharing from the FAT implementation
May 1994 – clever trick for VDL extension
Nov 1994 – ported over with a slight mislocation
Causal requirements
Simultaneous Read/Write, Async, NonCached IO
… to the same file
No other workload affected!
16
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
20
40
60
80
100
120
140
160
180
200
0 5 10 15 20 25 30
Seco
nd
s @
Avera
ge Q
ueu
e L
en
gth
Disk Queue Length
Average Disk Queue Length Distribution
SMB
SMB+
DAS
Simulated OLTP
17
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Perc
en
tile
IO Operations per Second
8KiB Random IO/s Cumulative Distribution SMB v. DAS Windows 2008 R2 RTM
SMB
SMB+
DAS
Simulated OLTP
18
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 10 100
Sam
ple
s
Latency (ms)
SMB+ Client v. Server DAS Latency Windows 2008 R2 RTM
Client Read
Server Read
Client Write
Server Write
Simulated OLTP
19
500us
Percentile Read Shift (%) Write Shift (%) Read Shift (us) Write Shift (us)
10% 13% 7% 489 495
50% 6% 4% 506 521
90% 3% 2% 709 931
95% 2% 1% 978 832
99% 1% 1% 982 881
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Perc
en
tile
IO Operations per Second
8KiB Random IO/s Cumulative Distribution SMB v. DAS Windows Server 2012
SMB
SMB+
DAS
Methodology
Effect
Simulated OLTP
20
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 10 100
Sam
ple
s
Latency (ms)
SMB+ Client v. Server DAS Latency Windows Server 2012
Client Read
Server Read
Client Write
Server Write
Simulated OLTP
21
410us
Percentile Read Shift (%) Write Shift (%) Read Shift (us) Write Shift (us)
10% 12% 6% 414 411
50% 5% 4% 415 416
90% 2% 1% 425 430
95% 1% 0% 425 421
99% 0% 0% 439 461
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
OLTP Hardware Configuration
22
Client Server
CPU Type 2 sockets with 4 cores @ 2.26 Ghz (8 cores total)
Memory Amount 24 GiB
Network Type
Onboard 1GbE network interface
Number of Network
Links 1 x 1GbE
Storage Adapter N/A
1 Fibre Channel Adapter
2x 4Gb/s connectivity
Storage N/A
24x 10KRPM HDD:
2x 10 HDD RAID0 (DB)
2 HDD RAID0 (LOG)
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
OLTP Hardware Configuration
23
2x 10 Disk RAID0 (Data)2 Disk RAID0 (Log)
ServerClient
1 Gb Ethernet 2x4Gb Fibre Channel
SMBDAS
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1000 2000 3000 4000 5000 6000
Perc
en
tile
Transactions/s
Transaction Rate - Cumulative Distribution OLTP Simulation - Full Run
SMB
SMB+
DAS
OLTP
24
• Full result is intuitively close
• Median 4270 transactions/s
• Mean 3920 transactions/s
• 4Gb v. 1Gb effect?
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
1000
2000
3000
4000
5000
6000
7000
0 30 60 90 120 150 180
Tra
nsa
cti
on
s/s
Time (seconds)
SMB v. DAS Transaction Rate - Ingest
SMB+ Tps
DAS Tps
OLTP – First Correction
25
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
50
100
150
200
250
300
0 30 60 90 120 150 180
Millio
n B
yte
s/s
Time (seconds)
SMB v. DAS Bandwidth - Ingest
SMB+ Read Bytes
DAS Read Bytes
OLTP – First Correction
26
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1000 2000 3000 4000 5000 6000
Perc
en
t o
f T
ota
l
SQL Transactions/s
Transaction Rate - Cumulative Distribution Post Ingest
SMB+
DAS
OLTP – After First Correction
27
• Far closer!
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
1000
2000
3000
4000
5000
6000
7000
0 60 120 180 240 300 360 420 480 540
Tra
nsa
cti
on
s/s
Time (seconds)
Transaction Rate To Equal Work Steady State - 5.62 GB OLTP Log
SMB+
DAS
OLTP – Second Correction
28
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
50
100
150
200
250
300
0 60 120 180 240 300 360 420 480 540
Millio
n B
yte
s/s
Time (seconds)
SMB v. DAS Transition Read Bandwidth To Equal Work Steady State - 5.62 GB OLTP Log
1Gb
SMB+
DAS
OLTP – Second Correction
29
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1000 2000 3000 4000 5000 6000
Perc
en
t o
f T
ota
l
SQL Transactions/s
Transaction Rate - Cumulative Distribution Steady State - Work Matched
SMB+
DAS
OLTP – After Second Correction
30
• Again intuitive – random IO
bandwidth << 1Gb
• Only -3.0% performance relative to
DAS over final 19 minutes of the
run (+33s)
• 1Gb Ethernet very nearly meets
4Gb FC
Log Volume Data Volumes
DAS SMB2 DAS SMB2
Mean IO Operations/s 2125 2010 (-5.4%) 4250 4020 (-5.4%)
Median 2135 2055 (-3.7%) 4275 4120 (-3.6%)
80th 2195 2130 (-3.1%) 4360 4230 (-3.1%)
90th 2225 2155 (-3.0%) 4405 4265 (-3.1%)
95th 2240 2175 (-3.1%) 4435 4305 (-3.0%)
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
8iKB
10
100
1,000
10,000
100,000
1,000,000
10,000,000
32 64 96 128 160 192 224 256
# I
Os
IO Size (KiBytes)
Database File IO Size
Read
Write
OLTP – More than 4/8 KiB
31
Note: 8iKB @ 49%
50%
60%
70%
80%
90%
100%
32 64 96 128 160 192 224 256
Perc
en
nt
of
To
tal
IO Size (KiBytes)
Database File IO - Bytes Transferred by IO
Read
Write
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0.1
1
10
100
1000
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
# I
Os
(Th
ou
san
ds)
IO Size (KiBytes)
Log Write Size
OLTP – Log Matters
32
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
OLTP
Understanding workload is critical for performance
Enabling new workloads is both a protocol and platform effort
Windows 7 / 2008 R2 KB 2536493
http://support.microsoft.com/kb/2536493
Mixed Read/Write
Async
NonCached IO
Same File
… otherwise, no issue. It took a true database to hit it.
No other workload affected!
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Cluster Motion
SMB 3.0 Continuous Availbility Cluster
Implementations should look at performance of
essential cluster management operations
Bet: scale of clustered resources will rise
significantly, relative to pre-SMB 3.0 services
Planned v. Unplanned Motion
Goal: define the time budget.
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Cluster Hardware Configuration
35
Client Server
CPU Type 2 sockets with 2 cores @ 2.0 Ghz (4 cores total)
Memory Amount 4GiB
Network Type Onboard 1GbE network interface
(Cluster Network)
Number of Network
Links 1 x 1GbE
Storage Adapter N/A
1 Fibre Channel Adapter
4Gb/s connectivity
Storage N/A
120x 10KRPM HDD
Enterprise FC Array
Slow & Old!
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
OLTP Hardware Configuration
36
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Cluster Motion
Resource Group
Container of objects representing the file service
High scale on disks depending on implementation
HW/SW Volume
Smaller scale on total number of groups
Config: 3 FS Groups, 40 disks each
37
Net Name
IPv4
File Server
Disk 1
Disk 2 Disk NIPv6
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Cluster Motion
Resource Control Manager
handles the process for
Windows, generalizable to
other platforms
Cluster log follows the
state changes, similar to:
Online
WaitingToGoOffline
OfflineCallIssued
OfflinePending
OfflineSavingCheckpoints
Offline
RCM Offline
Offline
WaitingToComeOnline
OnlineCallIssued
OnlinePending
Online
RCM Online
000006b4.000010cc::2011/08/25-00:23:04.890 INFO [RCM] TransitionToState(IP Address
10.10.10.202) OfflineCallIssued-->OfflineSavingCheckpoints.
PS> Get-ClusterLog
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
10
20
30
40
50
60
Clo
ck T
ime
(se
con
ds)
Offline -> Online (Disks) - Initial
Cluster Motion
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
10000
20000
30000
40000
50000
60000
Offline -> Online (NonDisk) Initial
40
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Cluster Motion
Crossover Ethernet cluster network
Simultaneous disconnect of crossover network and
loss of node (e.g., node power loss)
Unplanned motion – 80s delay
Windows 2008 R2 KB 2575625
http://support.microsoft.com/kb/2575625
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
2
4
6
8
10
12
Clo
ck T
ime
(se
con
ds)
Disks - 120 in 3 resource groups
Hard Disk Offline - Initial
Cluster Motion - Disks
42
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Cluster Motion - Disks
Hard disk resource internals
Remove of FSCTL_LOCK_VOLUME prior to
dismount
Remove of timer used to simplify async completion
of PR reservation release
Remove of soft guard between
IOCTL_OFFLINE_VOLUME and the online process
Reduce conservative polling interval for disk onlines
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
2
4
6
8
10
12
Clo
ck T
ime
(se
con
ds)
Disks - 120 in 3 resource groups
Hard Disk Offline - Final
Cluster Motion - Disks
44
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
5
10
15
20
25
30
Clo
ck T
ime
(se
con
ds)
Disks - 120 in 3 resource groups
Offline -> Online (Disks) - Final
Cluster Motion - Final
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
5
10
15
20
25
30
Clo
ck T
ime
(se
con
ds)
Offline -> Online (NonDisks) Final
OnlinePending
OnlineCallIssued
WaitingToComeOnline
Offline
OfflineSavingCheckpoints
OfflinePending
OfflineCallIssued
WaitingToGoOffline
Online
Cluster Motion - Final
46
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Cluster Motion
Summary
Large outlier in unplanned loss removed
Relative: ~80% faster at scale
Absolute: sub-10s motion … at scale
Larger or smaller than possible reality?
Implementers
Look at your platform
… repeat under loads …
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Multichannel – Goals
Expand session beyond a single TCP stream / CPU
Availability / Network fault tolerance
Make the SMB2 protocol resilient to interface, link
or switch failures.
Move “link awareness” higher up the stack to
enable more intelligent decision making.
Augment NIC teaming at the network layer.
Keep fallback paths ready, prioritize available links.
React quickly to changes to network availability.
Performance – utilize the available resource
48
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Multichannel - Terminology
Channel: underlying transport connection
Session: authenticated user context.
Session Binding: map SessionChannel
N:N relationship
49
TCP TCP RDMA
User A User B Sessions
Channels
Bindings
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Test Hardware Configuration
50
Client Server
CPU Type 2 sockets with 6 cores @ 2.66 Ghz (12 cores total)
Memory Amount 48 GiB
Network Type 2 network interface adapter cards
Each card has 2 10GbE interfaces
Number of Network
Links 4 x 10GbE
Storage Adapter N/A
2 RAID Host Bus Adapters
– 6Gb/s SAS connectivity
Storage N/A 12 3G SATA SSD / HBA
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Test Hardware Configuration
51
10GbE
RAID0 – 12 SSDs RAID0 – 12 SSDs
ServerClient
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
IOMETER Configuration
52
I/O Size Number of
Workers Queue
Depth Total
Queued
512 32 8 256
1024 32 8 256
4096 32 8 256
8192 16 32 512
16384 16 16 256
32768 16 8 128
65536 16 8 128
131072 16 8 128
262144 16 8 128
524288 16 8 128
1048576 16 8 128
Constant #Op
Constant Bytes
Scaled Bytes
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
1000
2000
3000
4000
5000
6000
512 1024 4096 8192 16384 32768 65536 131072 262144 524288 1048576
MB
/sec
IO Size (bytes)
Server/Local Read Throughput
Read Sequential Read Random
Server (Local) Baseline
53
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
1000
2000
3000
4000
5000
6000
512 1024 4096 8192 16384 32768 65536 131072 262144 524288 1048576
MB
/sec
IO Size (bytes)
SMB 3.0 Client Interface Scaling - Throughput
1 x 10GbE 2 x 10GbE 4 x 10GbE
SMB 3.0 Client 10GbE Interface Scaling
54
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0
1000
2000
3000
4000
5000
6000
512 1024 4096 8192 16384 32768 65536 131072 262144 524288 1048576
MB
/sec
IO Size (bytes)
Server/Local vs. Client Throughput
Server/Local Throughput Client (4 x 10GbE) Throughput
Server to SMB 3.0 Client Comparison
55
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
0.0
20.0
40.0
60.0
80.0
100.0
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
512 1024 4096 8192 16384 32768 65536 131072 262144 524288 1048576
Pri
v C
PU
%
I/O
s p
er
seco
nd
IO Size
Server/Local vs. Client IOps with CPU %
Server/Local IOps Client (4 x 10GbE) IOps
Client (4 x 10GbE) Priv CPU Server/Local Priv CPU
Server to SMB 3.0 Client IOps
56
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Multi Channel
Implementation Challenges
Nodes with varying interface counts (1, 2, 4 NIC)
Operation distribution across interfaces
TCP v. RDMA
TCP CPU is in the host
RDMA CPU is primarily in the HCA
Different connection scale points
Performance goal: scale
2011 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Summary
SMB 3.0 enables new workloads
Workload demands will expose new issues in platforms and implementations – be prepared
New protocol features create new performance requiremens
Downlevel platform improvements
Windows 2008 R2 KB 2575625 – Cluster Unplanned with Crossover Ethernet
Windows 7 / 2008 R2 KB 2536493 – Single-File Mixed Read/Write Async NonCached IO
Top Related