Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

60

description

Agenda Exchange storage background Storage technology Large mailbox value E2010 storage architecture Store innovations ESE database innovations E2010 storage design Summary

Transcript of Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Page 1: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.
Page 2: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Storage in Microsoft Exchange Server 2010

Matt GossageSenior Program ManagerMicrosoft CorporationUNC321

Page 3: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Agenda

Exchange storage backgroundStorage technology 2010+Large mailbox valueE2010 storage architecture

Store innovationsESE database innovations

E2010 storage designSummary

Page 4: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Exchange 2003 HA/Storage DesignMSIT 4+3 SCC SAN example

4 Active Nodes3 Passive Node8 Processor cores 4 GB of RAM4000 Users/Server250 MB MailboxesBackups:

Daily FullStream to disk/tape

SAN Fabric B

SAN Fabric A

+1 IOPS/Mailbox

RAID10 3.5” 10K FC Disks

Storage is single point of failure

Page 5: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Exchange 2007 HA/Storage DesignMSIT CCR + DAS example

File Share Witness

Hub Transport

Server:

Transport Dumpster

Public Network

Private Network

Active Node Passive Node

CCR

RAID

Transaction Log Shipping

Replay

RAID

RAID

RAID

RAID5 2.5” 10K SAS Disks

.33 IOPS/Mailbox

No single points of failure!

~4000 Mailboxes/Cluster8 Processor cores 16 GB of RAM2 GB MailboxesBackups: DPM

15 min IncrementalDaily Express Full

Page 6: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Disk Technology

Disk Capacity trend predicted to continue2TB Desktop class SATA disks available1TB Nearline/Midline SAS disk available

Sequential throughput increasing linearly based on areal density

2010 SATA = ~250MB/secRandom I/O performance not expected to improve substantially

15K RPM is the ceiling

Page 7: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Random vs. Sequential Disk IORandom IO

Disk head has to move to process subsequent IOHead movement = High IO latencySeek Latency limits IOPS

Disk Head

7.2K SATA Disk (20ms Latency)Random = 50 IOPSSequential = +300 IOPS!

Sequential IODisk head does not move to process subsequent IOStationary Head = Low IO latencyDisk RPM speed limits IOPS

IOPS = Input/Outputs (IO’s)per second

Page 8: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

FLASH/SSD: E2010 Scenarios

Cache

SSD

PCM

Enterprise SAN ArrayHybrid HDD

SATA SSD

NAND

NAND HBA / RAID

Flash best utilized by E2010 when used as a cache within storage stack

E2010 Mailbox Server

?

Page 9: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

E-mail Trends

The average corporate user, today, can expect to send and receive about 156 messages a day, and this number is expected to grow to about 233 messages a day by 2012. An increase of 33% over the four-year period. (Radicati, 2008)Business users report that they currently spend 19% of their work day, or close to 2 hours/day on email. (Radicati, 2007)

2008 2010 20120

50

100

150

200

250Messages Sent/Received Per User/Day

Page 10: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Large Mailbox ValueLarge Mailbox = 1-10GB+

“Aggregate Mailbox” = Primary mailbox + Archive Mailbox~1 Year of mail (minimum)

Increased knowledge worker productivity

Reduced mailbox managementClient Accessibility (Outlook/OWA/Mobile)

Eliminate/Reduce PST’sEliminate/Reduce 3rd Party Archive

Time Items Mailbox Size (MB)

1 Day 200 101 Month 4000 2001 Year 48,000 2,4004 Years 192,000 9,600*Very Heavy Profile = 150 Receive + 50 Send /Day, 50KB, no deletions

Page 11: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Large Mailbox Challenges & SolutionsClient Experience

Performance Improvements: Office 2007 SP2 (KB953195)

Updated OST sizing guidance (10GB)Utilize the E2010 Archive Mailbox to reduce data cached to OSTE2010 Store/ESE changes

Outlook 2007 Performance (Cached Mode)

Outlook 2007 (Online)/OWA Performance

Items/folder LimitationsView Creation Performance

Client Search Performance

E2010 Store/ESE changes

E2010 Search Performance Improvements

Real-time result views2x increase in indexing performance

E2010 Store/ESE changes

Page 12: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Large Mailbox Challenges & SolutionsDeployment/Operations Backup off passive copies

Daily Incremental/Weekly Full backupsDPM Express Full BackupsE2010 HA + Hold Policy is your backup

Long Backup Times

Fast Recovery Requirements (RTO)

High Storage CostsIOPS (efficiently utilizing low

performance/high capacity disks)RAID overhead

E2010 HA

E2010 Store/ESE changes

Move Mailbox Downtime E2010 Online Move Mailbox

Database MaintenanceOnline Maintenance Duration (OLD)DB corruption (-1018) pain pointDB re-seed performance hit on

active copy

E2010 Store/ESE changes

Page 13: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Exchange 2010 Storage Vision

IO ReductionSequential IO

Large, Fast, Low-cost Mailboxes

SATA/Tier 2 Disk Optimization

Storage Design Flexibility

RAID’less Storage (JBOD)

Page 14: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reductions: Store Schema Changes

Store Schema = The way the Store organizes data in the ESE DatabaseE2010: One simple theme

Move away from doing many, random, small size, disk IOs to doing fewer, sequential, large size, disk IOs.

Significant BenefitsFast/Efficient..

OWA/Outlook Online Mode…end user viewing for “cold” states/first time view creation…Calendar Operations…Search performance

Outlook Cached Mode/Exchange Active SyncOST sync = sequential IOEAS sync = sequential IO

Server Management…Move mailbox…Content Index Crawls

Page 15: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: Store Table Architecture

E2007

Message/Folder Table (MFT)

Joe:Inbox:H3

Joe:Inbox:H2

Joe:Inbox:H1

Per Database Per Folder

Mailbox Table

Jeff’s Mbx

Ann’s Mbx

Joe’s Mbx

Attachments Table

Jeff:Excel.xls

Ann:Pic.bmp

Joe:Help.doc

Message Table (Msg)

Joe:Msg10

Jeff:Msg32

Ann:Msg180

Folders Table

Jeff:Inbox

Ann:Drafts

Joe:Unread

E2010

View Tables (e.g. From)

Joe:H920

Joe:H302

Joe:H10

Secondary Indexes used for Views

Per Mailbox

Mailbox Table

Jeff’s Mbx

Ann’s Mbx

Joe’s Mbx

Body

Joe:Msg10

Joe:Help.doc

Joe:Msg302

Message Header Table

Joe:H10

Joe:H302

Joe:H920

Folders Table

Joe:Inbox

Joe:Drafts

Joe:Unread

Per Database

New Store Schema = no more single instance storage within a DB

Per View

Page 16: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Store Schema Changes: Physical Contiguity

1078

B+ Tree

92 4577 6 872 7210 3278 21 9346

1078

B+ Tree

1079 1080 1081 1082 1083 3456 3457 3458

E2007

E2010

Many, small size, IOs (1 per 8K page)

Fewer, larger size, sequential IOs

DB Pages (Page Numbers)

B+Tree = Table

Page 17: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Store Schema Changes: Logical Contiguity

E2007

E2010

Many, small size, IOs

Fewer, large size, IOs

Inbox

M1

Calendar

M3

Drafts

M5

For Follow-up

M4

DL Mail

M2

Mailbox

DL Mail M1

Calendar M2

Drafts M3

For Follow-up M4

Inbox M5

Mailbox

Random

Sequential

Page 18: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Store Schema Changes: Lazy View Updates

E2007

E2010

Many, random, IOs (1 per update)

Fewer, sequential, IOs (1 per view)

All Unread or Flagged items (view)

TimeM1 arrives M2 arrives M1 flagged M3 arrives M2 deleted

User uses OWA/Outlook Online and switches to this view

All Unread or Flagged items (view)

M1 M2 M1 M3 M2

M1 M2 M1 M3 M2

Nickel & Dime Approach

Pay to Play Approach

DB I/O

Reducing IO by deferring view updatesView updates utilize sequential IO

Page 19: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Outlook 2007 SP2 Large Mailbox Performance on E2010

demo

Page 20: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: ESE ChangesOptimize for new Store Schema

Allocate database space in contiguous mannerMaintain database contiguity over timeUtilize space efficiently (Database compression)

Increase IO SizesDB page size increased from 8KB to 32KBImproved read/write IO coalescing (Gap coalescing)Provide improved async read capability (Pre-read)

Increase Cache Effectiveness100MB Checkpoint Depth (HA configurations only) DB Cache Compression (aka Dehydration)DB Cache Priority (aka Fast Evict)

Page 21: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: Space ManagementAllocate space based on contiguity

Page 1

Used

Page 3

Used

Disk

Database Space Allocation Hints:• Allocate DB space based on either data compactness or data contiguity

(usage pattern)

DB CachePage X

Msg Header

Page Y

Msg Header

Page Z

Event History

Contiguity

Space Contiguity

Space Compactness

Page 4

Msg Header

Page 5

Msg Header

Page 2

Event History

Sequential/BloatRandom/Compact

Page 22: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: Maintain Contiguity New database maintenance architecture

ESE Function E2007 SP1 E2010

Cleanup (deleted items/mailboxes)

Cleanup performed during Online Defrag (OLD) which occurs during Online Maintenance (OLM) time window

Cleanup performed at run time (when hard delete occurs). Happens during Store dumpster cleanup (OLM), pages are zeroed by default.

Space Compaction

Database is compacted and space reclaimed during Online Defrag (OLD)

Database is compacted and space reclaimed at run-time. Auto-throttled.

Maintain Contiguity N/A: Contiguity is compromised by space compaction

Database is analyzed for contiguity and space at run time and is defragmented in the background (B+Tree Defrag/OLD2). Auto-throttled.

Database Checksum When configured, ½ of OLD maintenance window reserved for sequential scan (Checksum), manual throttle. Active DB copy only.

Two options (both Active and Passive copies):1. Run DB Checksum in the

background 24x7 (default). Sequential IO

2. Run DB Checksum during OLM window. Sequential IO

Page 23: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: DB Contiguity ResultsE2007 Message Folder Table (aka MFT)

E2010 Message Header Table (aka MsgHeader)

Blue = contiguous (good)Red = fragmented (bad)

*Production database analysis

Random Deletes at the tail

FRAGMENTED

CONTIGUOUS

DB Page Numbers

Page 24: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Mitigate DB Space Growth: Database Compression

Store Schema change, Space Hints, B+Tree Defrag & 32KB page size combine to increase DB file size by 20%.Growth is 100% mitigated by Database Compression

7bit/XPRESS Compression for message headers and text/html bodies (Long Values)

E2007/RTF E2010/RTF E2010/Mix E2010/HTML0.000.200.400.600.801.001.201.40

1.001.20

1.000.88

Counts E2007 SP1 E2010 Mailbox Count 750 750Tables 14754 92435Secondary Indexes 85784 4557Pages 28,486,144 5,814,032Used Pages (%) 85.7% 86.7%Available Pages (%) 14.3% 13.3%

1 Database, 750 x 250MB mailboxes,RTF = RTF Compressed, Mix = 77% HTML, 15% RTF, 8% Text, Avg. Message size = ~50KB

Msg Views

32KB Pages

DB Space AnalysisDB File Size Comparison

Page 25: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: DB Page Size Increased to 32KB

Page 1

Msg Header

Page 2

X

Page 3

Msg Body

DiskPage 4

X

Page 5

MsgBody

DBCache

Page 1

Msg Header

Page 3

Msg Body

Page 5

MsgBody

3 Read IO’s

Page 1 (32KB)

Msg Header, Msg Body

Disk

DBCache

1 Read IO

E2007 DB Read 20KB Message

E2010 DB Read 20KB Message

~20KB Message

8 KB Pages

32 KB Pages

Page 2 (32KB)

X

Page 1 (32KB)

Msg Header, Msg Body

Page 26: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: IO Gap CoalescingRead Case

Page 1

Msg Header

Page 2

X

Page 3

Msg Body

DiskPage 4

X

Page 5

Msg Body

E2007 DB Read Behavior

E2010 DB Read Behavior

DBCache

Page 1

Msg Header

Page 3

Msg Body

Page 5

Msg Body

3 Read IO’s

Page 1

Msg Header

Page 2

X

Page 3

Msg Body

DiskPage 4

X

Page 5

Msg Body

DBCache

Page 1

Msg Header

Page 3

Msg Body

Page 5

Msg Body

Page 2

Temp Buffer

Page 4

TempBuffer

1 Read IO

Page 27: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: 100MB Checkpoint DepthCheckpoint Depth = The amount of data that is waiting to be committed to the database file (edb).E2010 default Checkpoint Depth Max is increasing from 20MB to 100MB only on databases protected by E2010 HA (standalone still 20MB).

Loadgen Test: 3000 Mailbox, 12 DB, Outlook 2007 Online Very Heavy Profile

20 40 60 80 100 2000

20

40

60

80

100

120

Database Pages Re-peatedly Written/sec

DB Writes/sec (avg)

Checkpoint Depth (MB)

100MB Checkpoint Depth = 40% DB write IO reduction

Deep Checkpoint Benefit = Efficient DB writes (~40% reduction)

Deep Checkpoint Risks = long store shutdown times, long crash recovery times.Risk Mitigation: shutdown databases in parallel, failover on store crash

Page 28: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: DB Cache CompressionProblem: New Store Schema + 32KB pages can reduce efficiency of cache. E.g. A page with 8KB of data consumes 32KB of memory in the DB Cache.Solution: Implement DB Cache Compression to shrink partially used cached pages in memory; allowing more Effective cache.

Page 1 (32KB)

8KB

Disk

DBCache

Page 1 (32KB)

8KB

1. 32KB Page with only 8KB of data is read off disk

2. 32KB page is compressed to a 8KB in-memory image

Up to 30% more cache/mailbox serverMore Cache = Less DB IO!

Page 1 (8KB)

8KB

Page 29: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: DB Cache PriorityProblem: Background and recovery DB operations can pollute the cache. E.g. DB Check summing, OLD2, HA log replay.Solution: Implement DB Cache Priority to allow lower cache priorities for background/replay operations.

Now Past Future

DB Cache Time

Outlook Message Read

HA Log Replay (Passive)

DB Maintenance

Cache Eviction Cache Entry

ESE Caching Algorithm = LRU-K (Least Recently Used)

Page 30: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Exchange 2010 Storage Speeds and Feeds

DB IO E2007 E2010

IO Type Random “Sequentialish”

Read:Write 1:1 3:2

Avg Read IO Size (KB)

12 52

Avg Write IO Size (KB)

8 60

Mailbox IO Characteristics: E2007 vs... E2010

3000 Mailboxes, 12 DB’s 4MB DBCache/Mailbox, Loagen Outlook 2007 Online Very Heavy Profile, 250MB Mailbox Size

Log IO E2007 E2010

IO Type Sequential Sequential

Read:Write 0:1 0:1

Avg Read IO Size (KB)

n/a n/a

Avg Write IO Size (KB)

10 10

DB IO Sizes increase by 5x!!

Log IO Write Size is the same...

Page 31: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: E2007 vs. E2010 Results

E2007 E20100

50

100

150

200

250

300

350

400

450

500

DB IOPS Comparison

DB Read IO/SecDB Write IO/SecDB IO/Sec

+70% Reduction!

3000 Mailboxes, 3MB DB Cache/user, Loadgen Outlook 2007 Online Very Heavy Profile, 250MB Mailbox Size, E2010 Beta

Page 32: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Exchange IOPS Trend

Exchange 2003 Exchange 2007 Exchange 20100

0.2

0.4

0.6

0.8

1

DB IOPS/Mailbox

IOPS/Mailbox

+90% Reduction!

Page 33: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Optimize for SATA/Tier 2 DisksDB Write IO “Burstiness”Problem: Bursty DB writes negatively affect DB read and Log write latency

• The more write IO’s issued at a time, the more disk contention.

2 4 8 16 32 640

20

40

60

80

100

120

IO Latency Based on Max DB Write IO’s (ms)

Maximum DB Write IO's Issued

Latency (ms)

DB Read IO

Single 7.2k SATA disk, logs/db on same spindle, Loadgen load generating 250 RPC Operations/second, ~50 IOPS

Log Write IO

Solution: Throttle DB writes based on Checkpoint target (QoS), DB Write Smoothing

Page 34: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

DB Write Smoothing: Results

3000 Mailboxes, 3MB DB Cache/user, 12 x 7.2k SATA disks (DB/Logs on same spindles), Loadgen Outlook 2007 Online Very Heavy Profile

Exchange 2010 Baseline Exchange 2010 Smooth DB IO

05

101520253035404550

49

34

3.7 0.700000000000001

10.15.1

E2010 Smooth DB IO Benefit

DB Read Latency (ms)

Log Write Latency (ms)

RPC Average Latency

50% Reduction!

Page 35: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Putting It Altogether: Mailboxes/Disk

Exchange 2007 Exchange 2010

Mailboxes/Disk

250MB Mailbox Size, 3MB DB Cache/user, 12 x 7.2k SATA disks (DB/Logs on same spindles), Loadgen Outlook 2007 Online Very Heavy Profile, measured at <20ms RPC Average latency

E2010 Storage improvements cannot be quantified in IOPS reductions alone

+500

125

+4X Mailboxes/Disk!

Page 36: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

JBOD/RAID'less Storage: Now an option!JBOD : 1 disk = 1 Database/LogRequires E2010 HA (3+ DB Copies)Annual Disk Failure Rate (AFR) = ~5%

JBOD AdvantagesReducing Storage Costs/Complexity

Eliminates unnecessary DB copies: Server and Storage redundancy can be symmetrical

Reduces Disk IO: Eliminates RAID write penaltyEnables Simple Storage Design: 1 disk = 1 database

Enables Simple Storage Failure Recovery

JBOD ChallengesExchange HA/Storage must replace RAID functionality

Disk Striping performance (e.g. RAID10) cannot be leveraged

Disk Failure = Database Failover (~30 second outage)

Re-enabling Resiliency = Spare disk assignment/partitioning/format/DB re-seed (scriptable)

Soft Disk Errors (bad blocks) must be detected and repaired

Page 37: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

JBOD/RAID'less Storage: E2010 Optimizations

Failovers < 30 secondsESE tuned to maintain DB cache after failover (Cache warming)

Optimize HA Failovers/Switchovers

Improve storage failure detection (bad blocks/corruption)

Improve Database Seeding/Repair

Improve HA storage failure detection and failover

HA now detects storage failures and automatically fails over

Active/Passive copy background scan (Checksum)Active/Passive copy Lost Write Detection

Utilize DB passive copy for seeding sourceSeed capability for Content Index CatalogReduce re-seeds by using Single Page Restore (Active and Passive)

Page 38: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Mailbox Server Node 1

Mailbox Server Node 2

Database Availability Group (DAG)

Page1

Page2

Page3

Mailbox Server Node 3

1. Page corruption detected on Active Copy (e.g. -1018)

2. Active DB places marker in log stream to notify passive copies to ship up to date page

3. Passive receives log and replays up to marker, retrieves good page, invokes Replay Service callback and ships page

4. Active receives good page, writes page to DB. Page is restored.

DB1-Active

Database

Log

Page1

Page2

Page3

DB1-CopyA

Database

Log

Page1

Page2

Page3

DB1-CopyB

Database

Log

5. Subsequent page repair from additional copies ignored

JBOD/RAID'less Storage: Single Page Restore (Active)

Page 39: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

E2010 HA Storage Design Flexibility

SAN DAS (SAS) JBOD (SATA/Tier2)• HA = Shared Storage

Clustering• +1.0 IOPS/Mailbox• 3.5” 15K 146GB FC Disks• RAID10 for DB & Logs• Dedicated Spindles• Multi-path (HBA’s, FC Switches, SAN array controllers)• Backup = Streaming off active • Fast Recovery = Hardware VSS (Snapshots/Clones)

• HA = CCR• .33 IOPS/Mailbox• 2.5” 146GB 10K SAS Disks• RAID5 for DB• RAID10 for Logs• SAS Array Controller (/w BBU)• Backup = VSS Snapshot• Fast Recovery = CCR

• HA = DAG (2+ DB copies)• .11 IOPS/Mailbox• 3.5” 1TB 7.2K SATA/Tier2 Disks• RAID10 for DB & Logs• SAS Array Controller (/w BBU)• Backup = VSS Snapshot/Optional• Fast Recovery = Database Failover

DAS (SATA/Tier2) • HA = DAG (3+ DB copies)

• .11 IOPS/Mailbox• 3.5” 1TB 7.2K SATA/Tier2 Disks• 1 DB = 1 Disk• SAS Array Controller (/w BBU)• Backup = VSS Snapshot/Optional• Fast Recovery = Database Failover

More options to reduce storage cost

Page 40: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

E2010 Storage Design FlexibilityExchange Online Archive provides mailbox storage flexibility

One Mailbox per user or twoE2010 optimized for DAS storage, SAN storage is fully supportedIOPS reductions/SATA optimizations enable lower performing storage

E2010 HA architected for DAS (simpler)JBOD* and RAID storage supportE2010 optimized for Tier 2 (SATA) disks, Enterprise disks are fully supportedSSD storage supported but not recommended for mainstream due to high $/GB Storage Groups are gone; Max 100 Databases/ServerMax recommended DB Size = 2TB*Max recommended Folder Item Count = 100K**

*2+ copy E2010 HA only** Assuming no 3rd party applications

Page 41: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

E2010 Storage RequirementsStorage Guidance Stand Alone E2010 HA(2 copies) E2010 HA(3+ copies)

Storage Type DAS, SAN (Fibre Channel, iSCSI)

Disk Type SAS, Fibre Channel, SATA/Tier2 , SSD

RAID RAID recommended RAID optional

RAID Type RAID-1/0, RAID-5, RAID-6 JBOD

DB/Log Isolation Best Practice Not required

Windows Disk Type Basic (recommended), Dynamic (supported)

Partition Type GPT (recommended), MBR (supported)

Partition Alignment Windows 2008/R2 Default (1MB)

File System NTFS

NTFS Allocation Unit Size 64KB for both database and log volumes

Encryption Support Outlook Protection Rules, Bitlocker

See Appendix for full details

Page 42: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

E2010 HA/JBOD Storage ExampleSingle Site, 3 Node, 3 Copy DAG

DB1 DB1

DB1 DB2 DB3 DB4 DB5 DB6

DB7 DB8 DB9 DB10 DB11 DB12

DB13 DB14 DB15 DB16 DB17 DB18

DB19 DB20 DB21 DB22 DB23 DB24

DB25 DB26 DB27 DB28 DB29 DB30

Legend

Active copy Passive copy Spare Disk

DB1 DB1

DB1 DB2 DB3 DB4 DB5 DB6

DB7 DB8 DB9 DB10 DB11 DB12

DB13 DB14 DB15 DB16 DB17 DB18

DB19 DB20 DB21 DB22 DB23 DB24

DB25 DB26 DB27 DB28 DB29 DB30

DB1 DB1

DB1 DB2 DB3 DB4 DB5 DB6

DB7 DB8 DB9 DB10 DB11 DB12

DB13 DB14 DB15 DB16 DB17 DB18

DB19 DB20 DB21 DB22 DB23 DB24

DB25 DB26 DB27 DB28 DB29 DB30

Mbx Server 1

10,000 Mailboxes

3,333 Active Mailboxes/Server3 Nodes, 3 Copies = double disk failure resiliency

8 Cores32GB RAM

8 Cores32GB RAM

8 Cores32GB RAM 2GB Mailbox Size

.11 IOPS/Mailbox

1TB 7.2k disks (SAS/SATA/Tier2)

Online SparesBattery Backed Caching Array Controller

Heavy Profile: 120 Messages/day

JBOD: 30 Disks/nodeDatabase Availability Group (DAG)

Mbx Server 2 Mbx Server 3

Page 43: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Key Takeaways

Exchange Server 2010..Reduces DB IOPS by +70%...again!Optimizes for large mailboxes (+10GB) and 100K Item countsOptimizes for large/slow/low-cost disks (SATA/Tier2)Makes JBOD/RAID'less storage a viable optionEnables unmatched storage flexibility to reduce costs

Page 44: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

question & answer

Page 45: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

www.microsoft.com/teched

Sessions On-Demand & Community

http://microsoft.com/technet

Resources for IT Professionals

http://microsoft.com/msdn

Resources for Developers

www.microsoft.com/learningMicrosoft Certification and Training Resources

www.microsoft.com/learning

Microsoft Certification & Training Resources

Resources

Page 46: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Related ContentUNC314 – Information Protection and Control in Microsoft Exchange Server 2010UNC315 – Federation in Microsoft Exchange Server 2010UNC312 – Archiving and Retention in Microsoft Exchange Server 2010UNC320 – Microsoft Exchange Server Outlook Web Access 2010: The Future of Web-Based E-mailUNC317 – Microsoft Exchange Server 2010 Management ToolsUNC318 – Microsoft Exchange Server 2010 Transition and DeploymentUNC313 – High Availability in Microsoft Exchange Server 2010UNC321 – Storage in Microsoft Exchange server 2010UNC324 – What's New in Exchange Web Services in Microsoft Exchange Server 2010UNC319 – Unified Messaging in Microsoft Exchange Server 2010

Page 47: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Call to ActionLearn More!

Related Content at TechEd on “Related Content” SlideAttend in-person or consume post-event at TechEd Online

Check out online learning/training resourceshttp://technet.microsoft.com/exchange/2010 http://technet.microsoft.com/office/ocs

Try It Out!Download the Exchange Server 2010 Beta Evaluation

http://www.microsoft.com/exchange/2010/try-it

Get a 5-Day Trial of Office Communications Server 2007 R2https://r2.uctrial.com/

Page 48: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

appendix

Page 49: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reductions: Store Schema Elements

Element E2007 E2010

Physical Contiguity (ESE)

Poor physical contiguity of leaf pages. Hence many, small size, IOs (1 for each page)

Excellent physical contiguity of leaf pages. So fewer, large size IOs, spanning N pages (N ≈100)

Logical Contiguity (Store)

Headers for each folder kept in separate table. So many, small size, IOs spread over many tables

Headers for an entire mailbox kept in a single table. Hence fewer, large sized, IOs on a single table

Temporal Contiguity (View)

All views and indexes updated each time a mail is delivered. So many, small size, IOs spread over time

Views and indexes updated only when they are accessed by user. So fewer, large sized, IOs done together

How do you move from random IO to Sequential IO?

Page 50: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: Maintain Contiguity Over Time

Mailbox Messages

1. Delivery 2. Random Delete 3. Defragmentation

M1

M2

M3

M4

M5

M6

M7

M8

M9

M10

Mailbox Messages

M1

M3

M5

M7

M10

Mailbox Messages

M1

M2

M3

M4

M5

M6

M7

M8

M9

M10

Contiguous

Contiguous

FragmentedM11

M12

M13

M14

M15

New E2010 behavior…

Page 51: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

IOPS Reduction: Write IO Gap Coalescing

Page 1

DirtyPage 2

CleanPage 3

Dirty

Disk

Page 4

CleanPage 5

DirtyE2007 DB Write Behavior

3 Write IO’s

Page 1

DirtyPage 2

CleanPage 3

Dirty

Disk

Page 4

CleanPage 5

DirtyE2010 DB Write Behavior

1 Write IO

DB Cache

DB Cache

Writes spaced out over time

Page 52: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Big IO: How Big is Too Big?

0 128 256 384 512 640 768 896 10240

5

10

15

20

25Random DB IO Latency Based on Size

IO Size (KB)

IO La

tenc

y (m

s)Write

Read

SqlIO Test, 1x 750GB 7.2k SATA, no caching array controller

E2010 Max IO Size = 256KB for Read 384KB for Write

IO Latency increases with IO size

Page 53: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Optimize for SATA/Tier 2 DisksSolution: Smooth DB Write IOThrottle DB writes based on Checkpoint target (QoS)• When Checkpoint Depth equals 1x ->1.24x of Checkpoint target, Limit Max Outstanding DB

writes/LUN to 1• When Checkpoint Depth meets or exceeds 1.25x of Checkpoint target, ratchet up Max

Outstanding DB writes/LUN• The further behind on checkpoint, the more aggressively we raise the Max Outstanding DB

writes/LUN (Maximum = 512/LUN)

Works for both JBOD SATA through RAID10 SAN

20MB Max Checkpoint example

25.526.5

27.528.5

29.530.5

31.532.5

33.534.5

35.536.5

37.538.5

39.540.5

41.542.5

43.50

5

10

15

20

25

30

35

40Max Outstanding DB Writes vs.. Checkpoint Depth

Log Checkpoint Depth (MB)Log Checkpoint Depth (MB)

Max

Out

stan

ding

DB

Wri

tes

Page 54: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

JBOD/RAID'less Storage: Lost Flush Detection

What is a lost flush?A DB write IO that the disk subsystem/OS returned as completed did not actually get written to media or was written in the wrong location (aka lost write).

Why are they so bad?Your database may be logically corrupt and you do not know it!

How can they be detected in E2010?Two methods:

1. In Memory Flush Map (Active & Passive): memory overhead of 2 bits/page. Event ID 530 is fired when detected (-1119) and page can be patched.

2. Database Recovery: Event is fired (ID 516: timestamp mismatch, (-567)) and database must be re-seeded.

Page 55: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Mailbox Server

Exchange 2010 High Availability

Evolution of Continuous Replication technology (Database Mobility)Easier than traditional clustering to deploy and manageAllows each database to have 16 replicated copiesProvides full redundancy of Exchange roles on as few as two servers

Simplified Mailbox High Availability and Disaster Recovery with New Unified

Platform

DB1

DB3DB2

DB4DB5

Recover quickly from

disk and database

failures

Mailbox Server

DB1DB2

DB4DB5

DB3

Mailbox Server

DB1DB2

DB4DB5

DB3

Replicate databases to remote datacenter

San Jose New York

Page 56: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Client

DB2

DB3

DB2

DB3

DB4

DB4

DB5

CAS/HUB

Mailbox Server 1

Mailbox Server 2

Mailbox Server 3

Mailbox Server 6

Mailbox Server 4

AD site: Dallas

AD site: San Jose

Mailbox Server 5

DB5

DB2

DB3

DB4

DB5

DB1

DB3

DB5

DB1

DB1DB1

DB1

Database Availability Group (DAG)

E2010 High Availability Architecture

Page 57: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Mailbox Server Node 1

Mailbox Server Node 2

Database Availability Group (DAG)

Page1

Page2

Page3

Mailbox Server Node 3

1. Page corruption detected on DB Copy (e.g. -1018)

2. Passive copy pauses log replay (log copying continues)

3. Passive retrieves the corrupted page # from the active using DB seeding infrastructure

4. Passive copy waits till log file which meets max required generation requirement is copied/inspected, then patches page

DB1-Active

Database

Log

Page1

Page2

Page3

DB1-CopyA

Database

Log

Page1

Page2

Page3

DB1-CopyB

Database

Log

5. Passive resumes log replay

JBOD/RAID'less Storage: Single Page Restore Passive

Page 58: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Exchange 2010 Storage Guidance Stand Alone Database Availability Group: 2 nodes, 2 Database copies Database Availability Group: 3+ nodes, 3+ Database copiesStorage Type Direct Attached Storage (DAS) Supported Supported SupportedStorage Area Network (SAN): iSCSI Supported. Best Practice = Do not share physical

disks backing Exchange data with other applications.

Supported. Best Practice = Do not share physical disks backing Exchange data with other applications.

Supported. Best Practice = Do not share physical disks backing Exchange data with other applications.

Storage Area Network (SAN): Fibre Channel (FC)

Supported. Best Practice = Do not share physical disks backing Exchange data with other applications.

Supported. Best Practice = Do not share physical disks backing Exchange data with other applications. Best Practice = Do not place both database copies on the same physical spindles.

Supported. Best Practice = Do not share physical disks backing Exchange data with other applications. Best Practice = Do not place both database copies on the same physical spindles.

Network Attached Storage (NAS): SMB Not Supported Not Supported Not SupportedPhysical Disk Type SATA Supported, requires battery backed caching array

controller for data integritySupported, requires battery backed caching array controller for data integrity

Supported, requires battery backed caching array controller for data integrity

SAS Supported Supported SupportedFC Supported Supported SupportedSSD (Flash Disk) Supported Supported Supported

Physical Disk Write Caching (enabled) Not Supported Not Supported Not SupportedStorage RAID RAID recommended RAID recommended RAID optional

EDB Volume RAID5/6, RAID10, RAID1 RAID5/6, RAID10, RAID1 JBOD, RAID5/6, RAID10, RAID1Log Volume RAID1, RAID10 RAID1, RAID10 JBOD, RAID1, RAID10Disk Array RAID Stripe Size (kb) 256KB 256KB 256KB

Storage Array Cache Settings 75% Write Cache, 25% Read Cache (with Battery Backed Cache)

75% Write Cache, 25% Read Cache (with Battery Backed Cache)

75% Write Cache, 25% Read Cache (with Battery Backed Cache)

Database/Log file placement Database/Log Isolation Best Practice (for recoverability) = separate

database file (.edb) and logs from same Database on to different volumes backed by different physical disks

Database file (.edb) and logs from same Database can share same volume and same physical disk.

Database file (.edb) and logs from same Database can share same volume and same physical disk. This is a best practice for JBOD/RAID'less storage scenario where one or more volumes store the edb and log files backed by the same physical disk.

Database Files/Volume Based on backup methodology Based on backup methodology RAID = based on backup methodology, JBOD = one DB file/volume is recommended

Log Streams/Volume Based on backup methodology Based on backup methodology RAID = based on backup methodology, JBOD = one log stream/volume is recommended

Windows Disk Type Basic Disk Recommended Recommended RecommendedDynamic Disk Supported Supported Supported

Partition Type GUID Partition Table (GPT) Recommended Recommended RecommendedMaster Boot Record (MBR) Supported Supported Supported

Partition Alignment Windows 2008 Default: 1MB Windows 2008 Default: 1MB Windows 2008 Default: 1MBVolume Path Drive Letter or Mount Point (mount point host

volume must be RAIDed)Drive Letter or Mount Point (mount point host volume must be RAIDed)

Drive Letter or Mount Point (mount point host volume must be RAIDed)

File System NTFS support only NTFS support only NTFS support onlyNTFS Defragmentation Not required, not recommended Not required, not recommended Not required, not recommendedNTFS Allocation Unit Size 64KB for both edb and log volumes 64KB for both edb and log volumes 64KB for both edb and log volumes

NTFS Compression Not Supported for Exchange Database files Not Supported for Exchange Database files Not Supported for Exchange Database files

NTFS Encrypted File System (EFS) Not Supported for Exchange Database files Not Supported for Exchange Database files Not Supported for Exchange Database files

Windows Bitlocker (volume encryption) Supported for all Exchange database and log files Supported for all Exchange database and log files Supported for all Exchange database and log files

Preliminary Storage Guidance: Subject to Change!

Page 59: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

Complete an evaluation on CommNet and enter to win!

Page 60: Matt Gossage Senior Program Manager Microsoft Corporation UNC321.

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS,

IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.