1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U...

55
1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell Jim Gray Erik Riedel (CMU) Eve Schooler (Cal Tech) Don Slutz Catherine Van Ingen http://www.research.Microsoft.com/barc

Transcript of 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U...

Page 1: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

1

BARC

BARCMicrosoft Bay Area Research Center6/20/97 (PUBLIC VERSION)

Tom BarclayTyler Bean (U VA)Gordon BellJoe BarreraJosh Coates (UC B) Jim Gemmell Jim GrayErik Riedel (CMU)Eve Schooler (Cal Tech)Don SlutzCatherine Van Ingen

http://www.research.Microsoft.com/barc/

Page 2: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

2

Telepresence• The next killer app

• Space shifting:

»Reduce travel

• Time shifting:

»Retrospective

»Offer condensations

»Just in time meetings.

• Example: ACM 97

»NetShow and Web site.

»More web visitors than attendees

• People-to-People communication

Page 3: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

3

Telepresent Jim GemmellScaleable Reliable Multicast

Outline

• What Reliable Multicast is & why it is hard to scale

• Fcast file transfer

• ECSRM

• Layered Telepresentations

Page 4: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

4

Multiple Unicast

• Sender must repeat

• Link sees repeats

Page 5: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

5

IP Multicast

• Pruned broadcast

• Unreliable

Page 6: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

6

Reliable Multicast

• Difficult to scale:

»Sender state explosion

»Message implosion

StateState::receiver 1,receiver 1,

receiver 2,receiver 2,

……

receiver nreceiver n

Page 7: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

7

Receiver-Reliable

• Receiver’s job to NACK StateState::receiver 1,receiver 1,

receiver 2,receiver 2,

……

receiver nreceiver n

Page 8: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

8

SRM Approaches

• Hierarchy / local recovery

• Forward Error Correction (FEC)

• Suppression

• *HYBRID*

_________________________________

• Fcast is FEC only

• ECSRM is suppression + FEC

Page 9: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

9

(n,k) linear block encoding

Original packetsOriginal packets1 2 k

1 2 k k+1k+2 n

Encode Encode (copy 1st k)(copy 1st k)

1 2 k Original packetsOriginal packets

DecodeDecode

Take any kTake any k

Page 10: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

10

Fcast

• File tranfer protocol

• FEC-only

• Files transmitted in parallel

Page 11: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

11

Fcast send order

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 nFile 1File 1

File 2File 2

X Need k from Need k from each roweach row

Page 12: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

12

Fcast reception

timetime

SenderSender Files + FEC Files + FEC Files + FEC Files + FEC

Low loss Low loss receiverreceiver

joinjoin leaveleave

High loss High loss receiverreceiver

joinjoin leaveleave

Page 13: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

13

Fcast demo

Fcast Receive.lnkFcastSend.lnk

Page 14: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

14

ECSRM - Erasure Correcting SRM

• Combines:

» suppression

» erasure correction

Page 15: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

15

Suppression

• Delay a NACK or repair in the hopes that someone else will do it.

• NACKs are multicast

• After NACKing, re-set timer and wait for repair

• If you hear a NACK that you were waiting to send, then re-set your timer as if you did send it.

Page 16: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

16

ECSRM - adding FEC to suppression

• Assign each packet to an EC group of size k

• NACK: (group, # missing)

• NACK of (g,c) suppresses all (g,xc).

• Don’t re-send originals; send EC packets using (n,k) encoding

Page 17: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

17

Example: EC group size (k) = 7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

XXXX

XXXX

XX XXXX

1

2

3

4

5

6

7

Page 18: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

18

…example

NACK:NACK:

Group 1, 1 lostGroup 1, 1 lost

NACK’s suppressedNACK’s suppressed

Page 19: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

19

…example

Erasure Erasure correcting correcting packetpacket

Page 20: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

20

…example: summary

• Normal suppression needs:

»7 NACKs, 7 repairs

• ECSRM requires:

»1 NACK, 1 repair

• Large group: each packet lost by someone

»Without FEC, 1/2 of traffic is repairs

»With ECSRM, only 1/8 of traffic is repairs

»NACK traffic reduced by factor of 7

Page 21: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

21

Simulation: 112 receivers

0

10

20

30

40

50

60

5 10 15 20 25Tim e

Pac

ke

ts R

ece

ive

dSRM

ECSRM

Page 22: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

22

Simulation: 112 receivers

0

5

10

15

20

25

30

35

40

45

50

5 10 15 20Time

NA

CK

s r

ec

eiv

ed

SRM

ECSRM

Page 23: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

23

Multicast PowerPoint Add-in

Slides

Annotations

Control informationECSRECSRMM

slide masterFcastFcast

Page 24: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

24

Multicast PowerPoint - Late Joiners

• Viewers joining late don’t impact others with session persistent data (slide master)

timetime

joinjoin leaveleave

FcastFcast

ECSRECSRMM

joinjoin

Page 25: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

25

Future Work

• Adding hierarchy (e.g. PGM by Cisco)

• Do we need 2 protocols?

Page 26: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

26

RAGS: RAndom SQL test Generator

• Microsoft spends a LOT of money on testing. (60% of development according to one source).

• Idea: test SQL by » generating random correct queries» executing queries against database» compare results with SQL 6.5, DB2, Oracle, Sybase

• Being used in SQL 7.0 testing.» 375 unique bugs found (since 2/97)

» Very productive test tool

Page 27: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

27

Sample Rags Generated Statement

SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notesFROM titles T0, roysched T1WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY (

SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS (

SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 )

This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error:Query processor could not produce a query plan.

Page 28: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

28

Automation• Simpler Statement with same error

SELECT roysched.royalty

FROM titles, roysched

WHERE EXISTS (

SELECT DISTINCT TOP 1 titles.advance

FROM sales

ORDER BY 1)

• Control statement attributes»complexity, kind, depth, ...

• Multi-user stress tests»tests concurrency, allocation, recovery

Page 29: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

29

One 4-Vendor Rags Test3 of them vs Us

• 60 k Selects on MSS, DB2, Oracle, Sybase.

• 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements.

• Examine 10 suspects, filed 4 Bugs!One duplicate. Assume 3/10 are new

• Note: This is the SS Beta 2 ProductQuality rising fast (and RAGS sees that)

Page 30: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

30

RAGS Next Steps

• Done:

»Patents,

»Papers,

»talks,

» tech transfer to development

• Next steps:

»Extend to other parts of SQL and Tsql

»“Crawl” the config space (look for new holes)

»Apply ideas to other domains (ole db).

Page 31: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

31

Scaleup - Big Database• Build a 1 TB SQL Server database

» Show off Windows NT and SQL Server scalability

» Stress test the product

• Data must be» 1 TB» Unencumbered» Interesting to everyone everywhere» And not offensive to anyone anywhere

• Loaded » 1.1 M place names from Encarta World Atlas» 1 M Sq Km from USGS (1 meter resolution)» 2 M Sq Km from Russian Space agency (2 m)

• Will be on web (world’s largest atlas)• Sell images with commerce server.• USGS CRDA: 3 TB more coming.

Page 32: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

32

The System

• DEC Alpha + 8400

• 324 StorageWorks Drives (2.9 TB)

• SQL Server 7.0

• USGS 1-meter data (30% of US)

• Russian Space dataTwo meterresolutionimages

SPIN-2SPIN-2

Page 33: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

33

World’s Largest PC!

• 324 disks (2.9 terabytes)

• 8 x 440Mhz Alpha CPUs

• 10 GB DRAM

Page 34: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

34

1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 )

SPIN-2

Hardware

S T C9 7 4 0D L TT a p eL i b r a r y

4 89 G BD r i v e s

4 89 G BD r i v e s

4 89 G BD r i v e s

A l p h aS e r v e r8 4 0 0

E n t e r p r i s e S t o r a g e A r r a y

1 0 0 M b p sE t h e r n e t S w i t c h D S 3 I n t e r n e t

M a pS e r v e r

4 89 G BD r i v e s

4 89 G BD r i v e s

4 89 G BD r i v e s

8 x 4 4 0 M H zA l p h a c p u s1 0 G B D R A M

4 89 G BD r i v e s

S i t eS e r v e r s

Page 35: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

35

broswer

HTMLJava

Viewer

The Internet

Web Client

Microsoft AutomapActiveX Server

Internet InfoServer 4.0

Image DeliveryApplication

SQL Server7

MicrosoftSite Server EE

Internet InformationServer 4.0

Image Provider Site(s)

Terra-Server DB Automap Server

Sphinx(SQL Server)

Terra-ServerStored Procedures

InternetInformationServer 4.0

ImageServer

Active Server Pages

MTS

Terra-Server Web Site

Software

Page 36: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

36

• Backup and Recovery

»STC 9717 Tape robot

»Legato NetWorker™

»Sphinx Backup/Restore Utility

»Clocked at 80 MBps (peak)(~ 200 GB/hr)

• SQL Server Enterprise Mgr

»DBA Maintenance

»SQL Performance Monitor

System Management & Maintenance

Page 37: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

37

TerraServer File Group Layout• Convert 324 disks to 28 RAID5 sets

plus 28 spare drives

• Make 4 NT volumes (RAID 50) 595 GB per volume

• Build 30 20GB files on each volume

• DB is File Group of 120 files

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

E: F: G: H:

HSZ70 A

HSZ70 B

Page 38: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

38

Demo

Http://TerraWeb2

Page 39: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

39

Technical ChallengeKey idea

• Problem: Geo-Spatial Search without geo-spatial access methods.(just standard SQL Server)

• Solution:Geo-spatial search key:

Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y)

Z-transform X & Y into single Z value, build B-tree on Z

Adjacent images stored next to each other

Search Method:Latitude and Longitude => X, Y, then Z

Select on matching Z value

Page 40: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

40

Now What?

• New Since S-Day: • More data:

• 4.8 TB USGS DOQ• .5 TB Russian

• Bigger Server:• Alpha 8400

• 8 proc, 8 GB RAM, • 2.5 TB Disk

• Improved Application• Better UI• Uses ASP• Commerce App

• Cut images and Load Sept & Feb

• Built Commerce App for USGS & Spin-2

• Launch at Fed Scalability DaySQL 7 Beta 3 (6/24/98)

• Operate on Internet for 18 months

• Add more data (double)

• Working with Sloan Digital Sky Survey» 40 TB of images

» 3 TB of “objects”

Page 41: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

41

NT Clusters (Wolfpack)• Scale DOWN to PDA: WindowsCE

• Scale UP an SMP: TerraServer

• Scale OUT with a cluster of machines

• Single-system image

»Naming

»Protection/security

»Management/load balance

• Fault tolerance

»“Wolfpack”

• Hot pluggable hardware & software

Page 42: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

42

Web Web sitesite

DatabaseDatabase

Web site filesWeb site files

Database filesDatabase files

Server 1Server 1

BrowserBrowser

Symmetric Virtual Server Failover Example

Server 1Server 1 Server 2Server 2

Web site filesWeb site files

Database filesDatabase files

Web Web sitesite

DatabaseDatabase

Web Web sitesite

DatabaseDatabase

Page 43: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

43

Clusters & BackOffice• Research: Instant & Transparent failover

• Making BackOffice PlugNPlay on Wolfpack

»Automatic install & configure

• Virtual Server concept makes it easy

»simpler management concept

»simpler context/state migration

»transparent to applications

• SQL 6.5E & 7.0 Failover

• MSMQ (queues), MTS (transactions).

Page 44: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

44

1.2 B tpd• 1 B tpd ran for 24 hrs.

• Out-of-the-box software

• Off-the-shelf hardware

• AMAZING!

•Sized for 30 days•Linear growth•5 micro-dollars per transaction

Page 45: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

45Controller

The Memory Hierarchy

• Measued & Modeled Sequential IO

• Where is the bottleneck?

• How does it scale with

»SMP, RAID, new interconnects

Adapter SCSIFile cache PCI

MemoryGoals:balanced bottlenecksLow overheadScale many processors (10s)Scale many disks (100s)

Mem

bus

App address space

Page 46: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

46

Sequential IO your mileage

will vary0.00

5.00

10.00

15.00

20.00

25.00

30.00

2 4 8 16 32 64 128

transfer size (KB)

MB

/sec

4 disk read

4 disk write

1 disk read

1 disk write

Striping HelpsController is bottneck

40 MB/sec Advertised UW SCSI

35r-23w MB/sec Actual disk transfer

29r-17w MB/sec 64 KB request (NTFS)

9 MB/sec Single disk media

3 MB/sec 2 KB request (SQL Server)

• Measured hardware & Software

• Find software fixes..

• “out of the box” 1/2 power point: 50% of peak power“out of the box”

0.00

2.00

4.00

6.00

8.00

10.00

2 4 8 16 32 64 128

transfer size (KB)

MB

/sec

1 disk read

1 disk write

1 disk read/(NTFS buffer)

1 disk write(NTFS buffer)

NTFS Read is good at 8KB, but writes are uniformly slow

Page 47: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

47

PAP (peak advertised Performance) vs RAP (real application performance)

• Goal: RAP = PAP / 2 (the half-power point)

• http://research.Microsoft.com/BARC/Sequential_IO/

Page 48: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

48

Disk Bottleneck Analysis• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI

(not measured, we had only one PCI bus available, 2nd one was “internal”)

~ 120 MBps Unbuffered read

~ 80 MBps Unbuffered write

~ 40 MBps Buffered read

~ 35 MBps Buffered write

Memory Read/Write ~150 MBps

PCI~70 MBps

Adapter~30 MBps

PCI

Adapter

Adapter

Adapter

120

MB

ps

Page 49: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

49

Penny Sort Ground Ruleshttp://research.microsoft.com/barc/SortBenchmark

• How much can you sort for a penny.» Hardware and Software cost» Depreciated over 3 years» 1M$ system gets about 1 second,» 1K$ system gets about 1,000 seconds.» Time (seconds) = SystemPrice ($) / 946,080

• Input and output are disk resident

• Input is » 100-byte records (random data)» key is first 10 bytes.

• Must create output file and fill with sorted version of input file.

• Daytona (product) and Indy (special) categories

Page 50: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

50

PennySort• Hardware

» 266 Mhz Intel P2

» 64 MB SDRAM (10ns)

» Dual UDMA 3.2GB EIDE disk

• Software» NT workstation 4.3

» NT 5 sort

• Performance» sort 15 M 100-byte records (~1.5 GB)

» Disk to disk

» elapsed time 820 sec • cpu time = 404 sec or 100 sec

PennySort Machine (1107$ )

board13%

Memory8%

Cabinet + Assembly

7%

Network, Video, floppy

9%

Software6%

Other22%

cpu 32%

Disk25%

Page 51: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

51

• CHALLENGE

»reduce software tax

on messages» Today 30 K ins

+ 10 ins/byte

» Goal: 1 K ins + .01 ins/byte

• Best bet:» SAN/VIA

» Smart NICs

» Special protocol

» User-Level Net IO (like disk)

NetworkingBIG!! Changes coming!

• Technology

»10 GBps bus “now”» 1 Gbps links “now”

» 1 Tbps links in 10 years

» Fast & cheap switches

• Standard interconnects» processor-processor

» processor-device (=processor)

• Deregulation WILL work someday

Page 52: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

52

What if Networking Was as Cheap As Disk IO?

• TCP/IP

»Unix/NT 100% cpu @ 40MBps

• Disk

»Unix/NT 8% cpu @ 40MBps

Why the Difference?Host Bus Adapter does

SCSI packetizing, checksum,…flow controlDMA

Host doesTCP/IP packetizing, checksum,…flow controlsmall buffers

Page 53: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

53

The Promise of SAN/VIA10x better in 2 years

http://www.viarch.org/

• Today: »wires are 10 MBps (100 Mbps Ethernet)

»~20 MBps tcp/ip saturates 2 cpus

»round-trip latency is ~300 us

• In University & NT lab

»wires are 1 Gbps Ethernet, ServerNet,…

» Fast user-level communication• tcp/ip ~ 100 MBps 10% of each processor

• round-trip latency is 15 us

Page 54: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

54

Public Service• Gordon Bell

» Computer Museum

» Vanguard Group

» Edits column in CACM

• Jim Gray» National Research Council Computer Science and

Telecommunications Board

» Presidential Advisory Committee on NGI-IT-HPPC

» Edit Journals & Conferences.

• Tom Barclay» USGS and Russian cooperative research

Page 55: 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U VA) Gordon Bell Joe Barrera Josh Coates (UC B) Jim Gemmell.

55

BARCMicrosoft Bay Area Research Center

Tom BarclayGordon BellJoe Barrera Jim Gemmell Jim GrayErik Riedel (CMU)Eve Schooler (Cal Tech)Don SlutzCatherine Van Ingen

http://www.research.Microsoft.com/barc/