1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U...

Post on 26-Mar-2015

216 views 3 download

Tags:

Transcript of 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U...

1

BARC

BARCMicrosoft Bay Area Research Center6/20/97 (PUBLIC VERSION)

Tom BarclayTyler Bean (U VA)Gordon BellJoe BarreraJosh Coates (UC B) Jim Gemmell Jim GrayErik Riedel (CMU)Eve Schooler (Cal Tech)Don SlutzCatherine Van Ingen

http://www.research.Microsoft.com/barc/

2

Telepresence• The next killer app

• Space shifting:

»Reduce travel

• Time shifting:

»Retrospective

»Offer condensations

»Just in time meetings.

• Example: ACM 97

»NetShow and Web site.

»More web visitors than attendees

• People-to-People communication

3

Telepresent Jim GemmellScaleable Reliable Multicast

Outline

• What Reliable Multicast is & why it is hard to scale

• Fcast file transfer

• ECSRM

• Layered Telepresentations

4

Multiple Unicast

• Sender must repeat

• Link sees repeats

5

IP Multicast

• Pruned broadcast

• Unreliable

6

Reliable Multicast

• Difficult to scale:

»Sender state explosion

»Message implosion

StateState::receiver 1,receiver 1,

receiver 2,receiver 2,

……

receiver nreceiver n

7

Receiver-Reliable

• Receiver’s job to NACK StateState::receiver 1,receiver 1,

receiver 2,receiver 2,

……

receiver nreceiver n

8

SRM Approaches

• Hierarchy / local recovery

• Forward Error Correction (FEC)

• Suppression

• *HYBRID*

_________________________________

• Fcast is FEC only

• ECSRM is suppression + FEC

9

(n,k) linear block encoding

Original packetsOriginal packets1 2 k

1 2 k k+1k+2 n

Encode Encode (copy 1st k)(copy 1st k)

1 2 k Original packetsOriginal packets

DecodeDecode

Take any kTake any k

10

Fcast

• File tranfer protocol

• FEC-only

• Files transmitted in parallel

11

Fcast send order

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

1 2 k

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 n

k+1k+2 nFile 1File 1

File 2File 2

X Need k from Need k from each roweach row

12

Fcast reception

timetime

SenderSender Files + FEC Files + FEC Files + FEC Files + FEC

Low loss Low loss receiverreceiver

joinjoin leaveleave

High loss High loss receiverreceiver

joinjoin leaveleave

13

Fcast demo

Fcast Receive.lnkFcastSend.lnk

14

ECSRM - Erasure Correcting SRM

• Combines:

» suppression

» erasure correction

15

Suppression

• Delay a NACK or repair in the hopes that someone else will do it.

• NACKs are multicast

• After NACKing, re-set timer and wait for repair

• If you hear a NACK that you were waiting to send, then re-set your timer as if you did send it.

16

ECSRM - adding FEC to suppression

• Assign each packet to an EC group of size k

• NACK: (group, # missing)

• NACK of (g,c) suppresses all (g,xc).

• Don’t re-send originals; send EC packets using (n,k) encoding

17

Example: EC group size (k) = 7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

1

2

3

4

5

6

7

XXXX

XXXX

XX XXXX

1

2

3

4

5

6

7

18

…example

NACK:NACK:

Group 1, 1 lostGroup 1, 1 lost

NACK’s suppressedNACK’s suppressed

19

…example

Erasure Erasure correcting correcting packetpacket

20

…example: summary

• Normal suppression needs:

»7 NACKs, 7 repairs

• ECSRM requires:

»1 NACK, 1 repair

• Large group: each packet lost by someone

»Without FEC, 1/2 of traffic is repairs

»With ECSRM, only 1/8 of traffic is repairs

»NACK traffic reduced by factor of 7

21

Simulation: 112 receivers

0

10

20

30

40

50

60

5 10 15 20 25Tim e

Pac

ke

ts R

ece

ive

dSRM

ECSRM

22

Simulation: 112 receivers

0

5

10

15

20

25

30

35

40

45

50

5 10 15 20Time

NA

CK

s r

ec

eiv

ed

SRM

ECSRM

23

Multicast PowerPoint Add-in

Slides

Annotations

Control informationECSRECSRMM

slide masterFcastFcast

24

Multicast PowerPoint - Late Joiners

• Viewers joining late don’t impact others with session persistent data (slide master)

timetime

joinjoin leaveleave

FcastFcast

ECSRECSRMM

joinjoin

25

Future Work

• Adding hierarchy (e.g. PGM by Cisco)

• Do we need 2 protocols?

26

RAGS: RAndom SQL test Generator

• Microsoft spends a LOT of money on testing. (60% of development according to one source).

• Idea: test SQL by » generating random correct queries» executing queries against database» compare results with SQL 6.5, DB2, Oracle, Sybase

• Being used in SQL 7.0 testing.» 375 unique bugs found (since 2/97)

» Very productive test tool

27

Sample Rags Generated Statement

SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notesFROM titles T0, roysched T1WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY (

SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS (

SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 )

This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error:Query processor could not produce a query plan.

28

Automation• Simpler Statement with same error

SELECT roysched.royalty

FROM titles, roysched

WHERE EXISTS (

SELECT DISTINCT TOP 1 titles.advance

FROM sales

ORDER BY 1)

• Control statement attributes»complexity, kind, depth, ...

• Multi-user stress tests»tests concurrency, allocation, recovery

29

One 4-Vendor Rags Test3 of them vs Us

• 60 k Selects on MSS, DB2, Oracle, Sybase.

• 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements.

• Examine 10 suspects, filed 4 Bugs!One duplicate. Assume 3/10 are new

• Note: This is the SS Beta 2 ProductQuality rising fast (and RAGS sees that)

30

RAGS Next Steps

• Done:

»Patents,

»Papers,

»talks,

» tech transfer to development

• Next steps:

»Extend to other parts of SQL and Tsql

»“Crawl” the config space (look for new holes)

»Apply ideas to other domains (ole db).

31

Scaleup - Big Database• Build a 1 TB SQL Server database

» Show off Windows NT and SQL Server scalability

» Stress test the product

• Data must be» 1 TB» Unencumbered» Interesting to everyone everywhere» And not offensive to anyone anywhere

• Loaded » 1.1 M place names from Encarta World Atlas» 1 M Sq Km from USGS (1 meter resolution)» 2 M Sq Km from Russian Space agency (2 m)

• Will be on web (world’s largest atlas)• Sell images with commerce server.• USGS CRDA: 3 TB more coming.

32

The System

• DEC Alpha + 8400

• 324 StorageWorks Drives (2.9 TB)

• SQL Server 7.0

• USGS 1-meter data (30% of US)

• Russian Space dataTwo meterresolutionimages

SPIN-2SPIN-2

33

World’s Largest PC!

• 324 disks (2.9 terabytes)

• 8 x 440Mhz Alpha CPUs

• 10 GB DRAM

34

1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 )

SPIN-2

Hardware

S T C9 7 4 0D L TT a p eL i b r a r y

4 89 G BD r i v e s

4 89 G BD r i v e s

4 89 G BD r i v e s

A l p h aS e r v e r8 4 0 0

E n t e r p r i s e S t o r a g e A r r a y

1 0 0 M b p sE t h e r n e t S w i t c h D S 3 I n t e r n e t

M a pS e r v e r

4 89 G BD r i v e s

4 89 G BD r i v e s

4 89 G BD r i v e s

8 x 4 4 0 M H zA l p h a c p u s1 0 G B D R A M

4 89 G BD r i v e s

S i t eS e r v e r s

35

broswer

HTMLJava

Viewer

The Internet

Web Client

Microsoft AutomapActiveX Server

Internet InfoServer 4.0

Image DeliveryApplication

SQL Server7

MicrosoftSite Server EE

Internet InformationServer 4.0

Image Provider Site(s)

Terra-Server DB Automap Server

Sphinx(SQL Server)

Terra-ServerStored Procedures

InternetInformationServer 4.0

ImageServer

Active Server Pages

MTS

Terra-Server Web Site

Software

36

• Backup and Recovery

»STC 9717 Tape robot

»Legato NetWorker™

»Sphinx Backup/Restore Utility

»Clocked at 80 MBps (peak)(~ 200 GB/hr)

• SQL Server Enterprise Mgr

»DBA Maintenance

»SQL Performance Monitor

System Management & Maintenance

37

TerraServer File Group Layout• Convert 324 disks to 28 RAID5 sets

plus 28 spare drives

• Make 4 NT volumes (RAID 50) 595 GB per volume

• Build 30 20GB files on each volume

• DB is File Group of 120 files

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

HSZ70 A

HSZ70 B

E: F: G: H:

HSZ70 A

HSZ70 B

38

Demo

Http://TerraWeb2

39

Technical ChallengeKey idea

• Problem: Geo-Spatial Search without geo-spatial access methods.(just standard SQL Server)

• Solution:Geo-spatial search key:

Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y)

Z-transform X & Y into single Z value, build B-tree on Z

Adjacent images stored next to each other

Search Method:Latitude and Longitude => X, Y, then Z

Select on matching Z value

40

Now What?

• New Since S-Day: • More data:

• 4.8 TB USGS DOQ• .5 TB Russian

• Bigger Server:• Alpha 8400

• 8 proc, 8 GB RAM, • 2.5 TB Disk

• Improved Application• Better UI• Uses ASP• Commerce App

• Cut images and Load Sept & Feb

• Built Commerce App for USGS & Spin-2

• Launch at Fed Scalability DaySQL 7 Beta 3 (6/24/98)

• Operate on Internet for 18 months

• Add more data (double)

• Working with Sloan Digital Sky Survey» 40 TB of images

» 3 TB of “objects”

41

NT Clusters (Wolfpack)• Scale DOWN to PDA: WindowsCE

• Scale UP an SMP: TerraServer

• Scale OUT with a cluster of machines

• Single-system image

»Naming

»Protection/security

»Management/load balance

• Fault tolerance

»“Wolfpack”

• Hot pluggable hardware & software

42

Web Web sitesite

DatabaseDatabase

Web site filesWeb site files

Database filesDatabase files

Server 1Server 1

BrowserBrowser

Symmetric Virtual Server Failover Example

Server 1Server 1 Server 2Server 2

Web site filesWeb site files

Database filesDatabase files

Web Web sitesite

DatabaseDatabase

Web Web sitesite

DatabaseDatabase

43

Clusters & BackOffice• Research: Instant & Transparent failover

• Making BackOffice PlugNPlay on Wolfpack

»Automatic install & configure

• Virtual Server concept makes it easy

»simpler management concept

»simpler context/state migration

»transparent to applications

• SQL 6.5E & 7.0 Failover

• MSMQ (queues), MTS (transactions).

44

1.2 B tpd• 1 B tpd ran for 24 hrs.

• Out-of-the-box software

• Off-the-shelf hardware

• AMAZING!

•Sized for 30 days•Linear growth•5 micro-dollars per transaction

45Controller

The Memory Hierarchy

• Measued & Modeled Sequential IO

• Where is the bottleneck?

• How does it scale with

»SMP, RAID, new interconnects

Adapter SCSIFile cache PCI

MemoryGoals:balanced bottlenecksLow overheadScale many processors (10s)Scale many disks (100s)

Mem

bus

App address space

46

Sequential IO your mileage

will vary0.00

5.00

10.00

15.00

20.00

25.00

30.00

2 4 8 16 32 64 128

transfer size (KB)

MB

/sec

4 disk read

4 disk write

1 disk read

1 disk write

Striping HelpsController is bottneck

40 MB/sec Advertised UW SCSI

35r-23w MB/sec Actual disk transfer

29r-17w MB/sec 64 KB request (NTFS)

9 MB/sec Single disk media

3 MB/sec 2 KB request (SQL Server)

• Measured hardware & Software

• Find software fixes..

• “out of the box” 1/2 power point: 50% of peak power“out of the box”

0.00

2.00

4.00

6.00

8.00

10.00

2 4 8 16 32 64 128

transfer size (KB)

MB

/sec

1 disk read

1 disk write

1 disk read/(NTFS buffer)

1 disk write(NTFS buffer)

NTFS Read is good at 8KB, but writes are uniformly slow

47

PAP (peak advertised Performance) vs RAP (real application performance)

• Goal: RAP = PAP / 2 (the half-power point)

• http://research.Microsoft.com/BARC/Sequential_IO/

48

Disk Bottleneck Analysis• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI

(not measured, we had only one PCI bus available, 2nd one was “internal”)

~ 120 MBps Unbuffered read

~ 80 MBps Unbuffered write

~ 40 MBps Buffered read

~ 35 MBps Buffered write

Memory Read/Write ~150 MBps

PCI~70 MBps

Adapter~30 MBps

PCI

Adapter

Adapter

Adapter

120

MB

ps

49

Penny Sort Ground Ruleshttp://research.microsoft.com/barc/SortBenchmark

• How much can you sort for a penny.» Hardware and Software cost» Depreciated over 3 years» 1M$ system gets about 1 second,» 1K$ system gets about 1,000 seconds.» Time (seconds) = SystemPrice ($) / 946,080

• Input and output are disk resident

• Input is » 100-byte records (random data)» key is first 10 bytes.

• Must create output file and fill with sorted version of input file.

• Daytona (product) and Indy (special) categories

50

PennySort• Hardware

» 266 Mhz Intel P2

» 64 MB SDRAM (10ns)

» Dual UDMA 3.2GB EIDE disk

• Software» NT workstation 4.3

» NT 5 sort

• Performance» sort 15 M 100-byte records (~1.5 GB)

» Disk to disk

» elapsed time 820 sec • cpu time = 404 sec or 100 sec

PennySort Machine (1107$ )

board13%

Memory8%

Cabinet + Assembly

7%

Network, Video, floppy

9%

Software6%

Other22%

cpu 32%

Disk25%

51

• CHALLENGE

»reduce software tax

on messages» Today 30 K ins

+ 10 ins/byte

» Goal: 1 K ins + .01 ins/byte

• Best bet:» SAN/VIA

» Smart NICs

» Special protocol

» User-Level Net IO (like disk)

NetworkingBIG!! Changes coming!

• Technology

»10 GBps bus “now”» 1 Gbps links “now”

» 1 Tbps links in 10 years

» Fast & cheap switches

• Standard interconnects» processor-processor

» processor-device (=processor)

• Deregulation WILL work someday

52

What if Networking Was as Cheap As Disk IO?

• TCP/IP

»Unix/NT 100% cpu @ 40MBps

• Disk

»Unix/NT 8% cpu @ 40MBps

Why the Difference?Host Bus Adapter does

SCSI packetizing, checksum,…flow controlDMA

Host doesTCP/IP packetizing, checksum,…flow controlsmall buffers

53

The Promise of SAN/VIA10x better in 2 years

http://www.viarch.org/

• Today: »wires are 10 MBps (100 Mbps Ethernet)

»~20 MBps tcp/ip saturates 2 cpus

»round-trip latency is ~300 us

• In University & NT lab

»wires are 1 Gbps Ethernet, ServerNet,…

» Fast user-level communication• tcp/ip ~ 100 MBps 10% of each processor

• round-trip latency is 15 us

54

Public Service• Gordon Bell

» Computer Museum

» Vanguard Group

» Edits column in CACM

• Jim Gray» National Research Council Computer Science and

Telecommunications Board

» Presidential Advisory Committee on NGI-IT-HPPC

» Edit Journals & Conferences.

• Tom Barclay» USGS and Russian cooperative research

55

BARCMicrosoft Bay Area Research Center

Tom BarclayGordon BellJoe Barrera Jim Gemmell Jim GrayErik Riedel (CMU)Eve Schooler (Cal Tech)Don SlutzCatherine Van Ingen

http://www.research.Microsoft.com/barc/