1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U...
-
Upload
natalie-bishop -
Category
Documents
-
view
216 -
download
3
Transcript of 1 BARC BARC Microsoft Bay Area Research Center 6/20/97 (PUBLIC VERSION) Tom Barclay Tyler Bean (U...
1
BARC
BARCMicrosoft Bay Area Research Center6/20/97 (PUBLIC VERSION)
Tom BarclayTyler Bean (U VA)Gordon BellJoe BarreraJosh Coates (UC B) Jim Gemmell Jim GrayErik Riedel (CMU)Eve Schooler (Cal Tech)Don SlutzCatherine Van Ingen
http://www.research.Microsoft.com/barc/
2
Telepresence• The next killer app
• Space shifting:
»Reduce travel
• Time shifting:
»Retrospective
»Offer condensations
»Just in time meetings.
• Example: ACM 97
»NetShow and Web site.
»More web visitors than attendees
• People-to-People communication
3
Telepresent Jim GemmellScaleable Reliable Multicast
Outline
• What Reliable Multicast is & why it is hard to scale
• Fcast file transfer
• ECSRM
• Layered Telepresentations
4
Multiple Unicast
• Sender must repeat
• Link sees repeats
5
IP Multicast
• Pruned broadcast
• Unreliable
6
Reliable Multicast
• Difficult to scale:
»Sender state explosion
»Message implosion
StateState::receiver 1,receiver 1,
receiver 2,receiver 2,
……
receiver nreceiver n
7
Receiver-Reliable
• Receiver’s job to NACK StateState::receiver 1,receiver 1,
receiver 2,receiver 2,
……
receiver nreceiver n
8
SRM Approaches
• Hierarchy / local recovery
• Forward Error Correction (FEC)
• Suppression
• *HYBRID*
_________________________________
• Fcast is FEC only
• ECSRM is suppression + FEC
9
(n,k) linear block encoding
Original packetsOriginal packets1 2 k
1 2 k k+1k+2 n
Encode Encode (copy 1st k)(copy 1st k)
1 2 k Original packetsOriginal packets
DecodeDecode
Take any kTake any k
10
Fcast
• File tranfer protocol
• FEC-only
• Files transmitted in parallel
11
Fcast send order
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
1 2 k
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 n
k+1k+2 nFile 1File 1
File 2File 2
X Need k from Need k from each roweach row
12
Fcast reception
timetime
SenderSender Files + FEC Files + FEC Files + FEC Files + FEC
Low loss Low loss receiverreceiver
joinjoin leaveleave
High loss High loss receiverreceiver
joinjoin leaveleave
13
Fcast demo
Fcast Receive.lnkFcastSend.lnk
14
ECSRM - Erasure Correcting SRM
• Combines:
» suppression
» erasure correction
15
Suppression
• Delay a NACK or repair in the hopes that someone else will do it.
• NACKs are multicast
• After NACKing, re-set timer and wait for repair
• If you hear a NACK that you were waiting to send, then re-set your timer as if you did send it.
16
ECSRM - adding FEC to suppression
• Assign each packet to an EC group of size k
• NACK: (group, # missing)
• NACK of (g,c) suppresses all (g,xc).
• Don’t re-send originals; send EC packets using (n,k) encoding
17
Example: EC group size (k) = 7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
XXXX
XXXX
XX XXXX
1
2
3
4
5
6
7
18
…example
NACK:NACK:
Group 1, 1 lostGroup 1, 1 lost
NACK’s suppressedNACK’s suppressed
19
…example
Erasure Erasure correcting correcting packetpacket
20
…example: summary
• Normal suppression needs:
»7 NACKs, 7 repairs
• ECSRM requires:
»1 NACK, 1 repair
• Large group: each packet lost by someone
»Without FEC, 1/2 of traffic is repairs
»With ECSRM, only 1/8 of traffic is repairs
»NACK traffic reduced by factor of 7
21
Simulation: 112 receivers
0
10
20
30
40
50
60
5 10 15 20 25Tim e
Pac
ke
ts R
ece
ive
dSRM
ECSRM
22
Simulation: 112 receivers
0
5
10
15
20
25
30
35
40
45
50
5 10 15 20Time
NA
CK
s r
ec
eiv
ed
SRM
ECSRM
23
Multicast PowerPoint Add-in
Slides
Annotations
Control informationECSRECSRMM
slide masterFcastFcast
24
Multicast PowerPoint - Late Joiners
• Viewers joining late don’t impact others with session persistent data (slide master)
timetime
joinjoin leaveleave
FcastFcast
ECSRECSRMM
joinjoin
25
Future Work
• Adding hierarchy (e.g. PGM by Cisco)
• Do we need 2 protocols?
26
RAGS: RAndom SQL test Generator
• Microsoft spends a LOT of money on testing. (60% of development according to one source).
• Idea: test SQL by » generating random correct queries» executing queries against database» compare results with SQL 6.5, DB2, Oracle, Sybase
• Being used in SQL 7.0 testing.» 375 unique bugs found (since 2/97)
» Very productive test tool
27
Sample Rags Generated Statement
SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996 10:23AM" , T0.notesFROM titles T0, roysched T1WHERE EXISTS ( SELECT DISTINCT TOP 9 $3.11 , "Apr 15 1996 10:23AM" , T0.advance , ( "<v3``VF;" +(( UPPER(((T2.ord_num +"22\}0G3" )+T2.ord_num ))+("{1FL6t15m" + RTRIM( UPPER((T1.title_id +((("MlV=Cf1kA" +"GS?" )+T2.payterms )+T2.payterms ))))))+(T2.ord_num +RTRIM((LTRIM((T2.title_id +T2.stor_id ))+"2" ))))), T0.advance , (((-(T2.qty ))/(1.0 ))+(((-(-(-1 )))+( DEGREES(T2.qty )))-(-(( -4 )-(-(T2.qty ))))))+(-(-1 )) FROM sales T2 WHERE EXISTS ( SELECT "fQDs" , T2.ord_date , AVG ((-(7 ))/(1 )), MAX (DISTINCT -1 ), LTRIM("0I=L601]H" ), ("jQ\" +((( MAX(T3.phone )+ MAX((RTRIM( UPPER( T5.stor_name ))+((("<" +"9n0yN" )+ UPPER("c" ))+T3.zip ))))+T2.payterms )+ MAX("\?" ))) FROM authors T3, roysched T4, stores T5 WHERE EXISTS ( SELECT DISTINCT TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE ( (-(-(5 )))>= T4.royalty ) AND (( ( ( LOWER( UPPER((("9W8W>kOa" + T6.stor_address )+"{P~" ))))!= ANY (
SELECT TOP 2 LOWER(( UPPER("B9{WIX" )+"J" )) FROM roysched T7 WHERE ( EXISTS (
SELECT (T8.city +(T9.pub_id +((">" +T10.country )+ UPPER( LOWER(T10.city))))), T7.lorange , ((T7.lorange )*((T7.lorange )%(-2 )))/((-5 )-(-2.0 )) FROM publishers T8, pub_info T9, publishers T10 WHERE ( (-10 )<= POWER((T7.royalty )/(T7.lorange ),1)) AND (-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) ) ) --EOQ ) AND (NOT (EXISTS ( SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9, stores T10 WHERE ( (T10.city + LOWER(T10.stor_id )) BETWEEN (("QNu@WI" +T10.stor_id )) AND ("DT" ) ) AND ("R|J|" BETWEEN ( LOWER(T10.zip )) AND (LTRIM( UPPER(LTRIM( LOWER(("_\tk`d" +T8.title_id )))))) ) GROUP BY T9.i3, T8.royalty, T9.i3 HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty ))))) AND (COUNT(*)) ) --EOQ ) ) ) --EOQ ) AND (((("i|Uv=" +T6.stor_id )+T6.state )+T6.city ) BETWEEN ((((T6.zip +( UPPER(("ec4L}rP^<" +((LTRIM(T6.stor_name )+"fax<" )+("5adWhS" +T6.zip )))) +T6.city ))+"" )+"?>_0:Wi" )) AND (T6.zip ) ) ) AND (T4.lorange BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ GROUP BY ( LOWER(((T3.address +T5.stor_address )+REVERSE((T5.stor_id +LTRIM( T5.stor_address )))))+ LOWER((((";z^~tO5I" +"" )+("X3FN=" +(REVERSE((RTRIM( LTRIM((("kwU" +"wyn_S@y" )+(REVERSE(( UPPER(LTRIM("u2C[" ))+T4.title_id ))+( RTRIM(("s" +"1X" ))+ UPPER((REVERSE(T3.address )+T5.stor_name )))))))+ "6CRtdD" ))+"j?]=k" )))+T3.phone ))), T5.city, T5.stor_address ) --EOQ ORDER BY 1, 6, 5 )
This Statement yields an error: SQLState=37000, Error=8623 Internal Query Processor Error:Query processor could not produce a query plan.
28
Automation• Simpler Statement with same error
SELECT roysched.royalty
FROM titles, roysched
WHERE EXISTS (
SELECT DISTINCT TOP 1 titles.advance
FROM sales
ORDER BY 1)
• Control statement attributes»complexity, kind, depth, ...
• Multi-user stress tests»tests concurrency, allocation, recovery
29
One 4-Vendor Rags Test3 of them vs Us
• 60 k Selects on MSS, DB2, Oracle, Sybase.
• 17 SQL Server Beta 2 suspects 1 suspect per 3350 statements.
• Examine 10 suspects, filed 4 Bugs!One duplicate. Assume 3/10 are new
• Note: This is the SS Beta 2 ProductQuality rising fast (and RAGS sees that)
30
RAGS Next Steps
• Done:
»Patents,
»Papers,
»talks,
» tech transfer to development
• Next steps:
»Extend to other parts of SQL and Tsql
»“Crawl” the config space (look for new holes)
»Apply ideas to other domains (ole db).
31
Scaleup - Big Database• Build a 1 TB SQL Server database
» Show off Windows NT and SQL Server scalability
» Stress test the product
• Data must be» 1 TB» Unencumbered» Interesting to everyone everywhere» And not offensive to anyone anywhere
• Loaded » 1.1 M place names from Encarta World Atlas» 1 M Sq Km from USGS (1 meter resolution)» 2 M Sq Km from Russian Space agency (2 m)
• Will be on web (world’s largest atlas)• Sell images with commerce server.• USGS CRDA: 3 TB more coming.
32
The System
• DEC Alpha + 8400
• 324 StorageWorks Drives (2.9 TB)
• SQL Server 7.0
• USGS 1-meter data (30% of US)
• Russian Space dataTwo meterresolutionimages
SPIN-2SPIN-2
33
World’s Largest PC!
• 324 disks (2.9 terabytes)
• 8 x 440Mhz Alpha CPUs
• 10 GB DRAM
34
1TB Database Server AlphaServer 8400 4x400. 10 GB RAM 324 StorageWorks disks 10 drive tape library (STC Timber Wolf DLT7000 )
SPIN-2
Hardware
S T C9 7 4 0D L TT a p eL i b r a r y
4 89 G BD r i v e s
4 89 G BD r i v e s
4 89 G BD r i v e s
A l p h aS e r v e r8 4 0 0
E n t e r p r i s e S t o r a g e A r r a y
1 0 0 M b p sE t h e r n e t S w i t c h D S 3 I n t e r n e t
M a pS e r v e r
4 89 G BD r i v e s
4 89 G BD r i v e s
4 89 G BD r i v e s
8 x 4 4 0 M H zA l p h a c p u s1 0 G B D R A M
4 89 G BD r i v e s
S i t eS e r v e r s
35
broswer
HTMLJava
Viewer
The Internet
Web Client
Microsoft AutomapActiveX Server
Internet InfoServer 4.0
Image DeliveryApplication
SQL Server7
MicrosoftSite Server EE
Internet InformationServer 4.0
Image Provider Site(s)
Terra-Server DB Automap Server
Sphinx(SQL Server)
Terra-ServerStored Procedures
InternetInformationServer 4.0
ImageServer
Active Server Pages
MTS
Terra-Server Web Site
Software
36
• Backup and Recovery
»STC 9717 Tape robot
»Legato NetWorker™
»Sphinx Backup/Restore Utility
»Clocked at 80 MBps (peak)(~ 200 GB/hr)
• SQL Server Enterprise Mgr
»DBA Maintenance
»SQL Performance Monitor
System Management & Maintenance
37
TerraServer File Group Layout• Convert 324 disks to 28 RAID5 sets
plus 28 spare drives
• Make 4 NT volumes (RAID 50) 595 GB per volume
• Build 30 20GB files on each volume
• DB is File Group of 120 files
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
HSZ70 A
HSZ70 B
E: F: G: H:
HSZ70 A
HSZ70 B
38
Demo
Http://TerraWeb2
39
Technical ChallengeKey idea
• Problem: Geo-Spatial Search without geo-spatial access methods.(just standard SQL Server)
• Solution:Geo-spatial search key:
Divide earth into rectangles of 1/48th degree longitude (X) by 1/96th degree latitude (Y)
Z-transform X & Y into single Z value, build B-tree on Z
Adjacent images stored next to each other
Search Method:Latitude and Longitude => X, Y, then Z
Select on matching Z value
40
Now What?
• New Since S-Day: • More data:
• 4.8 TB USGS DOQ• .5 TB Russian
• Bigger Server:• Alpha 8400
• 8 proc, 8 GB RAM, • 2.5 TB Disk
• Improved Application• Better UI• Uses ASP• Commerce App
• Cut images and Load Sept & Feb
• Built Commerce App for USGS & Spin-2
• Launch at Fed Scalability DaySQL 7 Beta 3 (6/24/98)
• Operate on Internet for 18 months
• Add more data (double)
• Working with Sloan Digital Sky Survey» 40 TB of images
» 3 TB of “objects”
41
NT Clusters (Wolfpack)• Scale DOWN to PDA: WindowsCE
• Scale UP an SMP: TerraServer
• Scale OUT with a cluster of machines
• Single-system image
»Naming
»Protection/security
»Management/load balance
• Fault tolerance
»“Wolfpack”
• Hot pluggable hardware & software
42
Web Web sitesite
DatabaseDatabase
Web site filesWeb site files
Database filesDatabase files
Server 1Server 1
BrowserBrowser
Symmetric Virtual Server Failover Example
Server 1Server 1 Server 2Server 2
Web site filesWeb site files
Database filesDatabase files
Web Web sitesite
DatabaseDatabase
Web Web sitesite
DatabaseDatabase
43
Clusters & BackOffice• Research: Instant & Transparent failover
• Making BackOffice PlugNPlay on Wolfpack
»Automatic install & configure
• Virtual Server concept makes it easy
»simpler management concept
»simpler context/state migration
»transparent to applications
• SQL 6.5E & 7.0 Failover
• MSMQ (queues), MTS (transactions).
44
1.2 B tpd• 1 B tpd ran for 24 hrs.
• Out-of-the-box software
• Off-the-shelf hardware
• AMAZING!
•Sized for 30 days•Linear growth•5 micro-dollars per transaction
45Controller
The Memory Hierarchy
• Measued & Modeled Sequential IO
• Where is the bottleneck?
• How does it scale with
»SMP, RAID, new interconnects
Adapter SCSIFile cache PCI
MemoryGoals:balanced bottlenecksLow overheadScale many processors (10s)Scale many disks (100s)
Mem
bus
App address space
46
Sequential IO your mileage
will vary0.00
5.00
10.00
15.00
20.00
25.00
30.00
2 4 8 16 32 64 128
transfer size (KB)
MB
/sec
4 disk read
4 disk write
1 disk read
1 disk write
Striping HelpsController is bottneck
40 MB/sec Advertised UW SCSI
35r-23w MB/sec Actual disk transfer
29r-17w MB/sec 64 KB request (NTFS)
9 MB/sec Single disk media
3 MB/sec 2 KB request (SQL Server)
• Measured hardware & Software
• Find software fixes..
• “out of the box” 1/2 power point: 50% of peak power“out of the box”
0.00
2.00
4.00
6.00
8.00
10.00
2 4 8 16 32 64 128
transfer size (KB)
MB
/sec
1 disk read
1 disk write
1 disk read/(NTFS buffer)
1 disk write(NTFS buffer)
NTFS Read is good at 8KB, but writes are uniformly slow
47
PAP (peak advertised Performance) vs RAP (real application performance)
• Goal: RAP = PAP / 2 (the half-power point)
• http://research.Microsoft.com/BARC/Sequential_IO/
48
Disk Bottleneck Analysis• NTFS Read/Write 12 disk, 4 SCSI, 2 PCI
(not measured, we had only one PCI bus available, 2nd one was “internal”)
~ 120 MBps Unbuffered read
~ 80 MBps Unbuffered write
~ 40 MBps Buffered read
~ 35 MBps Buffered write
Memory Read/Write ~150 MBps
PCI~70 MBps
Adapter~30 MBps
PCI
Adapter
Adapter
Adapter
120
MB
ps
49
Penny Sort Ground Ruleshttp://research.microsoft.com/barc/SortBenchmark
• How much can you sort for a penny.» Hardware and Software cost» Depreciated over 3 years» 1M$ system gets about 1 second,» 1K$ system gets about 1,000 seconds.» Time (seconds) = SystemPrice ($) / 946,080
• Input and output are disk resident
• Input is » 100-byte records (random data)» key is first 10 bytes.
• Must create output file and fill with sorted version of input file.
• Daytona (product) and Indy (special) categories
50
PennySort• Hardware
» 266 Mhz Intel P2
» 64 MB SDRAM (10ns)
» Dual UDMA 3.2GB EIDE disk
• Software» NT workstation 4.3
» NT 5 sort
• Performance» sort 15 M 100-byte records (~1.5 GB)
» Disk to disk
» elapsed time 820 sec • cpu time = 404 sec or 100 sec
PennySort Machine (1107$ )
board13%
Memory8%
Cabinet + Assembly
7%
Network, Video, floppy
9%
Software6%
Other22%
cpu 32%
Disk25%
51
• CHALLENGE
»reduce software tax
on messages» Today 30 K ins
+ 10 ins/byte
» Goal: 1 K ins + .01 ins/byte
• Best bet:» SAN/VIA
» Smart NICs
» Special protocol
» User-Level Net IO (like disk)
NetworkingBIG!! Changes coming!
• Technology
»10 GBps bus “now”» 1 Gbps links “now”
» 1 Tbps links in 10 years
» Fast & cheap switches
• Standard interconnects» processor-processor
» processor-device (=processor)
• Deregulation WILL work someday
52
What if Networking Was as Cheap As Disk IO?
• TCP/IP
»Unix/NT 100% cpu @ 40MBps
• Disk
»Unix/NT 8% cpu @ 40MBps
Why the Difference?Host Bus Adapter does
SCSI packetizing, checksum,…flow controlDMA
Host doesTCP/IP packetizing, checksum,…flow controlsmall buffers
53
The Promise of SAN/VIA10x better in 2 years
http://www.viarch.org/
• Today: »wires are 10 MBps (100 Mbps Ethernet)
»~20 MBps tcp/ip saturates 2 cpus
»round-trip latency is ~300 us
• In University & NT lab
»wires are 1 Gbps Ethernet, ServerNet,…
» Fast user-level communication• tcp/ip ~ 100 MBps 10% of each processor
• round-trip latency is 15 us
54
Public Service• Gordon Bell
» Computer Museum
» Vanguard Group
» Edits column in CACM
• Jim Gray» National Research Council Computer Science and
Telecommunications Board
» Presidential Advisory Committee on NGI-IT-HPPC
» Edit Journals & Conferences.
• Tom Barclay» USGS and Russian cooperative research
55
BARCMicrosoft Bay Area Research Center
Tom BarclayGordon BellJoe Barrera Jim Gemmell Jim GrayErik Riedel (CMU)Eve Schooler (Cal Tech)Don SlutzCatherine Van Ingen
http://www.research.Microsoft.com/barc/