1 PennySort Award Ceremony Beijing China 23 October 2006.
-
Upload
mervin-cameron -
Category
Documents
-
view
215 -
download
0
Transcript of 1 PennySort Award Ceremony Beijing China 23 October 2006.
![Page 1: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/1.jpg)
1
PennySort Award Ceremony
Beijing China23 October 2006
![Page 2: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/2.jpg)
2
Outline
• Penny Sort history and Award
• What I have been doing.
![Page 3: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/3.jpg)
3
Benchmark History
WisconsinBitton Boral DeWitt Turbyfill
IBM TP 1-7CA and Tony Lukes
Debit CreditGray
DatamationAnon et al
TPC-A
MCCBoral &...
TPC-B
TPC-C
1970
1980
1990
2000
TPC-W ?
TeradataBollinger &...
TPC-D
Sort
PennySortMinuteSort
TPC-H
2010
![Page 4: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/4.jpg)
4
A Short History of Sort• April Fools 1995: Datamation Sort
– Sort 1M 100 B records– An IO benchmark: 15-min to 1 hr!
• 1993: {Minute | Penny}x{Daytona | Indy}
• 1998: TeraByte Sort• Web site:
http://research.Microsoft.com/barc/SortBenchmark/
![Page 5: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/5.jpg)
5
Ground Rules • How much can you sort for a penny (or in a minute).
– Hardware cost– Depreciated over 3 years– 1M$ system gets about 1 second,– 1K$ system gets about 1,000 seconds.– Time (seconds) = SystemPrice ($) / 946,080
• Input and output are disk resident• Input is
– 100-byte records (random data)– key is first 10 bytes.
• Must create output file and fill with sorted version of input file.
• Daytona (product) and Indy (special) categories
![Page 6: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/6.jpg)
6
1998 PennySort• Hardware
– 266 Mhz Intel PPro– 64 MB SDRAM (10ns)– Dual Fujitsu DMA 3.2GB EIDE disks
• Software– NT workstation 4.3– NT 5 sort
• Performance– sort 15 M 100-byte records (~1.5 GB)
– Disk to disk– elapsed time 820 sec
• cpu time = 404 sec
PennySort Machine (1107$ )
board13%
Memory8%
Cabinet + Assembly
7%
Network, Video, floppy
9%
Software6%
Other22%
cpu 32%
Disk25%
![Page 7: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/7.jpg)
7
2004 Daytona Terabyte Sort• NEC Express/5800/1320Xd
32x Itanium2 1.5Ghz 128GB 900 disk TPC-C machine
• Striped across 20 HBA– Read and write at 3.5 GBps–Sort 34GB in 60 seconds.–Sort 1 TB in 33 minutes
Input Phase of 1 TB nSort
![Page 8: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/8.jpg)
8
1999 Sort Records
2006 Sort Records Daytona Indy
Penny 590 M records ( 55GB)in 644 seconds
GpuTeraSort1,469$ system
3 GHz Pentium IV, 2 GB RAM, 7800GT Nvidia graphics card, 9x80GB SATA disks (4 data and 5 “runs”)
WindowsXP Naga Govindaraju, Ritesh Kumar ,
Dinesh Manocha, Jim GrayU. North Carolina at Chapel Hill, USA
Minute 40 GB (400 million records) NeoSort pdf MSword
Windows, Fujitsu 32 Itanium2, 128 SAN disksChris Nyberg, Charles Koester Ordinal Technology
( 2005) 116GB (125 M records)SCS pdf 58.7 secondsLinux, 80 Itanium2, 2,520 SAN disksJim Wyllie, IBM Almaden Research
TeraByte (2004) 33 minutesNsort pdf, word, htm Windows, 32 Itanium2, 2,350 SAN disks
Chris Nyberg, Charles Koester Ordinal Technology
(2005) 435 seconds (7.25 minutes)SCS pdf
Linux, 80 Itanium2, 2,520 SAN disksJim Wyllie, IBM Almaden Research
344 million records (32 GB)in 1,679 seconds
Bytes-Split-Index Sort (BSIS) $760 system
1.8 GHz AMD, 1 GB RAM, 4x80GB SATA disks, WindowsXP
Xing Huang and BinHeng Song School of Software, Tsinghua U., Beijing, China
Bo Huang Math&CS, Hunan U. of Technology, Zhuzhou, China
![Page 9: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/9.jpg)
9
Bytes Split Index Sort (BSIS)Xing Huang & BinHeng Song, Tsinghua
Bo Huang, Hunan U. of Technology
• A radix-partition sort. • Then merge the partitions.• 344 million records (32 GB) in 1,679 seconds
$760 system 1.8 GHz AMD, 1 GB RAM, 4x80GB SATA disks, WindowsXP
• Phase 1: 66 MB/s, Phase 28 MB/s• See http://research.microsoft.com/barc/SortBenchmark/BSIS-PennySort_2006.pdf
![Page 10: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/10.jpg)
10
Sort 100 byte records (minute / penny)Shows We Hit Memory Ceiling in 1995
http://research.microsoft.com/barc/SortBenchmark/
• Sort recs/s/cpuplateaued in1995
Records per Second per CPU slow improvement after 1995
1.E+1
1.E+2
1.E+3
1.E+4
1.E+5
1.E+6
1985 1990 1995 2000 2005
reco
rds/
sec/
cpu
Mini
Super
cache conscious
![Page 11: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/11.jpg)
11
Technology Trends: CPU and GPU
2.2GHz
4.4GHz
31 GHz
0.8 GHz
1.6 GHz
11.2
4.2
Lo
g o
f R
elat
ive
Pro
cess
ing
Po
wer
2002 2004 2006 2008
Corporate DT SW Requirements
Moore’s Law Trajectory
CPU
Value
Leading
Edge
Mobile
Mainstream Desktop
DT ‘Replacement’
Enthusiast / Specialty
Cooling (Cost)LimitationsGPU
Moore’s
Law 3 fo
r 18 m
o
Then Moore
’s La
w trajecto
ry
Graphics Req’m
ts
(enhanced experience)
Leading Edge
Value / UMA
?CPU
![Page 12: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/12.jpg)
12
Moore’s Wall: Chip Heat Death• Processor power density going to infinity.
• Solution: stablize clock at ~5GHzMulti-core (aka MTA) (1,000 core?)
![Page 13: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/13.jpg)
13
GPU TeraSort Naga Govindaraju, Ritesh Kumar , Dinesh Manocha,
U. North Carolina at Chapel Hill
• Use GPU for Phase 1 bitonic sort• 590 M records ( 55GB) in 644 seconds
1,469$ system 3 GHz Pentium IV, 2 GB RAM, 7800GT Nvidia graphics card, 9x80GB SATA disks (4 data and 5 “runs”) WindowsXP WindowsXP
• Phase 1: 185 MB/s, Phase 150 MB/s• See http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-TR-2005-183
![Page 14: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/14.jpg)
14
Sort 100 byte records (minute / penny)Shows We Hit Memory Ceiling in 1995
http://research.microsoft.com/barc/SortBenchmark/
• Sort recs/s/cpuplateaued in1995
• Had to get GPU to getbetter Memory bandwidth
• SIGMOD 2006GpuTeraSort
Records per Second per CPU slow improvement after 1995
1.E+1
1.E+2
1.E+3
1.E+4
1.E+5
1.E+6
1985 1990 1995 2000 2005
reco
rds/
sec/
cpu
Mini
Super
cache conscious
GPU better memory architecture, so finally more records/second
![Page 15: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/15.jpg)
15
Motherboard14%
CPU26%
GPU0%RAM
11%Disk controller
0%
Disks36%
Case, power, fan9%
Assembly4%
BSIS
2006 PennySort Price Breakdown
Motherboard16%
CPU12%
GPU18%
RAM10%
Disk controller6%
Disks33%
Case, power, fan3%
Assembly
2%
GpuTeraSort
$760 $1470
![Page 16: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/16.jpg)
16
Sort Performance/Price improved
• Based on parallelism and “commodity” not per-cpu performance.
1E+2
1E+3
1E+4
1E+5
1E+6
1E+7
1E+8
1985 1990 1995 2000 2005
Sort Records/second vs Time
M68000
Cray YMP
IBM 3090
Tandem
Hardware Sorter
Sequent
Intel Hyper
SGIIBM RS6000
NOW
Alpha
NOW
PennySort
TeraByte Sort
Minute Sort
1E+0
1E+3
1E+6
1E+9
1985 1990 1995 2000 2005
Speed: SortedRecords/Sec
Performance/Price: GB Sorted/$
Sort 68%/y performance/price improvement
![Page 17: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/17.jpg)
17
Musings: PennySort=TBsort• 2 pass so 3TB of disk
• = 8 disks if 400GB/disk
• = 0.5GBps (if each disk = 65 Mbps)
• So, 6000 seconds (3TB/5GBps)
• So, node can cost 200$
• Costs 10x that today
• maybe in 5 years?
![Page 18: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/18.jpg)
18
Musings: MinuteSort=TBsort• Sorts 1TB in 1Minute• 1 pass so 1TB of ram• 266Gbps bisection bandwidth• 1 pass so 2TB of IO in 60 sec
=> 600 disks => ~80 nodes: 8 disks 2GB ram=> interconnect with 10Gbps Ethernet
• or 300 nodes at 1Gbps Ethernet. • doable today
![Page 19: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/19.jpg)
19
What I Have Been Doing• Traveling & Talking
• Helping Build the SkyServer and the Virtual Observatory
• Doing spatial geometry in SQL (no kidding)!
• Trying to get all science literature and data online and interlinked.
• and…– to blob or not to blob– disk reliability
![Page 20: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/20.jpg)
20
To Blob or Not To Blob• For objects X smaller than 1MB
Select X into x from T where key = 123faster than h = open(X); read(h,x,n); close(h)
• So, blob beats file for objects < 1MB (on SQL Server – what about other DBs?)
• Because DB is CISC and FS is RISC• Most things are less than 1MB• DB should work to make this 10MB• File system should borrow ideas from DB.
“To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?” Rusty Sears, Catharine Van Ingen, Jim Gray, MSR-TR-2006-45, April 2006
![Page 21: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/21.jpg)
21
How Often do Disks Fail?
Observed failure rates.
System Source TypePart
Years FailsFails /Year
TerraServer SAN
Barclay
SCSI 10krpm 858 24 2.8%
controllers 72 2 2.8%
san switch 9 1 11.1%TerraServer
Brick Barclay SATA 7krpm 138 10 7.2%
Web Property 1
anonSCSI 10krpm 15,805 972 6.0%
controllers 900 139 15.4%
Web Property 2
anonPATA 7krpm 22,400 740 3.3%
motherboard 3,769 66 1.7%
![Page 22: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/22.jpg)
22
What About Bit Error Rates• Uncorrectable Errors on Read (UERs)
– Quoted uncorrectable bit error rates10-13 to 10-15
– That’s 1 error in 1TB to 1 error in 100TB
– WOW!!!
• We moved 1.5 PB looking for errors• Saw 5 UER events
– 3 real, 3 of them were masked by retry
• Many controller fails and system security reboots • Conclusion:
– UER not a useful metric – want mean time to data loss
– UER better than advertised. Empirical Measurements of Disk Failure Rates and Error RatesJim Gray, Catharine van Ingen, Microsoft Technical Report MSR-TR-2005-166
![Page 23: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/23.jpg)
23
So, You Want to Copy a Petabyte?• Today, that’s 4,000 disks (read 2k write 2k)
• Takes ~4 hours if they run in parallel, but…
• Probably not one file.
• You will see a few UERs.
• What’s the best strategy?
• How fast can you move a Petabyte from CERN to Pasadena? Is sneaker-net fastest and cheapest?
![Page 24: 1 PennySort Award Ceremony Beijing China 23 October 2006.](https://reader035.fdocuments.us/reader035/viewer/2022062518/56649eeb5503460f94bfd02c/html5/thumbnails/24.jpg)
24
UER things I wish I knew
• Better statistics from larger farms, and more diversity.
• What is the UER on a LAN, WAN?• What is the UER over time:
for a file on disk for a disk
• What’s the best replication strategy?– Symmetric (1+1)+(1+1) or triplex (1+1) + 1