1
Storage Bricks Jim Gray
Microsoft Researchhttp://Research.Micrsoft.com/~Gray/talksFAST 2002 Monterey, CA, 29 Jan 2002
Acknowledgements:Dave Patterson explained this to me long ago Leonard Chung
Kim Keeton Erik Riedel Catharine Van Ingen
Helped me sharpen these arguments
2
First Disk 1956• IBM 305 RAMAC
• 4 MB
• 50x24” disks
• 1200 rpm
• 100 ms access
• 35k$/y rent
• Included computer & accounting software(tubes not transistors)
3
10 years later1.
6 m
eter
s
4
Disk Evolution• Capacity:100x in 10 years
1 TB 3.5” drive in 2005 20 GB 1” micro-drive
• System on a chip • High-speed SAN
• Disk replacing tape• Disk is super computer!
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
5
Disks are becoming computers• Smart drives• Camera with micro-drive• Replay / Tivo / Ultimate TV• Phone with micro-drive• MP3 players• Tablet• Xbox• Many more…
Disk Ctlr + 1Ghz cpu+1GB RAM
Comm:Infiniband, Ethernet, radio…
ApplicationsWeb, DBMS, Files
OS
6
Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks
Storage
Network
Display
ASIC
ASIC
ASICToday:
P=50 mips
M= 2 MB
In a few years
P= 500 mips
M= 256 MB
Processing decentralized
Moving to data sources
Moving to power sources
Moving to sheet metal
? The end of computers ?
7
It’s Already True of PrintersPeripheral = CyberBrick
• You buy a printer• You get a
– several network interfaces– A Postscript engine
• cpu, • memory, • software,• a spooler (soon)
– and… a print engine.
8
The Absurd Design?• Segregate processing from storage• Poor locality• Much useless data movement• Amdahl’s laws: bus: 10 B/ips io: 1 b/ips
ProcessorsDisks
~ 1 Tips
RAM
~ 1 TB~ 100TB
100 GBps10 TBps
9
The “Absurd” Disk• 2.5 hr scan time
(poor sequential access)• 1 aps / 5 GB
(VERY cold data)• It’s a tape!• Optimizations:
– Reduce management costs– Caching– Sequential 100x faster than random
1 TB100 MB/s
200 Kaps
200$
10
Disk = Node• magnetic storage (1TB)• processor + RAM + LAN• Management interface
(HTTP + SOAP) • Application execution
environment• Application
– File– DB2/Oracle/SQL– Notes/Exchange/
TeamServer– SAP/Seibold/…– Quickbooks /Tivo/ PC.…
OS KernelLAN driver Disk driver
File System RPC, ...Services DBMS
Applications
11
Implications
• Offload device handling to NIC/HBA
• higher level protocols: I2O, NASD, VIA, IP, TCP…
• SMP and Cluster parallelism is important.
Terabyte/s Backplane
• Move app to NIC/device controller
• higher-higher level protocols: SOAP/DCOM/RMI..
• Cluster parallelism is VERY important.
CentralProcessor &
Memory
Conventional Radical
12
Intermediate Step: Shared Logic• Brick with 8-12 disk drives• 200 mips/arm (or more)
• 2xGbpsEthernet• General purpose OS • 10k$/TB to 50k$/TB• Shared
– Sheet metal– Power– Support/Config– Security– Network ports
• These bricks could run applications (e.g. SQL or Mail or..)
Snap ~1TB 12x80GB NAS
NetApp ~.5TB 8x70GB NAS
Maxstor ~2TB 12x160GB NAS
13
Example• Homogenous machines leads
to quick response through reallocation
• HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives
• $4k/TB (street), • 2.5processors/TB,
1GB RAM/TB• JIT storage & processing
3 weeks from order to deploy
Slide courtesy of Brewster Kahle, @ Archive.org
14
What if Disk Replaces Tape?How does it work?
• Backup/Restore– RAID (among the federation)– Snapshot copies (in most OSs)– remote replicas (standard in DBMS and FS)
• Archive– Use “cold” 95% of disk space
• Interchange– Send computers not disks.
15
It’s Hard to Archive a PetabyteIt takes a LONG time to restore it.
• At 1GBps it takes 12 days!• Store it in two (or more) places online
A geo-plex• Scrub it continuously (look for errors)• On failure,
– use other copy until failure repaired, – refresh lost copy from safe copy.
• Can organize the two copies differently (e.g.: one by time, one by space)
16
Archive to Disk100TB for 0.5M$ + 1.5 “free” petabytes
• If you have 100 TB active you need 10,000 mirrored disk arms (see tpcC)
• So you have 1.6 PB of (mirrored) storage (160GB drives)
• Use the “empty” 95% for archive storage.• No extra space or extra power cost.• Very fast access (milliseconds vs hours).• Snapshot is read-only (software enforced )• Makes Admin easy (saves people costs)
17
Disk as Tape Archive
• Tape is unreliable, specialized, slow, low density, not improving fast, and expensive
• Using removable hard drives to replace tape’s function has been successful
• When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used.
• Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.
Slide courtesy of Brewster Kahle, @ Archive.org
18
Disk as Tape Interchange
• Tape interchange is frustrating (often unreadable)• Beyond 1-10 GB send media not data
– FTP takes too long (hour/GB)– Bandwidth still very expensive (1$/GB)
• Writing DVD not much faster than Internet • New technology could change this
– 100 GB DVD @ 10MBps would be competitive.• Write 1TB disk in 2.5 hrs (at 100MBps)• But, how does interchange work?
19
Disk As Tape Interchange: What format?• Today I send 160GB NTFS/SQL disks.• But that is not a good format for Linux/DB2 users.• Solution: Ship NFS/CIFS/ODBC servers (not disks)• Plug “disk” into LAN.
– DHCP then file or DB server via standard interface.– “pull” data from server.
20
Some Questions
• What is the product?• How do I manage 10,000 nodes (disks)?• How do I program 10,000 nodes (disks)?• How does RAID work?• How do I backup a PB?• How do I restore a PB?
21
What is the Product?• Concept: Plug it in and it works!• Music/Video/Photo appliance (home)• Game appliance • “PC”• File server appliance• Data archive/interchange appliance• Web server appliance• DB server• eMail appliance• Application appliance
powernetwork
22
How Does Scale Out Work?• Files: well known designs:
– rooted tree partitioned across nodes– Automatic cooling (migration)– Mirrors or Chained declustering– Snapshots for backup/archive
• Databases: well known designs– Partitioning, remote replication similar to files– distributed query processing.
• Applications: (hypothetical)– Must be designed as mobile objects – Middleware provides object migration system
• Objects externalize methods to migrate ( == backup/restore/archive)
• Web services seem to have key ideas (xml representation)– Example: eMail object is mailbox
23
Auto Manage Storage• 1980 rule of thumb:
– A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb
– A DataAdmin per 5TB – SysAdmin per 100 clones (varies with app).
• Problem:– 5TB is 50k$ today, 5k$ in a few years.
–Admin cost >> storage cost !!!!• Challenge:
– Automate ALL storage admin tasks
24
Admin: TB and “guessed” $/TB(does not include cost of application, overhead, not “substance”)
• Google: 1 :100TB 5k$/TB/y• Yahoo! 1 : 50TB 20k$/TB/y• DB 1 : 5TB 60k$/TB/y • Wall St. 1 : 1TB 400k$/TB/y (reported)
• hardware dominant cost only @ Google.• How can we waste hardware to save people cost?
25
How do I manage 10,000 nodes?
• You can’t manage 10,000 x (for any x).• They manage themselves.
– You manage exceptional exceptions.• Auto Manage
– Plug & Play hardware– Auto-load balance & placement storage &
processing– Simple parallel programming model– Fault masking
26
How do I program 10,000 nodes?
• You can’t program 10,000 x (for any x).• They program themselves.
– You write embarrassingly parallel programs– Examples: SQL, Web, Google, Inktomi, HotMail,….– PVM and MPI prove it must be automatic (unless you have a PhD)!
• Auto Parallelism is ESSENTIAL
27
Summary• Disks will become supercomputers so
– Lots of computing to optimize the arm– Can put app close to the data (better modularity, locality)– Storage appliances (self-organizing)
• The arm/capacity tradeoff: “waste” space to save access. – Compression (saves bandwidth)– Mirrors– Online backup/restore– Online archive (vault to other drives or geoplex if possible)
• Not disks replace tapes: Storage appliances replace tapes.
• Self-organizing storage servers (file systems)(prototypes of this software exist)
Top Related