1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002...

27
1 Storage Bricks Jim Gray Microsoft Research http://Research.Microsoft.com/~Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements: Dave Patterson explained this to me long ago Leonard Chung Kim Keeton Erik Riedel Catharine Van Ingen Helped me sharpen these arguments

Transcript of 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002...

Page 1: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

1

Storage Bricks Jim Gray

Microsoft Researchhttp://Research.Microsoft.com/~Gray/talksFAST 2002 Monterey, CA, 29 Jan 2002

Acknowledgements:

Dave Patterson explained this to me long ago Leonard Chung

Kim Keeton Erik Riedel Catharine Van Ingen

Helped me sharpen these arguments

Page 2: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

2

First Disk 1956• IBM 305 RAMAC

• 4 MB

• 50x24” disks

• 1200 rpm

• 100 ms access

• 35k$/y rent

• Included computer & accounting software(tubes not transistors)

Page 3: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

3

10 years later1.

6 m

eter

s

Page 4: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

4

Disk Evolution• Capacity:100x in 10 years

1 TB 3.5” drive in 2005 20 GB 1” micro-drive

• System on a chip

• High-speed SAN

• Disk replacing tape

• Disk is super computer!

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

Page 5: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

5

Disks are becoming computers• Smart drives

• Camera with micro-drive

• Replay / Tivo / Ultimate TV

• Phone with micro-drive

• MP3 players

• Tablet

• Xbox

• Many more…

Disk Ctlr + 1Ghz cpu+1GB RAM

Comm:Infiniband, Ethernet, radio…

ApplicationsWeb, DBMS, Files

OS

Page 6: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

6

Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks

Storage

Network

Display

ASIC

ASIC

ASICToday:

P=50 mips

M= 2 MB

In a few years

P= 500 mips

M= 256 MB

Processing decentralized

Moving to data sources

Moving to power sources

Moving to sheet metal

? The end of computers ?

Page 7: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

7

It’s Already True of PrintersPeripheral = CyberBrick

• You buy a printer• You get a

– several network interfaces– A Postscript engine

• cpu, • memory, • software,• a spooler (soon)

– and… a print engine.

Page 8: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

8

The Absurd Design?• Segregate processing from storage

• Poor locality

• Much useless data movement

• Amdahl’s laws: bus: 10 B/ips io: 1 b/ips

ProcessorsDisks

~ 1 Tips

RAM

~ 1 TB

~ 100TB

100 GBps10 TBps

Page 9: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

9

The “Absurd” Disk• 2.5 hr scan time

(poor sequential access)• 1 aps / 5 GB

(VERY cold data)• It’s a tape!• Optimizations:

– Reduce management costs– Caching– Sequential 100x faster than random

1 TB100 MB/s

200 Kaps

200$

Page 10: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

10

Disk = Node• magnetic storage (1TB)• processor + RAM + LAN• Management interface

(HTTP + SOAP) • Application execution

environment• Application

– File

– DB2/Oracle/SQL

– Notes/Exchange/TeamServer

– SAP/Seibold/…

– Quickbooks /Tivo/ PC.…

OS KernelLAN driver Disk driver

File System RPC, ...Services DBMS

Applications

Page 11: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

11

Implications

• Offload device handling to NIC/HBA

• higher level protocols: I2O, NASD, VIA, IP, TCP…

• SMP and Cluster parallelism is important.

Terabyte/s Backplane

• Move app to NIC/device controller

• higher-higher level protocols: SOAP/DCOM/RMI..

• Cluster parallelism is VERY important.

CentralProcessor &

Memory

Conventional Radical

Page 12: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

12

Intermediate Step: Shared Logic• Brick with 8-12 disk drives• 200 mips/arm (or more)

• 2xGbpsEthernet• General purpose OS • 10k$/TB to 50k$/TB• Shared

– Sheet metal

– Power

– Support/Config

– Security

– Network ports

• These bricks could run applications (e.g. SQL or Mail or..)

Snap ~1TB 12x80GB NAS

NetApp ~.5TB 8x70GB NAS

Maxstor ~2TB 12x160GB NAS

Page 13: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

13

Example• Homogenous machines leads

to quick response through reallocation

• HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives

• $4k/TB (street), • 2.5processors/TB,

1GB RAM/TB• JIT storage & processing

3 weeks from order to deploy

Slide courtesy of Brewster Kahle, @ Archive.org

Page 14: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

14

What if Disk Replaces Tape?How does it work?

• Backup/Restore– RAID (among the federation)– Snapshot copies (in most OSs)– remote replicas (standard in DBMS and FS)

• Archive– Use “cold” 95% of disk space

• Interchange– Send computers not disks.

Page 15: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

15

It’s Hard to Archive a PetabyteIt takes a LONG time to restore it.

• At 1GBps it takes 12 days!• Store it in two (or more) places online

A geo-plex• Scrub it continuously (look for errors)• On failure,

– use other copy until failure repaired, – refresh lost copy from safe copy.

• Can organize the two copies differently (e.g.: one by time, one by space)

Page 16: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

16

Archive to Disk100TB for 0.5M$ + 1.5 “free” petabytes

• If you have 100 TB active you need 10,000 mirrored disk arms (see tpcC)

• So you have 1.6 PB of (mirrored) storage (160GB drives)

• Use the “empty” 95% for archive storage.

• No extra space or extra power cost.

• Very fast access (milliseconds vs hours).

• Snapshot is read-only (software enforced )

• Makes Admin easy (saves people costs)

Page 17: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

17

Disk as Tape Archive

• Tape is unreliable, specialized, slow, low density, not improving fast, and expensive

• Using removable hard drives to replace tape’s function has been successful

• When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used.

• Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.

Slide courtesy of Brewster Kahle, @ Archive.org

Page 18: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

18

Disk as Tape Interchange

• Tape interchange is frustrating (often unreadable)

• Beyond 1-10 GB send media not data– FTP takes too long (hour/GB)– Bandwidth still very expensive (1$/GB)

• Writing DVD not much faster than Internet

• New technology could change this – 100 GB DVD @ 10MBps would be competitive.

• Write 1TB disk in 2.5 hrs (at 100MBps)

• But, how does interchange work?

Page 19: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

19

Disk As Tape Interchange: What format?• Today I send 160GB NTFS/SQL disks.• But that is not a good format for Linux/DB2 users.• Solution: Ship NFS/CIFS/ODBC servers (not disks)• Plug “disk” into LAN.

– DHCP then file or DB server via standard interface.

– “pull” data from server.

Page 20: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

20

Some Questions

• What is the product?

• How do I manage 10,000 nodes (disks)?

• How do I program 10,000 nodes (disks)?

• How does RAID work?

• How do I backup a PB?

• How do I restore a PB?

Page 21: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

21

What is the Product?• Concept: Plug it in and it works!• Music/Video/Photo appliance (home)• Game appliance • “PC”• File server appliance• Data archive/interchange appliance• Web server appliance• DB server• eMail appliance• Application appliance

power

network

Page 22: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

22

How Does Scale Out Work?• Files: well known designs:

– rooted tree partitioned across nodes– Automatic cooling (migration)– Mirrors or Chained declustering– Snapshots for backup/archive

• Databases: well known designs– Partitioning, remote replication similar to files– distributed query processing.

• Applications: (hypothetical)– Must be designed as mobile objects – Middleware provides object migration system

• Objects externalize methods to migrate ( == backup/restore/archive)

• Web services seem to have key ideas (xml representation)– Example: eMail object is mailbox

Page 23: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

23

Auto Manage Storage• 1980 rule of thumb:

– A DataAdmin per 10GB, SysAdmin per mips

• 2000 rule of thumb– A DataAdmin per 5TB – SysAdmin per 100 clones (varies with app).

• Problem:– 5TB is 50k$ today, 5k$ in a few years.

– Admin cost >> storage cost !!!!• Challenge:

– Automate ALL storage admin tasks

Page 24: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

24

Admin: TB and “guessed” $/TB(does not include cost of application, overhead, not “substance”)

• Google: 1 :100TB 5k$/TB/y

• Yahoo! 1 : 50TB 20k$/TB/y

• DB 1 : 5TB 60k$/TB/y

• Wall St. 1 : 1TB 400k$/TB/y (reported)

• hardware dominant cost only @ Google.

• How can we waste hardware to save people cost?

Page 25: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

25

How do I manage 10,000 nodes?

• You can’t manage 10,000 x (for any x).• They manage themselves.

– You manage exceptional exceptions.

• Auto Manage– Plug & Play hardware– Auto-load balance & placement storage &

processing– Simple parallel programming model– Fault masking

Page 26: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

26

How do I program 10,000 nodes?

• You can’t program 10,000 x (for any x).

• They program themselves.– You write embarrassingly parallel programs– Examples: SQL, Web, Google, Inktomi, HotMail,….– PVM and MPI prove it must be automatic (unless you have a PhD)!

• Auto Parallelism is ESSENTIAL

Page 27: 1 Storage Bricks Jim Gray Microsoft Research Gray/talks FAST 2002 Monterey, CA, 29 Jan 2002 Acknowledgements : Dave Patterson.

27

Summary• Disks will become supercomputers so

– Lots of computing to optimize the arm

– Can put app close to the data (better modularity, locality)

– Storage appliances (self-organizing)

• The arm/capacity tradeoff: “waste” space to save access. – Compression (saves bandwidth)

– Mirrors

– Online backup/restore

– Online archive (vault to other drives or geoplex if possible)

• Not disks replace tapes: Storage appliances replace tapes.

• Self-organizing storage servers (file systems)(prototypes of this software exist)