Download - Helped me sharpen these arguments

1

Storage Bricks Jim Gray

Microsoft Researchhttp://Research.Micrsoft.com/~Gray/talksFAST 2002 Monterey, CA, 29 Jan 2002

Acknowledgements:Dave Patterson explained this to me long ago Leonard Chung

Kim Keeton Erik Riedel Catharine Van Ingen

Helped me sharpen these arguments

2

First Disk 1956• IBM 305 RAMAC

• 4 MB

• 50x24” disks

• 1200 rpm

• 100 ms access

• 35k$/y rent

• Included computer & accounting software(tubes not transistors)

3

10 years later1.

6 m

eter

s

4

Disk Evolution• Capacity:100x in 10 years

1 TB 3.5” drive in 2005 20 GB 1” micro-drive

• System on a chip • High-speed SAN

• Disk replacing tape• Disk is super computer!

Kilo

Mega

Giga

Tera

Peta

Exa

Zetta

Yotta

5

Disks are becoming computers• Smart drives• Camera with micro-drive• Replay / Tivo / Ultimate TV• Phone with micro-drive• MP3 players• Tablet• Xbox• Many more…

Disk Ctlr + 1Ghz cpu+1GB RAM

Comm:Infiniband, Ethernet, radio…

ApplicationsWeb, DBMS, Files

OS

6

Data Gravity Processing Moves to Transducers smart displays, microphones, printers, NICs, disks

Storage

Network

Display

ASIC

ASIC

ASICToday:

P=50 mips

M= 2 MB

In a few years

P= 500 mips

M= 256 MB

Processing decentralized

Moving to data sources

Moving to power sources

Moving to sheet metal

? The end of computers ?

7

It’s Already True of PrintersPeripheral = CyberBrick

• You buy a printer• You get a

– several network interfaces– A Postscript engine

• cpu, • memory, • software,• a spooler (soon)

– and… a print engine.

8

The Absurd Design?• Segregate processing from storage• Poor locality• Much useless data movement• Amdahl’s laws: bus: 10 B/ips io: 1 b/ips

ProcessorsDisks

~ 1 Tips

RAM

~ 1 TB~ 100TB

100 GBps10 TBps

9

The “Absurd” Disk• 2.5 hr scan time

(poor sequential access)• 1 aps / 5 GB

(VERY cold data)• It’s a tape!• Optimizations:

– Reduce management costs– Caching– Sequential 100x faster than random

1 TB100 MB/s

200 Kaps

200$

10

Disk = Node• magnetic storage (1TB)• processor + RAM + LAN• Management interface

(HTTP + SOAP) • Application execution

environment• Application

– File– DB2/Oracle/SQL– Notes/Exchange/

TeamServer– SAP/Seibold/…– Quickbooks /Tivo/ PC.…

OS KernelLAN driver Disk driver

File System RPC, ...Services DBMS

Applications

11

Implications

• Offload device handling to NIC/HBA

• higher level protocols: I2O, NASD, VIA, IP, TCP…

• SMP and Cluster parallelism is important.

Terabyte/s Backplane

• Move app to NIC/device controller

• higher-higher level protocols: SOAP/DCOM/RMI..

• Cluster parallelism is VERY important.

CentralProcessor &

Memory

Conventional Radical

12

Intermediate Step: Shared Logic• Brick with 8-12 disk drives• 200 mips/arm (or more)

• 2xGbpsEthernet• General purpose OS • 10k$/TB to 50k$/TB• Shared

– Sheet metal– Power– Support/Config– Security– Network ports

• These bricks could run applications (e.g. SQL or Mail or..)

Snap ~1TB 12x80GB NAS

NetApp ~.5TB 8x70GB NAS

Maxstor ~2TB 12x160GB NAS

13

Example• Homogenous machines leads

to quick response through reallocation

• HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives

• $4k/TB (street), • 2.5processors/TB,

1GB RAM/TB• JIT storage & processing

3 weeks from order to deploy

Slide courtesy of Brewster Kahle, @ Archive.org

14

What if Disk Replaces Tape?How does it work?

• Backup/Restore– RAID (among the federation)– Snapshot copies (in most OSs)– remote replicas (standard in DBMS and FS)

• Archive– Use “cold” 95% of disk space

• Interchange– Send computers not disks.

15

It’s Hard to Archive a PetabyteIt takes a LONG time to restore it.

• At 1GBps it takes 12 days!• Store it in two (or more) places online

A geo-plex• Scrub it continuously (look for errors)• On failure,

– use other copy until failure repaired, – refresh lost copy from safe copy.

• Can organize the two copies differently (e.g.: one by time, one by space)

16

Archive to Disk100TB for 0.5M$ + 1.5 “free” petabytes

• If you have 100 TB active you need 10,000 mirrored disk arms (see tpcC)

• So you have 1.6 PB of (mirrored) storage (160GB drives)

• Use the “empty” 95% for archive storage.• No extra space or extra power cost.• Very fast access (milliseconds vs hours).• Snapshot is read-only (software enforced )• Makes Admin easy (saves people costs)

17

Disk as Tape Archive

• Tape is unreliable, specialized, slow, low density, not improving fast, and expensive

• Using removable hard drives to replace tape’s function has been successful

• When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used.

• Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good.

Slide courtesy of Brewster Kahle, @ Archive.org

18

Disk as Tape Interchange

• Tape interchange is frustrating (often unreadable)• Beyond 1-10 GB send media not data

– FTP takes too long (hour/GB)– Bandwidth still very expensive (1$/GB)

• Writing DVD not much faster than Internet • New technology could change this

– 100 GB DVD @ 10MBps would be competitive.• Write 1TB disk in 2.5 hrs (at 100MBps)• But, how does interchange work?

19

Disk As Tape Interchange: What format?• Today I send 160GB NTFS/SQL disks.• But that is not a good format for Linux/DB2 users.• Solution: Ship NFS/CIFS/ODBC servers (not disks)• Plug “disk” into LAN.

– DHCP then file or DB server via standard interface.– “pull” data from server.

20

Some Questions

• What is the product?• How do I manage 10,000 nodes (disks)?• How do I program 10,000 nodes (disks)?• How does RAID work?• How do I backup a PB?• How do I restore a PB?

21

What is the Product?• Concept: Plug it in and it works!• Music/Video/Photo appliance (home)• Game appliance • “PC”• File server appliance• Data archive/interchange appliance• Web server appliance• DB server• eMail appliance• Application appliance

powernetwork

22

How Does Scale Out Work?• Files: well known designs:

– rooted tree partitioned across nodes– Automatic cooling (migration)– Mirrors or Chained declustering– Snapshots for backup/archive

• Databases: well known designs– Partitioning, remote replication similar to files– distributed query processing.

• Applications: (hypothetical)– Must be designed as mobile objects – Middleware provides object migration system

• Objects externalize methods to migrate ( == backup/restore/archive)

• Web services seem to have key ideas (xml representation)– Example: eMail object is mailbox

23

Auto Manage Storage• 1980 rule of thumb:

– A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb

– A DataAdmin per 5TB – SysAdmin per 100 clones (varies with app).

• Problem:– 5TB is 50k$ today, 5k$ in a few years.

–Admin cost >> storage cost !!!!• Challenge:

– Automate ALL storage admin tasks

24

Admin: TB and “guessed” $/TB(does not include cost of application, overhead, not “substance”)

• Google: 1 :100TB 5k$/TB/y• Yahoo! 1 : 50TB 20k$/TB/y• DB 1 : 5TB 60k$/TB/y • Wall St. 1 : 1TB 400k$/TB/y (reported)

• hardware dominant cost only @ Google.• How can we waste hardware to save people cost?

25

How do I manage 10,000 nodes?

• You can’t manage 10,000 x (for any x).• They manage themselves.

– You manage exceptional exceptions.• Auto Manage

– Plug & Play hardware– Auto-load balance & placement storage &

processing– Simple parallel programming model– Fault masking

26

How do I program 10,000 nodes?

• You can’t program 10,000 x (for any x).• They program themselves.

– You write embarrassingly parallel programs– Examples: SQL, Web, Google, Inktomi, HotMail,….– PVM and MPI prove it must be automatic (unless you have a PhD)!

• Auto Parallelism is ESSENTIAL

27

Summary• Disks will become supercomputers so

– Lots of computing to optimize the arm– Can put app close to the data (better modularity, locality)– Storage appliances (self-organizing)

• The arm/capacity tradeoff: “waste” space to save access. – Compression (saves bandwidth)– Mirrors– Online backup/restore– Online archive (vault to other drives or geoplex if possible)

• Not disks replace tapes: Storage appliances replace tapes.

• Self-organizing storage servers (file systems)(prototypes of this software exist)