Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing”...

37
1 Lustre Eric Barton Lead Engineer – Lustre Group Sun Microsystems 1

Transcript of Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing”...

Page 1: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

1

Lustre

Eric BartonLead Engineer – Lustre GroupSun Microsystems

1

Page 2: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

2

What is Lustre?• Shared POSIX file system, primarily for Linux today• Key benefits

> Open source under the GPL> POSIX-compliant> Multi-platform and multi-vendor> Heterogeneous networking> Aggregates petabytes of storage> Extremely scalable I/O performance> Serves tens of thousands of clients> Production-quality stability and high availability

Page 3: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

3

Lustre deployments today• Think “extreme computing”• Largest market share in HPC IDC's HPC User Forum Survey 2007) ‏

• Adopted by the largest systems in the world> 7 of top 10 run Lustre, including #1> 30% of top 100 (www.top500.org November 2007 list)‏

• Partners> Bull, Cray, DDN, Dell, HP, Hitachi, SGI, Terascala...

• Growth in commercial deployments> Big wins – Oil&Gas, Rich Media, ISPs, Chip Design

Page 4: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

4

Lustre todayClients: 25,000 – Red StormProcesses: 130,000 – BlueGene/L

#Clients

Metadata Servers: 1 + failoverOSS servers: up to 450, OST’s up to 4000

#Servers

Number of files: 2BillionFile System Size: 32PB, Max File size: 1.2PB

Capacity

Single Client or Server: 2 GB/s +BlueGene/L – first week: 74M files, 175TB writtenAggregate IO (One FS): ~130GB/s (PNNL)Pure MD Operations: ~15,000 ops/second

Performance

Software reliability on par with hardware reliabilityIncreased failover resiliency

Stability

Native support for many different networks, with routing Networks

Quota, Failover, POSIX, ACL, secure ports Features

Training, Level 1,2 & Internals. Certification for Level 1 Varia

WORLDRECORD

WORLDRECORD

WORLDRECORD

Page 5: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

5

How does it work?

ClientsLOV

MDS OSS

Directory Operations,file open/close metadata, andconcurrency

Recovery, filestatus and file

creation

File I/O and filelocking

Page 6: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

6

Lustre Stripes Files with Objects• Currently objects are simply files on OSS resident file systems• Enables parallel I/O to one file

> Lustre scales that to 100GByte/sec to one file

Page 7: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

7

OSS 7

Lustre Clients

10’s - 10,000’s

Metadata

Servers (MDS)‏

= failover

MDS 1(active)‏

MDS 2(standby) ‏

OSS 1

OSS 2

OSS 3

OSS 4

OSS 5

OSS 6

CommodityStorage Servers

Enterprise-ClassStorage Arrays &

SAN Fabrics

Multiple NetworksTCP/IPQSNetMyrinet

InfiniBandiWARP

Cray Seastar

Router

Shared storageenables failover

OSS

Lustre ClusterI/O Servers (OSS) ‏

Page 8: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

8

Vision• Broader Adoption

> Client Platform support> pNFS

• Improved Recovery andResilience> AT / VBR> ZFS> End-to-end integrity checks

• Improved Performance andScalability> Network Request Scheduler> Clustered MDS> Flash Cache> Metadata Write-back Cache

• Improved deployment> Stability> Version “smear”> Resource Management> HSM

• WAN> Security> Proxies

Page 9: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

9

Broader Adoption• Client platform support

> Windows- CIFS via clustered SAMBA

- native client

> Solaris> OS X

Page 10: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

10

Layered & direct pNFS

pNFS Clients

Lustre Servers (MDS & OSS)‏

NFSDNFSDpNFSD

pNFS file layout

Lustre Client FS Global Namespace

NFSDNFSDpNFSD

OSS OSS

OSS OSSMDS

MDS

pNFS layered on Lustre ClientspNFS and Lustre servers onLustre / DMU storage system

pNFS Clients

MD node Stripe node Stripe node

..... .....Lustre Client

pNFS file layout

Page 11: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

11

• Easier Administration> Pooled storage model> No volume manager> Snapshots

• Immense Capacity> 128-bit file system

• End-to-end data integrity> Everything is checksummed> Copy-on-write, transactional design> MD block replication> RAID-Z/Mirroring> Resilvering

ZFS

Page 12: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

12

Checksums areseparated from

the data

ZFS: End-to-End Checksums

Entire I/O path is self-validating (uber-block) ‏

Prevents:> Silent data corruption

> Corrupted metadata

> Phantom writes

> Misdirected reads and writes

> DMA parity errors

> Errors from driver bugs

> Accidental overwrites

Page 13: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

13

ZFS: Copy-on-Write / Transactional

Initial block tree Writes a copy of some changes

Copy-on-write of indirect blocks Rewrites the Uber-block

Original Data

New Data

New Pointers

Original Pointers New Uber-block

Uber-block

Page 14: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

14

Request Visualisation

Page 15: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

15

Request Visualisation

Page 16: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

16

Network Request Scheduler• Today, requests processed in FIFO order

> Only as fair as the network

> Over-reliance on disk elevator

• NRS will re-order requests on arrival> Enforce fairness> Re-order before bulk buffers assigned> Work with block allocator

• 2nd generation NRS will coordinate servers> QoS

Page 17: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

17

Clustered Metadata

MDS ServerPools

OSS ServerPools

Clients

Previous scalability +

Enlarge MD Pool: enhance NetBench/SpecFS, client scalability

Limits: 100’s of billions of files, M-ops / sec

Load Balanced

Page 18: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

18

Flash Cache• Exploit storage hardware revolution

> Very high bandwidth available from flash

• Flash Cache OSTs> capacity: ~= RAM of cluster> cost: fraction of cost of RAM of cluster

Page 19: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

19

Flash Cache interactions

Flash Cache OSS

Trad OSS Trad OSS Trad OSS

READ:- OSS – forces FC flush first

WRITE – all writes to Flash Cache

DRAIN – put data in final position

Page 20: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

20

Flash Cache• Better cluster utilisation

> Compute/checkpoint cycle of 1 hour> Cluster writes checkpoint to flash in 10 mins> Flash drains to disk in 50 mins> 1/5th fewer disk servers

• Lustre manages file system coherency

Page 21: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

21

Metadata WBC• Goal & problem:

> Disk file systems make updates in memory> Network FS’s do not - metadata ops require RPCs> The Lustre WBC should only require synchronous RPCs

for cache misses

Page 22: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

22

MD ops todayclient server

mkdir MDS_MKDIRmdt_reint

replyupdateVFS

create MDS_CREATmdt_reint

replyupdateVFS

• 1 RPC per MD operation• MD RPCs serialized

Page 23: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

23

MD ops with WBCclient server

mkdirupdateVFS

create

MDS_REINT*mdt_reint*

reply

updateVFScache

write-out

mdt_mkdirmdt_create

• Operations executed locally• Batched RPC

Page 24: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

24

WBC Benefits• Dramatically improved MD performance

> More efficient network usage– batched transfer

– Latency hiding– WAN

> More concurrency on the client> More efficient execution on the server

• Disconnected mode operation

Page 25: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

25

WBC Challenges• How to cache

> Log / Cumulative changes

• Dependent operations> Cross-directory rename> Creation of name and file

• Recovery• Efficient resource leasing is critical

> Allocation grants> Subtree locks

• Security

Page 26: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

26

Metadata WBC - epochs

Page 27: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

27

Metadata WBC• Key elements of the design

> Clients can determine file identifiers for new files> MD updates cached on the client> Reintegration in parallel on clustered MD servers> Sub-tree locks – enlarge lock granularity

Page 28: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

28

Uses of the WBC• HPC

> I/O forwarding makes Lustre clients I/O call servers> These servers can run on WBC clients

• Exa-scale clusters> WBC enables last minute resource allocation

• WAN Lustre> Eliminate latency from wide area use for updates

• HPCS> Dramatically increase small file performance

Page 29: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

29

Lustre with I/O forwarding

Forwarding clients

Servers

……

FW sever /

Lustre client

FW sever /

Lustre client

FW servers should be Lustre WBC enabled clients

Page 30: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

30

Migration – many uses• Between ext3 / ZFS servers• For space rebalancing• To empty servers and replace them• In conjunction with HSM• To manage caches & replicas

Page 31: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

31

Migration

OSS Pool A OSS Pool B

MDS Pool A MDS Pool B

Virtual Migrating MDS Pool

Virtual Migrating OSS Pool

Coordinator

Data Moving Agents

Page 32: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

32

General purpose replication• Driven by major content distribution networks

> DoD, ISPs> Keep multi petabyte file systems in sync

• Implementing scalable synchronization> Changelog based> Works on live file systems> No scanning, immediate resume, parallel

• Many other applications> Search, basic server network striping

Page 33: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

33

Caches/Proxies• Many variants

> HSM – Lustre cluster is proxy cache for 3rd tier storage> Collaborative read cache

– Bit-torrent style reading or

– When concurrency increases use other OSS’s as proxies

> Wide area cache – repeated reads come from cache

• Technical elements> Migrate data between storage pools> Re-validate cached data with versions> Hierarchical management of consistency

Page 34: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

34

Collaborative cache

Scenario:all clients read one fileInitial reads from primary OSSWith increasing load – redirectOther OSS’s act as caches

.....

OSS OSS OSS

Lustre Clients

Page 35: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

35

Proxy clusters

Local performance after the first read

Master

Lustre

Servers

Proxy

Lustre

Servers

Proxy

Lustre

Servers

Clients

RemoteClients

Page 36: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

36

Summary• Lustre is high performance and scalable enough to

meet the HPC petascale demands of today• Sun is leveraging Lustre and other open source

technologies to deliver a comprehensive Linux HPCsolution

• We will continue to push the limits of scalabiltiy andperformance with Lustre to address exascaledemands of the coming decade> Trillions of files> TB/s throughput> Millions of clients

Page 37: Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing” •Largest market share in HPC IDC's HPC User Forum Survey 2007) •Adopted by the largest

37

Eric [email protected]

37