Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing”...

Lustre

Eric BartonLead Engineer – Lustre GroupSun Microsystems

What is Lustre?• Shared POSIX file system, primarily for Linux today• Key benefits

> Open source under the GPL> POSIX-compliant> Multi-platform and multi-vendor> Heterogeneous networking> Aggregates petabytes of storage> Extremely scalable I/O performance> Serves tens of thousands of clients> Production-quality stability and high availability

Lustre deployments today• Think “extreme computing”• Largest market share in HPC IDC's HPC User Forum Survey 2007) ‏

• Adopted by the largest systems in the world> 7 of top 10 run Lustre, including #1> 30% of top 100 (www.top500.org November 2007 list)‏

• Partners> Bull, Cray, DDN, Dell, HP, Hitachi, SGI, Terascala...

• Growth in commercial deployments> Big wins – Oil&Gas, Rich Media, ISPs, Chip Design

Lustre todayClients: 25,000 – Red StormProcesses: 130,000 – BlueGene/L

#Clients

Metadata Servers: 1 + failoverOSS servers: up to 450, OST’s up to 4000

#Servers

Number of files: 2BillionFile System Size: 32PB, Max File size: 1.2PB

Capacity

Single Client or Server: 2 GB/s +BlueGene/L – first week: 74M files, 175TB writtenAggregate IO (One FS): ~130GB/s (PNNL)Pure MD Operations: ~15,000 ops/second

Performance

Software reliability on par with hardware reliabilityIncreased failover resiliency

Stability

Native support for many different networks, with routing Networks

Quota, Failover, POSIX, ACL, secure ports Features

Training, Level 1,2 & Internals. Certification for Level 1 Varia

WORLDRECORD

How does it work?

ClientsLOV

MDS OSS

Directory Operations,file open/close metadata, andconcurrency

Recovery, filestatus and file

creation

File I/O and filelocking

Lustre Stripes Files with Objects• Currently objects are simply files on OSS resident file systems• Enables parallel I/O to one file

> Lustre scales that to 100GByte/sec to one file

Lustre Clients

10’s - 10,000’s

Metadata

Servers (MDS)‏

= failover

MDS 1(active)‏

MDS 2(standby) ‏

CommodityStorage Servers

Enterprise-ClassStorage Arrays &

SAN Fabrics

Multiple NetworksTCP/IPQSNetMyrinet

InfiniBandiWARP

Cray Seastar

Router

Shared storageenables failover

Lustre ClusterI/O Servers (OSS) ‏

Vision• Broader Adoption

> Client Platform support> pNFS

• Improved Recovery andResilience> AT / VBR> ZFS> End-to-end integrity checks

• Improved Performance andScalability> Network Request Scheduler> Clustered MDS> Flash Cache> Metadata Write-back Cache

• Improved deployment> Stability> Version “smear”> Resource Management> HSM

• WAN> Security> Proxies

Broader Adoption• Client platform support

> Windows- CIFS via clustered SAMBA

- native client

> Solaris> OS X

Layered & direct pNFS

pNFS Clients

Lustre Servers (MDS & OSS)‏

NFSDNFSDpNFSD

pNFS file layout

Lustre Client FS Global Namespace

NFSDNFSDpNFSD

OSS OSS

OSS OSSMDS

pNFS layered on Lustre ClientspNFS and Lustre servers onLustre / DMU storage system

pNFS Clients

MD node Stripe node Stripe node

..... .....Lustre Client

pNFS file layout

• Easier Administration> Pooled storage model> No volume manager> Snapshots

• Immense Capacity> 128-bit file system

• End-to-end data integrity> Everything is checksummed> Copy-on-write, transactional design> MD block replication> RAID-Z/Mirroring> Resilvering

Checksums areseparated from

the data

ZFS: End-to-End Checksums

Entire I/O path is self-validating (uber-block) ‏

Prevents:> Silent data corruption

> Corrupted metadata

> Phantom writes

> Misdirected reads and writes

> DMA parity errors

> Errors from driver bugs

> Accidental overwrites

ZFS: Copy-on-Write / Transactional

Initial block tree Writes a copy of some changes

Copy-on-write of indirect blocks Rewrites the Uber-block

Original Data

New Data

New Pointers

Original Pointers New Uber-block

Uber-block

Request Visualisation

Network Request Scheduler• Today, requests processed in FIFO order

> Only as fair as the network

> Over-reliance on disk elevator

• NRS will re-order requests on arrival> Enforce fairness> Re-order before bulk buffers assigned> Work with block allocator

• 2nd generation NRS will coordinate servers> QoS

Clustered Metadata

MDS ServerPools

OSS ServerPools

Clients

Previous scalability +

Enlarge MD Pool: enhance NetBench/SpecFS, client scalability

Limits: 100’s of billions of files, M-ops / sec

Load Balanced

Flash Cache• Exploit storage hardware revolution

> Very high bandwidth available from flash

• Flash Cache OSTs> capacity: ~= RAM of cluster> cost: fraction of cost of RAM of cluster

Flash Cache interactions

Flash Cache OSS

Trad OSS Trad OSS Trad OSS

READ:- OSS – forces FC flush first

WRITE – all writes to Flash Cache

DRAIN – put data in final position

Flash Cache• Better cluster utilisation

> Compute/checkpoint cycle of 1 hour> Cluster writes checkpoint to flash in 10 mins> Flash drains to disk in 50 mins> 1/5th fewer disk servers

• Lustre manages file system coherency

Metadata WBC• Goal & problem:

> Disk file systems make updates in memory> Network FS’s do not - metadata ops require RPCs> The Lustre WBC should only require synchronous RPCs

for cache misses

MD ops todayclient server

mkdir MDS_MKDIRmdt_reint

replyupdateVFS

create MDS_CREATmdt_reint

replyupdateVFS

• 1 RPC per MD operation• MD RPCs serialized

MD ops with WBCclient server

mkdirupdateVFS

create

MDS_REINT*mdt_reint*

updateVFScache

write-out

mdt_mkdirmdt_create

• Operations executed locally• Batched RPC

WBC Benefits• Dramatically improved MD performance

> More efficient network usage– batched transfer

– Latency hiding– WAN

> More concurrency on the client> More efficient execution on the server

• Disconnected mode operation

WBC Challenges• How to cache

> Log / Cumulative changes

• Dependent operations> Cross-directory rename> Creation of name and file

• Recovery• Efficient resource leasing is critical

> Allocation grants> Subtree locks

• Security

Metadata WBC - epochs

Metadata WBC• Key elements of the design

> Clients can determine file identifiers for new files> MD updates cached on the client> Reintegration in parallel on clustered MD servers> Sub-tree locks – enlarge lock granularity

Uses of the WBC• HPC

> I/O forwarding makes Lustre clients I/O call servers> These servers can run on WBC clients

• Exa-scale clusters> WBC enables last minute resource allocation

• WAN Lustre> Eliminate latency from wide area use for updates

• HPCS> Dramatically increase small file performance

Lustre with I/O forwarding

Forwarding clients

Servers

……

FW sever /

Lustre client

FW sever /

Lustre client

FW servers should be Lustre WBC enabled clients

Migration – many uses• Between ext3 / ZFS servers• For space rebalancing• To empty servers and replace them• In conjunction with HSM• To manage caches & replicas

Migration

OSS Pool A OSS Pool B

MDS Pool A MDS Pool B

Virtual Migrating MDS Pool

Virtual Migrating OSS Pool

Coordinator

Data Moving Agents

General purpose replication• Driven by major content distribution networks

> DoD, ISPs> Keep multi petabyte file systems in sync

• Implementing scalable synchronization> Changelog based> Works on live file systems> No scanning, immediate resume, parallel

• Many other applications> Search, basic server network striping

Caches/Proxies• Many variants

> HSM – Lustre cluster is proxy cache for 3rd tier storage> Collaborative read cache

– Bit-torrent style reading or

– When concurrency increases use other OSS’s as proxies

> Wide area cache – repeated reads come from cache

• Technical elements> Migrate data between storage pools> Re-validate cached data with versions> Hierarchical management of consistency

Collaborative cache

Scenario:all clients read one fileInitial reads from primary OSSWith increasing load – redirectOther OSS’s act as caches

OSS OSS OSS

Lustre Clients

Proxy clusters

Local performance after the first read

Master

Lustre

Servers

Lustre

Servers

Lustre

Servers

Clients

RemoteClients

Summary• Lustre is high performance and scalable enough to

meet the HPC petascale demands of today• Sun is leveraging Lustre and other open source

technologies to deliver a comprehensive Linux HPCsolution

• We will continue to push the limits of scalabiltiy andperformance with Lustre to address exascaledemands of the coming decade> Trillions of files> TB/s throughput> Millions of clients

Eric Bartoneeb@sun.com

Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing”...

Documents

Transcript of Eric Barton Lustre - ISW2008 · Lustre deployments today •Think “extreme computing”...

LUSTRE HPC FILE SYSTEM DEPLOYOMENT METHOD - education…

Lustre Beyond HPC - OpenSFSopensfs.org/wp-content/uploads/2014/04/D2_S25_LustreFor... · 2016-10-12 · • Lustre market expansion – From a relatively small player to the dominant

on HPC Clusters? Columbus, OH, USA Can Non-Volatile Memory ... · – High performance design of MapReduce over Lustre – Plugin-based architecture supporting RDMA-based designs

COLORCHART - Kreinik001 Silver 001HL SilverHi Lustre 002 Gold 002HL GoldHi Lustre 003 Red 003HL RedHi Lustre 005 Black 005HL BlackHi Lustre 006 Blue 006HL BlueHi Lustre 007 Pink 007HL

Robust Health Monitoring - ORNL Lustre Activities · Robust Health Monitoring • Lustre • InfiniBand • Storage Arrays Blake Caldwell HPC Operations Oak Ridge Leadership Computing

Dell EMC Ready Solutions for HPC Lustre Storagelustrefs.cn › wp-content › uploads › 2018 › 10 › 15_Dell.pdf · 2018-10-25 · 4 Dell EMC HPC market leadership generations

Eric Barton Lustre - ISW2008 - DTC · IDC's HPC User Forum Survey 2007) •Adopted by the largest systems in the world >7 of top 10 run Lustre, including #1 >30% of top 100 ( November

Lustre at Dell Overview Jeffrey B. Layton, Ph.D. (Jeffrey_Layton@Dell.com) Dell HPC Solutions | .

Lustre Beyond HPC - OpenSFScdn.opensfs.org/wp-content/uploads/2014/04/D2_S25... · Lustre Market Overview ... 2008 2013 2018 Cloud & Content. Business Data Analysis. HPC "Work" Corporate.

Community Release Update - 2019 Lustre User Group Conference · •Lustre 2.12.1 went GA April 30th2019 •Lustre 2.12.2 coming soon 10. Lustre 2.12 •GA Dec 2018 •OS support ...

Moving Lustre Forward - HPC Advisory Council...Community Lustre Roadmap 1 Maintenance releases focus on bug fixes and stability. Updates to the current version are made at 3 month

Creating a Lustre Test System from Source with Virtual ... · Creating a Lustre Test System from Source with Virtual Machines . DoD HPC Research Program ... •Prepare the kernel

Lisbon Lustre I Lustre II - U.S. Paverscape

HPC Storage - DDN.com · PDF file• HPC Storage • September 2015 Lustre Towards ExaScaler. DDN/Intel Webinar. Content • Exascale Storage Challenges (Intel) ... – DDN Automated

Lustre Presentation

Hadoop* on Lustre* - OpenSFS: The Lustre File System …cdn.opensfs.org/wp-content/uploads/2014/10/8-Hadoop… · · 2016-10-12• HPC Adapter for Mapreduce/Yarn ... The Anatomy

High-Performance GPU Clustering: GPUDirect RDMA …on-demand.gputechconf.com/gtc/2016/presentation/s6854...Hadoop RDMA HPC iWARP RDMA over Ethernet GPUDirect RDMA Lustre RDMA pNFS

VIRTUALIZING HIGH-PERFORMANCE COMPUTING (HPC) … · • Parallel file systems such as Lustre or IBM Spectrum Scale can be used for HPC workloads that require access to large files,

Lustre Whitepaper

Lustre Development Eric Barton Lead Engineer, Lustre Group.