Post on 28-May-2020
1
Lustre
Eric BartonLead Engineer – Lustre GroupSun Microsystems
1
2
What is Lustre?• Shared POSIX file system, primarily for Linux today• Key benefits
> Open source under the GPL> POSIX-compliant> Multi-platform and multi-vendor> Heterogeneous networking> Aggregates petabytes of storage> Extremely scalable I/O performance> Serves tens of thousands of clients> Production-quality stability and high availability
3
Lustre deployments today• Think “extreme computing”• Largest market share in HPC IDC's HPC User Forum Survey 2007)
• Adopted by the largest systems in the world> 7 of top 10 run Lustre, including #1> 30% of top 100 (www.top500.org November 2007 list)
• Partners> Bull, Cray, DDN, Dell, HP, Hitachi, SGI, Terascala...
• Growth in commercial deployments> Big wins – Oil&Gas, Rich Media, ISPs, Chip Design
4
Lustre todayClients: 25,000 – Red StormProcesses: 130,000 – BlueGene/L
#Clients
Metadata Servers: 1 + failoverOSS servers: up to 450, OST’s up to 4000
#Servers
Number of files: 2BillionFile System Size: 32PB, Max File size: 1.2PB
Capacity
Single Client or Server: 2 GB/s +BlueGene/L – first week: 74M files, 175TB writtenAggregate IO (One FS): ~130GB/s (PNNL)Pure MD Operations: ~15,000 ops/second
Performance
Software reliability on par with hardware reliabilityIncreased failover resiliency
Stability
Native support for many different networks, with routing Networks
Quota, Failover, POSIX, ACL, secure ports Features
Training, Level 1,2 & Internals. Certification for Level 1 Varia
WORLDRECORD
WORLDRECORD
WORLDRECORD
5
How does it work?
ClientsLOV
MDS OSS
Directory Operations,file open/close metadata, andconcurrency
Recovery, filestatus and file
creation
File I/O and filelocking
6
Lustre Stripes Files with Objects• Currently objects are simply files on OSS resident file systems• Enables parallel I/O to one file
> Lustre scales that to 100GByte/sec to one file
7
OSS 7
Lustre Clients
10’s - 10,000’s
Metadata
Servers (MDS)
= failover
MDS 1(active)
MDS 2(standby)
OSS 1
OSS 2
OSS 3
OSS 4
OSS 5
OSS 6
CommodityStorage Servers
Enterprise-ClassStorage Arrays &
SAN Fabrics
Multiple NetworksTCP/IPQSNetMyrinet
InfiniBandiWARP
Cray Seastar
Router
Shared storageenables failover
OSS
Lustre ClusterI/O Servers (OSS)
8
Vision• Broader Adoption
> Client Platform support> pNFS
• Improved Recovery andResilience> AT / VBR> ZFS> End-to-end integrity checks
• Improved Performance andScalability> Network Request Scheduler> Clustered MDS> Flash Cache> Metadata Write-back Cache
• Improved deployment> Stability> Version “smear”> Resource Management> HSM
• WAN> Security> Proxies
9
Broader Adoption• Client platform support
> Windows- CIFS via clustered SAMBA
- native client
> Solaris> OS X
10
Layered & direct pNFS
pNFS Clients
Lustre Servers (MDS & OSS)
NFSDNFSDpNFSD
pNFS file layout
Lustre Client FS Global Namespace
NFSDNFSDpNFSD
OSS OSS
OSS OSSMDS
MDS
pNFS layered on Lustre ClientspNFS and Lustre servers onLustre / DMU storage system
pNFS Clients
MD node Stripe node Stripe node
..... .....Lustre Client
pNFS file layout
11
• Easier Administration> Pooled storage model> No volume manager> Snapshots
• Immense Capacity> 128-bit file system
• End-to-end data integrity> Everything is checksummed> Copy-on-write, transactional design> MD block replication> RAID-Z/Mirroring> Resilvering
ZFS
12
Checksums areseparated from
the data
ZFS: End-to-End Checksums
Entire I/O path is self-validating (uber-block)
Prevents:> Silent data corruption
> Corrupted metadata
> Phantom writes
> Misdirected reads and writes
> DMA parity errors
> Errors from driver bugs
> Accidental overwrites
13
ZFS: Copy-on-Write / Transactional
Initial block tree Writes a copy of some changes
Copy-on-write of indirect blocks Rewrites the Uber-block
Original Data
New Data
New Pointers
Original Pointers New Uber-block
Uber-block
14
Request Visualisation
15
Request Visualisation
16
Network Request Scheduler• Today, requests processed in FIFO order
> Only as fair as the network
> Over-reliance on disk elevator
• NRS will re-order requests on arrival> Enforce fairness> Re-order before bulk buffers assigned> Work with block allocator
• 2nd generation NRS will coordinate servers> QoS
17
Clustered Metadata
MDS ServerPools
OSS ServerPools
Clients
Previous scalability +
Enlarge MD Pool: enhance NetBench/SpecFS, client scalability
Limits: 100’s of billions of files, M-ops / sec
Load Balanced
18
Flash Cache• Exploit storage hardware revolution
> Very high bandwidth available from flash
• Flash Cache OSTs> capacity: ~= RAM of cluster> cost: fraction of cost of RAM of cluster
19
Flash Cache interactions
Flash Cache OSS
Trad OSS Trad OSS Trad OSS
READ:- OSS – forces FC flush first
WRITE – all writes to Flash Cache
DRAIN – put data in final position
20
Flash Cache• Better cluster utilisation
> Compute/checkpoint cycle of 1 hour> Cluster writes checkpoint to flash in 10 mins> Flash drains to disk in 50 mins> 1/5th fewer disk servers
• Lustre manages file system coherency
21
Metadata WBC• Goal & problem:
> Disk file systems make updates in memory> Network FS’s do not - metadata ops require RPCs> The Lustre WBC should only require synchronous RPCs
for cache misses
22
MD ops todayclient server
mkdir MDS_MKDIRmdt_reint
replyupdateVFS
create MDS_CREATmdt_reint
replyupdateVFS
• 1 RPC per MD operation• MD RPCs serialized
23
MD ops with WBCclient server
mkdirupdateVFS
create
MDS_REINT*mdt_reint*
reply
updateVFScache
write-out
mdt_mkdirmdt_create
• Operations executed locally• Batched RPC
24
WBC Benefits• Dramatically improved MD performance
> More efficient network usage– batched transfer
– Latency hiding– WAN
> More concurrency on the client> More efficient execution on the server
• Disconnected mode operation
25
WBC Challenges• How to cache
> Log / Cumulative changes
• Dependent operations> Cross-directory rename> Creation of name and file
• Recovery• Efficient resource leasing is critical
> Allocation grants> Subtree locks
• Security
26
Metadata WBC - epochs
27
Metadata WBC• Key elements of the design
> Clients can determine file identifiers for new files> MD updates cached on the client> Reintegration in parallel on clustered MD servers> Sub-tree locks – enlarge lock granularity
28
Uses of the WBC• HPC
> I/O forwarding makes Lustre clients I/O call servers> These servers can run on WBC clients
• Exa-scale clusters> WBC enables last minute resource allocation
• WAN Lustre> Eliminate latency from wide area use for updates
• HPCS> Dramatically increase small file performance
29
Lustre with I/O forwarding
Forwarding clients
Servers
……
FW sever /
Lustre client
FW sever /
Lustre client
FW servers should be Lustre WBC enabled clients
30
Migration – many uses• Between ext3 / ZFS servers• For space rebalancing• To empty servers and replace them• In conjunction with HSM• To manage caches & replicas
31
Migration
OSS Pool A OSS Pool B
MDS Pool A MDS Pool B
Virtual Migrating MDS Pool
Virtual Migrating OSS Pool
Coordinator
Data Moving Agents
32
General purpose replication• Driven by major content distribution networks
> DoD, ISPs> Keep multi petabyte file systems in sync
• Implementing scalable synchronization> Changelog based> Works on live file systems> No scanning, immediate resume, parallel
• Many other applications> Search, basic server network striping
33
Caches/Proxies• Many variants
> HSM – Lustre cluster is proxy cache for 3rd tier storage> Collaborative read cache
– Bit-torrent style reading or
– When concurrency increases use other OSS’s as proxies
> Wide area cache – repeated reads come from cache
• Technical elements> Migrate data between storage pools> Re-validate cached data with versions> Hierarchical management of consistency
34
Collaborative cache
Scenario:all clients read one fileInitial reads from primary OSSWith increasing load – redirectOther OSS’s act as caches
.....
OSS OSS OSS
Lustre Clients
35
Proxy clusters
Local performance after the first read
Master
Lustre
Servers
Proxy
Lustre
Servers
Proxy
Lustre
Servers
Clients
…
…
RemoteClients
36
Summary• Lustre is high performance and scalable enough to
meet the HPC petascale demands of today• Sun is leveraging Lustre and other open source
technologies to deliver a comprehensive Linux HPCsolution
• We will continue to push the limits of scalabiltiy andperformance with Lustre to address exascaledemands of the coming decade> Trillions of files> TB/s throughput> Millions of clients
37
Eric Bartoneeb@sun.com
37