Dilger Lustre HPCS May Workshop

download Dilger Lustre HPCS May Workshop

of 45

Transcript of Dilger Lustre HPCS May Workshop

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    1/45

    1

    Andreas Dilger Senior Staff Engineer, Lustre GroupSun Microsystems

    Lustre HPCS Design Overview

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    2/45

    Topics

    HPCS GoalsHPCS Architectural ImprovementsPerformance EnhancementsConclusion

    2Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    3/45

    HPC Center of the Future

    S ha red S tora g e Network

    Use r Da ta10 00 's OS Ts

    M e t a d a t a10 0 's M DTs

    Ca pa ci ty 1250 ,0 00 Nodes

    Capabil i ty500 ,000 Nodes

    Ca pa ci ty 2150 ,0 00 Nodes Cap aci ty 3

    50 ,000 Nodes Te s t25 ,000 Nodes

    Vi z2

    WANAccess

    10 TB/sec

    HPS S Archiv e

    Viz1

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    4/45

    HPCS Goals - Capacity

    Filesystem Limits 100 PB+ maximum file system size (10PB) 1 trillion files (1012) per file system (4 billion files) > 30k client nodes(> 30k clients)

    Single File/Directory Limits 10 billion files per directory (15M) 0 to 1 PB file size range (360TB) 1B to 1 GB I/O request size range(1B - 1GB IO req) 100,000 open shared files per process (10,000 files) Long file names and 0-length files as data records

    (long file names and 0-length files)4

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    5/45

    HPCS Goals - Performance Aggregate

    10,000 metadata operations per second(10,000/s) 1500 GB/s file per process/shared file(180GB/s) No impact RAID rebuilds on performance(variable)

    Single Client 40,000 creates/s , up to 64kB data(5k/s, 0kB) 30 GB/s full-duplex streaming I/O(2GB/s)

    Miscellaneous POSIX I/O API extensions proposed at OpenGroup* (partial)

    * High End Computing Extensions Working Group http://www.opengroup.org/platform/hecewg/

    5Scalability Workshop ORNL May 2009

    http://www.opengroup.org/platform/hecewg/http://www.opengroup.org/platform/hecewg/
  • 8/6/2019 Dilger Lustre HPCS May Workshop

    6/45

    HPCS Goals - Reliability

    End-to-end resiliency T10 DIF equivalent(net only) No impact RAID rebuild on performance(variable) Uptime of 99.99% (99% ?)

    Downtime

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    7/45

    HPCS Architectural Improvements

    7

    Use Available ZFS FunctionalityEnd-to-End Data IntegrityRAID Rebuild Performance ImpactFilesystem Integrity CheckingClustered MetadataRecovery ImprovementsPerformance Enhancements

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    8/45

    Use Available ZFS Functionality

    8

    Capacity Single filesystem 100TB+(264 LUNs * 264 bytes) Trillions of files in a single file system(248 files) Dynamic addition of capacity

    Reliability and resilience Transaction based, copy-on-write Internal data redundancy (double parity, 3 copies) End-to-end checksum of all data/metadata

    Online integrity verification and reconstructionFunctionality Snapshots, filesets, compression, encryption Online incremental backup/restore

    Hybrid storage pools (HDD + SSD)Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    9/45

    End-to-End Data Integrity

    9

    Current Lustre checksumming Detects data corruption over network Ext3/4 does not checksum data on disk

    ZFS stores data/metadata checksums Fast (Fletcher-4 default, or none) Strong (SHA-256)

    HPCS Integration Integrate Lustre and ZFS checksums Avoid recompute full checksum on data Always overlap checksum coverage Use scalable tree hash method

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    10/45

    Hash Tree and Multiple Block Sizes

    10Scalability Workshop ORNL May 2009

    16kB

    128kB

    256kB

    512kB

    32kB

    1024kB

    64kB

    DATASEGMENT

    LEAF HASHINTERIOR HASH

    ROOT HASH

    4kB8kB

    MAX PAGE

    ZFS BLOCK

    LUSTRE RPC SIZE

    M I N

    P A G E

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    11/45

    Hash Tree For Non-contiguous Data

    11Scalability Workshop ORNL May 2009

    C45

    K

    C7

    C457

    C1

    = LH(data segment 1)C

    0= LH(data segment 0)

    C5

    = LH(data segment 5)C4 = LH(data segment 4)

    LH(x) = hash of data x (leaf)x+y = concatenation x and y

    C7

    = LH(data segment 7)

    C01

    = IH(C0

    + C1)

    C45 = IH(C4 + C5)

    C457

    = IH(C45

    + C7)

    C5

    C7

    C4

    K = IH(C01

    + C457

    ) = ROOT hash

    IH(x) = hash of data x (interior)

    C01

    C0

    C01

    C1

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    12/45

    End-to-End Integrity Client Write

    12Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    13/45

    End-to-End Integrity Server Write

    13Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    14/45

    HPC Center of the Future

    S ha red S tora g e Network

    M e t a d a t a10 0 's M DTs

    Ca pa ci ty 1250 ,000 Nodes

    Capabil i ty500 ,000 Nodes

    Ca pa ci ty 2150 ,000 Nodes Cap aci ty 3

    50 ,000 Nodes Te s t25 ,000 Nodes

    Vi z2

    WANAccess

    10 TB/sec

    HPSS Archiv e

    Viz1

    1224 96TB OSTs3 * 32TB LUNs

    36720 4TB disksRAID-6 8+2

    115 PB Data

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    15/45

    RAID Failure Rates115PB filesystem, 1.5TB/s

    36720 4TB disks in RAID 6 8+2 LUNs 4 year Mean Time To Failure

    1 disk failsevery hour on average 4TB disk @ 30MB/s 38hr rebuild 30MB/s is 50%+ disk bandwidth (seeks) May reduce aggregate throughput by 50%+

    1 disk failure may cost 750GB/s aggregate 38hr * 1 disk/hr = 38 disks/OSTs degraded

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    16/45

    RAID Failure Rates

    0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Failure Rates vs. MTTF

    36720 4TB disks, RAID-6 8+2, 1224 OSTs, 30MB/s rebuild

    Disks failed during rebuild

    Percent OST degraded

    Days to OST double failure

    Years to OST triple failure

    MTTF (years)

    F a

    i l u r e

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    17/45

    0 5 10 15 20 25 30 35

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Failure Rates vs. RAID disks

    36720 4TB disks, RAID-6 N+2, 1224 OSTs, MTTF 4 years

    Disks failed during rebuild

    Percent OST degraded

    Days to OST double failure

    Years to OST triple failure

    Disks per RAID-6 stripe

    F a

    i l u r e

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    18/45

    0 10 20 30 40 50 60 70 80 90 100

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    Failure Rates vs. Rebuild Speed

    36720 4TB disks, RAID-6 8+2, 1224 OSTs, MTTF 4 years

    Disks failed during rebuild

    Percent OST degraded

    Days to OST double failure

    Years to OST triple failure

    Rebuild Speed (MB/s)

    F a

    i l u r e

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    19/45

    Lustre-level Rebuild Mitigation

    19Scalability Workshop ORNL May 2009

    Several related problems Avoid global impactfromdegraded RAID Avoid loadon rebuilding RAID set

    Avoids degraded OSTs for new files Little or no load on degraded RAID set Maximize rebuild performance Minimal global performance impact 30 disks (3 LUNs) per OST, 1224 OSTs 38 of 1224 OSTS = 3% aggregate cost OSTs available for existing files

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    20/45

    ZFS-level RAID-Z RebuildRAID-Z/Z2 is not the same as RAID-5/6

    NEVER does read-modify-write Supports arbitrary block size/alignment

    RAID layout is stored in block pointer

    ZFS metadata traversal for RAID rebuild Good: only rebuild used storage (

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    21/45

    RAID-Z Rebuild ImprovementsRAID-Z optimized rebuild

    ~3% of storage is metadata Scan metadata first, build ordered list

    Data reads mostly linear Bookmark to restart rebuild ZFS itself is not tied to RAID-Z

    Distributed hot space Spread hot-spare rebuild space over all disks All disks' bandwidth/IOPS for normal IO All disks' bandwidth/IOPS for rebuilding

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    22/45

    HPC Center of the Future

    S ha red S tora g e Network

    Ca pa ci ty 1250 ,000 Nodes

    Capabil i ty500 ,000 Nodes

    Ca pa ci ty 2150 ,000 Nodes Cap aci ty 3

    50 ,000 Nodes Te s t25 ,000 Nodes

    Vi z2

    WANAccess

    10 TB/sec

    HPSS Archiv e

    Viz1

    Use r Da ta10 00 's OSTs

    2 PB5120 800GBSAS/SSD

    RAID-1

    128 MDTs16TB LUNs

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    23/45

    Clustered Metadata

    23

    100s of metadata serversDistributed inodes

    Files normally local to parent directory Subdirectories often non-local

    Split directories Split dir::name hash Striped file::offset

    Distributed Operation Recovery Cross directory mkdir, rename, link, unlink Strictly ordered distributed updates Ensures namespace coherency, recovery At worst inode refcount too high, leaked

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    24/45

    Filesystem Integrity Checking

    24Scalability Workshop ORNL May 2009

    Problem is among hardest to solve 1 trillion files in 100h 2-4PB of MDT filesystem metadata

    ~3 million files/sec, 3GB/s+ for one pass 3M*stripes checks/sec from MDSes to OSSes 860*stripes random metadata IOPS on OSTs

    Need to handle CMD coherency as well Link count on files, directories Directory parent/child relationship Filename to FID to inode mapping

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    25/45

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    26/45

    Recovery ImprovementsVersion Based Recovery

    Independent recovery stream per file Isolate recovery domain to dependent ops

    Commit on Share Avoid client getting any dependent state Avoid sync for single client operations Avoid sync for independent operations

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    27/45

    Imperative RecoveryServer driven notification of failover

    Server notifies client of failover completed Client replies immediately to server

    Avoid client waiting on RPC timeouts Avoid server waiting for dead clients

    Can tell between slow/dead server No waiting for RPC timeout start recovery Can use external or internal notification

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    28/45

    Performance Enhancements

    SMP ScalabilityNetwork Request Scheduler Channel Bonding

    28Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    29/45

    SMP Scalability

    29Scalability Workshop ORNL May 2009

    Future nodes will have 100s of cores Need excellent SMP scaling on client/server Need to handle NUMA imbalances

    Remove contention on servers Per-CPU resources (queues, locks) Fine-grained locking Avoid cross-node memory access

    Bind requests to a specific CPU deterministically Client NID, object ID, parent directory

    Remove contention on clients Parallel copy_{to,from}_user, checksums

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    30/45

    Network Request Scheduler

    30Scalability Workshop ORNL May 2009

    Much larger working set than disk elevator Higher level information Client NID, File/Offset

    Read & write queue for each object on server

    Requests sorted in object queues by offset Queues serviced round-robin, operation count (variable) Deadline for request service time

    Scheduling input: opcount, offset, fairness, delay

    Future enhancements Job ID, process rank Gang scheduling across servers Quality of service

    Per UID/GID, cluster: min/max bandwidth

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    31/45

    31

    q2q3qn

    Network Request Scheduler

    q1 WQ

    req req req

    req req req req

    req

    req

    req

    Q (Global FIFO Queue)

    Data Object Queues

    Work Queue

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    32/45

    HPC Center of the Future

    S ha red S tora g e Network

    Use r Da ta10 00 's OS Ts

    M e t a d a t a10 0 's M DTs

    Ca pa ci ty 1250 ,0 00 Nodes

    Capabil i ty500 ,000 Nodes

    Ca pa ci ty 2150 ,0 00 Nodes Cap aci ty 3

    50 ,000 Nodes Te s t25 ,000 Nodes

    Vi z2

    10 TB/sec

    HPS S Archiv e

    Viz1

    WANAccess

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    33/45

    Channel Bonding

    33Scalability Workshop ORNL May 2009

    Combine multiple Network Interfaces Increased performance

    Shared: balance load across all interfaces Improved reliability

    Failover: use backup links if primary down Flexible configuration

    Network interfaces of different types/speedsPeers do not need to share all networksConfiguration server per network

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    34/45

    Conclusion

    34Scalability Workshop ORNL May 2009

    Lustre can meet HPCS filesystem goals Scalability roadmap is reasonable Incremental orthogonal improvements

    HPCS provides impetus to grow Lustre Accelerate development roadmap Ensures that Lustre will meet future needs

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    35/45

    Questions?

    35Scalability Workshop ORNL May 2009

    HPCS Filesystem Overview online at:http://wiki.lustre.org/index.php/Learn:Lustre_Publications

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    36/45

    36

    THANK YOU

    Scalability Workshop ORNL May 2009

    Andreas Dilger Senior Staff Engineer, Lustre GroupSun Microsystems

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    37/45

    T10-DIF vs. Hash Tree

    CRC-16 Guard Word All 1-bit errors All adjacent 2-bit errors Single 16-bit burst error

    10-5 bit error rate32-bit Reference Tag

    Misplaced write != 2nTB Misplaced read != 2nTB

    Fletcher-4 Checksum All 1- 2- 3- 4-bit errors All errors affecting 4 or fewer 32-b

    words

    Single 128-bit burst error 10-13 bit error rate

    Hash Tree Misplaced read

    Misplaced write Phantom write Bad RAID reconstruction

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    38/45

    Metadata Improvements

    Metadata Writeback Cache Avoids unnecessary server communication

    Operations logged/cached locally Performance of local file system when uncontended

    Aggregated distributed operations Server updates batched and tranferred using bulk protocols

    (RDMA) Reduced network and service overhead

    Sub-Tree Locking Lock aggregation a single lock protects a whole subtree Reduce lock traffic and server load

    38Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    39/45

    Metadata Improvements

    Metadata Protocol Size on MDT (SOM)

    > Avoid multiple RPCs for attributes derived from OSTs> OSTs remain definitive while file open> Compute on close and cache on MDT

    Readdir+> Aggregation

    - Directory I/O- Getattrs- Locking

    39Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    40/45

    LNET SMP Server Scaling

    40

    Total client processes Total client processes

    R P C

    T r h o u g h p u

    t

    R P C

    T h r o u g p u

    t

    ClientNodes

    Single Lock Finer Grain Locking

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    41/45

    Communication ImprovementsFlat Communications model

    Stateful client/server connection required for coherence and performance Every client connects to every server O(n) lock conflict resolution

    Hierarchical Communications Model Aggregate connections, locking, I/O, metadata ops Caching clients

    > Lustre System Calls> Aggregate local processes (cores)

    > I/O Forwarders scale another 32x or more Caching Proxies> Lustre Lustre> Aggregate whole clusters> Implicit Broadcast - scalable conflict resolution

    41Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    42/45

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    43/45

    Architectural Improvements

    Scalable Health Network Burden of monitoring clients distributed not replicated Fault-tolerant status reduction/broadcast network

    > Servers and LNET routers LNET high-priority small message support

    > Health network stays responsive Prompt, reliable detection

    > Time constants in seconds> Failed servers, clients and routers> Recovering servers and routers

    Interface with existing RAS infrastructure Receive and deliver status notification

    43Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    44/45

    44

    Primary Health Monitor

    Failover Health Monitor

    Client

    Health Monitoring Network

    Scalability Workshop ORNL May 2009

  • 8/6/2019 Dilger Lustre HPCS May Workshop

    45/45

    Operations support

    Lustre HSM Interface from Lustre to hierarchical storage

    > Initially HPSS> SAM/QFS soon afterward

    Tiered storage Combine HSM support with ZFSs SSD support and a

    policy manager to provide tiered storage management