Building Storage for Clouds (ONUG Spring 2015)

72
Building Storage for Open and Private Clouds

Transcript of Building Storage for Clouds (ONUG Spring 2015)

1. Building Storage for Open and Private Clouds 2. Your not so Humble Speaker 30 years of consulting and writing for trade press Columnist/blogger at NetworkComputing .com Chief Scientist DeepStorage, LLC. Independent test lab and analysts @DeepStorageNet on Twitter [email protected] 3. Our Agenda: Welcome to Storage A cloud needs storage An introduction to shared storage Block, file, object & storage networking RAID for resiliency Storage system architectures and media Replication and erasure codes Beyond just storage, the services Snapshots and replication & data reduction 4. Agenda 2: Joining Storage To Cloud Integrating storage and hypervisor vSphere VAAI, vVols, Etc. Managing storage for Openstack Swift for objects and Cinder for block The future of storage Storage QoS, managing the noisy neighbor The rise of hyper-converged infrastructures Storage analytics, getting smart about storage 5. A Cloud Needs Storage Storage for cloud compute instances Ephemeral storage Persistent block storage (eg: EBS) File services Tier 2 bulk storage Typically object storage 6. Introducing Shared Storage Shared storage provides Flexibility Centralized caching Data resiliency Evolving services Snapshots Replication Data reduction Head, Track, Sector Analog Data HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 7. Storage Area Networks AKA Block or SAN Array abstracts storage to virtual SCSI disks (LUNs) Transport via Fibre Channel, IP (iSCSI), Ethernet (FCoE) HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 8. Fibre Channel A network designed by storage guys Pre-allocates buffer credits for transit Interswitch fabric Layer 2 naming servers in switches Interoperability issues past Encapsulates SCSI in FCP Gen5=16Gbps Best tools View target/LUN Multipath aware HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 A B 9. Ill give up Fibre Channel, When you pry it from my cold dead hands 10. FCoE and Convergence Promises FC performance and management at Ethernet prices Encapsulates FCP in Ethernet (not IP) Relies on data center bridging to manage packet loss Primarily an edge tech Std with Cisco UCS 11. iSCSI Encapsulates SCSI commands in IP Usually over Ethernet Usually via software initiator Can be routed Relegated to SMB market on GigE Plenty of performance over 10Gig DCB flow control recommended 12. File Storage AKA NAS Network attach storage Abstracts file systems IP/Ethernet transport Locking for better sharing NAS has more context on data Enables selective services Required for VDI/DaaS FAS3160FAS3160 Shelf ID System Module B Module A Fault Power DS14 AT MK2 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA2.0TA Shelf ID System Module B Module A Fault Power DS14 AT MK2 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA2.0TA Shelf ID System Module B Module A Fault Power DS14 AT MK2 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA2.0TA Shelf ID System Module B Module A Fault Power DS14 AT MK2 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA2.0TA 13. Multipath IO SAN protocols expose paths Hypervisor/OS can Fail-over Round-robin Or with vendor plugin to owning controller NAS must rely on network NIC Teaming, LACP MPIO included in protocol updates NFS 4 SMB 3 HP ProLiant UID 1 2 DL165 G6 HP ProLiant UID 1 2 DL165 G6 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 14. Block Vs File Block Cleaner data path for fastest access Block in LUN is just an offset Storage team owns FC end to end Clustered file system limitations LUN connection limits File (and block on file) File system adds overhead Uses standard Ethernet Lower cost Network team involvement Better locking and sharing 15. Object Storage Greater scale than file systems Billions of objects, PB HTTP Get/Put semantics APIs standardizing on S3 and Swift Objects generally immutable Overwrite creates new version Requires application support Object Store Get Put 16. File Systems and Object Stores File System Limits: Disk Capacity (16-100TB) Path (255 char) Number of files Metadata Syntax: Open(file) Lock(2343,100) Write(2343,hello Close(file) Object Store Store/retrieve file by URI/URL Usually has extended metadata: Retention Protection policy No limits Path depth Files/folder Total files 17. Old School RAID Data striped as N-Data plus Y-Parity strips Issues: Small writes requires read and rewrite Larger drives stress rebuild times Limited data integrity assurance 4KB DataBlock 64KB RAID Strip 18. Chunklet RAID Data is still striped into xKB and striped across N+P drives Stripes distributed across all drives in pool Distributes Load Spares Rebuild N to N not N to 1 P P P P P P P P P P P 19. Storage Architectures 20. Monolithic Storage Systems Four or more tightly clustered controllers Proprietary shared memory interconnects Very high RAS Reliability (7 9s) Availablity Serviceability The old guard 21. Modular Two controllers Active-Active Asymmetrical Logical Unit Assignment Active-Passive failover Add capacity via JBOD Bottlenecks at controllers Requires cache coherency Shelf ID System Module B Module A Fault Power DS14 AT MK2 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA2.0TA Shelf ID System Module B Module A Fault Power DS14 AT MK2 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA2.0TA Shelf ID System Module B Module A Fault Power DS14 AT MK2 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA 2.0TA2.0TA 8 1 2UID41 5 32 SID HP ProLiant DL360 G7 8 1 2UID41 5 32 SID HP ProLiant DL360 G7 22. Shared Nothing Scale-Out Cluster of nodes act as one storage system Each node provides CPU/cache and storage Data distributed across nodes Protected against node failure Lower storage efficiency Tune for performance or cost Eg: Isilon, Solidfire, Ceph, HDFS FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 23. Resiliency in Scale Out Storage n-way replication Low CPU overhead Typically 200% capacity overhead Erasure Coding Reed-Solomon or similar maths n data chunks where any y can resolve data Commonly 10 of 16 chunks needed Survive 6 failures with 35% overhead 24. Dispersal Coding Send erasure data across multiple data centers If 10 of 16 coding 5 chunks to each to 3 data centers Survive data center w/o data loss But WAN latency Every read fetches all chunks Many combine erasure codes and replication HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 HP ProLiant DL380 G6 FANS PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 25. Other Scale-Out Architectures Clustered file systems Multiple host access to shared volume Metadata coordination, locking complex eg: IBM GPFS Federated file systems Distribute folders across nodes eg: Netapp cluster mode FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 26. Moores Law Takes Over Storage Transistor cost halves every 2 years Processor speed doubles every 2 years Producing software defined storage Flash memory cost declines 27. Todays Media Options Medium Latency Human Scale Latency Local RAM ~100ns 1 sec Flash DIMM 2.5-5 sec 25-50 sec PCIe SSD 20-50 sec 3-8 min SAS or SATA SSD 100-300 sec 16-50 min All Flash Array 500-1100 sec 1.5-3 hr Hybrid Array ~2ms 6 hr 15K RPM Disk 5 ms 15 hr 28. Measuring Storage Performance or Lies, Dammed Lies and Benchmarks IOPS Input Output Operations Per Second Theres no such thing as a standard IOP Size (of IO and dataset) Distribution (RND, Sequential, Semi-RND) Read/Write Ratio Latency Time per I/O operation Determines application performance Many applications use Chained I/O DB server may do 5-20 index reads per transaction 29. What Is Flash Memory? Solid State, Non-volatile memory Stored charge device Not as fast as DRAM but retains Read/Write blocks but must erase 256KB- 1MB pages Erase takes 2ms or more Erase wears out cells Writes always slower than reads 30. The Three, & , Types of Flash Single Level Cell (SLC) (1bit/cell) Fastest 100,000 program/erase cycle lifetime Multi Level Cell (MLC) (2 bits/cell) Slower 10,000 program/erase cycle lifetime eMLC or HET MLC (2 bits/cell) Slightly slower writes 30,000 cycles Triple Level Cell (TLC) (3 bits/cell) Now ready for data center use Phones, tablets, maybe laptops 31. Flashs Future Todays state of the art flash 1x nm cells (16-19nm) Smaller cells are denser, cheaper, crappier Samsung now shipping 3d Because they couldnt get Hi-k to work Other foundries have 1-2 more shrinks Other technologies post 2020 PCM, Memristors, Spin Torque, Etc. 32. Anatomy of an SSD Flash Controller Provides external interface SATA SAS PCIe Wear leveling Error correction DRAM Write buffer Metadata Ultra or other capacitor Power failure DRAM dump Enterprise SSDs only 33. Flash/SSD Form Factors SATA 2.5 The standard for laptops, good for servers SAS 2.5 Dual ports for dual controller arrays PCIe Lower latency, higher bandwidth Blades require special form factors SATA Express 2.5 PCIe frequently with NVMe 34. SSDs use Flash but FlashSSD Fusion-IO cards Atomic Writes Send multiple writes (eg: parts of a database transaction) Key-Value Store FTL runs in host CPU NVMe PCIe but with more, deeper queues Memory Channel Flash (SanDisk UltraDIMM) Block storage or direct memory Write latency as low as 3sec Requires BIOS support Pricey 35. All Flash Array Vendors Want You to Think of This 36. But Some Are This 37. Or Worse This 38. What You Really Want 39. Storage Services Snapshots Provide point in time Typically per volume Clones just a read/write snapshot Not created equal Replication Synchronous - Write to both targets before ack Distance limits and performance impact Asynchronous Ack from first target Point in time Transfer and apply snapshot Data Reduction 40. Copy On-Write Snapshots Copy on write 3 I/Os per write Redirect on write 1 I/O per write Plus metadata update 50% performance penalty was common 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 2010 Snapshot Created 3 7 13 18 Data from changed blocks written to snapshot area 1 2 4 5 6 8 9 11 12 14 15 16 19 203 13 187 10 17 Mounted snapshot shows: 41. Why Not vSphere Snapshots? Modern Infrastructure Seminar Use log files like linked clones Writes go to log file while snapshot exists Reads must walk snapshot tree to find current data Can have substantial performance impact That extends to the snapshot unwind Primary VMDK Snapshot 1 Snapshot 2 Is data in snapshot 2? If not Is data in snapshot 1 If not Read from VMDK Application read request 42. Vmware Snapshot Overhead Modern Infrastructure Seminar | Log based snapshots cause IOP multiplication on reads Must check each snap to see who has latest Graph is IOmeter running 4K OLTP workload IOPS Average Latency (ms) Snaps hot 1 Snaps hot 2 43. Data Reduction Trading CPU cycles for data space/writes Compression Eliminates small data duplications eg: null padding in databases Data Deduplication Eliminates larger duplicates eg: all those copies of Windows Best performed inline After ack 44. Data Deduplication Detect duplicate data Break data into blocks Calculate hash per block Compare to table of seen hashes The hard part Store unique data Use pointer to existing for dupe Volume/file is already a set of pointers Add use count metadata 45. Data Reduction and Flash Less data written Less IOPS consumed Reduced wearout In hybrids More data fits in flash layer Higher flash hit ratio Possible performance hit on sequential reads 46. Enter Modern Storage At least some flash media Data layout is metadata driven Volume is list of logical blocks Snapshot/clone is copy of list Writes coalesced in high performance log Data written in media friendly blocks & update metadata Full page for SSDs Full RAID stripe Eliminates small write penalty The Log Incoming Data Storage 47. Software Defined Storage x86 server is the building block of modern storage 10gigE provides low latency interconnect Software delivered Nexenta, Gluster Wrapped in tin Tintri, Nimble, Pure Storage 48. Enter Hyper-Converged Infrastructure Software to implement shared nothing scale out storage system Typical cluster 3-32 nodes Runs in hypervisor kernel or VM Integrated storage and hypervisor management Per-VM data services Hybrid or all flash eg: Nutanix, VSAN, 49. ServerSAN architecture differentiators Data protection model Per node RAID? N-way replication Network RAID, erasure codes? Flash usage: Write through or write back cache SubLUN tiering Prioritization/Storage QoS Data locality Data reduction Snapshots and cloning 50. Data Reduction and Flash Less data written Less IOPS consumed Reduced wearout In hybrids More data fits in flash layer Higher flash hit ratio Possible performance hit on sequential reads 51. Accelerating At the Server Use memory in server as cache DRAM and/or flash Leverage lower latency local resources Capitalize on server economics Decouple capacity and performance Target performance $ better Offload storage systems CPU MEM PSU ! PWR SYS PWR SYS PWR SYS PWR SYS PWR SYS PWR SYS PWR SYS PWR SYS PWR SYS PWR SYS PWR SYS PWR SYS PWR SYS USCS C210 M1 PWR SYS PWR SYS PWR SYS 7 12 9 14 4 1 10 15 5 2 11 16 6 3 hp StorageWorks Bay 14Bay 1 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k UID ESC ENTER UID ESC ENTER HP StorageWorks hsv210 HP StorageWorks hsv210 hp StorageWorks Bay 14Bay 1 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k hp StorageWorks Bay 14Bay 1 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k hp StorageWorks Bay 14Bay 1 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k hp StorageWorks Bay 14Bay 1 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 300GB15k 52. Capitalize on Server Economics Vendor gross margins EMC 60% Server vendors 20% Resources cost less on servers 400GB SSD EMC for VMAX $6600 (after 40% discount) Dell (for R730 server) $948-$2500 Intel P3700 PCIe $1400 Server Memory $10-40/GB Server capacity 384-768GB 53. Decouple capacity and performance Traditionally linked Each 15K RPM drive = 220 IOPS and up to 600GB Resulting in vast overprovisioning Use a few big, slow disks for capacity Save drive slots Slots cost more than the disk they hold Reduce capacity costs More array SSDs dont always mean faster 54. Target Performance Better Application requirements vary Arrays make addressing needs hard Data cached, migrated to flash by demand Regardless of whether app deserves it Array ignorant of VMs in data store Server side cache VM aware Critical VMs cached Others not 55. Offload storage systems Extend the life of existing array Postpone the fork lift upgrade At primary and DR site Wait for the new model For vVol or other feature support Use smaller model Run high storage I/O apps like VDI Maintain existing expertise 56. Optimizing Instant Recovery Best new backup feature Instant recovery spins up VM from backup repository Backup repositories are slow Big slow disks Disk based dedupe Server side cache Accelerates Relieves backup storage 57. Cache Types Write through Caches reads, writes acknowledged by storage Write back Replicates to cache across servers Acknowledges writes while in cache on n servers Local or Distributed Reduced? FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 FANS PROC 1 PROC 2 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 ONLINE SPARE MIRROR UID 2 1 4 3 6 5 8 7 6 5 4 3 2 1 2 1 4 3 6 5 8 7 POWER SUPPLY POWER SUPPLY 1 1 2 OVER POWER CAP 2 3 4 3 2 16 5 PROC 1 PROC 2 POWER SUPPLY 2 POWER SUPPLY 1 OVER TEMP POWER CAP 1 2 3 4 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 AMP STATUS FANS DIMMS HP ProLiant DL380 G7 58. Write Through and Write Back 0 10000 20000 30000 40000 50000 60000 Baseline Write Through Write Back TPC-C IOPS 100 GB cache Dataset 330GB grows to 450GB over 3 hour test 59. Storage and vSphere vSphere VASA 1.0 passes volume info to vCenter 2.0 is control path for VVOLS VAAI SCSI/NFS extensions for offload Locking Cloning 60. vStorage API for Array Integration (VAAI) Empowers vSphere to offload tasks to storage Storage Decisions Name Function Benefit Atomic Test and Set Lock block range (file lock for SAN) No SCSI reservation More VMs/LUN Clone Blocks Copies data in array Faster cloning, vMotion Block Zeroing Fills space w/zero (SCSI 0 write) Space reclamation Out of space Thin provision stun Suspends VMs when out of space Allows more graceful recovery Unmap Releases and zeros blocks Returns free space to array thin provisioning 61. The LUN Must Die A 1:1 relation between apps & LUNs Separate I/Os per LUN to identify sequentiality Provide data services Snapshots Especially application consistent snapshots Clones Replication Timing of application quiesing make datastore snapshots Crash Consistent At best 1 VM app consistent per volume Storage for virtualization should be VM aware 62. What if we just had lots of LUNs? Instead of many VMs=1 LUN 1 LUN=1 VMDK But: Provisioning and managing thousands of LUNS is real work SCSI, therefore FC and iSCSI limited to 256 connections VVols with VASA delivers Per VMDK interaction between storage and ESXi Therefore per-VM services Automated, policy based automation, provisioning 63. VVols Architecture 64. The Gotchas VVols NFS easier that Vvols Block so many 1st movers will be file But VVols only with NFS 3 not 4.1 VVols are just an API Implementation will vary VVols will be hard for older array architectures Much more metadata needed Tens of thousands of VVols vs 100 LUNS 32 snapshots each 65. Storage and Openstack Swift API and: Scale out object storage Objects stored in file system on nodes N-way object replication or erasure codes (local) Cinder Storage provisioning API LVM, NFS drivers in Openstack Vendors can also provide driver 66. Cinder API Functions Manage Volume Groups Ala storage policy - common characteristics Manage Volumes Within a group create, delete Etc. Snapshot management Backups QoSExtra and ExtraSpecs Access to array features 67. Quieting the Noisy Neighbor Storage Quality of Service Emerging technology with varying forms Throttles Limit workload to x IOPS Keep neighbors from getting too noisy But apply when performance is available Latency triggers reduce this impact Prioritization Bronze, silver, gold Bronze workloads get starved when system stressed Minimum, maximum, burst Most sophisticated and therefore complex Direct allocation 67 Modern Infrastructure Seminar | 68. Storage Analytics: Capacity Planning Capacity predictions let admins reduce overprovisioning. Producing upgrade recommendations 69. But, Why is my Application Slow? View that VMs storage latency Broken down into: Host Network Storage 70. File Analytics Most of us just dont know When files are accessed Who accessed them What files theyre storing Analytics best in the storage system External scanners present load Leaders: Data Gravity Qumulo