Building File System Semantics for an Exabyte Scale Object ... › sites › default › files ›...
Transcript of Building File System Semantics for an Exabyte Scale Object ... › sites › default › files ›...
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 1
Building File System Semantics for an Exabyte Scale Object Storage System
Shane MainaliRaji EaswaranMicrosoft
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 2
Agenda
Analytics Workloads (Access patterns & challenges) Azure Data Lake Storage overview Under the hood Q&A
Analytics WorkloadsAccess Patterns and Challenges
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 4
Analytics Workload Pattern
Sensors and IoT(unstructured)
Files (unstructured)
Media (unstructured)
Logs (unstructured)
Business/custom apps(structured) Power BI
Azure Analysis Services
Real-time Apps
Cosmos DB
INGEST PREP & TRAIN MODEL & SERVE
STOREAzure Data Lake Storage Gen2
EXPLORE
SQL Data Warehouse Azure Databricks
Azure Data Explorer
Azure SQLData Warehouse
Azure DatabricksAzure Data Factory
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 5
Challenges
- Containers are mounted as filesystems on Analytics Engines like Hadoop and Databricks
- Client-side file system emulation impacts performance, semantics, and correctness
- Directory operations are expensive- Coarse grained Access Control- Throughput is critical for Big Data
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 6
Storage for Analytics - Goals
Address shortcomings of client-side design First-class hierarchical namespace Interoperability with Object Storage (Blobs) Object-level ACLs (POSIX) Platform for future filesystem-based
protocols (e.g. NFS)
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 7
Azure Data Lake StorageFile System Semantics on Object Storage
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 8
Hierarchical Namespace (HNS)
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 9
Azure Data Lake Storage Architecture
Object Tiering and Lifecycle Policy Management
AAD Integration, RBAC, Storage Account Security
HA/DR support through ZRS and RA-GRS
Common Blob Storage Foundation
Blob API ADLS Gen2 API
Server Backups, Archive Storage, Semi-structured
Data
Unstructured Object Data
Hadoop File System, File and Folder Hierarchy,
Granular ACLs, Atomic File Transactions
File Data
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 10
Blobs and Flat Namespace
Data
Blob API
Flat Namespace
/foo/bar/file.txt
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 11
Files and Folders in HNS
Data
ADLS API
Hierarchical Namespacefoo
bar
baz.txt
Blob API
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 12
Mapping the concepts Same Storage account URIs are same except endpoint
http://account.dfs.core.windows.net/container/videos/movie.mp4http://account.blob.core.windows.net/container/videos/movie.mp4
Filesystem == ContainerCreate File System and Create Container APIs do the same thingExactly the same metadata and objects under the covers
Directory ~= BlobDirectories are first class entities; both implicit and explicit creation supportedImplicit creation when blobs are createdACLs and Leases obeyed by both
File == BlobADLS Gen 2 adds Append and Flush semanticsExisting Blob semantics supported as isACLs and Leases obeyed by both
Account
File System
Directory
File
Account
Container
Blob
BlobADLS
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 13
API Interoperability
Can use Blob or ADLS Gen 2 API’s to access the same data
Existing Blob applications work without code changes and no data movement on the Data Lake account
Account
File System
Directory
File
Account
Container
Blob
BlobADLS
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 14
Under the HoodDesigning for Performance, Scale & Throughput
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 15
Blob Storage Architecture
Front EndStateless, front door for request handling (auth, request metering/throttling, validation)
Partition LayerServes data in key-value fashion based on partitions, enables batch transactions and strong consistency
Stream LayerStores multiple replicas of the data, deals with failures, bit rot, etc.
FE 2
Partition 3(F-J)
Stream 2
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 16
Blob Storage with HNS ArchitectureFront EndStateless, front door for request handling (auth, request metering/throttling, validation)
Hierarchical NamespaceServes metadata based on partitions, including file names, directory structure and ACLs.
Partition LayerServes data in key-value fashion based on partitions, enables batch transactions and strong consistency
Stream LayerStores multiple replicas of the data, deals with the media/devices, handles failures, bit rot, etc.
FE 2
Partition 3(G3-G4)
Stream 2
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 17
Hierarchical Namespace Topology
/
/path1/
/path2/file1
/path2/file2
/path2/
/path2/path3/
/path2/path3/file3
/
path2path1
path3
GUID1
GUID2 GUID3
GUID4
GUID1 -> GUID2GUID1 -> GUID3GUID3 -> GUID4
--------, GUID1 <=> “/”GUID1, GUID2 <=> “path1”GUID1, GUID3 <=> “path2”GUID3, GUID4 <=> “path3”
file1
file2
file3
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 18
Partition Layer Schema
# Parent ID Name CT Del File Metadata Child ID1 GUID-ROOT . 00001 N Y … GUID-BLOB1
2 GUID-ROOT path1 00100 N N … GUID-PATH1
3 GUID-ROOT path2 00200 N N … GUID-PATH2
4 GUID-PATH1 . 00100 N Y … GUID-BLOB2
6 GUID-PATH2 . 00200 N N … GUID-BLOB3
GUID-PATH2 file1 00300 N Y … GUID-BLOB4
7 GUID-PATH2 file1 00350 N Y … GUID-BLOB4
8 GUID-PATH2 file2 00400 N Y … GUID-BLOB5
GUID-PATH2 path3 00400 N N … GUID-PATH3
10 GUID-PATH3 . 00400 N Y … GUID-BLOB6
11 GUID-PATH3 file3 00401 N N … GUID-BLOB7
Account;FileSystem;GUID-ROOT path1 00100
Partition Key Row Key Columns
/
/path1/
/path2/file1
/path2/file2
/path2/
/path2/path3/
/path2/path3/file3
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 19
Hierarchical Namespace Flow
FE 2
Partition 3(F-J)
Stream 2
Staging\Oscars\Movie.mp4
Create File
Parent Name Label<Null> Guid1 Staging
Guid1 Guid2 OscarsGuid2 Guid3 Movie.mp4
Hierarchical Namespace
3
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 20
Hierarchical Namespace Flow
FE 2
Partition 3(F-J)
Stream 2
Master\Oscars\Movie.mp4
Rename Directory
Parent Name Label<Null> Guid1 Staging
Guid1 Guid2 OscarsGuid2 Guid3 Movie.mp4
Hierarchical Namespace
3
Master
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 21
Scale Unit 1
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Scale Out & Load BalancingNamespace ProcessorsNP1: GUID 1 – 100NP2: GUID 101 to 200NP3: GUID 201 to 300NP4: GUID 301 to 400NP5: GUID 401 to 500NP6: GUID 501 to 1000
Namespace ProcessorsNP1: GUID 1001 – 1100NP2: GUID 1101 – 1200NP3: GUID 1201 – 1300NP4: GUID 1301 – 1400NP5: GUID 751 to 1000
NP6: GUID 501 to 750
A Scale Unit contains hundreds of nodes
Each node has many Namespace Processors (NP)
Each NP manages a portion of the namespace (GUID range for each <Account, FileSystem>)
Hot nodes are load balanced with other nodes in the Scale Unit by splitting managed GUID ranges among NPs
An Azure region contains several scale units
When a majority of nodes in a Scale unit become hot, load balancing occurs across Scale unitsScale Unit 2
Node
Node
Node
Node
Node
Node
Node
Node
NP6: GUID 501 to 750
NP5: GUID 751 to 1000
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 22
Transaction Processing and Caching
/path2/file1
/
path2
file1NP1 NP2
NP3
NP4 NP5
NP6
NP7 NP8
NP9
/
path2path1
path3
GUID1
GUID2 GUID3
GUID4
file1
file2
file3
Parent ID Name File Child IDGUID-ROOT path2 N GUID-PATH2
GUID-PATH2 file1 Y GUID-BLOB4
ID NameOwning
NP
GUID-ROOT / NP1
GUID-PATH2 path2 NP7
GUID-BLOB4 file1 NP6
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 23
High Throughput A single object/file
contains multiple blocks The block range is
partitioned uniformly across partitions
A single write can potentially be served by all partition nodes
Support 100s of Gbps of Ingress/Egress for a single account or to a single file
2 layers of caching to enable high throughput read performance
FE 2
Partition 3(F-J)
Stream 2
Movie.mp4 (50 GB)
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 24
Performance & Scale Implications Hierarchical Namespace is only in the path of namespace
traversal and metadata operations Data reads and writes don’t go through Hierarchical Namespace
Hierarchical Namespace leverages SSD for persistence and Memory for Caching to minimize latency overhead
Separation of Distributed Cache and Persistent State (Partition) layers is critical
Load Balancing is very efficient and fast Leverage Partition Layer; distinct partitioning for Blobs and HNS
While Distributed Transactions are more expensive, they are less frequent
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 25
Opportunities
- Snapshots at any level of the hierarchy- Time travel operations with E2E built-in transaction
timestamps- Support a wide variety of File Systems
- Interop across all- Zero data copying
- In-Place upgrade from Flat -> Hierarchical Namespace- Cross-Entity Strongly Consistent Reads- High-Fidelity On-Prem->Cloud Migration/Hybrid
2019 Storage Developer Conference. © Microsoft. All Rights Reserved. 26
Q & A