Post on 11-Aug-2020
IBM Research Lab at Haifa
Object Storage and beyond
Julian Satran
IBM Research Laboratory at Haifa
2
From Direct Attached Storageto Network Attached Storage…
DAS NAS
Application
File system
Database
Block Storage
Server
File system
Database
Block Storage
Server
NetworkNetwork
Application
IBM Research Laboratory at Haifa
3
…to Storage Area NetworksSAN
File system
Database
Block Storage
Server
Storage Server
NetworkNetwork
Application
IBM Research Laboratory at Haifa
4
Combined TechnologiesApplication
File system
Database
Block Storage
File/Database Server
Storage Server
NetworkNetwork
Application Server
NetworkNetworkNetwork: IP
NFS Protocol: CIFS
Network – FibreChannel/IPProtocol – FCP/iSCSI (SCSI)
IBM Research Laboratory at Haifa
5
SAN promise – Unmediated Access to Data
Application
File system
Database
File/Database Server
Application Server
NetworkNetwork
Control
Issue –security! Data
Block Storage
Storage Server
NetworkNetwork
IBM Research Laboratory at Haifa
6
SAN promise – Disk Sharing
File system
Database
File/Database Server
NetworkNetwork
ApplicationApplication Server
ApplicationApplication Server
ApplicationApplication Server
ApplicationApplication Server
ApplicationApplication Server
ApplicationApplication ServerApplication
Application Server
NetworkNetworkControlIssue - security
Data
Block Storage
Storage ServerBlock Storage
Storage ServerBlock Storage
Storage Server
IBM Research Laboratory at Haifa
7
Object Storage
Today's Block Device
Object Store
Operationsread blockwrite block
SecurityWeakFull disk
AllocationExternal
Operationsread object offsetwrite object offsetcreate objectdelete object
SecurityStrongPer Object
AllocationLocal
IBM Research Laboratory at Haifa
8
Object Store Operations
Basic OperationsCreate ObjectDelete ObjectWrite Offset in ObjectRead Offset in Object
Administrative OperationsBasic abstract flow
Create an object, getting back an object IDClients responsibility to remember the ID
Send requests to read and write the object given the IDDelete the object when done using
IBM Research Laboratory at Haifa
9
Object Store SecurityAll operations are secured by a credentialSecurity achieved by cooperation of:
Admin - authenticates, authorizes and generates credentials.ObS - validates credential that a host presents.
Credential is cryptographically hardenedObS and admin share a secret
Goals of Object Store security are:Increased protection/security
At level of objects rather than whole LUsHosts do not access metadata directly
Allow non-trusted clients to sit in the SANAllow shared access to storage without giving clients access to all data on volume
Security Admin
Shared Secret
Authorization Req
Credential
Client
Credential
Object Store
IBM Research Laboratory at Haifa
10
Object Store OperationsRead
ReadParameters: Object Store ID, Object ID, Offset, Length, CredentialsBasic steps an object store must provide
Receive requestValidate credentialsFind allocation data for indicated object
Map offset and length to a collection of LBAs in an underlying block storage
Stage the data if necessaryGather the data and return to the host
Issues and VariantsBlock alignmentRead of non-allocated data (sparse vs. past "end" of object)
IBM Research Laboratory at Haifa
11
Object Store OperationsWrite
WriteParameters: Object Store ID, Object ID, Offset, Length, Credentials, DataBasic steps an object store must provide
Receive requestValidate credentialsFind allocation data for indicated object
Determine if the indicated range is already bound to a collection of underlying LBAs
If not already boundDetermine the mappingUpdate the metadata
Destage the data to the indicated LBAsIssues/Variants
Use of a non-volatile write cacheLate binding
Hardening the metadata updatesEnsuring metadata updates are only modified if data is hardenedBlock alignment
IBM Research Laboratory at Haifa
12
Capability Structure
ObS RightsObject Rights
TypeDoes the credential apply to a specific object or entire object store
Object RightsRead, Write, Append, Truncate, Create (given an ID), Delete, Info
Object Store (ObS) RightsFormat, Create (ObS generates ID), Info on Object Store
Ver(sion)Used to allow the credential to time out
Type Object IDObS ID WR A T F C* VerC D I I*
IBM Research Laboratory at Haifa
13
Admin ClientCredential Structure
CapabilityOperations that the credential entitles
NonceA salt generated by the Admin
Different for every credentialMAC -- Message Authentication Code
Standard cryptographic hash on the capability and the encrypted secretEnsures host cannot alter/forge a credential
Secret The (un-encrypted) secretUsed by the client to verify that it got the credential form the Admin
Capability Nonce MAC Secret
Client ObS
public credential private credential
IBM Research Laboratory at Haifa
14
41 2 5 7
SAN File System without an Object Store
Client
File Manager
3
1. Create File2. Return allocation bitmap3. I/O (unsecured)4. Request additional
allocation bitmap5. Return allocation bitmap6. I/O (unsecured)7. Return actual allocation
data to File Manager
6
Block Device
IBM Research Laboratory at Haifa
15
12
SAN File System with an Object Store
Client
1. Create File2. Return Locks and
Credentials3. Create Object and perform
I/O with Credentials4. I/O with Credentials
File Manager
3 4
Object Store
IBM Research Laboratory at Haifa
16
Blocks vs. Objects vs. Files
Block Storage Object Storage File Storage
FastestSmall set of
operationNo connection to
user’s abstractionDumb deviceNo access security
FastSmall set of
operationsObject usually
maps to user abstractionLocal space
managementEnable end-to-end
managementAccess is Secure
SlowRich set of operations
(sometimes more than what the application requires)
LockingHierarchical name
service…
File usually maps to user abstractionAccess is Secure (strength
depending on the protocol)
IBM Research Laboratory at Haifa
17
How Real Are Object Stores?Object Stores
First big pushGarth Gibson, et al., NASD -- CMU, Panasas
DSF Storage ManagerAntara – a prototype Object StoreLustre and Object Store Target (LLNL)CAS (EMC Centera – A Write-Once ObS). . .
DriversiSCSI
IP access to storage exacerbates SAN security problemsData Sharing Facility (DSF)
A highly scalable research file system which incorporated an object-store like component to ensure local space allocation
Storage Tank and other SAN file systemsShared access requires SAN security (or trusted clients!)
IBM Research Laboratory at Haifa
18
What is Antara*?
A prototype implementation of an object store as a standalone control unitFirst generation ObS prototype
Generation zero: DSF Storage ManagerGeneration two…
Initial focus on integrity and recoverability of metadata and scalable designCurrent focus on demonstrating "reasonable" performance
Multiple versions have been developedFeatures
IP connectivity (but design is fairly transport independent)Supports the Antara ObS protocol
Similar in functionality to T10 draft but not SCSI Commands currently supported:
Open/Close Session, Create/Delete Object, Read/Write/Append, Truncate, . . .Export a single object storeCurrent version assumes a non-volatile store
Prior versions did not have this assumption
* "interior" in Sanskrit
IBM Research Laboratory at Haifa
19
General DesignA conceptual pipeline
Uses completion ports to avoid context switchesIn most cases same operating system thread handles multiple stages
The modules are:I-Module (communication input, connection management)S_Module (security)C_Module (control and dispatching)L_Module (lookup, metadata tables and locking mechanisms)RW_Module (read-write of data and log of metadata)O_Module (communication output)
The C, L, and S modules are mostly object store uniqueI, RW, and O provide functions that are in most control units
I_Module S_Module
start new request
O_Module
L_Module
C_Module RW_Module
error path
Abstraction of Antara Flow
IBM Research Laboratory at Haifa
20
Flow for Read and Write in Antara1 (I) Receive CB, determine needed buffers, allocate and receive data into buffers2 (C) Mark the request as running; Delay this request if clashes with other operation3 (C,I) Notify OS ready to receive the next request from the client
Additional requests handled in parallel4 (S) Check Security (e.g., proper session, credentials, etc.)5 (L) Perform necessary lookups (and/or allocations in case of writes)
Put allocation information in the request6 (RW) Read/write the data from the cache7 (L) Mark the request as done
1Allows delayed requests to start, e.g., read-write conflict although read response may need to wait for log complete
8 (L) Log the request (this is a no-op in the case of read)Ensures hardening of the allocation information
9 (L) Mark the request as logged (this is a no-op in the case of read)Allows delayed requests to complete
• (O) Send the reply • (L) Mark the request as sent• (O) Deallocate buffers used for the request
I_Module S_Module
start new request
O_Module
L_Module
C_Module RW_Module
error path
IBM Research Laboratory at Haifa
21
Metadata
Significant effort invested in designing efficient, scalable data structuresMinimize contention for concurrent clients and multiple processors
Object Directory maps from OID to object's metadata via a sparse hashtableObject metadata includes object size and a block number tableBlock number table maps from offset in block to extents
Small objects use a dynamic, linear-probe, hash table.Large objects use a b-tree.
Freespace managed via a buddy list bitmap
Object Directory
Object Meta Data
Block Number Table
Small objects (implemented as
hash-tables)
Large objects (implemented as b-
trees)
IBM Research Laboratory at Haifa
22
Metadata (cont)
The linear-probe hash table used for mapping OIDs to OMDs was chosen for parallelism
Not all of the metadata can fit into main memory.Some metadata accesses will require disk accesses Shared/exclusive locks, created on-the-fly, allow locking only specific entries of the hash table.
The block-number table implementations were chosen for minimizing page-faults
For large objects, b-trees were chosen based on similar choices in databases.For small objects, an entire b-tree page is inefficient in terms of space and time.
Entries in block number table represent extentsExtents range from one block up to the maximum amount of data transferred in a write request
IBM Research Laboratory at Haifa
23
Antara Space AllocationGoals
Logically consecutive blocks of an object will map to consecutive LBAsAvoid fragmentationQuick allocation decisions
ApproachMaintain a MRU cache of objects receiving allocating writesNon-persistently associate with each object a large extent of consecutive LBAsFor an allocating write, take space from large extent at appropriate offset
Persistently update free space bitmap
IBM Research Laboratory at Haifa
24
Storage Tank Background (IBM Research Almaden)
Control Network (IP)
ST client
UNIX
Backup
Data Data
data
ST client
WinCIFS/NFS CIFS/NFS
SAN IP/FC
Meta-dataServer
Meta-dataServer
Meta-dataServer
Meta-data
Capabilities:Performance and semantics similar to local file systemSharing like NASPolicy-based, centralized storage management
IBM Research Laboratory at Haifa
25
Object Store and Storage TankStorage Tank was designed with an Object Store in mindObject Store will be at same level as control unitsTo integrate an Object Store into Storage Tank requires changes to
Meta Data ServerStorage Tank Client
Main customer benefit will be security and protectionOther benefits will follow
ObS initiator is the callable module that interacts with the ObS TargetInitiator integrated into both Storage Tank client and Meta Data ServerCurrent Initiator comes in several flavors
Synchronous and AsynchronousUser-mode and Kernel-mode
IBM Research Laboratory at Haifa
26
Object Store added to Storage Tank
Control Network (IP)
ST agent
UNIX
Backup
data
ST agent
Win
Data Data
CIFS/NFS CIFS/NFS
SAN IP/FC
Meta-dataServer
Meta-dataServer
Meta-dataServer
Meta-data
Object Store
Object Storeadded:
SAN securityScalabilityManageability
IBM Research Laboratory at Haifa
27
ObS vs. alternatives
Complete segregation of the disk space (secure large entities)Pros
No infrastructure changeCons
No sharingRequires careful synchronization
NAS – the only complete alternativeRequires all application to use the FS naming structure (heavy)Does not scale
IBM Research Laboratory at Haifa
28
What next
Data repositories with semantic access (next generation FS)Closer “coupling” with search and indexing enginesPassive and active access models (push/pull)Richer security models (including securing data at rest)Multimodal data
IBM Research Laboratory at Haifa
29
Data repositories with semantic access (next generation FS)
With FS taking over a billion files access by file-name is cumbersomeContent based indexes and “web-like” searching are far more convenientIn specialized domains indexing can be based on manually createdmetadataThe most common case is that of “indexing” extracted automatically from content (data, text and media)This technology is required everywhere – from desktops to large seversCurrent access schemes based on a preset DAG should be seen as just instances of more general organization schemes
IBM Research Laboratory at Haifa
30
Closer coupling with indexing and searching
Who should do itLarge repositories will continuously re-index contentIssues/alternatives
Who does it ?How does storage support it (poll, interrupt, specialized data-structures)?
IBM Research Laboratory at Haifa
31
Passive and active access models (push/pull)
Repositories are regularly accessed in push or pull modeShould storage support both or should it keep the current pull (passive) mode only
IBM Research Laboratory at Haifa
32
Richer security models (including securing data at rest)
Security of data at rest is of major importanceIEEE effort of standardization is timelyThere is a strong relationship between ObS and securing data at rest
IBM Research Laboratory at Haifa
33
Multimodal data
Indexing and searching in multimedia is highly specializedRequires substantial computing power but unlikely to change frequently