RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

112
RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015

Transcript of RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

Page 1: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

RozoFS Architecture Overview:RozoFS components

edition 1.4

23/01/2015

Page 2: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

metadata

Exportd

StorageSid1: host1

StorageSid1: host1

StorageSid1: host1

StorageSid1: host1

Rozofsmount

/fs1/home/

RozoFS architecture overviewComponents

Rozofsmount

Storage

/fs1/home/

Metadata server

Data path

metadata

Exportd

Storage Storage Storage

client node

control path

Page 3: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 3

Storage component

[cid1,sid1]

Storageprocess

[cid2,sid1]

[cidn,sid1]

Storage Node

IP@:port

File System (e.g: XFS)

Raid 0 (0+1,5,6)

Device 0

File System (e.g: XFS)

Raid 0 (0+1,5,6)

Device n

Physical disks

Physical disks

• A storage (cid/sid) is a set of logical disks (devices) with the same capacity and performance• On the same server, RozoFS can provide storages based on different technologiesNote : configuration can be done with or without RAID controller

storage

Page 4: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 4

RozoFS clusters and Volumes

Storage (host_1) Storage (host_n)

Cluster 1

Cluster 2

Cluster n

Volume 1

Cluster 1

Sid1:host_1....Sidn:host_n

Cluster 2

Sid1:host_1....Sidn:host_n

Cluster n

Sid1:host_1....Sidn:host_n

• A RozoFS cluster(cid) is an uniform set of storages (sid) in terms of disk capacity and performance• A cluster id is unique within a RozoFS system

Page 5: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 5

Mapping filesystems on volumes

Volume 1Cluster 1 Cluster n Volume 2Cluster n+1 Cluster n+p

Filesystem 1 Filesystem j Filesystem j+1 Filesystem j+k

• RozoFS supports configuration with multiple volumes• A Volume can host more than one File system (thin provisioning)• There are quotas (hard and soft) per file system• A File system is identified by an unique id (eid) within the configuration

Page 6: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 6

File localization within a filesystem

Volume 1Cluster 1 Cluster n

Filesystem 1 Filesystem j

MojetteTransform

Projections

Storage nodes

Storage(cid/sid)

Page 7: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 7

RozoFS configuration

Eid1:/metadata/fs1,vid=1

Cluster n

Sid1:host1

Sid2:host2

Sid3:host3

Sid4:host4

Volume 1

Eid1:/metadata/fs1,vid=1

Cluster n

Sid1:host1Sid2:host2Sid3:host3Sid4:host4

Volume i

Cluster 1

Sid1:host1Sid2:host2Sid3:host3Sid4:host4

Storage_conf

Listening_endpoints (@IP:port)

[cid1,sid1]:pathname1,device_count[cid2,sid1]: pathname2,device_count

Exportd node Storage node

fstab

rozofsmount mount_path rozofs export@IP,/metadata/fs1

rozofsmount node

Page 8: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

StorageSid1: host1

StorageSid1: host1

StorageSid1: host1

StorageSid1: host1

Rozofsmount

/fs1/home/

Eid1:/metadata/fs1,vid=1

Cluster n

Sid1:host1

Sid2:host2

Sid3:host3

Sid4:host4

RozoFS architecture overviewComponents

Volume 1

conf

Eid1:/metadata/fs1,vid=1

Rozofsmount

StorageSid1: host1

/fs1/home/

Metadata server

Data path Cluster n

Sid1:host1Sid2:host2Sid3:host3Sid4:host4

RozoFS Export conf.

Volume i

Exportd

Cluster 1

Sid1:host1Sid2:host2Sid3:host3Sid4:host4 Storage

Sid2: host2Storage

Sid3: host3Storage

Sid4: host4

client node

control path

Page 9: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

Typical RozoFS deployments

Page 10: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 10

RozoFS native mode (scale-out NAS)

GigE infrastructure (shared by

Data storage and metadata)

NativeprotocolNative

protocolLinux Client with RozoFS

clients/applications

Storage and metadata

Rozofsmount

Storage

Storage

Storage

Storage

Exportd

Note: the exportd function can reside on some storage nodes also.

Page 11: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 11

RozoFS Cluster : NAS mode

GigE infrastructure

(data storage and metadata)

SMB,NFS,AFP..

SMB,NFS,AFP..

Windows, Linux, UNIX and Apple clients

GigE Infrastructure

clients/applicationsRozofsmount

Rozofsmount

Rozofsmount

Rozofsmount

Storage

Storage

Storage

Storage

Exportd

Page 12: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 12

Virtualisation solution with RozoFS: CloudStack+KVM

GigE infrastructure

(data storage and metadata)

+

Standard GigE Infrastructure

Niveau clients/applications

ExternalNetworkExternalNetwork

Rozofsmount Storage

Rozofsmount Storage

Storage

Rozofsmount Storage

Rozofsmount

Exportd

Page 13: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 13

RozoFS basic exchanges

Page 14: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 14

RozoFS basic exchanges inter components interfaces

Rozofsmount

Storcli 1 Storcli n

StorageSid1: host1Storage

Sid1: host1StorageSid1: host1Storage

Sid1: host1

Cluster conf.

Metadata ops./ mount

Storage monitoring

Projections deletion

Read/writetruncate

Met

adat

a Se

rver

CLIENT NODE

Page 15: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 15

Rozofsmount

Eid1:/metadata/fs1,vid=1

Cluster n

Sid1:host1

Sid2:host2

Sid3:host3

Sid4:host4

RozoFS basic exchanges Filesystem mounting

Volume 1

conf

Eid1:/metadata/fs1,vid=1

Rozofsmount

/fs1/home/Metadata

server

Cluster n

Sid1:host1Sid2:host2Sid3:host3Sid4:host4

RozoFS Export conf.

Volume i

Exportd

Cluster 1

Sid1:host1Sid2:host2Sid3:host3Sid4:host4

Mount /metadata/fs1

Rozofsmount –H exportd_host –E/metadata/fs1 /fs1/home/

1

2

3Clusters list

Storcli 1

StorageSid1: host1

StorageSid2: host2

StorageSid3: host3

StorageSid4: host4

4TCP open

Page 16: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 16

RozoFS basic exchanges file creation

Rozofsmount

Metadata Server(exportd) Open(« /fs1/home/foo »,O_CREAT|O_RDWR,0640)

Application/VFS

Volume distribute(EID)

Cluster 1

Sid1:host1Sid2:host2Sid3:host3Sid4:host4Sid5:host5Sid6:host6……

1) Get the volume associated with EID (VID)2) Get the Cluster list(CID)3) Get 4 storages for a Cluster(SID)

Export_mknod

1) allocate a unique file Id (FID)2) Volume distribute(EID)3) Insert(FID,« foo ») in parent directory4) write new file attributes5) update parent attributes

DISK

Eid1:/metadata/fs1,vid=1

mknod(EID,parent_fid,« foo »,O_RDWR,0640)

attrs(FID,cid1:{sid1..sid4},0640,etc…}

File_descriptor 1 4

2

3

FID : Unique File Identifier DescriptorParent_fid: FID of the parent directory

Page 17: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 17

RozoFS basic exchanges file opening

Rozofsmount

Metadata Server(exportd)

Open(« /fs1/home/foo »,O_RDWR,0640)

application

Directory entries cache

Parent_dir.

Name1->FID1Name2->FID2foo -> FID3…….

Export_lookup

1)Get file FID from parent directory (cache or disk)

2) Get File attributes (cache or disk)

DISK

Eid1:/metadata/fs1,vid=1 lookup(EID,parent_fid,« foo »,O_RDWR,0640)

File_attributes(attrs3)

lookup

1 9

3

4

attributes cache

FID1->attrs1FID2->attrs2FID3->attrs3…….

FID3cid:{sid1,sid2,sid3,sid4}Atime,mtime……

attrs3 open Fd 12 5 6 8

File descriptor allocator

FID3cid:{sid1,sid2,sid3,sid4}Atime,mtime……

Fd 1

7

Fd 1

VFS

Page 18: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 18

RozoFS basic exchanges synchronous file write

Len = pwrite(fd,offset,size,buffer)

Application/VFS

1 10

FID3cid:{sid1,sid2,sid3,sid4}File sizeAtime,mtime……

Fd 1

write(fd1,offset,size,data)

StorageSid4: host4

Mojette Transform

Forward

Write projections1) Generate projections2) Send all the projections write in parallel3) Wait for all the write responses

Write1) Find the context associated with fd12) Submit data to write to storcli3) Wait for end of write4) Update the blocks on exportd3) Return written to upper layer

StorageSid3: host3

StorageSid1: host1

StorageSid2: host2

write(FID3,offset,data,size) Size or errcode

write(FID3,prj1)

status

Prj1,prj2,prj3

Data,size2

65

3

4

7

Size or errcode

Redundancy level (2+1):2 reads3 writes

write(FID3,prj2) write(FID3,prj1)

status status5 5

6 6

Write_blocks(file attributes update)

1) Update time information2) Update size if greater3) Update cache and disk

DISK

Eid1:/metadata/fs1,vid=1

attributes cache

FID1->attrs1FID2->attrs2FID3->attrs3…….

Wr_blks(EID1,FID3,offset,size)

Attrs(attrs3)

8

9

Metadata server(exportd)

Redundancy level (2+1):2 reads3 writes

Page 19: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 19

RozoFS basic exchanges file read

Rozofsmount

Len = pread(fd,offset,size,buffer)

Application/VFS

1 8

FID3cid:{sid1,sid2,sid3,sid4}File sizeAtime,mtime……

Fd 1

Pread(fd1,offset,size)

StorageSid4: host4

Storcli

Mojette Transform Inverse

Read projections

1) Send parallel read requests2) Wait for projection data returned from storages3) Rebuild initial block

Read

1) Find the context associated with fd12) Request data to storcli3) Return requested data to VFS

StorageSid3: host3

StorageSid1: host1

StorageSid2: host2

Read(FID3,offset,size) Data,length

Read(FID3,prj1,offset_prj Read(FID3,prj2,offset_prjprj1 prj2

Prj1,prj2

Data,length

2

3 34 4

5

6

7

Data,length

Redundancy level (2+1):2 reads3 writes

Page 20: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 20

RozoFS basic exchanges file deletion

Rozofsmount

Metadata Server(exportd)

unlink(« /fs1/home/foo »)

Application/VFS

File deletion

1)Remove the file from the parent directory (disk and cache)2) Delete the attributes of the file (disk and cache)3) Update the parent attributes4) Insert file reference in the trash (list and disk)

DISK

Eid1:/metadata/fs1,vid=1

unlink(EID,parent_fid,«foo »)

Parent_attributes

1 4

2

3

Trash thread

FID6->attrs6FID7->attrs7FID3->attrs3…….FID3

cid:{sid1,sid2,sid3,sid4}……

errcode

Trash list

attrs3Storage

Sid1: host1

StorageSid2: host2

StorageSid3: host3

StorageSid4: host4

unlink(parent_fid,« /fs1/home/foo »)

Projections deletions(FID3)

Page 21: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 21

Storaged

Page 22: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 22

Storage node componentsStoraged processesStoraged node start up sequenceMulti device feature

IntroductionExtendable storages Faster rebuild processSpreading files among devicesProjection file structure Fault detection

Configuration file

Page 23: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 23

Storage node components

storaged rozolauncher

storio

Disk thread1

Disk thread16

Disk Request dispatcher

Local file systems (i.e.:XFS)

rozo

laun

cher

Config. file TCP listeningendpointsTCP listening

endpoint

Page 24: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 24

Storaged processes

•Storaged Provides the exportd with disk space information (volume balance) Provides the storcli process with the listening ports of the storio Takes care of the projection files deletion Controls the storio processes

•Storio The storio software is split into 2 processes types:

a main process handling TCP connections from the clients, receiving and decoding the requests and posting them in a queue.

several disk threads reading from a queue the requests posted by the main thread, processing them and sending a response back to the main thread.

Page 25: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 25

Storaged start-up sequence

Storaged process

rozolauncher

n

Exit()

rozolauncher start /var/run/launcher_storio_slave_<hostname>_<storio_id>.pid storaged -i <storio_id> -c <config_file>

storaged -c <config_file> -H <config_file>

Storio process

storaged -i <storio_id> -c <config_file>

rozolauncher

rozolauncher start /var/run/launcher_storaged_<hostname>.pid storaged

-c <config_file> -H <hostname>

Exit()

Page 26: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 26

Storage node: Multi-device featureIntroduction

• A device can be either: a physical hard drive or a logical volume made of one or several hard drives managed thanks to hardware and/or

software (ie LVM volume, RAID 0 behind a controller,...)

Formerly a storage had a root path per cid/sid tuple where to store the data files. Now the storage has N devices numbered from 0 to N-1 that are mounted as directories

under the root path to provide access to the devices. It is up to the storage to decide which device to use for each data file.

• Multi-device goals: Provide a scale-up capability Get rid of RAID 5/6 to avoid their weaknesses when a large number of disks are grouped Provide a faster rebuild time by limiting the number of hard drives per local filesystem Provide the capability to shrink a cluster

Page 27: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 27

Storage node: Multi-device featureExtendable storages

•Before the multi device feature, a storio had a root path that used to be a logical volume made of a bunch of hard drives in RAID5 or RAID6:

But it is not possible to extend a RAID 5 or 6 cluster handled by a hardware RAID controller.

When there was no space left under the root path, it was not possible to add some disk space to this cluster id/storage id.

When adding disks to the server, one had to create a new rozoFS cluster.

• With the multi device feature the storio handles several devices under its root path, and one can add a device to the storio.

Page 28: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 28

Storage node: Multi-device feature

Faster disk rebuild process

•Before the multi device feature when the RAID bunch of disk was failed, every data of the storage had to be rebuilt.

For instance on a bunch of 12 disks of 4TB in RAID 6, when 3 disks have failed, the 10 x 4 TB of data have to be rebuilt.

• With the multi device feature, one can group the disks in 6 RAID 0 clusters of 2 disks. When one disk fails, the data on the other devices is

still available. Only one device is lost and has to be rebuilt. A 2 disks

space is to be rebuilt. While the rebuild process may occur more often, it will be faster.

Page 29: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 29

Storage node: Multi-device feature

Spreading files among devices

•It is up to the storage: to distribute the projection files among its devices trying to equalize the free space on each

device , to remember where the projection files are located.

•RozoFS is able to store files of a size up to 8 TB: 8 TiB of data means 4 TiB of projection in layout 0, 2 TiB in layout 1 and 1 TiB in layout 2. Since a device is limited in size, a projection has to be spread among the different devices of a

storage. •Each file is sub-divided in 64 GB chunks (of user data) and can have a maximum of 128 chunks.

Page 30: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 30

Storage node: Multi-device feature

Spreading files among devices

•On the fly chunk allocation: The chunks of a file that have not yet been written have no device allocated to be written on. Each chunk of a file is allocated a device where to reside by the storage at the time it is

written for the first time. The whole chunk will then be written on this device.

•The size of a chunk of user data is 64 GiB. The size of a projection chunk depends on the layout:

Page 31: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 31

Storage node: Multi-device feature

Projection file structure

• The storage needs to remember the devices assigned to each chunk.• The projection file formerly (in release 1) had a 8K header followed by the projected data blocks. It is now

split into two types of files: a header or mapper file that contains the former header complemented with a list of 128 devices

allocated (or not yet allocated) for the chunks. chunks files containing the projections of up to 64 GiB of user data.

• The location of the header file is given by the result of a hash on the FID modulo the number of devices on which header files can be found. The number of devices a storio handles can be increased, while the hash function on a FID must always

give the same location. For this reason, the number of devices holding header files is determined at the storage installation and

can never be changed. Later added devices do not hold header files.

• As all data on a device can be lost when one of its disk fails, it is mandatory to replicate the header files on several devices. There is so 3 new configuration parameters per cid/sid:

device-mapper = the number of devices hosting header files. device-redundancy = the number of replica of header files. device-total = the number of devices holding chunks of projection files.

Page 32: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 32

Storage node: Multi-device feature Projection file structure

•Example of an extract of a storage configuration file:

• The device-mapper must not be changed from the first storage installation.

• The device-total is increased when adding new devices.

Page 33: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 33

Storage node: Multi-device feature

Projection files location (path)

• 2 header files are located on device 3 and 0.• 2 chunks are written; the first on device 5 and the second on device 2.• By the way, one may notice that the layout and distribution do not appear any more in the file

path. The new file path is built the following way: <root_path>/<device id>/<type>_<spare>/<slice>/<FID> <type> is ‘bins’ for chunk of projection files and hdr or .hdr files. spare is either 0 or 1 <slice> is a hash computed from the FID to spread all the files among several directories.

Page 34: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 34

Storage node: Multi-device feature

Projection files location (path)

A file that is in release 1 would have been located under :

With multi-device :

• 2 header files are located on device 3 and 0.• 2 chunks are written; the first on device 5 and the second on device 2.• The file path is built the following way:

<root_path>/<device id>/<type>_<spare>/<slice>/<FID> <type> is ‘bins’ for chunk of projection files and hdr or .hdr files. <spare> is either 0 or 1 <slice> is a hash computed from the FID to spread all the files among several directories.

Page 35: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 35

Storage node: Multi-device feature

Fault detection

•In the storio, each time an abnormal error is encountered while accessing data on a device an error counter is incremented.

This counter should reveal some problem on some sector of the device.

•A periodic task checks that every device is still accessible in read and write. In case it is not, a failure counter is increased.

•The error and failure counters can be read through the rozodiag interface thanks to the command “device”.

Page 36: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 36

Storage node: Multi-device feature

Fault detection: disk fault example 1

• This display shows the available blocks on each device as well as the error and failure counters.

• In this example, the device 1 has encountered some errors and is now no more accessible in read and write.

• This device may need to be rebuilt.

Note :The Nagios plug-in nagios_rozofs_storaged.sh checks these error/failure counters. When an error is raised, the plug-in returns a critical status and shows off the list of faulty devices that may require a rebuild.

Page 37: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 37

Storage node: Multi-device feature

Fault detection: disk fault example 2• This next display shows off errors on an other storage. At the end of the display, the “Faulty

FIDs:” paragraph displays FIDs that have encountered a problem.

• The displayed FID has encountered a fault that prevents its writing or reading.• Since only one line is displayed and not 10 or mores, one can guess that the device is not completely failed, but some disk sector used by the displayed FID may be corrupted.• In this case, rebuilding this FID only could solve the problem.

Note: the output format is “-s <cid/sid> -f <FID>” which is the input format of the rebuild command described later.

Page 38: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 38

Storaged configuration file

Page 39: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 39

Storaged configuration file

• Threads: number of threads associated with the storio process. The default number is 4. The maximum number of threads is 16.

• Nbcores: maximum number of core files that is kept for a storio process. By default 2 core files are supported.

• Storio mode: Single mode (« single »): One storio process for all cid/sid. It listens on all the endpoints defined in the listen section

of the configuration file.

Multiple mode (« multiple »): The Master storaged starts one storio per cid (cluster) defined in the configuration file. As

the for the case of the single mode, the storio listens on each all the listening addresses of the listen section. However a rule is applied on the port number: Listening_port_number = config_file_port_number+<cid>

Page 40: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 40

Storaged configuration file

• Crc32c configuration: Optionally, each transformed blocked can be protected with a CRC32C. The goal is to detect

and repair blocks for which there is a CRC32 error whatever the source of the error (hardware or software).

It is recommended to enable the CRC32 control in particular when using non entreprise drives. The CRC32C generation and control takes place on the storio process and the self healing is controlled by the storlci process.

Upon a read failure due to a CRC32 error, the block in error is regenerated once the initial block has been fully rebuilt. The repair takes place on the storcli.

crc32c_check: assert to « True » when CRC must be check on each block. crc32c_generate: assert to « True » for generating a crc32 on each block written on disk crc32c_hw_forced: assert to « True » to force the usage of the crc32 code that is hardware

assisted. That option MUST used only for the case of Virtual Machine for which the reported hardware features supported by a CPU is incomplete. It is typically the case with VirtualBox. When a CPU does not provide hardware for CRC generation , turning on the CRC generation and check will hurt the overall performances of the system.

Page 41: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 41

Storaged configuration file

Page 42: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 42

Rozofsmount/Storcli

Page 43: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 43

Storcli architecture overview

Shared Memory

AF-UNIX

Moj. Inverse

Read/Write/truncateRequets Dispatcher

Thread selector

Mojettewrite

threads

Mojetteread

threads

Moj. Fwd

AF-UNIX

Block_sz

Th_enable

Thread selectorTh_enable

Block_sz

Storage nodes load balancerTCP TCP

North bound AF-UNIX interface

South bound interface

Rozofsmount AF-UNIX channel• Storcli receives request on AF-UNIX socket from the north bound interface

• Data payload is read/write within a shared memory

• Mojette transform pass-through mode depends on Size of the block to transform State of the thread (read or write)

• A request is processed by the dispatcher: Takes care of the dependency

between requests Takes care of the communication

with storage nodes Takes care of the Mojette transform

activation• The north interface handles the load

balancing group associated with each storage node.

Page 44: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 44

Storaged processes

•Storaged Provides the exportd with disk space information (volume balance) Provides the storcli process with the listening ports of the storio Takes care of the projection files deletion Controls the storio processes

•Storio The storio software is split into 2 processes types:

a main process handling TCP connections from the clients, receiving and decoding the requests and posting them in a queue.

several disk threads reading from a queue the requests posted by the main thread, processing them and sending a response back to the main thread.

Page 45: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 45

Rozofsmount/storcli start-up sequence

rozolauncher start /var/run/launcher_rozofsmount_<rozofsmount_id>_storcli<storcli_id>.pid storcli –H <exportd_host_list> -i <storcli_id> -c <storlcli_options>

Rozofsmount process

rozolauncher

1 or 2

Exit()

Storcli process

storcli –H <exportd_host_list> -i <storcli_id> -c <storlcli_options>

rozofsmount –H <exportd_host_list> -E <export_path> <local_path> <mount_options>

Upon a fatal error, a storcli is automatically restarted

Page 46: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 46

Storcli start configuration

• A storcli process provides information related to its starting configuration. This information is accessible thanks rozodiag:

• host: hostname of the export node. More that one address might be provisioned. It is typically the case when the RozoFS is deployed in a routing environment.

• Module index: reference of the storcli within the rozofsmount that owns it• Site: site number. Revelant for the case of the geo-replication only.• Nb_cores: number of core files that can be generated by the storcli process.

Page 47: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 47

Storage node and cluster states seen by Storcli

_________________________________________________________[127.0.0.1:50004] rzdbg> cid_state____[storcli 1 of rozofsmount 0]__[ cid_state]____ cid | state |--------+--------------+ 1 | UP |_________________________________________________________[127.0.0.1:50004] rzdbg> storaged_status____[storcli 1 of rozofsmount 0]__[ storaged_status]____ cid | sid | hostname | lbg_id | state | Path state | Sel | tmo | Poll. |Per.| poll state |------+------+----------------------+----------+--------+------------+-----+-------+-------+----+--------------+ 001 | 01 | localhost1 | 0 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 02 | localhost2 | 1 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 03 | localhost3 | 2 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 04 | localhost4 | 3 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 05 | localhost5 | 4 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 06 | localhost6 | 5 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 07 | localhost7 | 6 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 08 | localhost8 | 7 | UP | UP | YES | 0 | 0 | 50 | IDLE |_________________________________________________________[127.0.0.1:50004] rzdbg>

• Storcli provides status related to cid/sid connectivity.• A cid (cluster) is considered to be up if there is at least one sid which is reachable from the

storcli• A cid/sid is in the UP state for the following condition:

There is at least one TCP connection of its load balancing group that is UP The remote end has replied to a NULL-poll requests

Page 48: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 48

Mojette threads statistics and on-line configuration

• Rozodiag provides the administrator with the capability to change the Mojette thread settings.

• Mojette thread configuration can be changed on the fly. This concern the enable flag and the buffer size threshold for entering the threads.

_________________________________________________________[127.0.0.1:50004] rzdbg> MojetteThreads ?

____[storcli 1 of rozofsmount 0]__[ MojetteThreads ?]____

usage:

MojetteThreads reset : reset statistics

MojetteThreads <read|write> enable : enable Mojette threads

MojetteThreads <read|write> disable : disable Mojette threads

MojetteThreads : display statistics

MojetteThreads size <count> : adjust the bytes threshold for thread activation (unit byte)

_________________________________________________________

• Mojette threads menu

Page 49: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 49

Mojette threads statistics and on-line configuration

_________________________________________________________[127.0.0.1:50004] rzdbg> MojetteThreads

____[storcli 1 of rozofsmount 0]__[ MojetteThreads]____

Thread activation threshold: 65536 bytes

max pending Mojette req cnt: 4

receive empty counter : 0

read/write thread status : DISABLE/ENABLE

Thread number | 0 | 1 | 2 | 3 | TOTAL |

Read Requests |__________________|__________________|__________________|__________________|__________________|

number | 0 | 0 | 0 | 0 | 0 |

Bytes | 0 | 0 | 0 | 0 | 0 |

Cumulative Time (us) | 0 | 0 | 0 | 0 | 0 |

Average Bytes | 0 | 0 | 0 | 0 | 0 |

Average Time (us) | 0 | 0 | 0 | 0 | 0 |

Average Cycle | 0 | 0 | 0 | 0 | 0 |

Throughput (MBytes/s) | 0 | 0 | 0 | 0 | 0 |

Write Requests |__________________|__________________|__________________|__________________|__________________|

number | 100 | 100 | 100 | 100 | 400 |

Bytes | 26214400 | 26214400 | 26214400 | 26214400 | 104857600 |

Cumulative Time (us) | 5758 | 4095 | 4414 | 4274 | 18541 |

Average Bytes | 262144 | 262144 | 262144 | 262144 | 262144 |

Average Time (us) | 57 | 40 | 44 | 42 | 46 |

Average Cycle | 126072 | 89207 | 98851 | 93165 | 101824 |

Throughput (MBytes/s) | 4552 | 6401 | 5938 | 6133 | 5655 |

|__________________|__________________|__________________|__________________|__________________|

_________________________________________________________

[127.0.0.1:50004] rzdbg>

Page 50: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 50

Rozofsmount shared memory

• The rozofsmount uses an AF-UNIX socket and a shared memory to communicate with its associated storcli processes. It might support up to 2 storcli.

• The shared memory key of the shared memory is built as follows 0x524f5a<rozofsmount_instance><storcli_instance>

• The following illustrates the case where there are 2 rozofsmounts, each of them owning 2 storclis.

root@debian:/home/rozofs/off_mgeo/src/exportd# ipcs -m

------ Segment de mémoire partagée --------clé shmid propriétaire perms octets nattch états 0x00000000 0 root 644 80 2 0x00000000 32769 root 644 16384 2 0x00000000 65538 root 644 280 2 0x00000000 98307 didier 600 33554432 2 0x4558504f 294916 root 666 1216 9 0x524f5a30 163845 root 666 8421376 2 0x524f5a31 196614 root 666 8421376 2 0x524f5a32 229383 root 666 8421376 2 0x524f5a33 262152 root 666 8421376 2 .

Note: if for any reason two rozofsmount with the same instance id are started, the system will fail for all the I/O operations since the two rozofsmount are using the same shared memory.

_________________________________________________________

[127.0.0.1:50004] rzdbg> shared_mem

____[storcli 1 of rozofsmount 0]__[ shared_mem]____

active | key | size | cnt | address |

--------+-----------+---------+------+----------------+

YES | 1380932144 | 263168 | 0032 | 0x7f320c2dc000 |

YES | 1380932145 | 263168 | 0032 | 0x7f320bad4000 |

_________________________________________________________

Page 51: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 51

Impact of the export configuration on storcli process

The storcli process communicates with storage nodes. For that purpose it needs to be aware of the storage nodes that are used by the volume associated to the

file system referenced by rozofsmount. To address such case, storcli gets the storage configuration from the export:

Once it gets the list of the storage node it establishes the TCP connection towards these nodes It periodically polls the export node to detect a change in the exportd configuration.

Changing the export configuration of the export such as adding/removing storage does not imply the restart of the storcli process. From the storcli standpoint the procedure is the following: Upon a exportd configuration polling, the exportd informs that the storcli configuration is out of date It Gets the new configuration It Establishes to any new storage detected within the configuration It Removes the connection that are no more referenced in the configuration The process does not stop the I/O operations that are in progress.

Page 52: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 52

Storcli I/O contexts

• The storcli process uses storcli buffer for processing the requests submitted by rozofsmount.• A storcli process can handled up to 32 transactions in parallel.• By default storcli attempts to process the transactions in parallel:

On each transaction submitted by rozofsmount, the storcli process checks if there is no overlap with an already on going transaction

When there is an overlap, the current transaction is inserted in the ring Each a transaction ends, the storcli check among the pending request of the ring if the overlap condition

has disappeared When the overlap condition disappears, the waiting transaction(s) is(are) processed The storcli buffer statistics reports the number of transaction fro which of collision occurred.

• However is to possible to force the storcli to operate in a serialized mode• Any error during the processing of a transaction is logged internally by incrementing the appropriated error

counter.• When a storcli is in the idle state, the number a allocated transaction contexts MUST be 0.• The information related to the state of the storcli buffer is accessible thanks rozodiag

[127.0.0.1:50004] rzdbg> storcli_buf ?

____[storcli 1 of rozofsmount 0]__[ storcli_buf ?]____

usage:

storcli_buf : display statistics

storcli_buf serialize : serialize the requests for the same FID

storcli_buf parallel : process in parallel the requests for the same FID

_________________________________________________________

[127.0.0.1:50004] rzdbg>

Page 53: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 53

Storcli context statistics

[127.0.0.1:50004] rzdbg> storcli_buf

____[storcli 1 of rozofsmount 0]__[ storcli_buf]____

number of transaction contexts (initial/allocated) : 64/0

Statistics

serialize mode : NORMAL

req submit/coll: 400/0

FID in parallel: 229

buf. depletion : 0

ring full : 0

SEND : 0

SEND_ERR : 0

RECV_OK : 0

RECV_OUT_SEQ : 0

RTIMEOUT : 0

EMPTY READ : 0

EMPTY WRITE : 25600

Buffer Pool (name[size] :initial/current

North interface Buffers

small[ 1024] : 64/64

large[264192] : 64/64

South interface Buffers

small[ 1] : 1/1

large[163840] : 1024/1024

_________________________________________________________

[127.0.0.1:50004] rzdbg>

Page 54: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 54

Metadata Server (exportd)

Page 55: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 55

Exportd highlights

• The exportd handles the metadata operations of several exported file system

• For better performances the exportd supports up to 8 processes

• Each process is responsible of a subset of the exported file systems

• The exportd is responsible of the allocation of the storaged servers a creation time

• The exportd controls the projections file remove upon file deletion

• The exportd provides user and group disk quota accounting and enforcement

• The exportd operates in active/stand-by mode thanks pacemaker/DRBD

Page 56: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 56

Metadata Server High level architectureExportd server

Export S 1

Export S 8

confrozofsmount

interface mount

Export M

storaged

interface metadata

interface monitor

interface remove

client node

storage node

Metadatadisks

One exportd server can handle the metadata of more than one file system One exportd server owns one Master Exportd that controls up to 8 Slave Exportd processes A RozoFS configuration can include more than one exportd server

Dentry filesInodes files

Page 57: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 57

Metadata Server interfaces

• Mount interface:• That interface provides the following services for the rozofsmount client:

filesystem mount: the goal is to provide the rozofsmount clients with the information related the filesystem that is associated with the rozofsmount (cluster id, storage nodes IP information, list of the Export Slices endpoints (IP@ and port), etc...

• Metadata interface:• That interface is used for all the metadata operations related to a file system: file/directory creation,

file/directory lookup, get/set file/directory attributes, etc...

• Monitor interface:• That interface is used by the Master Export to collect statistics information on the storage nodes that are in

the scope of its configuration.

• Remove interface• That interface is used by Slice Exports for removing the projections on storage nodes upon file deletion.

• The metadata of a file system are organized by slices. The slice notion is internal to the RozoFS metadata server. The goal of the slice is to distribute the processing of the metadata operations among several Slice Export process to increase the throughput of the metadata server.

• The metadata server is organized around 256 slices. It might be possible to associated one Slice Export process per slice. The current configuration supports upon to 16 Slice Export Processes.

Page 58: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 58

Slave exportd process architecture

Page 59: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 59

Slave exportd process Metadata server main thread

• main thread• The main thread is the entry point of the Slice Export process, it has the following

characteristics:• supports up to 512 TCP connections on the metadata interfaces: all the messages are processed

by the "metadata srv" building block;• It dispatches the requests towards:

the "metadata srv" module: that modules processes all the operations related to file/directory: creation/deletion, update, etc...

The quota module: that module is interfaced by clients running on the same host as the exportd and is responsible of all the operation related to the quota configuration (user and group).

Page 60: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 60

Slave exportd processVolume balance thread

• volume balance thread The role of the volume balance is to gather statistics from the storage nodes in

order to establish the list of storage nodes on which files can be allocated (file distribution).

This is a periodic task that exists on each Export Slave process The TCP connection opens towards the storage node are ephemeral:

as soon as the thread gets the information from a storage node, it closes its TCP connection.

Page 61: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 61

Slave exportd process Dirent write-back thread

• Dirent write-back thread• The role of the thread is to push pending updates related to modification of the

dentries file (dirent files).• It is a periodic task whose the period can be adjusted thanks rozodiag tool. By

default the period is 1 second.• The write-back cache can be disabled, in that case all the modification on dirent files

are done synchronously versus asynchronously when the thread is enabled.

Dirent writeback

cache

DISK

Update dirent_header

Dirent writeback

thread

Update chunk

MetaData SRV.

period

Enable/disable

Cache Handler.

insert

Disk write (disable)

Page 62: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 62

Slave exportd process Dirent write-back thread: thread menu and statistics

•Dirent write-back thread statistics

• Dirent write-back thread menu

Page 63: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 63

Slave exportd process Dirent write-back thread: dirent cache statistics

[127.0.0.1:52001] rzdbg> dirent_cache____[exportd-S1 ]__[ dirent_cache]____Malloc size (MB/B) : 0/8720Level 0 cache state : EnabledNumber of entries level 0 : 2hit/miss : 9/2collisions cumul level0/level1 : 0/0LRU stats global cpt (ok/err) : 0/0coll cpt (ok/err) : 0/0collisions Max level0/level1 : 0/0Name chunk size : 64Name chunk max : 9Sectors (nb sectors/size) : 521/266752 Bytes------------------+----------------------+--------------+ field name | start sector(offset) | sector count |------------------+----------------------+--------------+ header | 0 (0x0 ) | 1 | name bitmap | 1 (0x200 ) | 1 | hash buckets | 2 (0x400 ) | 1 | hash entries | 3 (0x600 ) | 6 | name chunks | 9 (0x1200) | 512 |------------------+----------------------+--------------+

------------+-------------+---------------------+ file_limit | mask | put count |------------+-------------+---------------------+ 10000 | 1 | 2 | 100000 | f | 0 | 0 | fff | 0 |------------+-------------+---------------------+File System usage statistics:Total Read : memory 0 MBytes (0 Bytes) requests 0Total Write: memory 0 MBytes (0 Bytes) requests 0

WriteBack cache statistics:state :EnabledNB entries :4096hit/miss/flush : 0/0/0invalidate : 0total Write : memory 0 MBytes (13312 Bytes) requests 4 ejected chunks 0

Page 64: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 64

Slave exportd process Remove bins thread

• RozoFS file deletion overview• Upon receiving a file deletion from a client (rozofsmount), the export slave process cleans up the

entry within the parent directory (dirent file) but differs the deletion of the associated projections on storage nodes.

• When a file is deleted, an entry is created in a dedicated directory (dir_trash) whose management is the same as the one used for the i-nodes.

• The file founds while that directory contents the reference of the FID associated list of the storage nodes from which the file has to be removed.

• An entry is also created in memory for performance purposes. • A periodic thread takes care of the in memory list of the files to delete and addresses each

storage node referenced with the FID in order to release the projection files from the storage nodes.

• A file is fully deleted once all its projections have been removed. In that case, its reference is removed from the associated file within the trash directory of the its export.

• The period and the maximum numbers of file deleted per period are configurable: Default period : 5 second, max. files : 500

Page 65: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 65

Slave exportd process Remove bins thread: statistics

• This information is accessible thanks rozodiag • Any changes done on the thread configuration will not survive to the process restart

• Remove bins thread menu:

[127.0.0.1:52001] rzdbg> trash ?

____[exportd-S1 ]__[ trash ?]____

usage:

trash limit [nb] : number of file deletions per period (default:500)

trash : display statistics

Remove bins thread menu:

• Remove bins thread statistics:

[127.0.0.1:52001] rzdbg> trash

____[exportd-S1 ]__[ trash]____

Trash thread period : 2 seconds

file deletion per period : 500

delete stats (pending/done) : 0/0

Page 66: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 66

Slave exportd process I-node tracking periodic thread

• The role of that thread is to check periodically the tracking files (files that contains the inodes) to figure out if files can be either truncated or/and removed in order to free disk space on disks that supports the exportd files.

• That thread is common to all the "eid" handled by the slave exportd.• The period of the thread can be adjusted thanks rozodiag. By the default the period is 30

seconds• That thread can be de-activated but it is not recommended since the i-node tracking files are

never removed nor truncated after i-node deletion.

Page 67: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 67

Slave exportd process Inode tracking periodic thread: statistics

[127.0.0.1:52001] rzdbg> inode_trck

____[exportd-S1 ]__[ inode_trck]____

period : 30 second(s)

statistics :

- buffer size :16384

- number of buffers :16

- update requests :5

- update errors :0

- flush errors :0

- activation counter:2662

- average time (us) :1

- total time (us) :4385

- total Write : memory 0 MBytes (65536 Bytes) requests 3

[127.0.0.1:52001] rzdbg> inode_trck ?

____[exportd-S1 ]__[ inode_trck ?]____

usage:

expt_thread reset : reset statistics

expt_thread disable : disable export tracking thread

expt_thread enable : enable export tracking thread

expt_thread period [ <period> ] : change thread period(unit is second)

• Inode tracking thread menu:

• Inode tracking thread statistics:

Page 68: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 68

Slave exportd process Fstats thread

• For each export (eid) there is a global file that contains statistics about the filesystem.• Each time a file is created/removed theses statistics have to be updated. To avoid overloading

the system in terms of disk accesses, RozoFS implements a write-behind mechanism.

• The role of the thread is to update periodically on disk the statistics of the exports handled by a slave exportd process.

• The updated statistics are : the number of files the number of i-nodes

• The corresponding file is located at the root of a export and is named fstat_<slave_id>. where slave_id is the index of the exportd slave process that is responsible of the eid associated with an export.

• The period of the thread can be changed on the fly thanks rozodiag. Any change will take place in memory only and will not survive to a restart of the exportd process. The default period is 5 seconds.

Page 69: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 69

Slave exportd process fstat thread: statistics

[127.0.0.1:52001] rzdbg> fstat_thread ?____[exportd-S1 ]__[ fstat_thread ?]____usage:fstat_thread : display statisticsfstat_thread eid <value> : display eid filesystem statisticsfstat_thread reset : reset statisticsfstat_thread period [ <period> ] : change thread period(unit is second)

[127.0.0.1:52001] rzdbg> fstat_thread ___[exportd-S1 ]__[ fstat_thread ]___period : 5 second(s) - activation counter:12725 - average time (us) :2 - total time (us) :28337statistics : thread_update_count :2

• Fstat thread menu

• fstat thread statistics

Page 70: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 70

Slave exportd process User/group quota writeback thread

• RozoFS supports natively quota per user and group. By default only accounting is enabled. quota enforcement can be activated thanks rozo_quota commands.

• For performance concerns, a write-behind mechanism associated with a quota cache is implemented. Any modification on quota (user or group) take place in memory (cache).

• A periodic thread takes care of the quota updates and flushes the modified information on disk.

• It can possible to disable the quota write back thread. In that case, the quota updates are synchronous.

• By default the quota write back thread is enabled.• The period of the thread can be changed thanks rozodiag. The default period is 1 second.

• Note: a direct flush can take place when the insertion in the quota cache triggers a LRU.

Page 71: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 71

Slave exportd process User/group quota write back thread: statistics

[127.0.0.1:52001] rzdbg> quota_wb_thread ?____[exportd-S1 ]__[ quota_wb_thread ?]____usage:quota_wbthread reset : reset statisticsquota_wbthread disable : disable writeback dirent cachequota_wbthread enable : enable writeback dirent cachequota_wbthread period [ <period> ] : change thread period(unit is second)

[127.0.0.1:52001] rzdbg> quota_wb_thread____[exportd-S1 ]__[ quota_wb_thread]____period : 1 second(s) statistics : - wr chunk counter :6 - write (hit/miss) :4/2 - activation counter:64217 - average time (us) :36 - total time (us) :2352493total Write : memory 0 MBytes (480 Bytes) requests 6[127.0.0.1:52001] rzdbg> quota_cache____[exportd-S1 ]__[ quota_cache]____lv2 attributes cache : current/max 6/65536hit 6 / miss 4 / lru_del 0entry size 96 - current size 576 - maximum size 6291456

• Quota write-back thread menu

• Quota write-back thread statistics

Page 72: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 72

RozoFS Metadata disk layout for one exported file system

dentry_slices file inode directory inode ext. Attr. inode sym. link inode

slice dir 1

inode file 1

inode file n

slice dir 256

inode file 1

inode file n

slice dir 1

inode file 1

inode file n

slice dir 256

inode file 1

inode file n

slice dir 1

inode file 1

inode file n

slice dir 256

inode file 1

inode file n

slice dir 1

inode file 1

inode file n

slice dir 256

inode file 1

inode file n

slice dir 1

FID dir

dentry file 1

collision flles

dentry file 1 dentry file 4096

bitmap file

FID dir

dentry file 1

collision flles

dentry file 1 dentry file 4096

bitmap file

slice dir 256

file system metadata root path

Page 73: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 73

Page 74: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 74

MDirent file overviewIntroduction

A dirent file is made of two main sections: Management section:

That part contains all the required information to allocate/release chunks in the data section for storing file information

and to perform lookup for searching the unique FID associated with eiher a file or directory

Data section : That section is used to store the information related to the directories, regular

files, hard links handled by RozoFS. The information found in a dirent file are: The external name of the directory/file The unique FID associated with that directory/file at creation time

A dirent file is always relative to a parent directory.

Page 75: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 75

MDirent file overviewfile types

Dirent file types Root dirent file

The root dirent file is the file that is used for starting the lookup of a name within the data section of a dirent file

Collision dirent file A collision dirent file is indirectly accessible across a root dirent file. Theorically a root

dirent file may support up to 2048 collision files

The presence of a collision file associated with a root dirent file is indicated thanks a bitmap handled on the root dirent file.

dirent file naming rules: The name of the dirent file are the following format:

Dirent root filed_<root_idx> where root_idx is the index of the root file (0..4095)

Collision dirent filed_<root_idx>_<coll_idx> where coll_idx is the local index within the associated dirent root

file. Coll_idx is in the range of 0..2047.

Page 76: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 76

MDirent file overview Per parent directory capacity

   number of

entries per file  

Per Parent directory   384 620

Max number of root dirent file 4096    

Max Collision files per root dirent file 64    

Max entries per root/collision files 384    

Max entries (theorical:2048) (Millions) 102,23 3222,79 5203,47

Max entries (Millions)   100,66 162,52

Page 77: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 77

MDirent file overview

MDIRENT File Layout (Disk representation for 384 entries case)

Dirent_header

version,dirent_file ref, parent fid,root_idx

coll_entry_bitmap

2048 hash entries

block 0

hash_bitmap

384 hash entries

name_bitmap

4096 chunks of 32 bytes

block 1

hash_table

256 buckets

block 2

hash entries # 0

64 entries of 8 bytes

block 3->8

hash entries # 5

64 entries of 8 bytes

name entries # 0

16 chunks of 64 bytes each

name entries # 215

16 chunks of 64 bytes each

block 9->521

Cadifra Evaluationwww.cadifra.com

Page 78: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 78

MDirent file overview

DIRENT file layout (memory representation for 384 entries case)dirent_cache_t

dirent_header

free_hash_bitmap_p

free_name_bitmap_p

hash_tbl_p[0]

..........

hash_tbl_p[3]

hash_entry_p[0]

..........

hash_entry_p[5]

name_entry_lvl0_p[0]

..........

name_entry_lvl0_p[7]

dirent_coll_lvl0_p[0]

..........

dirent_coll_lvl0_p[15]

free_hash_bitmap (48 bytes)

384 entries->48 bytes free_name_bitmap

(432 bytes)

384 entries->432 bytes

hash_level_32 (128 bytes)

64 entires->128 bytes

hash_entry (512 bytes)

64 entries->512 bytes

hash_name_lvl0_p (128)

hash_name_lvl1_p[0]

.........

hash_name_lvl1_p[15]

entry_name....

..........

hash_level_32 (128 bytes)

64 entires->128 bytes

hash_entry (512 bytes)

64 entries->512 bytes

coll_dirent_p (512)

dirent_cache_p[0]

..........

dirent_cache_p[63]

entry_name....

..........

hash_name_lvl0_p (128)

hash_name_lvl1_p[0]

.........

hash_name_lvl1_p[15]

entry_name....

..........

entry_name....

..........

coll_dirent_p (512)

dirent_cache_p[0]

..........

dirent_cache_p[2047]

64 pointers per entry

32 blocks of 64 bytes 32 blocks of 64 bytes

32 blocks of 64 bytes 32 blocks of 64 bytes

hash_name_lvl1_p (2048)

hash_name_lvl1_p (2048)

hash_name_lvl1_p (2048)

hash_name_lvl1_p (2048)

Cadifra Evaluationwww.cadifra.com

Page 79: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 79

MDirent file overviewdisk dump• Searching for the slice directory for which there is some dentries. The search starts at the root of the exported file system:

root@debian:/home/rozofs/off_mgeo/tests/export_1# ls */*/attributes198/a091301b-742a-4c00-1400-000000000018/attributes 20/3706f468-22be-4b00-0000-000000000018/attributes

• The directory that follows the slice directory is the ascii representation of the FID of the user directory. Within a user directory there MUST be a file called attributes and at list one d_<xx> file where <xx> being the result of a hash applied on the name of the object (either a file, symlink or directory):

root@debian:/home/rozofs/off_mgeo/tests/export_1# cd 20/3706f468-22be-4b00-0000-000000000018/root@debian:/home/rozofs/off_mgeo/tests/export_1/20/3706f468-22be-4b00-0000-000000000018# lsattributes d_0 d_1

• The attributes file is a bitmap that contains information relative to the dirent files have been created. It has a fixed size of 512 bytes:

root@debian:/home/rozofs/off_mgeo/tests/export_1/20/3706f468-22be-4b00-0000-000000000018# hexdump attributes

0000000 0003 0000 0000 0000 0000 0000 0000 00000000010 0000 0000 0000 0000 0000 0000 0000 0000*0000200

• The attributes file is a bitmap that contains information relative to the dirent files have been created. Here in the example, it indicates that 2 files have been created: d_0 and d_1

Page 80: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 80

RozoFS i-node (regular file)

• RozoFS i-node: • 512 bytes structure• Includes regular attributes• RozoFS specific fields:

FID : unique file identifier (RozoFS i-node number) File distribution: cid and list of sids (storage nodes where

to find the projections Parent FID: RozoFS i-node number of the parent directory Dirent_Name: user filename

dirent_name

type (1 bit)

len(15 bits)

hash_suffix(16bits)

coll (1bit)

root_idx(15 bits)

coll_idx (16 bits)

suffix (16 bytes)

name(60 Bytes)

DIRECT

mattr_t attrs

pfid

i_extra_isize

i_file_acl

i_link_name

dirent_name

extended attributes array

INDIRECT

chunk_id (12 bits)

nb_chunks (4 bits)

dirent_file_idx(16 bits)

fid

cid

sids[ROZOFS_SAFE_MAX]

mode

uid,gid

nlink

ctime,atime,mtime

size

children

ext_mattr (512 bytes)

mattr_t

Cadifra Evaluationwww.cadifra.com

Page 81: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 81

I-node file management

• I-node file An i-node file can contains up to 2044 i-nodes The header of the file contains

A timestamp : creation time of the first i-node An relative i-node index table

trk_10

track_main_[1..4]

exp_trck_file_header_t

creation_time

inode_idx_table[2044]

4096 Bytes

inode_array

inode idx 0

.....

inode idx 2043

exp_trck_header_t

first_idx = 0

last_idx = 10

16 Bytes

trk_0

exp_trck_file_header_t

creation_time

inode_idx_table[2044]

inode_array

inode idx 0

.....

inode idx 2043 Cadifra Evaluationwww.cadifra.com

• Tracking main file: That file keeps track of the first and last

inode file within a slice

• The i-node file management is common to all the i-node types• The difference resides in the payload and size of the i-node only

Page 82: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 82

I-node access from FID

Page 83: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 83

I-node allocation

creation_time = T1

inode_idx_table[0] = 0

.....

inode_idx_table[2043] = 2043

exp_trck_file_header_t

inode idx 0

inode_array

Cadifra Evaluationwww.cadifra.com

• RozoFS i-node time and space organization RozoFS i-nodes are always allocated in the increasing number of the i-node file

number (40 bits) within a slice A RozoFS i-node that has been released is never re-allocated A RozoFS i-node file has the timestamp of the first allocated i-node within a i-

node file Thus each RozoFS i-node has an embedded virtual time information The space allocation (i-node and data blocks) needed by the inode file is

achieved thanks the file system (e.g.: ext4) used by RozoFS for storing RozoFS i-node.

• I-node file creation The i-node index table is initialized with the relative index within the header The i-node file header is then saved on disk The i-node content is then appended to the i-node file The i-node file is then saved on disk

• Double allocation prevention In order to avoid reallocating twice same i-node, the header is saved on disk as

all the i-nodes of i-node file were all allocated Upon a restart of the metadata server the index of the first i-node to allocate

within of file is deducted from the real i-node file size.

Page 84: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 84

I-node release

• The i-node release operation consists in clearing the entry index of relative i-node index within its associated i-node file

• The space taken by a released i-node is released by resizing the i-node file.• That takes is achieved thanks a periodic task• When all the indexes within a file header i-node are cleared (-1), the i-node file is deleted.• The main tracking file is updated, if the deleted i-node file matches the index of the first file of the main

tracking file

creation_time = T1

inode_idx_table[0] = 0

inode_idx_table[1] = -1

inode_idx_table[2] = 2

.....

inode_idx_table[2043] = 2043

exp_trck_file_header_t

inode idx 0

inode idx 1

inode idx 2

.....

inode idx 2043

inode_array

creation_time = T1

inode_idx_table[0] = 0

inode_idx_table[1] = -1

inode_idx_table[2] = 1

.....

inode_idx_table[2043] = 2042

exp_trck_file_header_t

inode idx 0

inode idx 2

.....

inode idx 2043

inode_array

Cadifra Evaluationwww.cadifra.com

Before After

Page 85: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 85

Exportd start-up sequence

Export Master Process

rozolauncher

x8

Exit()

rozolauncher start /var/run/launcher_exportd_slave_<slave_id>.pid exportd

-i <slave_id> -s -c <config_file> -d <rozodiag_port>

exportd -c <config_file> -d <rozodiag_port>

Export Slave Process

Exportd -i <slave_id> -s -c <config_file> -d <rozodiag_port>

• The Master exportd process is started first• It launches 8 Slave exportd processes thanks the rozolauncher• Each rozolauncher starts a exportd process in slave mode and provides

to it the slave_id reference as well as the export configuration file• In case of failure (exit()), the slave exportd process is automatically

restarted by its parent rozolauncher:• Before being restarted, a core dump file might be generated.

Page 86: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 86

Exportd shared memory

root@debian:/home/rozofs/off_mgeo/src/exportd# ipcs -m

------ Segment de mémoire partagée --------clé shmid propriétaire perms octets nattch états 0x00000000 0 root 644 80 2 0x00000000 32769 root 644 16384 2 0x00000000 65538 root 644 280 2 0x00000000 98307 didier 600 33554432 2 dest

0x4558504f 294916 root 666 1216 9 0x524f5a30 163845 root 666 8421376 2 0x524f5a31 196614 root 666 8421376 2 0x524f5a32 229383 root 666 8421376 2 0x524f5a33 262152 root 666 8421376 2 .

• The Master Exportd process supervises the Slave Exportd within a shared memory that it creates when it starts. It permits to report the information related to each exportd slave process.

• Notes: the key of the Master Exportd is a constant « EXPO ». It is not possible to run two exportd Master processes on the same host.

Page 87: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 87

Exportd Slave status

• Information relative to exportd slave status can be obtained from the Master Exportd:[127.0.0.1:52000] rzdbg> exp_slave____[exportd-M]__[ exp_slave]____id | pid | state | uptime | observer port | metadata port | reload count |-----+------------+-----------------+---------------------+----------------+----------------+--------------+ 0 | 11544 | running |1 days, 4:20:53 | 52000 | 53000 | 0 | 1 | 11637 | running |1 days, 4:20:50 | 52001 | 53001 | 0 | 2 | 11639 | running |1 days, 4:20:50 | 52002 | 53002 | 0 | 3 | 11640 | running |1 days, 4:20:50 | 52003 | 53003 | 0 | 4 | 11644 | running |1 days, 4:20:50 | 52004 | 53004 | 0 | 5 | 11646 | running |1 days, 4:20:50 | 52005 | 53005 | 0 | 6 | 11654 | running |1 days, 4:20:50 | 52006 | 53006 | 0 | 7 | 11650 | running |1 days, 4:20:50 | 52007 | 53007 | 0 | 8 | 11653 | running |1 days, 4:20:50 | 52008 | 53008 | 0 |_________________________________________________________

• id: slave identifier. The value 0 is corresponds to the Master Exportd process• pid: the pid of the process• State: current state of the process

Starting: asserted by the Master exportd process when it starts a slave exportd Running: asserted by the Slave process once it is ready to process metadata requests

• Uptime: time since the process is up and running• Observer port: port to be used with rozodiag• Metadata port: listening port for all metadata operation• Reload count: number of time the exportd configuration has been reloaded by the process

Page 88: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 88

Export reload highlights

• The exportd reload takes place upon a change within the configuration file of the export: adding storage node in cluster, adding a volume, etc…

• The reload is done without stopping the exportd processes (Master and Slave)• This is achieved by raising the SIGUSR1 linux signal• The current operation in progress is suspended until the end of the processing of the

configuration file.• nominal case:

the Master process updates its own configuration and then informs its Slave processes that they must process the new configuration Once a new configuration has been successfully loaded, a new md5 is computed for the

configuration. It is used as a trigger by clients to detect remotely a configuration change and to ask for new configuration.

• Failure case: In case of failure during the parsing of the new configuration, the system reverts to the

current configuration. The Slave processes are not involved in that case.

Page 89: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 89

Exportd reload nominal sequence

Export Master Process

rozolauncher

x8

Exit()

rozolauncher reload /var/run/launcher_exportd_slave_<slave_id>.pid exportd

-i <slave_id> -s -c <config_file> -d <rozodiag_port>

Kill -1 <exportd_master_pid>

Export Slave Process

Kill -1 <exportd_slave_pid>

Validate the new configuration file

Parse and load the new configuration file

Increment the reload count at the end of the processing

Page 90: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 90

• RozoFS data path Mojette Transform performances Mojette Transform uses cases

Page 91: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 91

Mojette Transform performancesEncoding/decoding performances with 2 redundancies projections (4+2)

8192 4096 2048 10240

2000

4000

6000

8000

10000

12000

14000

2911

1979

1302

824

933

933

916

92026

04

2604

2474

2474

9897 9897

12371 12371

Encoding performances (4+2) (best case)

sub-title

Cauchy-good 4+2

Reed-sol-van 4+2

Reed-sol-r6 4+2

Mojette 4+2

user data block size (bytes)

thro

ug

hp

ut (

MB

yte

s/s

)

8192 4096 2048 10240

1000

2000

3000

4000

5000

6000

7000

8000

9000

2585

1767

1075

673

908

883

824

727

2414

2367

2061

1767

8247

7069

6185 6185

Encoding performances (4+2) (worst case)

Cauchy-good 4+2

Reed-sol-van 4+2

Reed-sol-r6 4+2

Mojette 4+2

user data block size (bytes)

Th

rou

gh

pu

t (M

Byt

es

/s)

1 20

1000

2000

3000

4000

5000

6000

7000

8000

496 232

5498

677

7331 7452

Decoding performances 4+2

user data block: 4K

Cauchy-good 4+2

Reed-sol-van 4+2

Reed-sol-r6 4+2

Mojette 4+2

number of failures

thro

ug

hp

ut (

MB

yte

s/s

)

1 20

1000

2000

3000

4000

5000

6000

7000

8000

9000

951435

7690

699

8018 8173

Decoding performances 4+2

user data block : 8K

Cauchy-good 4+2 8192

Reed-sol-van 4+2

Mojette 4+2 8192

number of failures

thro

ug

hp

ut (

MB

yte

s/s

)

1. Mojette decoding/encoding is not CPU intensive and fits well on client side2. Mojette decoding time does not depend on number of failures

Page 92: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 92

Mojette Transform performances

Encoding/decoding performances with 4 redundancies projections (8+4)

8192 4096 2048 10240

1000

2000

3000

4000

5000

6000

482 324 202 95

41234498

4948 4948

Encoding performances 8+4 (best case)

Cauchy-good 8+4

Cauchy-orig 8+4

Reed-sol-van 8+4

Mojette 8+4

user data block (bytes)

thro

ug

hu

t (M

Byt

es

/s)

1 2 3 40

1000

2000

3000

4000

5000

204 74 38 26

4498

568

329

227

22492749 2585

1903

Decoding performances 8+4

user data block: 4K

Cauchy-good 8+4

Reed-sol-van 8+4

Mojettte 8+4

number of failures

Th

rou

gh

pu

t (M

Byt

es

/s)

1 2 3 40

1000

2000

3000

4000

5000

6000

7000

8000

406 142 71 51

7272

648 363 243

31923665 3665

3092

Decoding performances 8+4user data block: 8K

Cauchy-good 8+4 8192

Reed-sol-van 8+4

Mojette 8+4 8192

number of failures

TH

rou

gh

pu

t (M

Byt

es

/s)

8192 4096 2048 10240

500

1000

1500

2000

2500

3000

3500

471319 194 91

3192 3192

2249

1508

Encoding performances 8+4 (worst case)

Cauchy-good 8+4

Cauchy-orig 8+4

Reed-sol-van 8+4

Mojette 8+4

user data block size (bytes)

thro

ug

hp

ut (

MB

yte

s/s

)

Page 93: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 93

MojetteTransformForward

+Write

Process

RozoFS Layout

Distribution

OSD Node 1

OSD Node 2

OSD Node 3

OSD Node 4

OSD nodes 1,2,3,4)

User p

ayl

oad

RozoFS data-path write serviceFile system block forward transformation (nominal use case)

proj 1.1

proj 2.1

proj 3.1proj1.

2proj2.

2proj

3.2

proj 1.3

proj2.3

proj 3.3proj

1.4

proj 2.4

proj 3.4proj 1.5 proj

2.5

proj 3.5

The set of OSD is provided within the metadata associated with the file User payload is split in User Data Blocks (4K or 8K) Mojette transform is applied on each UDB

Optimaldistributio

n

SpareNode(s)

UDB 1

UDB 2

UDB 3

UDB 4

UDB 5

Page 94: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 94

RozoFS data-path write servicenominal use case sequence diagram

rozofsmont/storcli application vfs/fuse

pwrite(fd_rozofs,buf,offset,length) pwrite(fd,buf,offset,length)

storage node 1

1 mojette_transform_forward(buf)

write_req(fid,projection1)

write_rsp(ok)

2

3 status

write_req(id,projection3)

write_rsp(ok)

write_req(id,projection2)

write_rsp(ok)

storage node 2

storage node 3

Cadifra Evaluationwww.cadifra.com

Write transactions are performed in parallel Write service ends upon receiving all the responses from OSD nodes

Page 95: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 95

MojetteTransformForward

+Write

Process

RozoFS Layout

Distribution

OSD Node 1

OSD Node 2

OSD Node 3

OSD Node 4

OSD nodes (1,2,3,4)

User p

ayl

oad

RozoFS data-path write service

failure use case

proj 1.1

proj 2.1

proj 3.1proj1.

2proj2.

2proj

3.2

proj 1.3

proj2.3

proj 3.3proj

1.4

proj 2.4

proj 3.4proj 1.5 proj

2.5

proj 3.5

Spare OSD is used in case of failure of OSD belonging to the optimal distribution

Write operation is successful when n+m projections are successfully written

Optimaldistributio

n

SpareNode(s)

UDB 1

UDB 2

UDB 3

UDB 4

UDB 5

proj 3.1proj 3.2

proj 3.3

proj 3.4proj 3.5

Page 96: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 96

RozoFS data-path write service

failure sequence diagram

rozofsmont/storcli storage node 3

storage node 4

storage node 1

storage node 2

write_req(fid,projection1)

write_req(id,projection2)

application vfs/fuse

pwrite(fd,buf,offset,length) pwrite(fd_rozofs,buf,offset,length)

mojette_transform_forward(buf) 1

2

write_rsp(ok)

write_rsp(ok)

write_rsp(ok)

status

write_req(id,projection3)

write_req(id,projection3)

3

4

Cadifra Evaluationwww.cadifra.com

Page 97: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 97

RozoFS data-path read service

Filesystem block Mojette inverse transformation (nominal use case)

optimaldistribution

UDB(4K or 8K)

OSD NODES

projection 2

projection 1 1

2

projection 3Read+

InverseMojette

Transform

3

4

RozoFS Layout

Distribution

OSD nodes (1,2,3,4)

Read

Read

Read process selects n projections among the n+m projections to rebuild a User Data Block

It can be any projection subset (n) in the n+m projection set Read transactions towards the OSD are performed in parallel:

Minimize data transfer delay over the network

Page 98: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 98

RozoFS data-path read service

sequence diagram (nominal use case)

rozofsmont/storcli storage node 1

storage node 2

read_req(fid,offset,len)

read_req(fid,offset,len)

application vfs/fuse

pread(fd,buf,offset,length) read(fd_rozofs,buf,offset,length)

buf,length

1

2

3

res_rsp(projection1)

read_rsp((projection2)

mojette_transform_inverse(buf,projection1,projection2)

Cadifra Evaluationwww.cadifra.com

Page 99: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 99

RozoFS data-path read service

failure use case

optimaldistribution

UDB(4K or 8K)

OSD NODES

projection 2

projection 1 1

2

projection 3Read+

InverseMojette

Transform

3

4

RozoFS Layout

Distribution

OSD nodes (1,2,3,4)

Read

Read

Attempt reading on remaining OSD in case of read projection failure:Disk failureNetwork failureOut of date projection

Read

Page 100: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 100

RozoFS data-path read service

failure sequence diagram

rozofsmont/storcli storage node 1

storage node 2

read_req(fid,offset,len)

read_req(fid,offset,len)

application vfs/fuse

pread(fd,buf,offset,length) read(fd_rozofs,buf,offset,length)

timer_expiration

buf,length

1

2

3

read_rsp(projection3)

4

read_rsp(projection1)

read_req(projection3)

storage node 3

mojette_transform_inverse(buf,projection1,projection3)

Cadifra Evaluationwww.cadifra.com

Fast projection recovery time:Start a guard timer on first projection read replyAt timer expiration read requests are propagated towards remaining OSD

Page 101: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 101

RozoFS data-path read service

failure sequence diagram: case of a CRC 32 error

rozofsmont/storcli storage node 1

storage node 2

read_req(fid,offset,len)

read_req(fid,offset,len)

application vfs/fuse

pread(fd,buf,offset,length) read(fd_rozofs,buf,offset,length)

1

2

4

read_rsp(projection1)

read_rsp(projection2,crc_error)

read_req(projection3)

read_rsp(projection3)

storage node 3

crc error!!

3

mojette_transform_inverse(buf,projection1,projection3)

buf,length self healing sequence

mojette_transform_forward_one(buf,projection 2)

write_req(projection_2)

write_rsp(OK) block in error is now fixed.

5

6

Cadifra Evaluationwww.cadifra.com

The crc error is detected on the storage node The storage nodes informs that the read failure is due to a CRC error After rebuilding the initial data, the storcli process triggers a transform forward

The transform forward concerns only the faulty projection It might more that one block to regenerate (depends on the number of CRC errors)

Once the projection has been regenerated, it is sent back the associated storage node

Page 102: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 102

• RozoFS user and group quotas

Page 103: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 103

RozoFS disk quotaIntroduction

• Quota subsystem allows system administrator to set limits on used space and number of used inodes (inode is a file system structure which is associated which each file or directory) for users and/or groups.

• For both used space and number of used i-nodes there are actually two limits: The first one is called softlimit and the second one hardlimit.

• An user can never exceed a hardlimit for any resource.• When an user exceeds a softlimit (s)he is warned that (s)he uses more space than (s)he should

but space/inode is allocated (of course only if an user also does not exceed the hardlimit).• If an user is exceeding softlimit for specified period of time (this period is called grace time)

(s)he is not allowed to allocate more resources (so (s)he must free some space/i-nodes to get under softlimit).

• Quota limits are set independently for each file system (eid)

•By default RozoFS quotas are always one (accounting), but quota enforcement can be control thanks rozofs_quotaon.

Page 104: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 104

RozoFS disk quotaIntroduction

• There are separate quota files for user and group.• The quota files are specific to an exported filesystem• The structure of the files that contains the quotas are specific to RozoFS:

On can access to the contents of the RozoFS quota files thanks the quota services provided with RozoFS.

• The structure of a quota record within a file is the following:

typedef struct _rozo_mem_dqblk {

int64_t dqb_bhardlimit; /* absolute limit on disk blks alloc */

int64_t dqb_bsoftlimit; /* preferred limit on disk blks */

int64_t dqb_curspace; /* current used space */

int64_t dqb_ihardlimit; /* absolute limit on allocated inodes */

int64_t dqb_isoftlimit; /* preferred inode limit */

int64_t dqb_curinodes; /* current # allocated inodes */

time_t dqb_btime; /* time limit for excessive disk use */

time_t dqb_itime; /* time limit for excessive inode use */

} rozo_mem_dqblk;

Page 105: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 105

RozoFS quota files

group user

quotainfo_grp

group_btmap

file system metadata root path

group_0

group_n

quotainfo_usr

user_btmap

user_0

user_n

bitmap file

bitmap (8192 bytes)

.......

entry_idx[usr1] =0

entry_idx[2047]

entry_idx[0] = 0x8000

.............

entry_idx[usr1] =0

.......

bitmap file header

bitmap file payload

quota_info(usr1)

quota_info(usr2) Cadifra Evaluationwww.cadifra.com

• The file bitmap permits to cover 64K quota files • Each quota file can contains up to 2048 quota information• The uid/gid is used as a relative index within a quota file to

find out the effective entry in the payload• The structure can cover up to 133 Millions of either users or

groups

Page 106: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 106

Turning on/off quota on a file system

NAME rozo_quotaon, rozo_quotaoff - turn filesystem quotas on and off

SYNOPSIS /usr/sbin/rozo_quotaon [ -vugfp ] [ -e exportconf-name ] filesystem-id... /usr/sbin/rozo_quotaon [ -avugfp ] [ -e exportconf-name ]

/usr/sbin/rozo_quotaoff [ -vugp ] filesystem-id... /usr/sbin/rozo_quotaoff [ -avugp ]

DESCRIPTION rozo_quotaon rozo_quotaon announces to the system that disk quotas should be enabled on one or more file systems. There are two compo‐ nents to the RozoFS disk quota system: accounting and limit enforcement. RozoFS file systems require that quota account‐ ing be turned on at mount time. It is possible to enable and disable limit enforcement on an RozoFS file system after quota accounting is already turned on. The default is to turn on both accounting and enforcement.

The RozoFS quota implementation does not maintain quota information in user-visible files, but rather stores this information internally.

rozo_quotaoff rozo_quotaoff announces to the system that the specified file systems should have any disk quotas turned off.

Page 107: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 107

RozoFS Disks quota components

RozoFS disks quota services

Export S 1

Export S 8

conf

rozo_quotaonGet/set quota

Export M

Exportd Host

Metadatadisks

Dentry filesInodes files

rozo_repquota

rozo_setquotaGet/set quota

rozo_warnquota

NameServiceSwitch

sendmail

Direct read quota files

exim4

Getpwuid()Getgrpgid()Getpwuid()

Getgrpgid()

Page 108: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 108

The Name Service Switch•Name Service Switch configuration file /etc/nsswitch.conf:

• hosts: files dnsOn the right hand side of the colon are the data sources, where NSS will go to retrieve the system database.

• It progresses left to right, checking each source in turn until the data is found.

• On the left hand side of the colon, the groupings of data, the database itself, which we are calling "maps" -- in this example, the passwd database API functions are mapped to the "compat" and "files" data sources.

• When an NSS function is called, the NSS implementation reads its configuration file /etc/nsswitch.conf, which names the library that implements the data retrieval. NSS dynamically loads this library, in this example, libnss_files.so.

• The correct function within this library is then called, for example _nss_files_getpwuid().• libnss_files then opens and parses /etc/passwd, and returns (typically a struct).

Page 109: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 109

NSS + RFC 2307 LDAP

•Add in a directory service, and you get a situation familiar to many sysadmins. /etc/nsswitch.conf would now also list ldap in addition to files in this example.•If NSS were to load libnss_files.so, and find nothing, it would then load libnss_ldap.so. libnss_ldap.so would make a network connection to the LDAP server, perform a query, and convert the LDAP results into the right return structure.•This means that every query will translate into a TCP connection with handshake overhead, possibly over SSL with its crypto overhead, and then do various ASN.1 and BER en- and decoding within the LDAP protocol itself...

Page 110: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 110

• Miscellaneous

Page 111: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 111

RozoFS Core files

• A core file is generated upon a fatal error encountered during the execution of a RozoFS process.

• By default, the system supports upon to 2 core files per process• The core files generated for a RozoFS process are found under /var/run/rozofs_core

directory.• There is one directory per RozoFS process:

Exportd : Master metadata server Export_slave: Slave metadata server Geomgr: Geo-replication process Rozofsmount: RozoFS client process Storaged: Master storaged process Storio : Slave storaged process Storcli: RozoFS erasure coding read/write client process

Page 112: RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015.

© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 112

List of the reserved ports for RozoFS usage

[127.0.0.1:50004] rzdbg> reserved_ports____[storcli 1 of rozofsmount 0]__[ reserved_ports]___________._____._______.___________________________.____________________________________ Value | Nb | Const | /etc/services | Role_______|_____|_______|___________________________|____________________________________ 52000 | 9 | 52000 | rozofs_export_diag | Export master and slave diagnostic 53000 | 9 | 53000 | rozofs_export_eproto | Export master and slave eproto 53010 | 1 | 53010 | rozofs_export_geo_replica | Export master geo-replication 50003 | 24 | 50003 | rozofs_mount_diag | rozofsmount & storcli diagnostic 50027 | 256 | 50027 | rozofs_storaged_diag | Storaged & storio diagnostic 51000 | 1 | 51000 | rozofs_storaged_mproto | Storaged mproto 54000 | 91 | 54000 | rozofs_geomgr_diag | Geo-replication manager, clients & storcli diagnostic_______|_____|_______|___________________________|____________________________________

echo net.ipv4.ip_local_reserved_ports="52000-52008,53000-53008,53010-53010,50003-50026,50027-50282,51000-51000,54000-54090" >> /etc/sysctl.conf

echo "52000-52008,53000-53008,53010-53010,50003-50026,50027-50282,51000-51000,54000-54090" > /proc/sys/net/ipv4/ip_local_reserved_ports

grep ip_local_reserved_ports /etc/sysctl.conf

cat /proc/sys/net/ipv4/ip_local_reserved_ports