FileSystemDesignforDistributedPersistentMemory
�
%49<49��9#703./9(�$30:,6708<
19<49<49�8703./9(�,+9�*3/885���7846(.,�*7�8703./9(�,+9�*3�=19
������� �� ��������� ����� �
9.56 million tickets/day
11. 6 millionpayments/day
1.48 billiontxs/day
0.256 milliontxs/second
������������������� ����������
• �(8(�+60:,3��3-462(8043�#,*/3414.<• �4259803.��38,370:,� �4259803.�à �(8(��38,370:,� �4259803.• � ����0.��(8(�����
• �4;�1(8,3*<�+(8(�7846(.,�(3+�564*,7703.
�7��.68:A��*<*��7*5A<2,;��7�6.68:A��*<*+*;.;� �� $A;<.6
NVMM&RDMA
• '�� (3D XPoint,etc)• Datapersistency• Byte-addressable• Lowlatency
•#���• Remotedirectaccess• Bypassremotekernel• Lowlatencyandhighthroughput
Client ServerRegisteredMemory RegisteredMemory
HCA HCACPU CPU
A
B
C
D
E
�
Outline•Octopus:aRDMA-enabledDistributedPersistentMemoryFileSystem•Motivation•OctopusDesign•Evaluation•Conclusion
•ScalableandReliableRDMA
�
Modular-DesignedDistributedFileSystem
• DiskGluster• Diskfordatastorage• GigEforcommunication
•MemGluster• Memory fordatastorage• RDMAforcommunication
18ms
Latency(1KBwrite+sync)
98%
2%
Overall
HDD
Software
Network
324us
99.7%
Overall
MEM
Software
RDMA
Modular-DesignedDistributedFileSystem
• DiskGluster• Diskfordatastorage• GigEforcommunication
•MemGluster• Memory fordatastorage• RDMAforcommunication
Bandwidth (1MBwrite)
88MB/s
83MB/s
HDD
Software
118MB/sNetwork
6509MB/sMEM
Software
6350MB/sRDMA
94%
323MB/s6.9%
�
RDMA-enabledDistributedFileSystem•Cannotsimplyreplacethenetwork/storagemodule•Morethanfasthardware…•NewfeaturesofNVM• Byte-addressability• Datapersistency
•RDMAverbs•Write,Read,Atomics (memorysemantics)
• Send,Recv (channel semantics)
•Write-with-imm�
RDMA-enabledDistributedFileSystem
•Wechoosetoredesign theDFS!�
Byte-addressabilityofNVMOne-sidedRDMAverbs
Shareddatamanagements
CPUisthenewbottleneckNewdataflowstrategies
Write-with-Imm EfficientRPCdesign
RDMAAtomics Concurrent control
Opportunity Approaches
Outline•Octopus:aRDMA-enabledDistributedPersistentMemoryFileSystem•Motivation•OctopusDesign• High-ThroughputDataI/O• LowLatencyMetadataAccess
•Evaluation•Conclusion
•ScalableandReliableRDMA��
OctopusArchitecture
N2
NVMM
N2
NVMM
N3
NVMM
HCA HCA HCA
...
SharedPersistentMemoryPool
Client1 Client2 Self-IdentifiedRPC
RDMA-basedDataIO
create(“/home/cym”) Read(“/home/lyy”)
�<�9.:/8:6;�:.68<.�-2:.,<�-*<*�*,,.;;�3=;<�524.�*7�!,<89=;�=;.;�2<;�.201<�5.0;
��
1.SharedPersistentMemoryPool
•ExistingDFSs• Redundantdatacopy
Client Server
FSImage
UserSpaceBuffer
GlusterFS
PageCache
UserSpaceBuffer
mbuf
NICNIC
mbuf
�
7copy
1.SharedPersistentMemoryPool
•ExistingDFSs• Redundantdatacopy
Client Server
FSImage
UserSpaceBuffer
GlusterFS+DAX
PageCache
UserSpaceBuffer
mbuf
NICNIC
mbuf
�
6 copy
•OctopuswithSPMP• Introducesthesharedpersistentmemorypool• Globalviewofdatalayout
1.SharedPersistentMemoryPool
•ExistingDFSs• Redundantdatacopy
Client Server
FSImage
UserSpaceBuffer UserSpaceBuffer
mbuf
NICNIC
mbuf
��
MessagePool
4 copy
2.Client-ActiveDataI/O
• Server-Active• ServerthreadsprocessthedataI/O• WorkswellforslowEthernet• CPUscaneasilybecomethebottleneckwithfasthardware
• Client-Active• Letclientsread/writedatadirectlyfrom/totheSPMP
C1 timeNIC MEMCPUC2
Lookupfiledata Senddata
Lookupfiledata Sendaddress��
3.Self-IdentifiedMetadataRPC• Message-basedRPC
• easytoimplement, lowerthroughput• DaRPC,FaSST
• Memory-basedRPC• CPUcoresscanthemessagebuffer• FaRM
• Usingrdma_write_with_imm?• Scanbypolling• Imm dataforself-identification
�
MessagePoolHCA
Thead1 Theadn
HCA DATAID
MessagePoolHCA
Thead1 Theadn
HCA DATA
4.Collect-DispatchDistributedTransaction• mkdir,mknod operationsneeddistributedtransactions• Two-PhaseLocking&Commit
• Distributedlogging• Two-phase coordination
2*RPC��
Coordinator
Participant
Begin LogBegin
LocalLock wait
LogCommit/Abort
WriteData End
EndBegin
LogBegin
TxExec
LogContext
LogCommit/Abort
LocalLock
LogBegin
TxExec
LogContext
LogCommit/Abort
wait
LocalUnlock
WriteData
LocalUnlock
wait LogEnd
prepare commit
• mkdir,mknod operationsneeddistributedtransactions• Collect-DispatchTransaction
• Locallogging withremotein-placeupdate• Distributedlockservice
Coordinator
Participant
Begin LogBegin
LocalLock
wait LocalExec Log&Update
LocalUnlock
wait LocalLock
Collect
Commit/Abort End
UpdateUnlock
End
1*RPC+2*One-sidedcollect dispatch
EvaluationSetup•EvaluationPlatform
• ConnectedwithMellanoxSX1012switch•EvaluatedDistributedFileSystems•memGluster,runsonmemory,withRDMAconnection• NVFS[OSU],Crail[IBM],optimizedtorunonRDMA•memHDFS,Alluxio,forbigdatacomparison
�5=;<.: �"& �.68:A �877.,<)� ��# =6+.:
� ��� �� �� ����� %,7 ���
� ��� � � ��� %,7 ���
��
OverallEfficiency
75%
80%
85%
90%
95%
100%
0.<*<<: :.*--2:
�*<.7,A��:.*4-8?7
;8/<?*:. 6.6 7.<?8:4
�
����
���
���
����
����
���
����
(:2<. #.*-
�*7-?2-<1�&<252B*<287
;8/<?*:. 6.6 7.<?8:4
• Softwarelatencyisreducedto6us(85%ofthetotallatency)• Achievesread/writebandwidththatapproachestherawstorageandnetworkbandwidth
��
MetadataOperationPerformance
• OctopusprovidesmetadataIOPSintheorderof10#~10%• Octopuscanscaleslinearly
6���
6����
6����
� � � �
�� !�
05=;<.:/; 7>/;,:*25 -6/;,:*25�9855
8����
8����
8����
� � � �
��%�%%#
05=;<.:/; 7>/;,:*25 -6/;,:*25�9855
3����
3����
� � � �
#� !�
05=;<.:/; 7>/;,:*25 -6/;,:*25�9855
�
Read/WritePerformance
• Octopuscaneasilyreachthemaximumbandwidthofhardwarewithasingleclient• OctopuscanachievethesamebandwidthasCrailevenaddanextradatacopy[notshown]
6���
6����
6����
6���
�� �� ��� ��� ���� ���
(#�%�
05=;<.:/;
7>/;
,:*25�79
,:*25�9855
-6/;
6���
6����
6����
6���
�� �� ��� ��� ���� ���
#���
�
5629MB/s 6088MB/s
BigDataEvaluation
•Octopuscanalsoprovidebetterperformanceforbigdataapplicationsthanexistingfilesystems.
�
�
��
�
��
?:2<. :.*-
%.;<��$�! ���;�
6.6���$ �55=@28 '�$ �:*25 !,<89=;
���������������������
���
%.:*0.7 (8:-,8=7<
8:6*52B.-��@.,=<287�%26.
6.6���$ �55=@28 '�$ �:*25 !,<89=;
Conclusion•Octopusprovideshighefficiencybyredesigningthesoftware•Octopus’sinternalmechanisms• Simplifiesdatamanagementlayerby reducingdatacopies• RebalancesnetworkandserverloadswithClient-ActiveI/O• Redesigns themetadataRPCanddistributedtransactionwithRDMAprimitives
•EvaluationsshowthatOctopussignificantlyoutperformsexistingfilesystems
Outline•Octopus:aRDMA-enabledDistributedPersistentMemoryFileSystem•BackgroundandMotivation•OctopusDesign•Evaluation•Conclusion
•ScalableandReliableRDMA
�
RCishardtoscale
�
0
5
10
15
20
25
30
10 100 200 300 400 500 600 700 800
RCWrite UDSend
Throughp
ut(M
ops/s)
#.ofclients
p MCX353AConnectX-3 FDRHCA(singleport)p 1 servernodesendverbsto11 clientnodes
RC vsUDReliableConnection(RC)
UnreliableDatagram(UD)One-to-oneparadigm
QPQPQP
QPQPQP
QPQPQPQP
One-to-manyparadigm
p Offloadingwithone-sidedverbsp Higherperformancep Reliablep Flexible-sizedtransferringpHardtoscale
p Unreliable(riskofpacketloss,out-of-order,etc.)
p Cannotsupportone-sidedverbsp MTUisonly4KBp Goodscalability
WhyisRChardtoscale?
CPU1
CPU CPU … MEM
LastLevel Cache
2
Client
3
1 Memory-MappedI/O 2 PCIeDMARead 3 PacketSending
4 PCIeDMAWrite(DDIO enabled) 5 CPUPollsMessage
4
5
NICCacheNIC
CPU CPU CPU…MEM
LastLevel Cache
Server
NICCacheNIC
�
WhyisRChardtoscale?
n NICCache[1]p Mappingtablep QPstatesp Workqueueelements
n CPUCachep DDIOwritesdatatoLLCp Only10%reservedforDDIO
TwotypesofResourceContention:
With RC, the size of cached data is proportional to thenumber of clients!
…
…Server
Clients
MessagePool
�
Ourgoal:howtomakeRCscalablen FocusonRPC primitivewithRCwrite
p RPCisagoodabstraction,widelyusedp RCwrite(one-sided)hashigherthroughput(FaRM)
n Targetatone-to-manydatatransferringparadigmp e.g.,MDS,KVstore,parameterserver,etc.
n System-levelsolutionp Withoutanymodificationstothehardware
n Deploymentsp MetadataserverinOctopusp Distributedtransactionalsystem
�
1.Groupingtheconnections
S
C
C
CC
C
C
C
C
C
C
C
C
n NaïveApproachpNICcachethrashingwhenthenumberofclientsincreases
p Frequentswapin/outp CausinghigherPCIe traffic
0
20
40
0 50 100 150 200
RCWriteVerb
Throughp
ut(M
ops/s)
#.ofclients�
1.Groupingtheconnections
S
C
C
CC
C
C
C
C
C
C
C
C
n ConnectionGroupingp Serveonegroupatatimeslice
Time
�
1.Groupingtheconnections
S
C
C
CC
C
C
C
C
C
C
C
CTime
n ConnectionGroupingp Serveonegroupatatimeslicep Bettercachelocality:recentlyaccessedmetadataismorelikelybeusedagain
2.VirtualizedMappingn AlleviatethecontentionintheCPUcache
p Reducememoryfootprintinthemessagepooln Observations:
p Whengroupingtheclients,onlypartofthemessagepoolisused
Server
Clients …
…MessagePool
Currentgroup Unused messagebuffers
2.VirtualizedMappingnWedon’tneedtoassignamessagebufferforeachclient
p Virtualize asinglephysicalmessagepooltobeshared amongmultiplegroups
p Withoutextraoverheadforloading/savingthecontext
Server
Clients …
…
MessagePool
�
2.VirtualizedMapping
Server
Clients
…
MessagePoolSwitchingtothenextgroup
…
nWedon’tneedtoassignamessagebufferforeachclientp Virtualize asinglephysicalmessagepooltobeshared amongmultiplegroups
p Withoutextraoverheadforloading/savingthecontext
�
Other Challenges&solutionsn Staticgroupingis suboptimalwhen clientshave
p Varyingrequirementsforthetaillatencyp VaryingfrequenciesofthepostedRPCsp Varyingpayloadsizesp VaryingexecutiontimesfordifferenthandlersPriority-basedscheduler:monitorstheperformanceofeachclientsanddynamicallyadjustthegroupsizeandtimeslicelength.
n SwitchingbetweenthegroupsshouldbeefficientWarmuppool:beforebeingserved,clientsfromthenextgroupputtheirnewrequestsinthewarmuppoolfirst
�,8(017 03 &'�"*(1()1,�!����! ��43�!,10()1,��433,*8043�;08/��--0*0,38�!,7496*,�"/(603.� %49203 �/,3� %49<49��9� �0;9 "/9� 03 �=:8;A;���
Evaluationn Platform
p 2× 2.2GHzIntelXeonE5-2650v4CPUs(24coresintotal)p 128GBDRAMp MCX353ACX-3FDRHCAs(56Gbps IBand40GbE)p 12-nodeclusterconnectedwithMellanoxSX-1012switch
nComparedSystemsRPC Description
RawWrite RPC AbaselineRPCwith alltheoptimizationsinScaleRPC disabled
HERDRPC AscalableRPCwithahybrid ofUCwriteandUDsendverbs
FaSST RPC AscalableRPCbasedonUDsendverbs�
Evaluationn Throughput
0246810
60 160 260 360
RawWrite HERDFaSST ScaleRPC
Throughp
ut(M
ops/s)
#.ofclients(Batch=1)
05
10152025
60 160 260 360
RawWrite HERDFaSST ScaleRPC
#.ofclients(Batch=8)Th
roughp
ut(M
ops/s)
�
0
0.5
1
1.5
2
MKNOD RMNOD STAT READDIR
RawWrite ScaleRPC
EvaluationnMetadataServerinOctopus(DistributedFileSystem)
Throughp
ut(ops/s)
80clients
�
Thanks
��
&�'��*84597��(3�!����,3()1,+��07860)98,+� ,67078,38��,246<��01,�"<78,2� %49<49��9� �0;9 "/9� %49203 �/,3� #(4��0��03�&$� �)��%����&'�"*(1()1,�!����! ��43�!,10()1,��433,*8043�;08/��--0*0,38�!,7496*,�"/(603.� %49203 �/,3� %49<49��9� �0;9"/9� 03 �=:8;A;���
Top Related