Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases....

24
Linux Multipathing Edward Goggin EMC Corporation [email protected] Alasdair Kergon Red Hat [email protected] Christophe Varoqui [email protected] David Olien OSDL [email protected] Abstract Linux multipathing provides io failover and path load sharing for multipathed block de- vices. In this paper, we provide an overview of the current device mapper based multi- pathing capability and describe Enterprise level requirements for future multipathing enhance- ments. We describe the interaction amongst kernel multipathing modules, user mode mul- tipathing tools, hotplug, udev, and kpartx com- ponents when considering use cases. Use cases include path and logical unit re-configuration, partition management, and path failover for both active-active and active-passive generic storage systems. We also describe lessons learned during testing the MD scheme on high- end storage. 1 Introduction Multipathing provides the host-side logic to transparently utilize the multiple paths of a re- dundant network to provide highly available and higher bandwidth connectivity between hosts and block level devices. Similar to how TCP/IP re-routes network traffic between 2 hosts, multipathing re-routes block io to an al- ternate path in the event of one or more path connectivity failures. Multipathing also trans- parently shares io load across the multiple paths to the same block device. The history of Linux multipathing goes back at least 3 years and offers a variety of differ- ent architectures. The multipathing personal- ity of the multidisk driver first provided block device multipathing in the 2.4.17 kernel. The Qlogic FC HBA driver has provided multi- pathing across Qlogic FC initiators since 2002. Storage system OEM vendors like IBM, Hi- tachi, HP, Fujitsu, and EMC have provided multipathing solutions for Linux for several years. Strictly software vendors like Veritas also provide Linux multipathing solutions. In this paper, we describe the current 2.6 Linux kernel multipathing solution built around the kernel’s device mapper block io framework and consider possible enhancements. We first de- scribe the high level architecture, focusing on both control and data flows. We then de- scribe the key components of the new archi- tecture residing in both user and kernel space. This is followed by a description of the in- teraction amongst these components and other user/kernel components when considering sev- 147

Transcript of Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases....

Page 1: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

Linux Multipathing

Edward GogginEMC [email protected]

Alasdair KergonRed Hat

[email protected]

Christophe Varoqui

[email protected]

David OlienOSDL

[email protected]

Abstract

Linux multipathing provides io failover andpath load sharing for multipathed block de-vices. In this paper, we provide an overviewof the current device mapper based multi-pathing capability and describe Enterprise levelrequirements for future multipathing enhance-ments. We describe the interaction amongstkernel multipathing modules, user mode mul-tipathing tools, hotplug, udev, and kpartx com-ponents when considering use cases. Use casesinclude path and logical unit re-configuration,partition management, and path failover forboth active-active and active-passive genericstorage systems. We also describe lessonslearned during testing the MD scheme on high-end storage.

1 Introduction

Multipathing provides the host-side logic totransparently utilize the multiple paths of a re-dundant network to provide highly availableand higher bandwidth connectivity betweenhosts and block level devices. Similar to howTCP/IP re-routes network traffic between 2

hosts, multipathing re-routes block io to an al-ternate path in the event of one or more pathconnectivity failures. Multipathing also trans-parently shares io load across the multiple pathsto the same block device.

The history of Linux multipathing goes backat least 3 years and offers a variety of differ-ent architectures. The multipathing personal-ity of the multidisk driver first provided blockdevice multipathing in the 2.4.17 kernel. TheQlogic FC HBA driver has provided multi-pathing across Qlogic FC initiators since 2002.Storage system OEM vendors like IBM, Hi-tachi, HP, Fujitsu, and EMC have providedmultipathing solutions for Linux for severalyears. Strictly software vendors like Veritasalso provide Linux multipathing solutions.

In this paper, we describe the current 2.6 Linuxkernel multipathing solution built around thekernel’s device mapper block io framework andconsider possible enhancements. We first de-scribe the high level architecture, focusing onboth control and data flows. We then de-scribe the key components of the new archi-tecture residing in both user and kernel space.This is followed by a description of the in-teraction amongst these components and otheruser/kernel components when considering sev-

• 147 •

Page 2: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

148 • Linux Multipathing

eral key use cases. We then describe severaloutstanding architectural issues related to themultipathing architecture.

2 Architecture Overview

This chapter describes the overall architectureof Linux multipathing, focusing on the controland data paths spanning both user and kernelspace multipathing components. Figure 1 isa block diagram of the kernel and user com-ponents that support volume management andmultipath management.

Linux multipathing provides path failover andpath load sharing amongst the set of redun-dant physical paths between a Linux host anda block device. Linux multipathing servicesare applicable to all block type devices, (e.g.,SCSI, IDE, USB, LOOP, NBD). While the no-tion of what constitutes a path may vary signif-icantly across block device types, for the pur-pose of this paper, we consider only the SCSIupper level protocol or session layer defini-tion of a path—that is, one defined solely byits endpoints and thereby indifferent to the ac-tual transport and network routing utilized be-tween endpoints. A SCSI physical path is de-fined solely by the unique combination of aSCSI initiator and a SCSI target, whether usingiSCSI, Fiber Channel transport, RDMA, or se-rial/parallel SCSI. Furthermore, since SCSI tar-gets typically support multiple devices, a logi-cal path is defined as the physical path to a par-ticular logical device managed by a particularSCSI target. For SCSI, multiple logical paths,one for each different SCSI logical unit, mayshare the same physical path.

For the remainder of this paper, “physical path”refers to the unique combination of a SCSI ini-tiator and a SCSI target, “device” refers to aSCSI logical unit, and a “path” or logical path

refers to the logical connection along a physicalpath to a particular device.

The multipath architecture provides a cleanseparation of policy and mechanism, a highlymodular design, a framework to accommodateextending to new storage systems, and well de-fined interfaces abstracting implementation de-tails.

An overall philosophy of isolating mechanismin kernel resident code has led to the creationof several kernel resident frameworks utilizedby many products including multipathing. Adirect result of this approach has led to theplacement of a significant portion of the multi-pathing code in user space and to the avoidanceof a monolithic kernel resident multipathingimplementation. For example, while actualpath failover and path load sharing take placewithin kernel resident components, path dis-covery, path configuration, and path testing aredone in user space.

Key multipathing components utilize frame-works in order to benefit from code sharingand to facilitate extendibility to new hardware.Both kernel and user space multipathing frame-works facilitate the extension of multipathingservices to new storage system types, storagesystems of currently supported types for newvendors, and new storage system models forcurrently supported vendor storage systems.

The device mapper is the foundation of themultipathing architecture. The device map-per provides a highly modular framework forstacking block device filter drivers in the ker-nel and communicating with these drivers fromuser space through a well defined libdevmap-per API. Automated user space resident de-vice/path discovery and monitoring compo-nents continually push configuration and pol-icy information into the device mapper’s multi-pathing filter driver and pull configuration andpath state information from this driver.

Page 3: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

2005 Linux Symposium • 149

Figure 1: Overall architecture

The primary goals of the multipathing driverare to retry a failed block read/write io onan alternate path in the event of an io failureand to distribute io load across a set of paths.How each goal is achieved is controllable fromuser space by associating path failover and loadsharing policy information on a per device ba-sis.

It should also be understood that the multipathdevice mapper target driver and several mul-tipathing sub-components are the only multi-path cognizant kernel resident components inthe linux kernel.

3 Component Modules

The following sections describe the kernel anduser mode components of the Linux multi-pathing implementation, and how those com-ponents interact.

3.1 Kernel Modules

Figure 2 is a block diagram of the kernel devicemapper. Included in the diagram are compo-nents used to support volume management aswell as the multipath system. The primary ker-nel components of the multipathing subsystemare

• the device mapper pseudo driver

• the multipathing device mapper targetdriver

• multipathing storage system Device Spe-cific Modules (DSMs),

• a multipathing subsystem responsible forrun time path selection.

3.1.1 Device Mapper

The device mapper provides a highly modularkernel framework for stacking block device fil-ter drivers. These filter drivers are referred to as

Page 4: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

150 • Linux Multipathing

Figure 2: device mapper kernel architecture

target drivers and are comparable to multidiskpersonality drivers. Target drivers interact withthe device mapper framework through a welldefined kernel interface. Target drivers addvalue by filtering and/or redirecting read andwrite block io requests directed to a mapped de-vice to one or more target devices. Numeroustarget drivers already exist, among them onesfor logical volume striping, linear concatena-tion, and mirroring; software encryption; soft-ware raid; and various other debug and test ori-ented drivers.

The device mapper framework promotes aclean separation of policy and mechanism be-tween user and kernel space respectively. Tak-ing this concept even further, this frameworksupports the creation of a variety of servicesbased on adding value to the dispatching and/orcompletion handling of block io requests wherethe bulk of the policy and control logic can re-side in user space and only the code actuallyrequired to effectively filter or redirect a block

io request must be kernel resident.

The interaction between user and kernel de-vice mapper components takes place throughdevice mapper library interfaces. While the de-vice mapper library currently utilizes a varietyof synchronous ioctl(2) interfaces for this pur-pose, fully backward compatible migration tousing Sysfs or Configfs instead is certainly pos-sible.

The device mapper provides the kernel residentmechanisms which support the creation of dif-ferent combinations of stacked target driversfor different block devices. Each io stack is rep-resented at the top by a single mapped device.Mapped device configuration is initiated fromuser space via device mapper library interfaces.Configuration information for each mapped de-vice is passed into the kernel within a map or ta-ble containing one or more targets or segments.Each map segment consists of a start sector andlength and a target driver specific number of

Page 5: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

2005 Linux Symposium • 151

target driver parameters. Each map segmentidentifies one or more target devices. Since allsectors of a mapped device must be mapped,there are no mapping holes in a mapped device.

Device mapper io stacks are configured inbottom-up fashion. Target driver devices arestacked by referencing a lower level mappeddevice as a target device of a higher levelmapped device. Since a single mapped devicemay map to one or more target devices, eachof which may themselves be a mapped device,a device mapper io stack may be more accu-rately viewed as an inverted device tree with asingle mapped device as the top or root nodeof the inverted tree. The leaf nodes of the treeare the only target devices which are not de-vice mapper managed devices. The root nodeis only a mapped device. Every non-root, non-leaf node is both a mapped and target device.The minimum device tree consists of a singlemapped device and a single target device. Adevice tree need not be balanced as there maybe device branches which are deeper than oth-ers. The depth of the tree may be viewed as thetree branch which has the maximum number oftransitions from the root mapped device to leafnode target device. There are no design limitson either the depth or breadth of a device tree.

Although each target device at each level ofa device mapper tree is visible and accessibleoutside the scope of the device mapper frame-work, concurrent open of a target device forother purposes requiring its exclusive use suchas is required for partition management andfile system mounting is prohibited. Target de-vices are exclusively recognized or claimed bya mapped device by being referenced as a tar-get of a mapped device. That is, a target de-vice may only be used as a target of a singlemapped device. This restriction prohibits boththe inclusion of the same target device withinmultiple device trees and multiple references tothe same target device within the same device

tree, that is, loops within a device tree are notallowed.

It is strictly the responsibility of user spacecomponents associated with each target driverto

• discover the set of associated target de-vices associated with each mapped devicemanaged by that driver

• create the mapping tables containing thisconfiguration information

• pass the mapping table information intothe kernel

• possibly save this mapping information inpersistent storage for later retrieval.

The multipath path configurator fulfills thisrole for the multipathing target driver. Thelvm(8) , dmraid(8) , and dmsetup(8)commands perform these tasks for the logicalvolume management, software raid, and the de-vice encryption target drivers respectively.

While the device mapper registers with the ker-nel as a block device driver, target drivers inturn register callbacks with the device map-per for initializing and terminating target de-vice metadata; suspending and resuming io ona mapped device; filtering io dispatch and iocompletion; and retrieving mapped device con-figuration and status information. The devicemapper also provides key services, (e.g., io sus-pension/resumption, bio cloning, and the prop-agation of io resource restrictions), for use byall target drivers to facilitate the flow of io dis-patch and io completion events through the de-vice mapper framework.

The device mapper framework is itself a com-ponent driver within the outermostgeneric_make_request framework for block de-vices. The generic_make_request

Page 6: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

152 • Linux Multipathing

framework also provides for stacking block de-vice filter drivers. Therefore, given this archi-tecture, it should be at least architecturally pos-sible to stack device mapper drivers both aboveand below multidisk drivers for the same targetdevice.

The device mapper processes all read and writeblock io requests which pass through the blockio subsystem’sgeneric_make_requestand/orsubmit_bio interfaces and is directedto a mapped device. Architectural symmetryis achieved for io dispatch and io completionhandling since io completion handling withinthe device mapper framework is done in theinverse order of io dispatch. All read/writebios are treated as asynchronous io within allportions of the block io subsystem. This de-sign results in separate, asynchronous and in-versely ordered code paths through both thegeneric_make_request and the devicemapper frameworks for both io dispatch andcompletion processing. A major impact of thisdesign is that it is not necessary to process ei-ther an io dispatch or completion either imme-diately or in the same context in which they arefirst seen.

Bio movement through a device mapper de-vice tree may involve fan-out on bio dispatchand fan-in on bio completion. As a bio is dis-patched down the device tree at each mappeddevice, one or more cloned copies of the bioare created and sent to target devices. The sameprocess is repeated at each level of the devicetree where a target device is also a mappeddevice. Therefore, assuming a very wide anddeep device tree, a single bio dispatched to amapped device can branch out to spawn a prac-tically unbounded number of bios to be sentto a practically unbounded number of targetdevices. Since bios are potentially coalescedat the device at the bottom of thegeneric_make_request framework, the io request(s)actually queued to one or more target devices

at the bottom may bear little relationship to thesingle bio initially sent to a mapped device atthe top. For bio completion, at each level ofthe device tree, the target driver managing theset of target devices at that level consumes thecompletion for each bio dispatched to one of itsdevices, and passes up a single bio completionfor the single bio dispatched to the mapped de-vice. This process repeats until the original biosubmitted to the root mapped device is com-pleted.

The device mapper dispatches bios recursivelyfrom top (root node) to bottom (leaf node)through the tree of device mapper mapped andtarget devices in process context. Each levelof recursion moves down one level of the de-vice tree from the root mapped device to oneor more leaf target nodes. At each level, thedevice mapper clones a single bio to one ormore bios depending on target mapping infor-mation previously pushed into the kernel foreach mapped device in the io stack since abio is not permitted to span multiple map tar-gets/segments. Also at each level, each clonedbio is passed to the map callout of the targetdriver managing a mapped device. The targetdriver has the option of

1. queuing the io internal to that driver to beserviced at a later time by that driver,

2. redirecting the io to one or more differenttarget devices and possibly a different sec-tor on each of those target devices, or

3. returning an error status for the bio to thedevice mapper.

Both the first or third options stop the recursionthrough the device tree and thegeneric_make_request framework for that matter.Otherwise, a bio being directed to the first tar-get device which is not managed by the devicemapper causes the bio to exit the device mapper

Page 7: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

2005 Linux Symposium • 153

framework, although the bio continues recurs-ing through thegeneric_make_requestframework until the bottom device is reached.

The device mapper processes bio completionsrecursively from a leaf device to the rootmapped device in soft interrupt context. Ateach level in a device tree, bio completions arefiltered by the device mapper as a result of redi-recting the bio completion callback at that levelduring bio dispatch. The device mapper calloutto the target driver responsible for servicing amapped device is enabled by associating a tar-get_io structure with the bi_private field of abio, also during bio dispatch. In this fashion,each bio completion is serviced by the targetdriver which dispatched the bio.

The device mapper supports a variety ofpush/pull interfaces to enhance communicationacross the system call boundary. Each of theseinterfaces is accessed from user space via thedevice mapper library which currently issuesioctls to the device mapper character interface.The occurrence of target driver derived io re-lated events can be passed to user space via thedevice mapper event mechanism. Target driverspecific map contents and mapped device sta-tus can be pulled from the kernel using devicemapper messages. Typed messages and statusinformation are encoded as ASCII strings anddecoded back to their original form accordingdictated by their type.

3.1.2 Multipath Target Driver

A multipath target driver is a component driverof the device mapper framework. Currently, themultipath driver is position dependent within astack of device mapper target drivers: it must beat the bottom of the stack. Furthermore, theremay not be other filter drivers, (e.g., multidisk),stacked underneath it. It must be stacked im-

mediately atop driver which services a blockrequest queue, for example,/dev/sda .

The multipath target receives configuration in-formation for multipath mapped devices in theform of messages sent from user space throughdevice mapper library interfaces. Each messageis typed and may contain parameters in a po-sition dependent format according to messagetype. The information is transferred as a sin-gle ASCII string which must be encoded by thesender and decoded by the receiver.

The multipath target driver provides pathfailover and path load sharing. Io failure on onepath to a device is captured and retried down analternate path to the same device. Only afterall paths to the same device have been tried andfailed is an io error actually returned to the ioinitiator. Path load sharing enables the distri-bution of bios amongst the paths to the samedevice according to a path load sharing policy.

Abstractions are utilized to represent key en-tities. A multipath corresponds to a device.A logical path to a device is represented by apath. A path group provides a way to associatepaths to the same device which have similar at-tributes. There may be multiple path groupsassociated with the same device. A path selec-tor represents a path load sharing algorithm andcan be viewed as an attribute of a path group.Round robin based path selection amongst theset of paths in the same path group is currentlythe only available path selector. Storage sys-tem specific behavior can be localized within amultipath hardware handler.

The multipath target driver utilizes two sub-component frameworks to enable both storagesystem specific behavior and path selection al-gorithms to be localized in separate moduleswhich may be loaded and managed separatelyfrom the multipath target driver itself.

Page 8: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

154 • Linux Multipathing

3.1.3 Device Specific Module

A storage system specific component can be as-sociated with each target device type and is re-ferred to as a hardware handler or Device Spe-cific Module (DSM). A DSM allows for thespecification of kernel resident storage systemspecific path group initialization, io completionfiltering, and message handling. Path groupinitialization is used to utilize storage systemspecific actions to activate the passive interfaceof an active-passive storage system. Storagesystem specific io completion filtering enablesstorage system specific error handling. Storagesystem specific message handling enables stor-age system specific configuration.

DSM type is specified by name in the multi-path target driver map configuration string andmust refer to a DSM pre-loaded into the kernel.A DSM may be passed paramters in the con-figuration string. A hardware context structurepassed to each DSM enables a DSM to trackstate associated with a particular device.

Associating a DSM with a block device typeis optional. The EMC CLARiion DSM is cur-rently the only DSM.

3.1.4 Path Selection Subsystem

A path selector enables the distribution of ioamongst the set of paths in a single path group.

Path selector type is specified by name in themultipath target driver map configuration stringand must refer to a path selector pre-loadedinto the kernel. A path selector may be passedparamters in the configuration string. The pathselector context structure enables a path selec-tor type to track state across multiple ios to thepaths of a path group.

Each path group must be associated with a pathselector. A single round robin path selector ex-ists today.

3.2 User Modules

Figure 3 outlines the architecture of the user-mode multipath tools. Multipath user spacecomponents perform path discovery, path pol-icy management and configuration, and pathhealth testing. The multipath configurator is re-sponsible for discovering the network topologyfor multipathed block devices and for updatingkernel resident multipath target driver config-uration and state information. The multipathdaemon monitors the usability of paths both inresponse to actual errors occurring in the kerneland proactively via periodic path health test-ing. Both components share path discoveryand path health testing services. Furthermore,these services are implemented using an exten-sible framework to facilitate multipath supportfor new block device types, block devices fromnew vendors, and new models. The kpartx toolcreates mapped devices for partitions of multi-path managed block devices.

3.2.1 Multipath Configurator

Path discovery involves determining the set ofroutes from a host to a particular block devicewhich is configured for multipathing. Path dis-covery is implemented by scanning Sysfs look-ing for block device names from a multipathconfiguration file which designate block devicetypes eligible for multipathing. Each entry in/sys/block corresponds to the gendisk for a dif-ferent block device. As such, path discov-ery is independent of whatever path transportis used between host and device. Since de-vices are assumed to have an identifier attributewhich is unique in both time and space, the

Page 9: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

2005 Linux Symposium • 155

low level device drivers

add/del block device

block

sysfs

hotplug NetLinkuevents

devicemapper

callsexecslistenioctl

testfailedpaths

kernel space

user space

1) add path2) remove path3) add devmap4) remove devmap

manual exec

kpartx multipath multipathd

udev

(re)configuremultipathdevicemaps

configurepartitionsdevicemaps

1) wait devmap events2) reinstate paths3) switch path group (failback)

Figure 3: multipath tools architecture

Page 10: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

156 • Linux Multipathing

cumulative set of paths found from Sysfs arecoalesced based on device UID. Configurationdriven multipath attributes are setup for each ofthese paths.

The multipath configurator synchronizes pathconfiguration and path state information acrossboth user and kernel multipath components.The current configuration path state is com-pared with the path state pulled from the mul-tipath target driver. Most discrepancies aredealt with by pushing the current configurationand state information into the multipath targetdriver. This includes creating a new multipathmap for a newly discovered device; changingthe contents of an existing multipath map for anewly discovered path to a known device, fora path to a known device which is no longervisible, and for configuration driven multipathattributes which may have changed; and for up-dating the state of a path.

Configuration and state information are passedbetween user and kernel space multipath com-ponents as position dependent information as asingle string. The entire map for a mapped de-vice is transferred as a single string and must beencoded before and decoded after the transfer.

The multipath configurator can be invokedmanually at any time or automatically in reac-tion to a hotplug event generated for a configu-ration change for a block device type managedby the multipathing subsystem. Configurationchanges involve either the creation of a newpath or removal of an existing path.

3.2.2 Multipath Daemon

The multipath daemon actively tests paths andreacts to changes in the multipath configura-tion.

Periodic path testing performed by the multi-path daemon is responsible for both restoring

failed paths to an active state and proactivelyfailing active paths which fail a path test. Cur-rently, while the default is to test all active andfailed paths for all devices every 5 seconds, thisinterval can be changed via configuration direc-tive in the multipath configuration file. The cur-rent non-optimized design could be enhancedto reduce path testing overhead by

• testing the physical transport componentsinstead of the logical ones

• varying the periodic testing interval.

An example of the former for SCSI block de-vices is to

• associate across all devices those pathswhich utilize common SCSI initiators andtargets and

• for each test interval test only one path forevery unique combination of initiator andtarget.

An example of the latter is to vary the periodictest interval in relationship to the recent pasthistory of the path or physical components, thatis, paths which fail often get tested more fre-quently.

The multipath daemon learns of and reacts tochanges in both the current block device con-figuration and the kernel resident multipathingconfiguration. The addition of a new path orthe removal of an already existing path to analready managed block device is detected overa netlink socket as a uevent triggered callbackwhich adds or removes the path to or fromthe set of paths which will be actively tested.Changes to the kernel resident multipathingstate are detected as device-mapper generatedevent callbacks. Events of this kind involeblock io errors, path state change, and changesin the highest priority path group for a mappeddevice.

Page 11: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

2005 Linux Symposium • 157

3.2.3 Multipath Framework

The multipath framework enables the use ofblock device vendor specific algorithms for

1. deriving a UID for identifying the physicaldevice associated with a logical path

2. testing the health of a logical path

3. determining how to organize the logicalpaths to the same device into separate sets,

4. assigning a priority to each path

5. determining how to select the next pathwithin the same path set

6. specifying any kernel resident device spe-cific multipathing capabilities.

While the last two capabilities must be ker-nel resident, the remaining user resident capa-bilities are invoked either as functions or ex-ecutables. All but item four and item six aremandatory. A built-in table specifying eachof these capabilities for each supported blockdevice vendor and type may, but need not be,overridden by configuration directives in a mul-tipath configuration file. Block device vendorand type are derived from attributes associatedwith the Sysfs device file probed during devicediscovery. Configuration file directives mayalso be used to configure these capabilities fora specific storage system instance.

A new storage system is plugged into thisuser space multipath framework by specifyinga configuration table or configuration file en-try for the storage system and providing any ofthe necessary, but currently missing mechanismneeded to satisfy the six services mentionedabove for the storage system. The service selec-tions are specified as string or integer constants.In some cases, the selection is made from a re-stricted domain of options. In other cases a new

mechanism can be utilized to provide the re-quired service. Services which are invoked asfunctions must be integrated into the multipathcomponent libraries while those invoked as ex-ecutables are not so restricted. Default optionsprovided for each service may also be overrid-den in the multipath configuration file.

Since the service which derives a UID for amultipath device is currently invoked from themultipath framework as an executable, the ser-vice may be, and in fact is now external to themultipath software. All supported storage sys-tems (keep in mind they are all SCSI at the mo-ment) utilizescsi_id(8) to derive a UID fora SCSI logical unit. Almost all of these casesobtain the UID directly from the Vendor Spec-ified Identifier field of an extended SCSI in-quiry command using vital product page 0x83.This is indeed the default option. Althoughscsi_id is invoked as an executable today,a scsi_id service library appears to be in-plan, thereby allowing in-context UID genera-tion from this framework in the near future.

Path health testing is invoked as a service func-tion built into the libcheckers multipath library.While SCSI specific path testing functions al-ready exist in this library based on reading sec-tor 0 (this is the default) and issuing a TUR,path health testing can be specified to be stor-age system specific but must be included withinlibcheckers.

The selection of how to divide up the paths tothe same device into groups is restricted to a setof five options:

• failover

• multibus

• group-by-priority

• group-by-serial

Page 12: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

158 • Linux Multipathing

• group-by-node-name.

Failover, the default policy, implies one pathper path group and can be used to disallow pathload sharing while still providing path failover.Multibus, by far the most commonly selectedoption, implies one path group for all paths andis used in most cases when access is symmet-ric across all paths, e.g., active-active storagesystems. Group-by-priority implies a group-ing of paths with the same priority. This op-tion is currently used only by the active-passiveEMC CLARiion storage array and provides thecapability to assign a higher priority to pathsconnecting to the portion of the storage sys-tem which has previously been assigned to bethe default owner of the SCSI logical unit.Group-by-serial implies a grouping based onthe Vendor Specified Identifier returned by aVPD page 0x80 extended SCSI inquiry com-mand. This is a good way to group paths for anactive-passive storage system based on whichpaths are currently connected to the active por-tion of the storage system for the SCSI logicalunit. Group-by-node-name currently implies agrouping by by SCSI target.

Paths to the same device can be assignedpriorities in order to both enable the group-by-priority path grouping policy and to af-fect path load sharing. Path group priorityis a summation of the priority for each ac-tive path in the group. An io is always di-rected to a path in the highest priority pathgroup. Theget_priority service is cur-rently invoked as an executable. The de-fault option is to not assign a priority to anypath, which leads to all path groups beingtreated equally. Thepp_balance_pathsexecutable assigns path priority in order to at-tempt to balance path usage for all multipathdevices across the SCSI targets to the samestorage system. Several storage system specificpath priority services are also provided.

Path selectors and hardware contexts are spec-ified by name and must refer to specific kernelresident services. A path selector is mandatoryand currently the only option is round-robin. Ahardware context is by definition storage sys-tem specific. Selection of hardware context isoptional and only the EMC CLARiion storagesystem currently utilizes a hardware context.Each may be passed parameters, specified as acount followed by each parameter.

3.2.4 Kpartx

The kpartx utility creates device-mappermapped devices for the partitions of multipathmanaged block devices. Doing so allows ablock device partition to be managed withinthe device mapper framework as would beany whole device. This is accomplished byreading and parsing a target device’s partitiontable and setting up the device-mapper tablefor the mapped device from the start addressand length fields of the paritition table entryfor the partition in question. Kpartx uses thesame devmapper library interfaces as does themultipath configurator in order to create andinitialize the mapped device.

4 Interaction Amongst Key Kerneland User Components

The interaction between key user and kernelmultipath components will be examined whileconsidering several use cases. Device andpath configuration will be considered first. Ioscheduling and io failover will then be exam-ined in detail.

Page 13: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

2005 Linux Symposium • 159

4.1 Block Device and Path Discovery

Device discovery consists of obtaining infor-mation about both the current and previousmultipath device configurations, resolving anydifferences, and pushing the resultant updatesinto the multipath target driver. While thesetasks are primarily the responsibility of themultipath configurator, many of the device dis-covery services are in fact shared with the mul-tipath daemon.

The device discovery process utilizes the com-mon services of the user space multipath frame-work. Framework components to identify,test, and prioritize paths are selected from pre-established table or config driven policy optionsbased on device attributes obtained from prob-ing the device’s Sysfs device file.

The discovery of the current configuration isdone by probing block device nodes createdin Sysfs. A block device node is created byudev in reaction to a hotplug event generatedwhen a block device’s request queue is regis-tered with the kernel’s block subsystem. Eachdevice node corresponds to a logical path to ablock device since no kernel resident compo-nent other than the multipath target driver ismultipath cognizant.

The set of paths for the current configurationare coalesced amongst the set of multipathmanaged block devices. Current path and de-vice configuration attributes are retrieved con-figuration file and/or table entries.

The previous device configuration stored in thecollective set of multipath mapped device mapsis pulled from the multipath target driver usingtarget driver specific message ioctls issued bythe device-mapper library.

Discrepancies between the old and new deviceconfiguration are settled and the updated device

configuration and state information is pushedinto the multipath target driver one device at atime. Several use cases are enumerated below.

• A new mapped device is created for a mul-tipath managed device from the new con-figuration which does not exist in the oldconfiguration.

• The contents of an existing multipath mapare updated for a newly discovered pathto a known device, for a path to a knowndevice which is no longer visible andfor multipath attributes which may havechanged. Examples of multipath attributeswhich can initiate an update of the kernelmultipath device configuration are enu-merated below.

– device size

– hardware handler

– path selector

– multipath feature parameters

– number of path groups

– assignment of paths to path groups

– highest priority path group

• Path state is updated based on path testingdone during device discovery.

Configuration updates to an existing multipathmapped device involve the suspension and sub-sequent resumption of io around the completereplacement of the mapped device’s device-mapper map. Io suspension both blocks all newio to the mapped device and flushes all io fromthe mapped device’s device tree. Path state up-dates are done without requiring map replace-ment.

Hotplug initiated invocation of the multipathconfigurator leads to semi-automated multipathresponse to post-boot time changes in the block

Page 14: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

160 • Linux Multipathing

device configuration. For SCSI target devices,a hotplug event is generated for a SCSI targetdevice when the device’s gendisk is registeredafter the host attach of a SCSI logical unit andunregistered after the host detach of a SCSI log-ical unit.

4.2 Io Scheduling

The scheduling of bios amongst the multiplemultipath target devices for the same multi-path mapped device is controlled by both apath grouping policy and a path selection pol-icy. While both path group membership andpath selection policy assignment tasks are per-formed in user space, actual io scheduling isimplemented via kernel resident mechanism.

Paths to the same device can be separated intopath groups, where all paths in the same grouphave similar path attributes. Both the numberof path groups and path membership within agroup are controlled by the multipath configu-rator based on one of five possible path group-ing policies. Each path grouping policy usesdifferent means to assign a path to a path groupin order to model the different behavior in thephysical configuration. Each path is assigneda priority via a designated path priority callout.Path group priority is the summation of the pathpriorities for each path in the group. Each pathgroup is assigned a path selection policy gov-erning the selection of the next path to use whenscheduling io to a path within that group.

Path group membership and path selection in-formation are pushed into the kernel where it isthen utilized by multipath kernel resident com-ponents to schedule each bio on one of mul-tipath paths. This information consists of thenumber of path groups, the highest priority pathgroup, the path membership for each group(target devices specified bydev_t ), the nameof the path selection policy for each group, a

count of optional path selection policy param-eters, and the actually path selection policy pa-rameters if the count value is not zero. As isthe case for all device mapper map contentspassed between user and kernel space, the col-lective contents is encoded and passed as a sin-gle string, and decoded on the other side ac-cording to its position dependent context.

Path group membership and path selection in-formation is pushed into the kernel both when amultipath mapped device is first discovered andconfigured and later when the multipath config-urator detects that any of this information haschanged. Both cases involve pushing the infor-mation into the multipath target driver withina device mapper map or table. The latter casealso involves suspending and resuming io to themapped device during the time the map is up-dated.

Path group and path state are also pushed intothe kernel by the multipath configurator inde-pendently of a multipath mapped device’s map.A path’s state can be either active or failed. Iois only directed by the multipath target driverto a path with an active path state. Currentlya path’s state is set to failed either by the mul-tipath target driver after a single io failure onthe path or by the multipath configurator aftera path test failure. A path’s state is restoredto active only in user space after a multipathconfigurator initiated path test succeeds for thatpath. A path group can be placed into bypassmode, removed from bypass mode, or made thehighest priority path group for a mapped de-vice. When searching for the next path group touse when there are no active paths in the highestpriority path group, unless a new path group hasbeen designated as the highest priority group,all path groups are searched. Otherwise, pathgroups in bypass mode are first skipped overand selected only if there are no path groupsfor the mapped device which are not in bypassmode.

Page 15: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

2005 Linux Symposium • 161

The path selection policy name must refer toan already kernel resident path selection policymodule. Path selection policy modules registerhalf dozen callbacks with the multipath targetdriver’s path selection framework, the most im-portant of which is invoked in the dispatch pathof a bio by the multipath target driver to selectthe next path to use for the bio.

Io scheduling triggered during the multipathtarget driver’s bio dispatch callout from thedevice mapper framework consists of first se-lecting a path group for the mapped device inquestion, then selecting the active path to usewithin that group, followed by redirecting thebio to the selected path. A cached value of thepath group to use is saved with each multipathmapped device in order to avoid its recalcula-tion for each bio redirection to that device. Thiscached value is initially set from to the highestpriority path group and is recalculated if either

• the highest priority path group for amapped device is changed from user spaceor

• the highest priority path group is put intobypassed mode either from kernel or userspace multipathing components.

A cached value of the path to use within thehighest priority group is recalculated by invok-ing the path selection callout of a path selectionpolicy whenever

• a configurable number of bios have al-ready been redirected on the current path,

• a failure occurs on the current path,

• any other path gets restored to a usablestate, or

• the highest priority path group is changedvia either of the two methods discussedearlier.

Due to architectural restrictions and the rela-tively (compared with physical drivers) highpositioning of the multipath target driver in theblock io stack, it is difficult to implement pathselection policies which take into account thestate of shared physical path resources withoutimplementing significant new kernel residentmechanism. Path selection policies are limitedin scope to the path members of a particularpath group for a particular multipath mappeddevice. This multipath architectural restrictiontogether with the difficulty in tracking resourceutilization for physical path resources from ablock level filter driver makes it difficult to im-plement path selection policies which could at-tempt to minimize the depth of target device re-quest queues or the utilization of SCSI initia-tors. Path selectors tracking physical resourcespossibly shared amongst multiple hosts, (e.g.,SCSI targets), face even more difficulties.

The path selection algorithms are also impactedarchitecturally by being positioned above thepoint at the bottom of the block io layer wherebios are coalesced into io requests. To helpdeal with this impact, path reselection withina priority group is done only for every n bios,where n is a configurable repeat count value as-sociated with each use of a path selection pol-icy for a priority group. Currently the repeatcount value is set to 1000 for all cases in or-der to limit the adverse throughput effects ofdispersing bios amongst multiple paths to thesame device, thereby negating the ability of theblock io layer to coalesce these bios into largerio requests submitted to the request queue ofbottom level target devices.

A single round-robin path selection policy ex-ists today. This policy selects the least recentlyused active path in the current path group forthe particular mapped device.

Page 16: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

162 • Linux Multipathing

4.3 Io Failover

While actual failover of io to alternate paths isperformed in the kernel, path failover is con-trolled via configuration and policy informationpushed into the kernel multipath componentsfrom user space multipath components.

While the multipath target driver filters bothio dispatch and completion for all bios sent toa multipath mapped device, io failover is trig-gered when an error is detected while filteringio completion. An understanding of the errorhandling taking place underneath the multipathtarget driver is useful at this point. AssumingSCSI target devices as leaf nodes of the devicemapper device tree, the SCSI mid-layer fol-lowed by the SCSI disk class driver each parsethe result field of thescsi_cmd structure setby the SCSI LLDD. While parsing by the SCSImid-layer and class driver filter code filter outsome error states as being benign, all othercases lead to failing all bios associated with theio request corresponding to the SCSI commandwith -EIO. For those SCSI errors which pro-vide sense information, SCSI sense key, Addi-tional Sense Code (ASC), and Additional SenseCode Qualifier (ASCQ) byte values are set inthe bi_error field of each bio. The -EIO,SCSI sense key, ASC, and ASCQ are propa-gated to all parent cloned bios and are availablefor access by the any target driver managing tar-get devices as the bio completions recurse backup to the top of the device tree.

Io failures are first seen as a non-zero error sta-tus, (i.e., -EIO), in the error parameter passed tothe multipath target driver’s io completion fil-ter. This filter is called as a callout from thedevice mapper’s bio completion callback asso-ciated with the leaf node bios. Assuming oneexists, all io failures are first parsed by the stor-age system’s hardware context’s error handler.Error parsing drives what happens next for thepath, path group, and bio associated with the io

failure. The path can be put into a failed state orleft unaffected. The path group can be placedinto a bypassed state or left unaffected. Thebio can be queued for retry internally withinthe multipath target driver or failed. The ac-tions on the path, the path group, and the bioare independent of each other. A failed path isunusable until restored to a usable state fromthe user space multipath configurator. A by-passed path group is skipped over when search-ing for a usable path, unless there are no usablepaths found in other non-bypassed path groups.A failed bio leads to the failure of all parentcloned bios at higher levels in the device tree.

Io retry is done exclusively in a dedicated mul-tipath worker thread context. Using a workerthread context allows for blocking in the codepath of an io retry which requires a pathgroup initialization or which gets dispatchedback to generic_make_request —eitherof which may block. This is necessary sincethe bio completion code path through the de-vice mapper is usually done within a soft in-terrupt context. Using a dedicated multipathworker thread avoids delaying the servicing ofnon-multipath related work queue requests aswould occur by using the kernel’s default workqueue.

Io scheduling for path failover follows basicallythe same path selection algorithm as that for aninitial io dispatch which has exhausted its pathrepeat count and must select an alternate path.The path selector for the current path group se-lects the best alternative path within that pathgroup. If none are available, the next highestpriority path group is made current and its pathselector selects the best available path. This al-gorithm iterates until all paths of all path groupshave been tried.

The device mapper’s kernel resident eventmechanism enables user space applications todetermine when io related events occur in the

Page 17: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

2005 Linux Symposium • 163

kernel for a mapped device. Events are gener-ated by the target driver managing a particularmapped device. The event mechanism is ac-cessed via a synchronous device mapper libraryinterface which blocks a thread in the kernel inorder to wait for an event associated with a par-ticular mapped device. Only the event occur-rence is passed to user space. No other attributeinformation of the event is communicated.

The occurrence of a path failure event (alongwith path reinstatement and a change in thehighest priority path group) is communicatedfrom the multipath target driver to the multipathdaemon via this event mechanism. A separatemultipath daemon thread is allocated to wait forall multipath events associated with each mul-tipath mapped device. The detection of anymultipath event causes the multipath daemon torediscover its path configuration and synchro-nize its path configuration, path state, and pathgroup state information with the multipath tar-get driver’s view of the same.

A previously failed path is restored to an activestate only as a result of passing a periodicallyissued path health test issued by the multipathdaemon for all paths, failed or active. This pathstate transition is currently enacted by the mul-tipath daemon invoking the multipath configu-rator as an executable.

A io failure is visible above the multipathingmapped device only when all paths to the samedevice have been tried once. Even then, it ispossible to configure a mapped device to queuefor an indefinite amount of time such bios on aqueue specific to the multipath mapped device.This feature is useful for those storage systemswhich can possibly enter a transient all-paths-down state which must be ridden through bythe multipath software. These bios will remainwhere they are until the mapped device is sus-pended, possibly done when the mapped de-vice’s map is updated, or when a previouslyfailed path is reinstated. There are no practical

limits on either the amount of bios which maybe queued in this manner nor on the amountof time which these bios remain queued. Fur-thermore, there is no congestion control mech-anism which will limit the number of bios actu-ally sent to any device. These facts can leadto a significant amount of dirty pages beingstranded in the page cache thereby setting thestage for potential system deadlock if memoryresources must be dynamically allocated fromthe kernel heap anywhere in the code path ofreinstating either the map or a usable path forthe mapped device.

5 Future Enhancements

This section enumerates some possible en-hancements to the multipath implementation.

5.1 Persistent Device Naming

The cryptic name used for the device file as-sociated with a device mapper mapped de-vice is often renamed by a user space compo-nent associated with the device mapper targetdriver managing the mapped device. The mul-tipathing subsystem sets up udev configurationdirectives to automatically rename this namewhen a device mapper device file is first cre-ated. The dm-<minor #> name is changed tothe ASCII representation of the hexi-decimalvalues for each 4-bit nibble of the device’s UIDutilized by multipath. Yet, the resultant multi-path device names are still cryptic, unwieldly,and their use is prone to error. Although analias name may be linked to each multipath de-vice, the setup requires manipulcation of themultipath configuration file for each device.The automated management of multipath aliasnames by both udev and multipath componentsseems a reasonable next step.

Page 18: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

164 • Linux Multipathing

It should be noted that the Persistent StorageDevice Naming specification from the Stor-age Networking SIG of OSDL is attempting toachieve consistent naming across all block de-vices.

5.2 Event Mechanism

The device mapper’s event mechanism enablesuser space applications to determine when iorelated events occur in the kernel for a mappeddevice. Events are generated by the targetdriver managing a particular mapped device.The event mechanism is currently accessed viaa synchronous device mapper library interfacewhich blocks a thread in the kernel in order towait for an event associated with a particularmapped device. Only the event occurrence ispassed back to user space. No other attributeinformation of the event is communicated.

Potential enhancements to the device mapperevent mechanism are enumerated below.

1. Associating attributes with an event andproviding an interface for communicatingthese attributes to user space will improvethe effectiveness of the event mechanism.Possible attributes for multipath events in-clude (1) the cause of the event, (e.g., pathfailure or other), (2) error or status in-formation associated with the event, (e.g.,SCSI sense key/ASC/ASCQ for a SCSI er-ror), and (3) an indication of the target de-vice on which the error occurred.

2. Providing a multi-event wait synchronousinterface similar to select(2) or poll(2) willsignificantly reduce the thread and mem-ory resources required to use the eventmechanism. This enhancement will allowa single user thread to wait on events formultiple mapped devices.

3. A more radical change would be to inte-greate the device-mappers event mecha-nism with the kernel’s kobject subsystem.Events could be send as uevents to be re-ceived over anAF_NETLINK socket.

5.3 Monitoring of io via Iostat(1)

Block io to device mapper mapped devices can-not currently be monitored viaiostat(1)or /proc/diskstats . Although an io toa mapped device is tracked on the actual tar-get device(s) at the bottom of thegeneric_make_request device tree, io statistics arenot tracked for any device mapper mapped de-vices positioned within the device tree.

Io statistics should be tracked for each devicemapper mapped device positioned on an iostack. Multipathing must account for possiblymultiple io failures and subsequent io retry.

5.4 IO Load Sharing

Additional path selectors will be implemented.These will likely include state based oneswhich select a path based on the minimumnumber of outstanding bios or minimum roundtrip latency. While the domain for this criteriais likely a path group for one mapped device, itmay be worth looking sharing io load across ac-tual physical components, (e.g., SCSI initiatoror target), instead.

5.5 Protocol Agnostic Multipathing

Achieving protocol agnostic multipathing willrequire the removal of some SCSI specificaffinity in the kernel, (e.g., SCSI-specific errorinformation in the bio), and user, (e.g., path dis-covery), multipath components.

Page 19: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

2005 Linux Symposium • 165

5.6 Scalable Path Testing

Proactive path testing could be enhanced tosupport multiple path testing policies and newpolicies created which provide improved re-source scalability and improve the predictabil-ity of path failures. Path testing could empha-size the testing of the physical components uti-lized by paths instead of simply exhaustivelytesting every logical path. For example, theavailability through Sysfs of path transport spe-cific attributes for SCSI paths could will makeit easier to group paths which utilize commonphysical components. Additionally, the fre-quency of path testing can be based on the re-cent reliability of a path, that is, frequently andrecently failed paths are more often.

6 Architectural Issues

This section describes several critical architec-tural issues.

6.1 Elevator Function Location

The linux block layer performs the sorting andmerging of IO requests (elevator modules) ina layer just above the device driver. The dmdevice mapper supports the modular stackingof multipath and RAID functionality above thislayer.

At least for the device mapper multipath mod-ule, it is desirable to either relocate the elevatorfunctionality to a layer above the device map-per in the IO stack, or at least to add an elevatorat that level.

An example of this need can be seen with amultipath configuration where there are four

equivalent paths between the host and each tar-get. Assume also there is no penalty for switch-ing paths. In this case, the multipath modulewants to spread IO evenly across the four paths.For each IO, it may choose a path based onwhich path is most lightly loaded.

With the current placement of the elevator then,IO requests for a given target tend to be spreadevenly across each of the four paths to that tar-get. This reduces the chances for request sort-ing and merging.

If an elevator were placed in the IO stack abovethe multipath layer, the IO requests cominginto the multipath would already be sorted andmerged. IO requests on each path would at leasthave been merged. When IO requests on differ-ent paths reach their common target, the IO’swill may nolonger be in perfect sorted order.But they will tend to be near each other. Thisshould reduce seeking at the target.

At this point, there doesn’t seem to be anyadvantage to retaining the elevator above thedevice driver, on each path in the multi-path. Aside from the additional overhead(more memory occupied by the queue, moreplug/unplug delay, additional cpu cycles) theredoesn’t seem to be any harm from invoking theelevator at this level either. So it may be sat-isfactory to just allow multiple elevators in theIO stack.

Regarding other device mapper targets, it is notyet clear whether software RAID would benefitfrom having elevators higher in the IO stack, in-terspersed between RAID levels. So, it maybebe sufficient to just adapt the multipath layer toincorporate an elevator interface.

Further investigation is needed to determinewhat elevator algorithms are best for multi-path. At first glance, the Anticipatory sched-uler seems inappropriate. It’s less clear howthe deadline scheduler of CFQ scheduler would

Page 20: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

166 • Linux Multipathing

perform in conjunction with multipath. Con-sideration should be given to whether a new IOscheduler type could produce benefits to multi-path IO performance.

6.2 Memory Pressure

There are scenarios where all paths to a logicalunit on a SCSI storage system will appear to befailed for a transient period of time. One suchexpected and transient all paths down use caseinvolves an application transparent upgrade ofthe micro-code of a SCSI storage system. Dur-ing this operation, it is expected that for a rea-sonably short period of time likely bounded bya few minutes, all paths to a logical unit on thestorage system in question will appear to a hostto be failed. It is expected that a multipathingproduct will be capable of riding through thisscenario without failing ios back to applica-tions. It is expected that the multipathing soft-ware will both detect when one or more of thepaths to such a device become physically us-able again, do what it takes to make the pathsusable, and retry ios which failed during the allpaths down time period.

If this period coincides with a period of ex-treme physical memory congestion it must stillbe possible for multipath components to enablethe use of these paths as they become physi-cally usable. While a kernel resident conges-tion control mechanism based on block requestallocation exists to ward off the over commit-tal of page cache memory to any one targetdevice, there are no congestion control mech-anisms that take into account either the use ofmultiple target devices for the same mapped de-vice or the internal queuing of bios within de-vice mapper target drivers.

The multipath configuration for several stor-age systems must include the multipath featurequeue_if_no_path in order to not imme-diately return to an application an io request

whose transfer has failed on every path to itsdevice. Yet, the use of this configuration direc-tive can result in the queuing of an indefinitenumber of bios each for an indefinite period oftime when there are no usable paths to a de-vice. When coincident with a period of heavyasynchronous write-behind in the page cache,this can lead to lots of dirty page cache pagesfor the duration of the transient all paths downperiod.

Since memory congestion states like this can-not be detected accurately, the kernel and usercode paths involved with restoring a path toa device must not ever execute code whichcould result in blocking while an io is issued tothis device. A blockable (i.e.,__GFP_WAIT)memory allocation request in this code pathcould block for write-out of dirty pages to thisdevice from the synchronous page reclaim al-gorithm of__alloc_pages . Any modifica-tion to file system metadata or data could blockflushing modified pages to this device. Any ofthese actions have the potential of deadlockingthe multipathing software.

These requirements are difficult to satisfy formultipathing software since user space inter-vention is required to restore a path to a usablestate. These requirements apply to all user andkernel space multipathing code (and code in-voked by this code) which is involved in testinga path and restoring it to a usable state. Thisprecludes the use of fork, clone, or exec in theuser portion of this code path. Path testing ini-tiated from user space and performed via ioctlentry to the block scsi ioctl code must also con-form to these requirements.

The pre-allocation of memory resources in or-der to make progress for a single device at atime is a common solution to this problem.This approach may require special case codefor tasks such as the kernel resident path test-ing. Furthermore, in addition to being “lockedto core,” the user space components must only

Page 21: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

2005 Linux Symposium • 167

invoke system calls and library functions whichalso abide by these requirements. Possiblycombining these approaches with a bit of con-gestion control applied against bios (to accountfor the ones internally queued in device-mappertarget drivers) instead of or in addition to blockio requests and/or a mechanism for timing outbios queued within the multipath target driveras a result of thequeue_if_no_path mul-tipath feature is a reasonable starting point.

7 Conclusion

This paper has analyzed both architecture anddesign of the block device multipathing indige-nous to linux. Several architectural issues andpotential enhancements have been discussed.

The multipathing architecture described in thispaper is actually implemented in several linuxdistributions to be released around the time thispaper is being written. For example, SuSESLES 9 service pack 2 and Red Hat AS 4 up-date 1 each support Linux multipathing. Fur-thermore, several enhancements described inthis paper are actively being pursued.

Please referencehttp://christophe.varoqui.free.fr and http://sources.redhat.com/dm for themost up-to-date development versions of theuser- and kernel-space resident multipathingsoftware respectively. The first web site listedalso provides a detailed description of thesyntax for a multipathing device-mapper map.

Page 22: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

168 • Linux Multipathing

Page 23: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

Proceedings of theLinux Symposium

Volume One

July 20nd–23th, 2005Ottawa, Ontario

Canada

Page 24: Linux Multipathing - Linux kernel · PDF file148 • Linux Multipathing eral key use cases. We then describe several outstanding architectural issues related to the multipathing architecture

Conference Organizers

Andrew J. Hutton,Steamballoon, Inc.C. Craig Ross,Linux SymposiumStephanie Donovan,Linux Symposium

Review Committee

Gerrit Huizenga,IBMMatthew Wilcox,HPDirk Hohndel,IntelVal Henson,Sun MicrosystemsJamal Hadi Salimi,ZnyxMatt Domsch,DellAndrew Hutton,Steamballoon, Inc.

Proceedings Formatting Team

John W. Lockhart,Red Hat, Inc.

Authors retain copyright to all submitted papers, but have granted unlimited redistribution rightsto all as a condition of submission.