brief introduction of drbd in SLE12SP2

Post on 22-Jan-2018

136 views 3 download

Transcript of brief introduction of drbd in SLE12SP2

Introduction of DRBD

Nick WangHA Team

nwang@suse.com

2

Overview

• What is DRBD

• Development status

• How to use DRBD

• Key features of DRBD

• Packages & Environment

• State of DRBD

• Basic structure

• MD

• What happening when resource starting

What is DRBD?

Distributed Replicated Block Device

Distributed Replicated Block Device

Distributed Replicated Block Device

Dual primary(Need shard FS support: OCFS2/gfs)

Development status

DRBD & Kernel

• drbd.ko – already built in kernel, but fall behind our dist Kernel 2.6.33 → 8.3.7 Kernel 3.12 → 8.4.6 (SLE12 SP1 as KMP) Kernel 4.2 → 8.4.X Kernel 4.4 → 9.0.1 (SLE12 SP2 as KMP)

• DRBD – Dev and maintain by Linbit. Ver8.0~8.3.x, Ver8.4.x, Ver9.0.x – Other tools like: drbd-utils, drbd-doc, drbd-test, drbdmanage

How to use DRBD

Demo time!! - DRBD8 (147.2.207.59/154) - DRBD9 (147.2.212.220/144/107) - DRBD with HA cluster

Preparation

• 1) You need to create/provide block device for DRBD

2) You need to distribute DRBD config files.

3) Enable the ports DRBD needed.

4) Need to create meta-data.

5) Trigger the initial synchronization.

Configuration in DRBD8

• “test.res” in /etc/drbd.d/ resource test { protocol C; disk { on-io-error pass_on; } on node-1 { address 147.2.207.187:7792; device /dev/drbd0; disk /dev/vdb; meta-disk internal; } on node-2 { address 147.2.207.199:7792; device /dev/drbd0; disk /dev/vdb; meta-disk internal; }}

Configuration in DRBD9• “test.res” in /etc/drbd.d/

resource test { net { protocol C; } connection-mesh { hosts node-1 node-2 node-3; } on node-1 { address 10.161.155.151:7788; device /dev/drbd0; disk /dev/sdb1; meta-disk internal; node-id 0; } on node-2 { address 10.161.155.158:7788; device /dev/drbd0; disk /dev/sdb1; meta-disk internal; node-id 1; } on node-3 { address 10.161.155.159:7788; device /dev/drbd0; disk /dev/sdb1; meta-disk /dev/sdc1; node-id 2; }}

Crm configuration

• crm configurecrm(live)configure# primitive drbd_test ocf:linbit:drbd \ params drbd_resource="test" \ op monitor interval="29s" role="Master" \ op monitor interval="31s" role="Slave"crm(live)configure# ms ms_drbd_test drbd_test \ meta master-max="1" master-node-max="1" \ clone-max="2" clone-node-max="1" \ notify="true"crm(live)configure# commitcrm(live)configure# exit

Key features of DRBD

Replication modes

• ...net { protocol C;}…

Fully synchronous mode (LAN): Protocol CAsynchronous mode(WAN): Protocol A and Protocol B (Normally used in Geo scenario)

Online device verification

• DRBD permits the verification of local and peer devices in an online fashion.

DRBD doesn't move data between nodes to validate but instead moves cryptographic digests of the data (hash). In this way, a node computes a hash of a block; transfers the much smaller signature to the peer node, which also calculates the hash; and then compares them. If the hashes are the same, the blocks are properly replicated. But if the hashes differ, the out-of-date block is marked as out of sync, and subsequent synchronization ensures that the block is properly synchronized.

Automatic recovery

• Automatic resync after node or connectivity failure, direction, amount. DRBD can also recover from a wide variety of errors, but one of the most insidious is the so-called "split brain" situation.

1) Discarding modifications made on the younger primary.2) Discarding modifications made on the older primary.3) Discarding modifications on the primary with fewer changes.4) Graceful recovery from split brain if one host has had no intermediate changes. (Recommended)

...handlers { split-brain "/usr/lib/drbd/notify-split-brain.sh root" ...}net { after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect; ...}...

Optimizing synchronization

• Two of the schemes that DRBD uses are activity logs and the quick-sync bitmap.

The activity log stores blocks that were recently written to and define which blocks need to be synchronized after a failure is resolved. The quick-sync bitmap defines the blocks that are in sync (or out of sync) during a time of disconnection. When the nodes are reconnected, synchronization can use this bitmap to quickly synchronize the nodes to be exact replicas of one another.

New features of DRBD9

• 1) Multi-Node replication.

2) Up to 31 connections per resource, that means support 32 nodes cluster.

3) Auto promote.

4) Transport abstraction layer. eg. drbd_transport_tcp.ko All for RDMA on Ethernet/InfiniBand.

5) New manage tools: drbdmanage

Packages & Environment

DRBD Packages in SLE12SP2

• Project drbd: drbd (COPYING, ChangeLog) drbd-kmp-default (drbd.ko, drbd_transport_tcp.ko ) Project drbd-utils: drbd-utils (drbdadm, drbdmeta, drbdsetup, etc...)

Project yast2-drbd: yast2-drbd

Threads of DRBD

• After ko loaded kthread drbd_reissue PR: 0

Per resources started and after connected: drbd<minor>_submit PR: 0 drbd_w(orker)_<res> PR: 20 drbd_r(eceiver)_<res> PR: 20 drbd_a(ck_receiever)_<res> PR: -3 drbd_s(ender)_<res> PR: 20

State

Resource roles

• Primary: may be read from and written to

Secondary: normally receives updates from its peer, but may neither be read from nor written to

Unknown: It is only displayed for the peer’s resource role, and only in disconnected mode

Disk states

• Diskless: No local block device has been assigned to the DRBD driver

Attaching: Reading meta data. Next → Consistent/Inconsistent/…

Failed: I/O failure reported by local block device. Next → Diskless

Consistent/Inconsistent: Consistent data of a node/need sync

UpToDate/Outdated: It is decided when connection is establised.

Dunknown: Used for the peer disk if no network connection.

Connection states

• StandAlone: The resource has not yet been connected.

Disconnecting: Temporary state, Next → StandAlone.

Unconnected: Temporary state, Next → WFConnection.

Timeout/NetworkFailure/ProtocolError: Connection Errors.

Teardown: Temporary state, Next → Unconnected.

WFConnection: waiting until the peer node become visible.

Connected: connection has been established.

Others: StartingSyncS/StartingSyncT, WFBitMapS/WFBitMapT, SyncSource/SyncTarget, PausedSyncS/PausedSyncT, VerifyS/VerifyT

Basic data structure

DRBD resources

• A node has a number of DRBD resources. Each such resource has a number of devices (volumes) and connections to other nodes. Each device has a unique minor device number.

This relationship is represented by the global variable drbd_resources, thedrbd_resource, drbd_connection, drbd_device, and drbd_peer_device objects, and their interconnections.

| resource | device | … | device | | connection | peer_device | … | peer_device | | … | … | ... | … | | connection | peer_device | … | peer_device |

All in lru-safe way, protected by the resource->conf_update mutex.

Metadata

Metadata includes:

• Information like size of the DRBD device

Generation Identifier

Activity Log

Quick-sync bitmap

Activity log

• Considering write operation to the local backing device and the data block send over though the network at the same time, the primary node fail and fail-over being initiated… this data block is out of sync

“The Activity log” , keeps track of those blocks that have "recently" been written to.

So only the blocks in the Activity log need to be synchronized after connection resume.

Quick sync bitmap (per node)

• On a per-resource per-peer basis, to keep track of blocks being out-of sync.

One bit represents a 4-KiB chunk of on-disk data

Bitmap is changed in memory, unless changes out of the activity log or the resource is prepare to down.

Generation Identifier

• Determining whether the two nodes are in the same cluster

Determining whether need sync and the direction

Identifying split brain

A list consist of:Current UUIDBitmap UUIDsHistorical UUIDs * 2

Three main ways to generate GI:1) Initial sync happen, both side using the GI of SyncSource.2) Promote Secondary to Primary when connection state is disconnected. 3) Original Primary generate new GI when disconnecting, secondary stay unchanged.

Others like disconnecting during state changing...

$ drbdadm up <res> What happening?

Stages:

• CFG_PREREQCFG_RESOURCECFG_DISK_PREP_DOWN/CFG_DISK_PREP_UPCFG_NET_DISCONNECT/CFG_NET_CONNECTCFG_NET_PREP_DOWN/CFG_NET_PREP_UPCFG_NET_PATHCFG_NET…

For drbdadm up <res>, scheduled stages are:CFG_NET_PREP_UPCFG_NET_PATHCFG_NET_CONNECTCFG_PEER_DEVICECFG_DISK_PREP_UPCFG_DISK

Appendices

Links

• Linbit homepage: http://www.drbd.org/en/

Source code in tarball: http://www.drbd.org/en/community/download

Git repos: http://git.linbit.com/

40