Swapping Pacemaker Corosync with repmgr

20
Swapping Pacemaker/Corosync for repmgr pgDay Asia 2016 Ang Wei Shan 17th March 2016 Disclaimer: I don’t work for 2ndQuadrant

Transcript of Swapping Pacemaker Corosync with repmgr

Page 1: Swapping Pacemaker Corosync with repmgr

Swapping Pacemaker/Corosync for

repmgrpgDay Asia 2016

Ang Wei Shan17th March 2016 Disclaimer: I don’t work for 2ndQuadrant

Page 2: Swapping Pacemaker Corosync with repmgr

Agenda

● Introduction

● Challenges with Pacemaker/Corosync

● Pg_bouncer

● Linux’s UCARP

● 2ndQuadrant’s repmgr

● Demo

Page 3: Swapping Pacemaker Corosync with repmgr

Agenda

● Introduction

● Challenges with Pacemaker/Corosync

● Pg_bouncer

● Linux’s UCARP

● 2ndQuadrant’s repmgr

● Demo

Page 4: Swapping Pacemaker Corosync with repmgr

● Database Administrator

● > 4 years of experience in databases

● Worked with majority of the RDBMS

● ≈ 350 days with PostgreSQL

Page 5: Swapping Pacemaker Corosync with repmgr

Agenda

● Introduction

● Challenges with Pacemaker/Corosync

● Pg_bouncer

● Linux’s UCARP

● 2ndQuadrant’s repmgr

● Demo

Page 6: Swapping Pacemaker Corosync with repmgr

● Open-source alternative to Red Hat Cluster Suite

● Extremely popular choice in the open-source world

● Made up of 2 different stack of software

○ Pacemaker

○ Corosync/Heartbeat

● Complicated to get the configuration correct

Page 7: Swapping Pacemaker Corosync with repmgr

Online: [ node1 node2 ]

Full list of resources:

stonith_node1 (stonith:fence_ipmilan): Stoppedstonith_node2 (stonith:fence_ipmilan): Stoppedvip-slave (ocf::heartbeat:IPaddr2): Started node1 Master/Slave Set: msPostgresql [pgsql] Masters: [ node1 ] Stopped: [ pgsql:1 ] Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started node1 vip-rep (ocf::heartbeat:IPaddr2): Started node1

Node Attributes:* Node node1: + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000070C6FCF9F0 + pgsql-status : PRI* Node node2: + master-pgsql : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : STOP

Migration summary:* Node node2: stonith_node1: migration-threshold=1000000 fail-count=1000000 last-failure='Thu Feb 4 13:43:49 2016' pgsql:0: migration-threshold=1 last-failure='Thu Feb 4 13:46:14 2016'* Node node1: stonith_node2: migration-threshold=1000000 fail-count=1000000 last-failure='Thu Feb 4 13:38:45 2016' pgsql_start_0 (node=node2, call=84, rc=1, status=complete): unknown error

Page 8: Swapping Pacemaker Corosync with repmgr

Feb 4 17:08:12 node1 attrd[3149]: notice: attrd_perform_update: Sent delete 46: node=node1, attr=last-failure-stonith_node2, id=<n/a>, set=(null), section=statusFeb 4 17:08:12 node1 stonith-ng[3147]: notice: stonith_device_register: Device 'stonith_node2' already existed in devicelist (2 active devices)Feb 4 17:08:14 node1 stonith-ng[3147]: notice: log_operation: Operation 'monitor' [8201] for device 'stonith_node2' returned: -1001 (Generic Pacemaker error)Feb 4 17:08:14 node1 stonith-ng[3147]: warning: log_operation: stonith_node2:8201 [ ERROR: Failed to authenticate to https://cathy.rocketwork.com.sg:4000 as node1 with key /etc/chef/client.pem ]Feb 4 17:08:14 node1 stonith-ng[3147]: warning: log_operation: stonith_node2:8201 [ Getting status of IPMI:10.51.113.22...Spawning: '/usr/bin/ipmitool -I lanplus -H '10.51.113.22' -U 'pacemaker' -L 'OPERATOR' -P '' -v chassis power status'... ]Feb 4 17:08:14 node1 stonith-ng[3147]: warning: log_operation: stonith_node2:8201 [ Failed ]Feb 4 17:08:15 node1 crmd[3151]: error: process_lrm_event: LRM operation stonith_node2_start_0 (call=52, status=4, cib-update=48, confirmed=true) ErrorFeb 4 17:08:15 node1 crmd[3151]: warning: status_from_rc: Action 5 (stonith_node2_start_0) on node1 failed (target: 0 vs. rc: 1): ErrorFeb 4 17:08:15 node1 crmd[3151]: warning: update_failcount: Updating failcount for stonith_node2 on node1 after failed start: rc=1 (update=INFINITY, time=1454576895)Feb 4 17:08:15 node1 attrd[3149]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-stonith_daina2 (INFINITY)Feb 4 17:08:15 node1 crmd[3151]: warning: update_failcount: Updating failcount for stonith_node2 on node1 after failed start: rc=1 (update=INFINITY, time=1454576895)Feb 4 17:08:15 node1 crmd[3151]: notice: run_graph: Transition 12 (Complete=2, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=unknown): StoppedFeb 4 17:08:15 node1 attrd[3149]: notice: attrd_perform_update: Sent update 51: fail-count-stonith_node2=INFINITYFeb 4 17:08:15 node1 attrd[3149]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-stonith_da

Page 9: Swapping Pacemaker Corosync with repmgr

Agenda

● Introduction

● Challenges with Pacemaker/Corosync

● Pg_bouncer

● Linux’s UCARP

● 2ndQuadrant’s repmgr

● Demo

Page 10: Swapping Pacemaker Corosync with repmgr

● Lightweight connection pooler for PostgreSQL

● Open-source

● Acts as the single point of entry to the database

● Useful for managing huge number of incoming

connections to the database

● Latest version - v1.7.2

Page 11: Swapping Pacemaker Corosync with repmgr

Agenda

● Introduction

● Challenges with Pacemaker/Corosync

● Pg_bouncer

● Linux’s UCARP

● 2ndQuadrant’s repmgr

● Demo

Page 12: Swapping Pacemaker Corosync with repmgr

● Common Address Redundancy Protocol (CARP)

● Linux’s implementation of CARP from FreeBSD

● Allows multiple hosts to share a single IP address

● Management of Virtual IP for failover purpose

● For client connectivity to Pg_bouncer

● Latest version => v1.5.2

Page 13: Swapping Pacemaker Corosync with repmgr

Agenda

● Introduction

● Challenges with Pacemaker/Corosync

● Pg_bouncer

● Linux’s UCARP

● 2ndQuadrant’s repmgr

● Demo

Page 14: Swapping Pacemaker Corosync with repmgr

● Developed by 2ndQuadrant

● Open-source

● Manages replication and failover for your

PostgreSQL HA cluster

● Latest version - v3.1.1

Page 15: Swapping Pacemaker Corosync with repmgr

● Linux or Unix only

● repmgr 2.0 is for PostgreSQL 9.0 to 9.4

● repmgr 3.0 is for PostgreSQL 9.3 or higher

● Does not take care of client failover!!

Page 16: Swapping Pacemaker Corosync with repmgr

● Automatic failover capabilities

● Provisioning of standby servers

● 2 main tools○ repmgr => Perform administrative tasks

○ repmgrd => Perform monitoring, automatic failover and

notification events

Page 17: Swapping Pacemaker Corosync with repmgr

● Requires a database to store cluster metadata

● Runs as postgres user

● Password-less SSH connectivity between all

hosts

● Recommended to run at with an odd number

cluster

Page 18: Swapping Pacemaker Corosync with repmgr

The decision whether a server can be promoted depends whether the majority of servers are

"visible". If you have three servers - primary and standby in one location, and a second

standby in another location - and the network to the second standby goes down, the second

standby will see it's in the minority (its location represents 1/3 of the servers) and won't

promote itself.

If you have two servers in each location, you'd need an additional witness server so one

location still has a "majority" - otherwise in the event of a network disconnection you might

end up with one standby in each location promoting itself.

Page 19: Swapping Pacemaker Corosync with repmgr

Agenda

● Introduction

● Challenges with Pacemaker/Corosync

● Pg_bouncer

● Linux’s UCARP

● 2ndQuadrant’s repmgr

● Demo

Page 20: Swapping Pacemaker Corosync with repmgr

Thank you for your [email protected]

newbiedba.wordpress.comsg.linkedin.com/in/weishan