Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and...
Transcript of Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and...
![Page 1: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/1.jpg)
www.data61.csiro.au
Adventures in High(ish) AvailabilityPeter Chubb | Principal Research EngineerJanuary 21, 2019
![Page 2: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/2.jpg)
services
• DNS, DHCP/BOOTP, LDAP, NFS, TFTP, Postgres, kitty, web services, CI services(bambooTM), login, hg, git, machine-queue, bitbucketTM . . .
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 2
![Page 3: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/3.jpg)
services
• DNS, DHCP/BOOTP, LDAP, NFS, TFTP, Postgres, kitty, web services, CI services(bambooTM), login, hg, git, machine-queue, bitbucketTM . . .
• around 40 desktops using DHCP, NFS and LDAP
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 2
![Page 4: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/4.jpg)
services
• DNS, DHCP/BOOTP, LDAP, NFS, TFTP, Postgres, kitty, web services, CI services(bambooTM), login, hg, git, machine-queue, bitbucketTM . . .
• around 40 desktops using DHCP, NFS and LDAP
• around 30 dev boards and test machines using BOOTP, TFTP, and NFS
– rebooting every few minutes; different mac address every reboot
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 2
![Page 5: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/5.jpg)
The Situation
• Ancient server hardware (donated to us in 2000 or thereabouts)
• Only some services replicated (DNS, LDAP both master/slave)
• Growing group — downtime costs more
• Desire for planned downtime (kernel upgrades, hardware changes etc)
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 3
![Page 6: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/6.jpg)
The Situation
• Ancient server hardware (donated to us in 2000 or thereabouts)
• Only some services replicated (DNS, LDAP both master/slave)
• Growing group — downtime costs more
• Desire for planned downtime (kernel upgrades, hardware changes etc)
• applied for Capex funding for new server
– Huge corporate discount
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 3
![Page 7: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/7.jpg)
The Situation
• Ancient server hardware (donated to us in 2000 or thereabouts)
• Only some services replicated (DNS, LDAP both master/slave)
• Growing group — downtime costs more
• Desire for planned downtime (kernel upgrades, hardware changes etc)
• applied for Capex funding for new server
– Huge corporate discount
→ Buy Two!
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 3
![Page 8: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/8.jpg)
High(ish) availability
99.99999999999999999999999999999999999999999999999999999999%
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 4
![Page 9: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/9.jpg)
High(ish) availability
99.99999999999999999999999999999999999999999999999999999999%
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 4
![Page 10: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/10.jpg)
High(ish) availability
99.99999999999999999999999999999999999999999999999999999999%
A few minutes here and there don’t matter
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 4
![Page 11: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/11.jpg)
High(ish) availability
99.99999999999999999999999999999999999999999999999999999999%
A few minutes here and there don’t matterManual failover for new kernel, replace network card etc. OK
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 4
![Page 12: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/12.jpg)
Two Servers!
• 24 core
• 300G Ram
• 16Tb spinning Disk with 1.2TbRAID-1 nVME cache
• 2x10Gb/s fibre, 8x1Gb/s copper
Replication and/or failover possible.
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 5
![Page 13: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/13.jpg)
Two Servers!
Stopped
Running
containers
Hosts
Cellar Brewer
DNS
ldap
tftp
web
login
NFS
DNS
NFS
tftp
ldap
web
login
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 6
![Page 14: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/14.jpg)
Two Servers!
Cellar Brewer
DNS
ldap
tftp
web
login
NFS
DNS
NFS
tftp
ldap
web
login
lsyncd
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 6
![Page 15: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/15.jpg)
Two Servers!
Cellar Brewer
DNS
NFS
ldap
web
login
DNS
ldap
web
login
NFS
tftptftp
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 6
![Page 16: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/16.jpg)
Two Servers!
Cellar Brewer
DNS
NFS
ldap
web
login
DNS
ldap
web
login
NFS
tftptftp
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 6
![Page 17: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/17.jpg)
Testing
7.00am Came into work; Turned coffee machine on; checked logwatch
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 7
![Page 18: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/18.jpg)
Testing
7.00am Came into work; Turned coffee machine on; checked logwatch
7:15am Attempted failover: shutdown one host
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 7
![Page 19: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/19.jpg)
Testing
7.00am Came into work; Turned coffee machine on; checked logwatch
7:15am Attempted failover: shutdown one host
7:40am Looking good: services all transferred and running
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 7
![Page 20: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/20.jpg)
Testing
7.00am Came into work; Turned coffee machine on; checked logwatch
7:15am Attempted failover: shutdown one host
7:40am Looking good: services all transferred and running
7:45am get coffee
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 7
![Page 21: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/21.jpg)
Testing
7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8
![Page 22: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/22.jpg)
Testing
7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.
8:00am get warning (to phone) that webservers are down
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8
![Page 23: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/23.jpg)
Testing
7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.
8:00am get warning (to phone) that webservers are down
8:10am On console, NFS server not responding; can’t connect to nfshomes:no DNS entry.
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8
![Page 24: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/24.jpg)
Testing
7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.
8:00am get warning (to phone) that webservers are down
8:10am On console, NFS server not responding; can’t connect to nfshomes:no DNS entry.
8:15am (people start arriving at work; can’t work: no local DNS)
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8
![Page 25: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/25.jpg)
Testing
7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.
8:00am get warning (to phone) that webservers are down
8:10am On console, NFS server not responding; can’t connect to nfshomes:no DNS entry.
8:15am (people start arriving at work; can’t work: no local DNS)
8:20am reboot original server; restart original services one at a time; fail back
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8
![Page 26: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/26.jpg)
Testing
7:50am Notice login xterms have frozen: can’t log back in. Attempt to get into host’sconsoles — can’t do it as me; manage to remember root password. Veryslow response.
8:00am get warning (to phone) that webservers are down
8:10am On console, NFS server not responding; can’t connect to nfshomes:no DNS entry.
8:15am (people start arriving at work; can’t work: no local DNS)
8:20am reboot original server; restart original services one at a time; fail back
11am Everything seems normal again; get another coffee
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 8
![Page 27: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/27.jpg)
PROBLEMS
• DHCP can’t update names on slave server
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 9
![Page 28: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/28.jpg)
PROBLEMS
• DHCP can’t update names on slave server
• DNS entries time out if master is down.
– Timeouts are short to cope with devboard short lease lifetimes
– Everything stops if DNS stops
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 9
![Page 29: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/29.jpg)
PROBLEMS
• DHCP can’t update names on slave server
• DNS entries time out if master is down.
• NFS after failover fails
– Handle based on inode number and File-System ID — inode numbers different
– NFSv4 is stateful
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 9
![Page 30: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/30.jpg)
PROBLEMS
• DHCP can’t update names on slave server
• DNS entries time out if master is down.
• NFS after failover fails
• Run out of watch slots for lsyncd
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 9
![Page 31: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/31.jpg)
PROBLEMS
• DHCP can’t update names on slave server
• DNS entries time out if master is down.
• NFS after failover fails
• Run out of watch slots for lsyncd
• Postgres failover (sort-of) OK; fail-back difficult
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 9
![Page 32: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/32.jpg)
Second attempt
• Stateless services as before
• Per-service solutions for the rest
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 10
![Page 33: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/33.jpg)
LDAP
• Not hard to make openldap replicate master-master.
• Round-robin DNS allows load sharing
• SSSD on clients mean short outages don’t matter (much).
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 11
![Page 34: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/34.jpg)
LDAP
• Not hard to make openldap replicate master-master.
• Round-robin DNS allows load sharing
• SSSD on clients mean short outages don’t matter (much).
Works!
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 11
![Page 35: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/35.jpg)
DNS
• LDAP replication working ...
– So use LDAP as backend.
∗ bind9-dyndb-ldap already packaged for Debian
– Works well with BIND 9.11
– Multi-master DNS ‘tricky’, but seems to work.
– Running in containers on both hosts as masters; watchdog ensures containers arerunning
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 12
![Page 36: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/36.jpg)
DNS
• LDAP replication working ...
– So use LDAP as backend.
∗ bind9-dyndb-ldap already packaged for Debian
– Works well with BIND 9.11
– Multi-master DNS ‘tricky’, but seems to work.
– Running in containers on both hosts as masters; watchdog ensures containers arerunning
Works!
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 12
![Page 37: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/37.jpg)
DHCP
• Still have bootp clients — can’t use native replication
• Server runs in same container as one of the DNS servers, to allow name update
• watchdog in each DNS container starts DHCPD if it is not running on the DNS replica
• /etc/dhcpd.conf held in GIT, git pull on start.
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 13
![Page 38: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/38.jpg)
DHCP
• Still have bootp clients — can’t use native replication
• Server runs in same container as one of the DNS servers, to allow name update
• watchdog in each DNS container starts DHCPD if it is not running on the DNS replica
• /etc/dhcpd.conf held in GIT, git pull on start.
Works
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 13
![Page 39: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/39.jpg)
NFS
• DRBD for underlying FS
• NFSv4 state on one of the replicated volumes
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 14
![Page 40: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/40.jpg)
NFS
1. Check switches are up. Abort if not
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15
![Page 41: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/41.jpg)
NFS
1. Check switches are up. Abort if not
2. Check if DRBD is up-to-date. Abort if not.
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15
![Page 42: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/42.jpg)
NFS
1. Check switches are up. Abort if not
2. Check if DRBD is up-to-date. Abort if not.
3. If remote is up, shut it down:
• stop nfs-kernel-server and rpcbind
• unmount exported volumes
• delete the HA address
• Check to see that the HA address is gone; if not, destroy the container.
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15
![Page 43: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/43.jpg)
NFS
1. Check switches are up. Abort if not
2. Check if DRBD is up-to-date. Abort if not.
3. If remote is up, shut it down:
4. switch the local DRBD to primary
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15
![Page 44: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/44.jpg)
NFS
1. Check switches are up. Abort if not
2. Check if DRBD is up-to-date. Abort if not.
3. If remote is up, shut it down:
4. switch the local DRBD to primary
5. Start the local container if nec.
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15
![Page 45: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/45.jpg)
NFS
1. Check switches are up. Abort if not
2. Check if DRBD is up-to-date. Abort if not.
3. If remote is up, shut it down:
4. switch the local DRBD to primary
5. Start the local container if nec.
6. (in container) mount the filesystems, add the HA address, start nfs-kernel-server
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 15
![Page 46: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/46.jpg)
NFS
Sort-of works.
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 16
![Page 47: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/47.jpg)
NFS
Sort-of works.
Planned failovers work
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 16
![Page 48: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/48.jpg)
NFS
Sort-of works.
Planned failovers work
Often see partial failover (DRBD switches rôles for some discs)
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 16
![Page 49: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/49.jpg)
NFS
Sort-of works.
Planned failovers work
Often see partial failover (DRBD switches rôles for some discs)
Still investigating — packet loss?
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 16
![Page 50: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/50.jpg)
NFS
Sort-of works.
Planned failovers work
Often see partial failover (DRBD switches rôles for some discs)
Still investigating — packet loss?
Also DAD races for IPv6.
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 16
![Page 51: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/51.jpg)
Postgres
• Write-Ahead Log shipping for replication supported
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 17
![Page 52: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/52.jpg)
Postgres
• Write-Ahead Log shipping for replication supported
– With ‘just a bit’ of configuration
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 17
![Page 53: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/53.jpg)
Postgres
• Write-Ahead Log shipping for replication supported
– With ‘just a bit’ of configuration
• Easy to trigger failover
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 17
![Page 54: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/54.jpg)
Postgres
• Write-Ahead Log shipping for replication supported
– With ‘just a bit’ of configuration
• Easy to trigger failover
BUT
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 17
![Page 55: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/55.jpg)
Postgres
• Write-Ahead Log shipping for replication supported
– With ‘just a bit’ of configuration
• Easy to trigger failover
BUT
• Clients don’t know of failover
• No load balancing between active instances
• Fail-back is hard
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 17
![Page 56: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/56.jpg)
Postgres
Investigating Patroni as a solution.
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 18
![Page 57: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/57.jpg)
Remaining Issues
is_up(){
ping -c 1 "$1" > /dev/null 2>&1
}
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 19
![Page 58: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/58.jpg)
Remaining Issues
packet loss or congestion causes false down indications.
is_up() {
for t in 5 10 30
do
ping -c 1 "$1" > /dev/null 2>&1 && return 0
sleep $t
done
ping -c1 "$1" > /dev/null 2>&1
}
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 20
![Page 59: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/59.jpg)
Remaining Issues
Where possible check service not container:
is_up() {
pg_isready "$1" > /dev/null 2>&1
}
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 21
![Page 60: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/60.jpg)
Orphan Zombies
$ ps axf
...
26313 ? Sl 0:00 /usr/lib/libvirt/libvirt_lxc --name nfshomes ...
26355 ? Ss 0:19 \_ /sbin/init
26455 ? Ss 1:49 \_ /lib/systemd/systemd-journald
26468 ? Ss 0:00 \_ /usr/sbin/blkmapd
...
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 22
![Page 61: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/61.jpg)
Orphan Zombies
$ ps axf
...
25234 ? Ss 2:16 [init]
32097 ? Zl 2:11 \_ [apache2] <defunct>
...
• Orphan Zombies
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 23
![Page 62: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/62.jpg)
Orphan Zombies
$ ps axf
...
25234 ? Ss 2:16 [init]
32097 ? Zl 2:11 \_ [apache2] <defunct>
...
• Orphan Zombies
– Kill them all!∗ every 30 min/usr/local/bin/kill-orphans
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 23
![Page 63: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/63.jpg)
But why not use . . .
• corosync and pacemaker
• piranha
• Etc
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 24
![Page 64: Adventures in High(ish) Availability Peter Chubb ...€¦ · around 40 desktops using DHCP, NFS and LDAP around 30 dev boards and test machines using BOOTP, TFTP, and NFS – rebooting](https://reader033.fdocuments.us/reader033/viewer/2022043007/5f92517032241b416f24dc5d/html5/thumbnails/64.jpg)
scripts
Available at: https://bitbucket.csiro.au/projects/TRUSTWORTHYSYSTEMS/repos/hiavail/browse
CSIRO Data61 Copyright c©2019 CC BY-SA Adventures in High(ish) Availability 25