from AWS RDS to EC2 PostgreSQL migration -...

79
PostgreSQL migration from AWS RDS to EC2

Transcript of from AWS RDS to EC2 PostgreSQL migration -...

PostgreSQL migration from AWS RDS to EC2

● Technology lover● Worked as Software Engineer, Team lead, DevOps, DBA, Data analyst● Sr. Tech Architect at Coverfox● Email me at [email protected]● Tweet me at @hitul007

“Everything is possible but, it takes time”- Hitul Mistry

● Database Evolution● Self hosted database● AWS RDS● Why we migrated from Database As Service to self hosted database● Challenges in migration● How we planned migration and point to be noted● Current Architecture● Demo

● Simple GUI● HA, Multi AZ. Encryption, Backup, Recovery, Disaster recovery,

Security, Compliance● SLAs● Performance optimizations by self● GUI for version upgrades

AWS RDS

● Postgresql DB functionality is similar to postgresql

● Cannot install extra extensions other then provided

● Cannot do replication to self hosted server● Cannot install custom plugin to logical

decoding● Cannot install custom data-types● Upgrades can be done at few clicks on

GUI

Self hosted PostgreSQL

● Postgresql DB functionality will be as postgresql original behaviour

● 100% control on functionality● Upgrade needs to be done

manually

AWS RDS

● HA, Fault Tolerance, Disaster recovery, Backups can be implemented at few GUI clicks

● You will have to monitor common parameters when postgresql can crash. It can crash when disk is full or CPU usage is high or other parameters. Postgresql will be auto restarted, disk is full.

● SLA

Self hosted PostgreSQL

● HA, Fault Tolerance, Disaster recovery, Backups can be implemented by configuring postgresql by self

● We will have to monitor all the threats which can occur and be ready to fix those things

● Upgrades can be done by self● SLA

AWS RDS

● Everything is available in GUI● Postgresql usage knowledge and some

architectural knowledge required● AWS controlled performance tuning; not

use-case dependent

Self hosted PostgreSQL

● Postgresql Expert knowledge required

● Everything needs to be done by setting up configs

AWS RDS

● Few performance parameters can be tuned via GUI

● Cannot use custom hardware for performance

Self hosted PostgreSQL

● Performance can be tuned via basic postgresql config and also other parameters like kernel, os, disk etc.

● Lot of performance parameters available to tune as per the application

AWS RDS

● To identify fault in postgresql you will be provided GUI where all the postgresql logs will be printed.

● New version upgrades can be done with few clicks

● Cannot go deep beyond DAAS service provides

Self hosted PostgreSQL

● Can directly see logs of postgresql● Can go deep as much want to go

AWS RDS

● Operating environment cannot be changed

Self hosted PostgreSQL

● It can be moved to any operating environment

● Cost to scale vertically or horizontally on aws is high● Many open-source plugins required by the application cannot be

installed on RDS● New Logical decoding plugin for replication or other use-cases is

not supported by RDS● AWS takes time to upgrade to latest postgres versions● Almost zero downtime server upgrades possible with self hosting● Database auto scaling● Performance tuning as per application needs

Instance Type CPU RAM(GiB) Pricing/Yr

m4.2xlarge 8 32 $ 9014.04

m4.4xlarge 16 64 $ 18045.6

m4.10xlarge 40 160 $ 45122.76

AWS RDS Cost (On Demand)

Instance Type CPU RAM(GiB) Pricing/Yr

m4.2xlarge 8 32 $ 3679.2

m4.4xlarge 16 64 $ 7358.4

m4.10xlarge 40 160 $ 18396

AWS EC2 Cost (On Demand)

● If you we buy multi-AZ setup then cost will be doubled● Reserved instance can save cost from 12-64%● Rack servers and different cloud infrastructures usage for cost cutting● Zero Downtime Upgrades

Migration required extra hands, but self-hosted maintenance has not increased load on DevOPs team!

● RDS supports limited plugins. Just now they added wal2json.

Database Operation

INSERT INTO data(data) VALUES('1');

INSERT INTO data(data) VALUES('2');

Format inserts

table public.data: INSERT: id[integer]:1 data[text]:'1'

table public.data: INSERT: id[integer]:2 data[text]:'2'

BEGIN 89283

table public.core_tracker: UPDATE: id[bigint]:63899671 session_key[text]:'w84fhz6c8b5jpc1ufesnegbxrfmnehh8' user_id[bigint]:23573

extra[text]:'{"h":100,"no_show":true}' created[timestamp with time zone]:'2018-01-05 16:03:23.654652+05:30' fd_id[integer]:null

COMMIT 89283

BEGIN 89285

table public.core_tracker: UPDATE: id[bigint]:63899671 session_key[text]:'w84fhz6c8b5jpc1ufesnegbxrfmnehh8' user_id[bigint]:23573 extra[text]:'{"h":100,"no_show":true,"hello-world":{"sfs":"sdf\"2''3"}}'

created[timestamp with time zone]:'2018-01-05 16:03:23.654652+05:30' fd_id[integer]:null

COMMIT 89285

● 5M unique quotes a month● 45M unique quotes from insurance companies ● 5GB write on DB and logs combined

Downtime allowed

● Daily: 8.6s● Weekly: 1m 0.5s● Monthly: 4m 23.0s● Yearly: 52m 35.7s

● Almost zero downtime● Stable database after migration and should work as older one● SLAs 99.99

● Shared disk failover● File system level● Transaction log(WAL)● Trigger based● Statement based

● Most Reliable● We cannot access pg_hba.conf● You don’t have enough permission to execute pg_start_backup

● PgBadger● CURRENT_TIMESTAMP, random, sequences will be affected● Huge change in codebase

● Postgresql utility to create postgresql dump and restore it● Time consuming

● Functions, Indexes, Constraints, are not migrated● JSON considered as CLOB and truncated values● Varchar, character varying values truncated● DDL ignored● Does not replicate partitioned tables

● Does not replicate DDL ● Tough to parse output and output is not standard

● Documentation and support was really limited● If tools failed then whole replication will fail

● All tables must have modified date● All the tables must have primary key but, some tables had non non numeric

primary keys

● Mongodb + GoLang + Trigger based replication + pg_listen + pg_notify

● Disable foreign key validations on self hosted postgresql db● Create triggers on AWS RDS database● Take backup of postgreSQL RDS

- MongoDB Schema { "table_name":"schema_name.table_name", "primary_key":"{primary_key_value}", "created_at":"timestamp", "operation":"Insert/Update/Delete"

}

● Reset sequence● Enable foreign key validations● Stop AWS RDS instance● Run basic Sanity scripts which will verify data on sample● Stop website and it can be opened from internal users only● Run QA tests● Take backup of AWS RDS

● How much downtime can be accepted ?● SLA● What is the worst thing that can happen ?● Services which can be affected ?● How soon we can recover ?● How much data will be lost and can be recovered ?● What will be long term ROI ?● What data will be affected ?

● Plan should be like steps to execute● Execute plan once on sandbox environment before going live● Plan should include rollback strategy

● CPU● Disk type● Disk size● Connections● Future plans and traffic● RAID● Power usage● Network

● Buffer size● Kernel parameters● Work memory● Checkpoint

● Our rollback strategy was similar● Logical Decoding replication for 2 weeks

● Write deployment scripts to setup database● Write script for things as much you can

● Sanity scripts● QA team scripts and approval

● High availability● Fault tolerance● Disaster recovery● Backup & Recovery● Hardware & Software updates● Security● Monitoring● Testing● Rollback● Compliance

● Promote master

pg_ctl promote -D /data-dir-path/

● Add cluster to PostgreSQL

rm -rf /data-dir-path/ && pg_basebackup ….

● Run pg_rewind after promote

pg_rewind -D /data-dir-path --source-server=... host=...

● Service discovery is the automatic detection of devices and services offered by these devices on a computer network. - Wikipedia

● Service Discovery● Health Checking● KV store● Mutli Data Center

● Fork of Governer● Developed at Zalando● Used with Consul, ZooKeeper, Etcd

Image: http://aisaac.io

● service/coverfox/optime/leader○ Leader node name

● service/coverfox/members/master-a○ Member of cluster

● service/coverfox/members/master-b○ Member of cluster

● Errors○ data corruption○ system failure (including hardware failure)○ human error○ natural disaster

● Tool for disaster recovery, backups and recovery by 2ndcondrant● Remote backup● Remote restore● WAL Logs recovery

Barman Backup

/usr/bin/barman backup --jobs 6 mumbai-master-a

Barman restore db

barman recover --target-time "2017-12-15 22:22:00" --remote-ssh-command "ssh [email protected]" mumbai-master-c 20171214T190201 /pg-data-dir-path/ -j 10

What to monitor ?

● RAM/CPU Usage● DIskIO● Process info● Bandwidth● Vacuum running● DB Space● Active Connections● Active Transactions● Open files● Replication diff● PgBouncer client connections● PgBouncer stats

● Grafana● InfluxDB● Twilio● Slack