Download - Honey I Shrunk the Database

Honey, I Shrunk the Database

For Test and Development Environments

Postgres Open, September 2011

Vanessa HurstPaperless Post

User Data

Why Shrink?

Accuracy

You don’t truly know how your app will behave in production unless you use real data.

Production data is the ultimate in accuracy.

Why Shrink?

Accuracy

Freshness

New data should be available regularly.

Full database refreshes should be timely.

Why Shrink?

Accuracy

Freshness

Resource Limitations

Staging and developer machines cannot handle production load.

Why Shrink?

Accuracy

Freshness


Data Protection

Limit spread of sensitive user or client data.

Why Shrink?

Accuracy

Freshness


Data Protection

Case Study: Paperless Post

Requirements Freshness – Daily, On command for non-

developers Shrinkage – Slices, Mutations




Resources Source – extra disk space, RAM, and CPUs Destination – limited, often entirely un-

optimized Development -- constrained DBA resources

Shrink Strategies

Copies

Restored backups or live replicas of entire production database

Shrink Strategies

Copies

Slices

Select portions of exact data

Shrink Strategies

Copies

Slices

Mutations

Sanitized, anonymized, or otherwise changed data

Shrink Strategies

Copies

Slices

Mutations

Assumptions

Seed databases, fixtures, test data

Shrink Strategies

Copies

Slices

Mutations

Assumptions

Slices

Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude

others

Slices

Vertical Slice Difficult to obtain a valid, useful subset of data. Example: Include some entire tables, exclude

others

Horizontal Slice Difficult to write and maintain. Example: SQL or application code to determine

subset of data

PG Tools – Vertical Slice

Flexibility at Source (Production)

pg_dump Include data only [-a --data-only] Include table schema only [-s --schema-only] Select tables [-t table1 table2 --table table1

table2] Select schemas [-n schema --schema=schema] Exclude schemas [-N schema --exclude-

schema=schema]

PG Tools – Vertical Slice

Flexibility at Destination (Staging, Development)

pg_restore Include data only [-a --data-only] Select indexes [-i index --index=index] Tune processing [-j number-of-jobs --jobs=number-

of-jobs] Select schemas [-n schema --schema=schema] Select triggers[-T trigger --trigger=trigger] Exclude privileges [-x --no-privileges --no-acl]

Mutations

External Data Protection HIPAA Regulations PCI Compliance API Terms of Use

Mutations

External Data Protection HIPAA Regulations PCI Compliance API Terms of Use

Internal Data Protection Protecting your users’ personal data Protecting your users from accidents, e.g. staging

emails Your Terms of Service

User Data


Composite Slice including

Vertical Slice – All application object schemas

Vertical Slice – Entire tables of static content

Horizontal Slice – Subset of users and their data

Mutation – Changed user email addresses




pg_dump --clean --schema-only --schema public db-01 > slice.sql






pg_dump --data-only --schema public -t cards db-01 >> slice.sql







Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses


CREATE SCHEMA staging;


Horizontal Slice Custom SQL

SELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);


Horizontal Slice Custom SQL

SELECT * INTO staging.usersFROM usersWHERE EXISTS (subset of users);

Dynamic relative to full data set or newly created slice

SELECT * INTO staging.stuffFROM stuffWHERE EXISTS (stuff per staging.users);


Horizontal Slice Custom SQL Dynamic relative to full data set or newly created

slice

Mutations Email Addresses

Use regular expressions to clean non-admin addressese.g. [email protected] => [email protected]

Cached Data Clear cached short link from link-shortening API

mailto:[email protected]

mailto:[email protected]







Horizontal Slice – Subset of users and their dataMutation – Changed user email addresses

pg_dump --data-only --schema staging db-01 >> slice.sql


Rebuild Prepare new database as standby Gracefully close connections Rotate by renaming databases

Security Dedicated database build user Membership in application user role Application user role & privileges remain


Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all

data loads to default schema


We hacked our rebuild by importing across schemas!

Now our sequences are wrong, causing duplicate data errors every time we try to insert into tables.

Secret Weapon

--Updates all serial sequences for ID columns only

BEGINFOR table_record IN SELECT pc.relname FROM pg_class pc

WHERE pc.relkind = 'r' AND EXISTS (SELECT 1 FROM pg_attribute pa WHERE pa.attname = 'id' AND pa.attrelid = pc.oid) LOOPtable_name = table_record.relname::text;EXECUTE 'SELECT setval(pg_get_serial_sequence(' || quote_literal(table_name) || ', ' || quote_literal('id')::text || '), MAX(id)) FROM ' || table_name || ' WHERE EXISTS (SELECT 1 FROM ' || table_name || ')';

END LOOP;


Rebuild $ bzcat slice.sql.bz2 | psql db-new Staging schema has not been created, so all

data loads to default schema echo “select 1 from update_id_sequences();”

>> slice.sql Vacuum Reindex


Security Database build user

CREATE DB privileges Member of Application user role

Application user remains database owner Application user privileges remain limited Build only works in predetermined

environments




Resources Source – extra disk space, RAM, and CPUs Destination – limited, often entirely un-

optimized Development -- constrained DBA resources

Questions?

Postgres Open, September 2011

Vanessa HurstPaperless Post

@DBNess

More Tools

Copies -- LVMSnapshots See talk by Jon Erdman at PG Conf EU Great for all reads Data stays virtualized & doesn’t take up space

until changed Ideal for DDL changes without actual data

changes

More Tools

Copies, Slices -- pg_staging by dmitrihttp://github.com/dimitri/pg_staging Simple -- pauses pgbouncer & restores backup Efficient -- leverage bulk loading Flexible -- supports varying psql files Custom -- limited

Slices -- replicate by rtomayko of Githubhttp://github.com/rtomayko/replicate Simple - Preserves object relations via ActiveRecord Inefficient -- Creates text-based .dump Inflexible -- Corrupts id sequences on data insert Custom -- highly