Helsinki Cassandra Meetup #2: From Postgres to Cassandra

From Postgres to Cassandra

In four easy stepsAxel Eirola < >

Jarrod Creado < >[email protected]@f-secure.com

Agenda1. Postgres2. Cassandra3. ???4. Profit

0. Context

Categorizing the internetHundreds of millions

Data size in the terabytes

Reputation metadata:

Categories: adult, gambling, …

Safety: malicious, safe, …

Automatic processing(re)Processing hundreds of thousands of URLs per dayComputation divided among multiple services, each withmultiple instancesDowntime not an option

Manual researchData mining capabilitiesResearching (aimlessly poking around)Reporting

1. Postgres

BCNF up in thisPlanned for storage, not queries

Highly normalized

Stiff schema, hard to add more fields

Sharding like a bossSegmenting the URL keyspace

One (or more) box for each segment

Difficult to add more capacity

We got eight single points of failure

Upgrading means downtime

Index all the things Building queries is hard due to the structure of the schemaManaging indices for those queries is hardThe mess needs to be abstracted away from the user, this is also hard

2. Cassandra

Easy managementEasy scaling up as more data is stored

Out of the box:

Replication

Pagination

Load balancing

Less downtime during upgrades

TTL

Mapping dataStructure of our data is suitable for NoSQLMostly based around single URLsGiven a URL, fetch metadata

Got queries?Cassandra schema designed for fixed pattern access

performed by automation

Human free-form searches offloaded to Elasticsearch

Load on one doesn't affect the other

DenormalizeProvide fixed pattern access for automationRelations become ranges in the column namespaceThis is pre-CQL, so we are doing the old-school wayMinimize the amount of read-then-write scenarios

collections

PostgresUrl_Category

url_key

category_key

timestamp

Category

key

name

Url

key

url

Url

row_key url (c)_<category_name>

<url_key> <url> <timestamp>

row_key <url_key>

<category_name> <empty>

Category

Cassandra

3. ???

Going into productionbefore going into production

DAL (data access layer) abstracts away the split databasesImplement new features in Cassandra onlyGet a feel of Cassandra before taking it into full use

A tale of two databasesRun both databases in parallelWrites:

New data, and updates, into both databases

Blind writes makes it easy to do partial updates

Reads:Reads from both databases, cross-validate responses

Easy to move responsibilities from one database to another

Migration boiled down to this1. Dump URL keys form Postgres into batches2. Custom migration script to chew a batch; for each URL in

batch:2.1. Read data from Postgres

2.2. Delete Cassandra row key for each URL

2.3. Write fresh data from Postgres into Cassandra

3. Log failing URLs4. Cross-validate on reads for a while to ensure successful

migration

4. Profit

Bro-tipsDecide what you don't want to migrateDry run while testing, keep an eye on the performanceStart in small batches, and verify the results before proceedingParallelize the batches, if you need to speed it upKeep an eye on performance, throttle if necessaryEverything doesn't always go as planned, make it easy torepeat migrationMake sure the cluster is prepared for the migration, reservetime to tweak if not

Kiitos

Helsinki Cassandra Meetup #2: From Postgres to Cassandra

Technology

Transcript of Helsinki Cassandra Meetup #2: From Postgres to Cassandra