Helsinki Cassandra Meetup #2: From Postgres to Cassandra
-
Upload
bruno-almeida -
Category
Technology
-
view
1.172 -
download
1
description
Transcript of Helsinki Cassandra Meetup #2: From Postgres to Cassandra
From Postgres to Cassandra
In four easy stepsAxel Eirola < >
Jarrod Creado < >[email protected]@f-secure.com
Agenda1. Postgres2. Cassandra3. ???4. Profit
0. Context
Categorizing the internetHundreds of millions
Data size in the terabytes
Reputation metadata:
Categories: adult, gambling, …
Safety: malicious, safe, …
Automatic processing(re)Processing hundreds of thousands of URLs per dayComputation divided among multiple services, each withmultiple instancesDowntime not an option
Manual researchData mining capabilitiesResearching (aimlessly poking around)Reporting
1. Postgres
BCNF up in thisPlanned for storage, not queries
Highly normalized
Stiff schema, hard to add more fields
Sharding like a bossSegmenting the URL keyspace
One (or more) box for each segment
Difficult to add more capacity
We got eight single points of failure
Upgrading means downtime
Index all the things Building queries is hard due to the structure of the schemaManaging indices for those queries is hardThe mess needs to be abstracted away from the user, this is also hard
2. Cassandra
Easy managementEasy scaling up as more data is stored
Out of the box:
Replication
Pagination
Load balancing
Less downtime during upgrades
TTL
Mapping dataStructure of our data is suitable for NoSQLMostly based around single URLsGiven a URL, fetch metadata
Got queries?Cassandra schema designed for fixed pattern access
performed by automation
Human free-form searches offloaded to Elasticsearch
Load on one doesn't affect the other
DenormalizeProvide fixed pattern access for automationRelations become ranges in the column namespaceThis is pre-CQL, so we are doing the old-school wayMinimize the amount of read-then-write scenarios
collections
PostgresUrl_Category
url_key
category_key
timestamp
Category
key
name
Url
key
url
Url
row_key url (c)_<category_name>
<url_key> <url> <timestamp>
row_key <url_key>
<category_name> <empty>
Category
Cassandra
3. ???
Going into productionbefore going into production
DAL (data access layer) abstracts away the split databasesImplement new features in Cassandra onlyGet a feel of Cassandra before taking it into full use
A tale of two databasesRun both databases in parallelWrites:
New data, and updates, into both databases
Blind writes makes it easy to do partial updates
Reads:Reads from both databases, cross-validate responses
Easy to move responsibilities from one database to another
Migration boiled down to this1. Dump URL keys form Postgres into batches2. Custom migration script to chew a batch; for each URL in
batch:2.1. Read data from Postgres
2.2. Delete Cassandra row key for each URL
2.3. Write fresh data from Postgres into Cassandra
3. Log failing URLs4. Cross-validate on reads for a while to ensure successful
migration
4. Profit
Bro-tipsDecide what you don't want to migrateDry run while testing, keep an eye on the performanceStart in small batches, and verify the results before proceedingParallelize the batches, if you need to speed it upKeep an eye on performance, throttle if necessaryEverything doesn't always go as planned, make it easy torepeat migrationMake sure the cluster is prepared for the migration, reservetime to tweak if not
Kiitos