Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram...
Transcript of Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram...
![Page 1: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/1.jpg)
Sharding InstagramSFPUG April 2012
Mike KriegerInstagram
Monday April 23 12
me
- Co-founder Instagram
- Previously UX amp Front-end Meebo
- Stanford HCI BSMS
- mikeyk on everything
Monday April 23 12
pug
Monday April 23 12
communicating and sharing in the real world
Monday April 23 12
Monday April 23 12
Monday April 23 12
Monday April 23 12
30+ million users in less than 2 years
Monday April 23 12
at its heart Postgres-driven
Monday April 23 12
a glimpse at how a startup with a small eng team scaled with PG
Monday April 23 12
a brief tangent
Monday April 23 12
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 2: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/2.jpg)
me
- Co-founder Instagram
- Previously UX amp Front-end Meebo
- Stanford HCI BSMS
- mikeyk on everything
Monday April 23 12
pug
Monday April 23 12
communicating and sharing in the real world
Monday April 23 12
Monday April 23 12
Monday April 23 12
Monday April 23 12
30+ million users in less than 2 years
Monday April 23 12
at its heart Postgres-driven
Monday April 23 12
a glimpse at how a startup with a small eng team scaled with PG
Monday April 23 12
a brief tangent
Monday April 23 12
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 3: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/3.jpg)
pug
Monday April 23 12
communicating and sharing in the real world
Monday April 23 12
Monday April 23 12
Monday April 23 12
Monday April 23 12
30+ million users in less than 2 years
Monday April 23 12
at its heart Postgres-driven
Monday April 23 12
a glimpse at how a startup with a small eng team scaled with PG
Monday April 23 12
a brief tangent
Monday April 23 12
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 4: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/4.jpg)
communicating and sharing in the real world
Monday April 23 12
Monday April 23 12
Monday April 23 12
Monday April 23 12
30+ million users in less than 2 years
Monday April 23 12
at its heart Postgres-driven
Monday April 23 12
a glimpse at how a startup with a small eng team scaled with PG
Monday April 23 12
a brief tangent
Monday April 23 12
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 5: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/5.jpg)
Monday April 23 12
Monday April 23 12
Monday April 23 12
30+ million users in less than 2 years
Monday April 23 12
at its heart Postgres-driven
Monday April 23 12
a glimpse at how a startup with a small eng team scaled with PG
Monday April 23 12
a brief tangent
Monday April 23 12
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 6: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/6.jpg)
Monday April 23 12
Monday April 23 12
30+ million users in less than 2 years
Monday April 23 12
at its heart Postgres-driven
Monday April 23 12
a glimpse at how a startup with a small eng team scaled with PG
Monday April 23 12
a brief tangent
Monday April 23 12
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 7: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/7.jpg)
Monday April 23 12
30+ million users in less than 2 years
Monday April 23 12
at its heart Postgres-driven
Monday April 23 12
a glimpse at how a startup with a small eng team scaled with PG
Monday April 23 12
a brief tangent
Monday April 23 12
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 8: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/8.jpg)
30+ million users in less than 2 years
Monday April 23 12
at its heart Postgres-driven
Monday April 23 12
a glimpse at how a startup with a small eng team scaled with PG
Monday April 23 12
a brief tangent
Monday April 23 12
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 9: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/9.jpg)
at its heart Postgres-driven
Monday April 23 12
a glimpse at how a startup with a small eng team scaled with PG
Monday April 23 12
a brief tangent
Monday April 23 12
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 10: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/10.jpg)
a glimpse at how a startup with a small eng team scaled with PG
Monday April 23 12
a brief tangent
Monday April 23 12
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 11: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/11.jpg)
a brief tangent
Monday April 23 12
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 12: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/12.jpg)
the beginning
Monday April 23 12
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 13: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/13.jpg)
Text
Monday April 23 12
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 14: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/14.jpg)
2 product guys
Monday April 23 12
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 15: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/15.jpg)
no real back-end experience
Monday April 23 12
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 16: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/16.jpg)
(you should have seen my first time finding my
way around psql)
Monday April 23 12
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 17: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/17.jpg)
analytics amp python meebo
Monday April 23 12
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 18: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/18.jpg)
CouchDB
Monday April 23 12
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 19: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/19.jpg)
CrimeDesk SF
Monday April 23 12
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 20: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/20.jpg)
Monday April 23 12
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 21: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/21.jpg)
early mix PG Redis Memcached
Monday April 23 12
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 22: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/22.jpg)
but were hosted on a single machine
somewhere in LA
Monday April 23 12
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 23: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/23.jpg)
Monday April 23 12
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 24: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/24.jpg)
less powerful than my MacBook Pro
Monday April 23 12
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 25: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/25.jpg)
okay we launchednow what
Monday April 23 12
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 26: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/26.jpg)
25k signups in the first day
Monday April 23 12
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 27: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/27.jpg)
everything is on fire
Monday April 23 12
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 28: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/28.jpg)
best amp worst day of our lives so far
Monday April 23 12
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 29: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/29.jpg)
load was through the roof
Monday April 23 12
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 30: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/30.jpg)
friday rolls around
Monday April 23 12
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 31: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/31.jpg)
not slowing down
Monday April 23 12
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 32: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/32.jpg)
letrsquos move to EC2
Monday April 23 12
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 33: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/33.jpg)
Monday April 23 12
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 34: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/34.jpg)
PG upgrade to 90
Monday April 23 12
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 35: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/35.jpg)
Monday April 23 12
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 36: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/36.jpg)
scaling = replacing all components of a car
while driving it at 100mph
Monday April 23 12
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 37: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/37.jpg)
this is the story of how our usage of PG has
evolved
Monday April 23 12
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 38: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/38.jpg)
Phase 1 All ORM all the time
Monday April 23 12
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 39: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/39.jpg)
why pg at first postgis
Monday April 23 12
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 40: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/40.jpg)
managepy syncdb
Monday April 23 12
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 41: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/41.jpg)
ORM made it too easy to not really think through
primary keys
Monday April 23 12
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 42: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/42.jpg)
pretty good for getting off the ground
Monday April 23 12
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 43: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/43.jpg)
Mediaobjectsget(pk = 4)
Monday April 23 12
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 44: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/44.jpg)
first version of our feed (pre-launch)
Monday April 23 12
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 45: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/45.jpg)
friends = Relationshipsobjectsfilter(source_user = user)
recent_photos = Mediaobjectsfilter(user_id__in = friends)order_by(lsquo-pkrsquo)[020]
Monday April 23 12
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 46: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/46.jpg)
main feed at launch
Monday April 23 12
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 47: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/47.jpg)
Redis user 33 postsfriends = SMEMBERS followers33for user in friends LPUSH feedltuser_idgt ltmedia_idgt
for readingLRANGE feed4 0 20
Monday April 23 12
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 48: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/48.jpg)
canonical data PGfeedslistssets Redis
object cache memcache
Monday April 23 12
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 49: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/49.jpg)
post-launch
Monday April 23 12
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 50: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/50.jpg)
moved db to its own machine
Monday April 23 12
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 51: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/51.jpg)
at time largest table photo metadata
Monday April 23 12
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 52: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/52.jpg)
ran master-slave from the beginning with
streaming replication
Monday April 23 12
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 53: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/53.jpg)
backups stop the replica xfs_freeze drives and take EBS snapshot
Monday April 23 12
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 54: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/54.jpg)
AWS tradeoff
Monday April 23 12
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 55: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/55.jpg)
3 early problems we hit with PG
Monday April 23 12
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 56: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/56.jpg)
1 oh that setting was the problem
Monday April 23 12
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 57: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/57.jpg)
work_mem
Monday April 23 12
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 58: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/58.jpg)
shared_buffers
Monday April 23 12
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 59: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/59.jpg)
cost_delay
Monday April 23 12
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 60: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/60.jpg)
2 Django-specific ltidle in transactiongt
Monday April 23 12
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 61: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/61.jpg)
3 connection pooling
Monday April 23 12
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 62: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/62.jpg)
(we use PGBouncer)
Monday April 23 12
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 63: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/63.jpg)
somewhere in this crazy couple of months
Christophe to the rescue
Monday April 23 12
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 64: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/64.jpg)
photos kept growing and growing
Monday April 23 12
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 65: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/65.jpg)
and only 68GB of RAM on biggest machine in EC2
Monday April 23 12
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 66: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/66.jpg)
so what now
Monday April 23 12
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 67: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/67.jpg)
Phase 2 Vertical Partitioning
Monday April 23 12
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 68: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/68.jpg)
django db routers make it pretty easy
Monday April 23 12
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 69: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/69.jpg)
def db_for_read(self model) if app_label == photos return photodb
Monday April 23 12
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 70: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/70.jpg)
once you untangle all your foreign key
relationships
Monday April 23 12
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 71: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/71.jpg)
(all of those useruser_id interchangeable calls bite
you now)
Monday April 23 12
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 72: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/72.jpg)
plenty of time spent in PGFouine
Monday April 23 12
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 73: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/73.jpg)
read slaves (using streaming replicas) where
we need to reduce contention
Monday April 23 12
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 74: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/74.jpg)
a few months later
Monday April 23 12
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 75: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/75.jpg)
photosdb gt 60GB
Monday April 23 12
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 76: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/76.jpg)
precipitated by being on cloud hardware but likely to have hit limit eventually
either way
Monday April 23 12
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 77: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/77.jpg)
what now
Monday April 23 12
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 78: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/78.jpg)
horizontal partitioning
Monday April 23 12
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 79: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/79.jpg)
Phase 3 sharding
Monday April 23 12
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 80: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/80.jpg)
ldquosurely wersquoll have hired someone experienced before we actually need
to shardrdquo
Monday April 23 12
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 81: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/81.jpg)
never true about scaling
Monday April 23 12
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 82: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/82.jpg)
1 choosing a method2 adapting the application
3 expanding capacity
Monday April 23 12
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 83: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/83.jpg)
evaluated solutions
Monday April 23 12
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 84: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/84.jpg)
at the time none were up to task of being our
primary DB
Monday April 23 12
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 85: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/85.jpg)
NoSQL alternatives
Monday April 23 12
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 86: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/86.jpg)
Skypersquos sharding proxy
Monday April 23 12
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 87: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/87.jpg)
Rangedate-based partitioning
Monday April 23 12
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 88: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/88.jpg)
did in Postgres itself
Monday April 23 12
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 89: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/89.jpg)
requirements
Monday April 23 12
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 90: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/90.jpg)
1 low operational amp code complexity
Monday April 23 12
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 91: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/91.jpg)
2 easy expanding of capacity
Monday April 23 12
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 92: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/92.jpg)
3 low performance impact on application
Monday April 23 12
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 93: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/93.jpg)
schema-based logical sharding
Monday April 23 12
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 94: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/94.jpg)
many many many (thousands) of logical
shards
Monday April 23 12
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 95: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/95.jpg)
that map to fewer physical ones
Monday April 23 12
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 96: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/96.jpg)
8 logical shards on 2 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 A 3 A 4 B 5 B 6 B 7 B
Monday April 23 12
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 97: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/97.jpg)
8 logical shards on 2 4 machines
user_id 8 = logical shard
logical shards -gt physical shard map
0 A 1 A 2 C 3 C 4 B 5 B 6 D 7 D
Monday April 23 12
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 98: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/98.jpg)
iexclschemas
Monday April 23 12
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 99: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/99.jpg)
all that lsquopublicrsquo stuff Irsquod been glossing over for 2
years
Monday April 23 12
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 100: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/100.jpg)
- database - schema - table - columns
Monday April 23 12
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 101: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/101.jpg)
spun up set of machines
Monday April 23 12
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 102: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/102.jpg)
using fabric created thousands of schemas
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 103: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/103.jpg)
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineB shard4 photos_by_user shard5 photos_by_user shard6 photos_by_user shard7 photos_by_user
Monday April 23 12
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 104: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/104.jpg)
(fabric or similar parallel task executor is
essential)
Monday April 23 12
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 105: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/105.jpg)
application-side logic
Monday April 23 12
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 106: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/106.jpg)
SHARD_TO_DB =
SHARD_TO_DB[0] = 0SHARD_TO_DB[1] = 0SHARD_TO_DB[2] = 0SHARD_TO_DB[3] = 0SHARD_TO_DB[4] = 1SHARD_TO_DB[5] = 1SHARD_TO_DB[6] = 1SHARD_TO_DB[7] = 1
Monday April 23 12
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 107: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/107.jpg)
instead of Django ORM wrote really simple db
abstraction layer
Monday April 23 12
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 108: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/108.jpg)
selectupdateinsertdelete
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 109: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/109.jpg)
select(fields table_name shard_key where_statement where_parameters)
Monday April 23 12
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 110: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/110.jpg)
select(fields table_name shard_key where_statement where_parameters)
shard_key num_logical_shards = shard_id
Monday April 23 12
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 111: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/111.jpg)
in most cases user_id for us
Monday April 23 12
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 112: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/112.jpg)
custom Django test runner to createtear-down sharded DBs
Monday April 23 12
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 113: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/113.jpg)
most queries involve visiting handful of shards
over one or two machines
Monday April 23 12
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 114: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/114.jpg)
if mapping across shards on single DB UNION
ALL to aggregate
Monday April 23 12
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 115: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/115.jpg)
clients to library pass in((shard_key id)
(shard_keyid)) etc
Monday April 23 12
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 116: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/116.jpg)
library maps sub-selects to each shard and each
machine
Monday April 23 12
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 117: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/117.jpg)
parallel execution (per-machine at least)
Monday April 23 12
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 118: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/118.jpg)
-gt Append (cost=00097372 rows=100 width=12) (actual time=0290160035 rows=30 loops=1) -gt Limit (cost=00080624 rows=30 width=12) (actual time=0288159913 rows=14 loops=1) -gt Index Scan Backward using index on table (cost=0001865104 rows=694 width=12) (actual time=0286159885 rows=14 loops=1) -gt Limit (cost=0007115 rows=30 width=12) (actual time=00150018 rows=1 loops=1) -gt Index Scan using index on table (cost=00010199 rows=43 width=12) (actual time=00130014 rows=1 loops=1) (etc)
Monday April 23 12
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 119: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/119.jpg)
eventually would be nice to parallelize across
machines
Monday April 23 12
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 120: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/120.jpg)
next challenge unique IDs
Monday April 23 12
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 121: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/121.jpg)
requirements
Monday April 23 12
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 122: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/122.jpg)
1 should be time sortable without requiring
a lookup
Monday April 23 12
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 123: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/123.jpg)
2 should be 64-bit
Monday April 23 12
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 124: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/124.jpg)
3 low operational complexity
Monday April 23 12
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 125: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/125.jpg)
surveyed the options
Monday April 23 12
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 126: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/126.jpg)
ticket servers
Monday April 23 12
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 127: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/127.jpg)
UUID
Monday April 23 12
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 128: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/128.jpg)
twitter snowflake
Monday April 23 12
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 129: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/129.jpg)
application-level IDs ala Mongo
Monday April 23 12
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 130: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/130.jpg)
hey the db is already pretty good about
incrementing sequences
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 131: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/131.jpg)
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 132: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/132.jpg)
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 133: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/133.jpg)
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 134: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/134.jpg)
[ 41 bits of time in millis ][ 13 bits for shard ID ][ 10 bits sequence ID ]
Monday April 23 12
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 135: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/135.jpg)
CREATE OR REPLACE FUNCTION insta5next_id(OUT result bigint) AS $$DECLARE our_epoch bigint = 1314220021721 seq_id bigint now_millis bigint shard_id int = 5BEGIN SELECT nextval(insta5table_id_seq) 1024 INTO seq_id
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) 1000) INTO now_millis result = (now_millis - our_epoch) ltlt 23 result = result | (shard_id ltlt 10) result = result | (seq_id)END$$ LANGUAGE PLPGSQL
Monday April 23 12
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 136: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/136.jpg)
pulling shard ID from ID
shard_id = id ^ ((id gtgt 23) ltlt 23)timestamp = EPOCH + id gtgt 23
Monday April 23 12
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 137: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/137.jpg)
pros guaranteed unique in 64-bits not much of a
CPU overhead
Monday April 23 12
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 138: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/138.jpg)
cons large IDs from the get-go
Monday April 23 12
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 139: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/139.jpg)
hundreds of millions of IDs generated with this
scheme no issues
Monday April 23 12
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 140: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/140.jpg)
well what about ldquore-shardingrdquo
Monday April 23 12
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 141: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/141.jpg)
first recourse pg_reorg
Monday April 23 12
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 142: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/142.jpg)
rewrites tables in index order
Monday April 23 12
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 143: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/143.jpg)
only requires brief locks for atomic table renames
Monday April 23 12
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 144: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/144.jpg)
20+GB savings on some of our dbs
Monday April 23 12
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 145: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/145.jpg)
Monday April 23 12
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 146: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/146.jpg)
especially useful on EC2
Monday April 23 12
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 147: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/147.jpg)
but sometimes you just have to reshard
Monday April 23 12
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 148: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/148.jpg)
streaming replication to the rescue
Monday April 23 12
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 149: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/149.jpg)
(btw repmgr is awesome)
Monday April 23 12
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 150: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/150.jpg)
repmgr standby clone ltmastergt
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 151: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/151.jpg)
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineArsquo shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 152: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/152.jpg)
machineA shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
machineC shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user
Monday April 23 12
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 153: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/153.jpg)
PGBouncer abstracts moving DBs from the
app logic
Monday April 23 12
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 154: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/154.jpg)
can do this as long as you have more logical shards than physical
ones
Monday April 23 12
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 155: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/155.jpg)
beauty of schemas is that they are physically
different files
Monday April 23 12
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 156: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/156.jpg)
(no IO hit when deleting no lsquoswiss cheesersquo)
Monday April 23 12
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 157: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/157.jpg)
downside requires ~30 seconds of maintenance to roll out new schema
mapping
Monday April 23 12
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 158: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/158.jpg)
(could be solved by having concept of ldquoread-
onlyrdquo mode for some DBs)
Monday April 23 12
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 159: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/159.jpg)
not great for range-scans that would span across
shards
Monday April 23 12
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 160: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/160.jpg)
latest project follow graph
Monday April 23 12
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 161: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/161.jpg)
v1 simple DB table(source_id target_id
status)
Monday April 23 12
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 162: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/162.jpg)
who do I followwho follows me
do I follow Xdoes X follow me
Monday April 23 12
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 163: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/163.jpg)
DB was busy so we started storing parallel
version in Redis
Monday April 23 12
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 164: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/164.jpg)
follow_all(300 item list)
Monday April 23 12
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 165: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/165.jpg)
inconsistency
Monday April 23 12
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 166: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/166.jpg)
extra logic
Monday April 23 12
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 167: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/167.jpg)
so much extra logic
Monday April 23 12
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 168: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/168.jpg)
exposing your support team to the idea of cache invalidation
Monday April 23 12
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 169: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/169.jpg)
Monday April 23 12
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 170: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/170.jpg)
redesign took a page from twitterrsquos book
Monday April 23 12
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 171: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/171.jpg)
PG can handle tens of thousands of requests very light memcached
caching
Monday April 23 12
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 172: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/172.jpg)
next steps
Monday April 23 12
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 173: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/173.jpg)
isolating services to minimize open conns
Monday April 23 12
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 174: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/174.jpg)
investigate physical hardware etc to reduce
need to re-shard
Monday April 23 12
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 175: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/175.jpg)
Wrap up
Monday April 23 12
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 176: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/176.jpg)
you donrsquot need to give up PGrsquos durability amp features to shard
Monday April 23 12
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 177: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/177.jpg)
continue to let the DB do what the DB is great at
Monday April 23 12
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 178: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/178.jpg)
ldquodonrsquot shard until you have tordquo
Monday April 23 12
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 179: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/179.jpg)
(but donrsquot over-estimate how hard it will be either)
Monday April 23 12
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 180: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/180.jpg)
scaled within constraints of the cloud
Monday April 23 12
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 181: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/181.jpg)
PG success story
Monday April 23 12
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 182: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/182.jpg)
(wersquore really excited about 92)
Monday April 23 12
thanks any qs
Monday April 23 12
![Page 183: Sharding @ Instagram - PostgreSQL · Sharding @ Instagram SFPUG April 2012 Mike Krieger Instagram Monday, April 23, 12](https://reader031.fdocuments.us/reader031/viewer/2022020104/5b9c603709d3f2f6368c8234/html5/thumbnails/183.jpg)
thanks any qs
Monday April 23 12