Working with Dimensional data in Distributed Hash Tables

71
STRANGE LOOP 2010 DIMENSIONAL DATA IN DISTRIBUTED HASH TABLES fturg Mike Male o o o o o Monday, October 18, 2010

description

Recently a new class of database technologies has developed offering massively scalable distributed hash table functionality. Relative to more traditional relational database systems, these systems are simple to operate and capable of managing massive data sets. These characteristics come at a cost though: an impoverished query language that, in practice, can handle little more than exact-match lookups at scale. This talk will explore the real world technical challenges we faced at SimpleGeo while building a web-scale spatial database on top of Apache Cassandra. Cassandra is a distributed database that falls into the broad category of second-generation systems described above. We chose Cassandra after carefully considering desirable database characteristics based on our prior experiences building large scale web applications. Cassandra offers operational simplicity, decentralized operations, no single points of failure, online load balancing and re-balancing, and linear horizontal scalability. Unfortunately, Cassandra fell far short of providing the sort of sophisticated spatial queries we needed. We developed a short term solution that was good enough for most use cases, but far from optimal. Long term, our challenge was to bridge the gap without compromising any of the desirable qualities that led us to choose Cassandra in the first place. The result is a robust general purpose mechanism for overlaying sophisticated data structures on top of distributed hash tables. By overlaying a spatial tree, for example, we’re able to durably persist massive amounts of spatial data and service complex nearest-neighbor and multidimensional range queries across billions of rows fast enough for an online consumer facing application. We continue to improve and evolve the system, but we’re eager to share what we’ve learned so far.

Transcript of Working with Dimensional data in Distributed Hash Tables

  • 1. o oo ooDIMENSIONAL DATA IN DISTRIBUTED HASH TABLESfturg Mike MaleSTRANGE LOOP 2010 Monday, October 18, 2010

2. SIMPLEGEO We originally began as a mobile gaming startup, but quicklydiscovered that the location servicesand infrastructure needed to support our ideas didnt exist. So we tookmatters into our own hands and began building it ourselves.Mt GaigJoe Stump CEO & co-founder CTO & co-founderSTRANGE LOOP 2010 Monday, October 18, 2010 3. ABOUT ME MIKE MALONEINFRASTRUCTURE [email protected]@mjmalone For the last 10 months Ive been developing a spatial database at the core of SimpleGeos infrastructure. STRANGE LOOP 2010 Monday, October 18, 2010 4. REQUIREMENTS Multidimensional Efficiency Query Speed Complex QueriesDecentralizationLocalitySpatial PerformanceFault Tolerance Availability DistributednessGOALS Replication ReliabilitySimplicity Resilience Conceptual IntegrityScalabilityOperational Consistency DataCoherent ConcurrencyQuery Volume Durability Linear STRANGE LOOP 2010 Monday, October 18, 2010 5. CONFLICT Integrity vs AvailabilityLocality vs DistributednessReductionism vs Emergence STRANGE LOOP 2010 Monday, October 18, 2010 6. DATABASE CANDIDATESTHE TRANSACTIONAL RELATIONAL DATABASESTheyre theoretically pure, well understood, and mostlystandardized behind a relatively clean abstractionThey provide robust contracts that make it easy to reasonabout the structure and nature of the data they containTheyre battle hardened, robust, durable, etc.OTHER STRUCTURED STORAGE OPTIONSPlain I see you, western youths, see you tramping with theforemost, Pioneers! O pioneers! STRANGE LOOP 2010 Monday, October 18, 2010 7. INTEGRITY- vs -AVAILABILITY STRANGE LOOP 2010 Monday, October 18, 2010 8. ACID These terms are not formally dened - theyre a framework, not mathematical axioms ATOMICITYEither all of a transactions actions are visible to another transaction, or none are CONSISTENCYApplication-specific constraints must be met for transaction to succeed ISOLATIONTwo concurrent transactions will not see one anothers transactions while in flight DURABILITYThe updates made to the database in a committed transaction will be visible tofuture transactionsSTRANGE LOOP 2010 Monday, October 18, 2010 9. ACID HELPS ACID is a sort-of-formal contract that makes it easy to reason about your data, and thats good IT DOES SOMETHING HARD FOR YOUWith ACID, youre guaranteed to maintain a persistent globalstate as long as youve defined proper constraints and yourlogical transactions result in a valid system state STRANGE LOOP 2010 Monday, October 18, 2010 10. CAP THEOREM At PODC 2000 Eric Brewer told us there were three desirable DB characteristics. But we can only have two. CONSISTENCYEvery node in the system contains the same data (e.g., replicas arenever out of date) AVAILABILITYEvery request to a non-failing node in the system returns a response PARTITION TOLERANCESystem properties (consistency and/or availability) hold even whenthe system is partitioned and data is lost STRANGE LOOP 2010 Monday, October 18, 2010 11. CAP THEOREM IN 30 SECONDSCLIENT SERVERREPLICA STRANGE LOOP 2010 Monday, October 18, 2010 12. CAP THEOREM IN 30 SECONDSCLIENTSERVERREPLICAwreSTRANGE LOOP 2010 Monday, October 18, 2010 13. CAP THEOREM IN 30 SECONDSCLIENTSERVER pliceREPLICAwreSTRANGE LOOP 2010 Monday, October 18, 2010 14. CAP THEOREM IN 30 SECONDSCLIENTSERVER pliceREPLICAwre ackSTRANGE LOOP 2010 Monday, October 18, 2010 15. CAP THEOREM IN 30 SECONDSCLIENTSERVER pliceREPLICA wre aeptackSTRANGE LOOP 2010 Monday, October 18, 2010 16. CAP THEOREM IN 30 SECONDSCLIENT SERVER REPLICAwre FAIL! ni UNAVAILAB!STRANGE LOOP 2010 Monday, October 18, 2010 17. CAP THEOREM IN 30 SECONDSCLIENTSERVERREPLICA wre FAIL! aept CSTT!STRANGE LOOP 2010 Monday, October 18, 2010 18. ACID HURTS Certain aspects of ACID encourage (require?) implementors to do bad things Unfortunately, ANSI SQLs definition of isolation...relies in subtle ways on an assumption that a locking scheme isused for concurrency control, as opposed to an optimistic ormulti-version concurrency scheme. This implies that theproposed semantics are ill-defined. Joseph M. Hellerstein and Michael StonebrakerAnatomy of a Database System STRANGE LOOP 2010 Monday, October 18, 2010 19. BALANCEITS A QUESTION OF VALUESFor traditional databases CAP consistency is the holy grail: itsmaximized at the expense of availability and partitiontoleranceAt scale, failures happen: when youre doing something amillion times a second a one-in-a-million failure happens everysecondWere witnessing the birth of a new religion... CAP consistency is a luxury that must be sacrificed at scale in order to maintain availability when faced with failures STRANGE LOOP 2010 Monday, October 18, 2010 20. APACHE CASSANDRAA DISTRIBUTED HASH TABLE WITH SOME TRICKSPeer-to-peer gossip supports fault tolerance and decentralizationwith a simple operational modelRandom hash-based partitioning provides auto-scaling andefficient online rebalancing - read and write throughput increaseslinearly when new nodes are addedPluggable replication strategy for multi-datacenter replicationmaking the system resilient to failure, even at the datacenter levelTunable consistency allow us to adjust durability with the value ofdata being written STRANGE LOOP 2010 Monday, October 18, 2010 21. CASSANDRADATA MODEL{ column family users: {key alice: { city: [St. Louis, 1287040737182], columns (name, value, timestamp) name: [Alice 1287080340940],, }, ... }, locations: { }, ...}STRANGE LOOP 2010 Monday, October 18, 2010 22. DISTRIBUTED HASH TABLEDESTROY LOCALITY (BY HASHING) TO ACHIEVE GOODBALANCE AND SIMPLIFY LINEAR SCALABILITY{ column family users: { f 0 keyalice: { city: [St. Louis, 1287040737182], columns name: [Alice 1287080340940],, }, ... },} ...alicebob s3b 3e8 STRANGE LOOP 2010 Monday, October 18, 2010 23. HASH TABLESUPPORTED QUERIESEXACT MATCH RANGE PROXIMITY ANYTHING THATS NOT EXACT MATCH STRANGE LOOP 2010 Monday, October 18, 2010 24. LOCALITY - vs -DISTRIBUTEDNESSSTRANGE LOOP 2010 Monday, October 18, 2010 25. THE ORDER PRESERVINGPARTITIONERCASSANDRAS PARTITIONINGSTRATEGY IS PLUGGABLEPartitioner maps keys to nodesRandom partitioner destroys locality by hashingOrder preserving partitioner retains locality, storingkeys in natural lexicographical order around ringz a alice a bobuhsamm STRANGE LOOP 2010 Monday, October 18, 2010 26. ORDER PRESERVING PARTITIONER EXACT MATCHRANGEOn a single dimension ? PROXIMITY STRANGE LOOP 2010 Monday, October 18, 2010 27. SPATIAL DATAITS INHERENTLY MULTIDIMENSIONAL 2 x2, 21 1 2STRANGE LOOP 2010 Monday, October 18, 2010 28. DIMENSIONALITY REDUCTIONWITH SPACE-FILLING CURVES 1 23 4 STRANGE LOOP 2010 Monday, October 18, 2010 29. Z-CURVESECOND ITERATIONSTRANGE LOOP 2010 Monday, October 18, 2010 30. Z-VALUE 14xSTRANGE LOOP 2010 Monday, October 18, 2010 31. GEOHASHSIMPLE TO COMPUTEInterleave the bits of decimal coordinates(equivalent to binary encoding of pre-ordertraversal!)Base32 encode the resultAWESOME CHARACTERISTICSArbitrary precisionHuman readableSorts lexicographically 01101 eSTRANGE LOOP 2010 Monday, October 18, 2010 32. DATA MODEL{record-index: {key : 9yzgcjn0:moonrise hotel: { : [, 1287040737182], }, ...},records: { moonrise hotel: { latitude: [38.6554420, 1287040737182], longitude: [-90.2992910, 1287040737182], ...}}} STRANGE LOOP 2010 Monday, October 18, 2010 33. BOUNDING BOXE.G., MULTIDIMENSIONAL RANGEGie u bg box!Gie 2 31 23 4 Gie 4 5 STRANGE LOOP 2010 Monday, October 18, 2010 34. SPATIAL DATASTILL MULTIDIMENSIONALDIMENSIONALITY REDUCTION ISNT PERFECTClients must Pre-process to compose multiple queries Post-process to filter and merge resultsDegenerate cases can be bad, particularly for nearest-neighborqueriesSTRANGE LOOP 2010 Monday, October 18, 2010 35. Z-CURVE LOCALITYSTRANGE LOOP 2010 Monday, October 18, 2010 36. Z-CURVE LOCALITY xxSTRANGE LOOP 2010 Monday, October 18, 2010 37. Z-CURVE LOCALITY xxSTRANGE LOOP 2010 Monday, October 18, 2010 38. Z-CURVE LOCALITYx o o o xoo oo STRANGE LOOP 2010 Monday, October 18, 2010 39. THE WORLDIS NOT BALANCEDCredit: C. Mayhew & R. Simmon (NASA/GSFC), NOAA/NGDC, DMSP Digital Archive STRANGE LOOP 2010 Monday, October 18, 2010 40. TOO MUCH LOCALITY12SAN FRANCISCO34 STRANGE LOOP 2010 Monday, October 18, 2010 41. TOO MUCH LOCALITY12SAN FRANCISCO34 STRANGE LOOP 2010 Monday, October 18, 2010 42. TOO MUCH LOCALITY12 Im sad. SAN FRANCISCO34STRANGE LOOP 2010 Monday, October 18, 2010 43. TOO MUCH LOCALITYIm b.12 Im sad. Me o.SAN FRANCISCO34Lets py xbox.STRANGE LOOP 2010 Monday, October 18, 2010 44. A TURNING POINTSTRANGE LOOP 2010 Monday, October 18, 2010 45. HELLO, DRAWING BOARDSURVEY OF DISTRIBUTED P2P INDEXINGAn overlay-dependent index works directly with nodes of thepeer-to-peer network, defining its own overlayAn over-DHT index overlays a more sophisticated datastructure on top of a peer-to-peer distributed hash tableSTRANGE LOOP 2010 Monday, October 18, 2010 46. ANOTHER LOOK AT POSTGISMIGHT WORK, BUTThe relational transaction management system (which wedwant to change) and access methods (which wed have tochange) are tightly coupled (necessarily?) to other parts ofthe systemCould work at a higher level and treat PostGIS as a black box Now were back to implementing a peer-to-peer network with failure recovery, fault detection, etc... and Cassandra already had all that. Its probably clear by now that I think these problems are more difficult than actually storing structured data on diskSTRANGE LOOP 2010 Monday, October 18, 2010 47. LETS TAKE A STEP BACKSTRANGE LOOP 2010 Monday, October 18, 2010 48. EARTHSTRANGE LOOP 2010 Monday, October 18, 2010 49. EARTHSTRANGE LOOP 2010 Monday, October 18, 2010 50. EARTHSTRANGE LOOP 2010 Monday, October 18, 2010 51. EARTHSTRANGE LOOP 2010 Monday, October 18, 2010 52. EARTH, TREE, RINGSTRANGE LOOP 2010 Monday, October 18, 2010 53. DATA MODEL{record-index: { layer-name:37 .875, -90:40.25, -101.25: { 38.6554420, -90.2992910:moonrise hotel: [ 1287040737182], , ... },},record-index-meta: { layer-name:37 .875, -90:40.25, -101.25: { split: [false, 1287040737182],}layer-name: 37 .875, -90:42.265, -101.25 {split: [true, 1287040737182],child-left: [layer-name:37 .875, -90:40.25, -101.25 1287040737182],child-right: [layer-name:40.25, -90:42.265, -101.25 1287040737182] , }}} STRANGE LOOP 2010 Monday, October 18, 2010 54. DATA MODEL{record-index: { layer-name:37 .875, -90:40.25, -101.25: { 38.6554420, -90.2992910:moonrise hotel: [ 1287040737182], , ... },},record-index-meta: { layer-name:37 .875, -90:40.25, -101.25: { split: [false, 1287040737182],}layer-name: 37 .875, -90:42.265, -101.25 {split: [true, 1287040737182],child-left: [layer-name:37 .875, -90:40.25, -101.25 1287040737182],child-right: [layer-name:40.25, -90:42.265, -101.25 1287040737182] , }}} STRANGE LOOP 2010 Monday, October 18, 2010 55. DATA MODEL{record-index: { layer-name:37 .875, -90:40.25, -101.25: { 38.6554420, -90.2992910:moonrise hotel: [ 1287040737182], , ... },},record-index-meta: { layer-name:37 .875, -90:40.25, -101.25: { split: [false, 1287040737182],}layer-name: 37 .875, -90:42.265, -101.25 {split: [true, 1287040737182],child-left: [layer-name:37 .875, -90:40.25, -101.25 1287040737182],child-right: [layer-name:40.25, -90:42.265, -101.25 1287040737182] , }}} STRANGE LOOP 2010 Monday, October 18, 2010 56. SPLITTINGITS PRETTY MUCH JUST A CONCURRENT TREESplitting shouldnt lock the tree for reads or writes and failuresshouldnt cause corruption Splits are optimistic, idempotent, and fail-forward Instead of locking, writes are replicated to the splitting node and the relevant child[ren] while a split operation is taking place Cleanup occurs after the split is completed and all interested nodes are aware that the split has occurred Cassandra writes are idempotent, so splits are too - if a split fails, it is simply be retriedSplit size: A Tunable knob for balancing locality and distributednessThe other hard problem with concurrent trees is rebalancing - wejust dont do it! (more on this later)STRANGE LOOP 2010 Monday, October 18, 2010 57. THE ROOT IS HOTMIGHT BE A DEAL BREAKERFor a tree to be useful, it has to be traversed Typically, tree traversal starts at the root Root is the only discoverable node in our treeTraversing through the root meant reading the root for everyread or write below it - unacceptable Lots of academic solutions - most promising was a skip graph, but that required O(n log(n)) data - also unacceptable Minimum tree depth was propsed, but then you just get multiple hot- spots at your minimum depth nodes STRANGE LOOP 2010 Monday, October 18, 2010 58. BACK TO THE BOOKSLOTS OF ACADEMIC WORK ON THIS TOPICBut academia is obsessed with provable, deterministic,asymptotically optimal algorithmsAnd we only need something that is probably fast enoughmost of the time (for some value of probably and most ofthe time) And if the probably good enough algorithm is, you know... tractable... one might even consider it qualitatively better! STRANGE LOOP 2010 Monday, October 18, 2010 59. REDUCTIONISM- vs - EMERGENCE STRANGE LOOP 2010 Monday, October 18, 2010 60. We have We want STRANGE LOOP 2010 Monday, October 18, 2010 61. THINKING HOLISTICALLYWE OBSERVED THATOnce a node in the tree exists, it doesnt go awayNode state may change, but that state only really matterslocally - thinking a node is a leaf when it really has children isnot fatalSO... WHAT IF WE JUST CACHED NODES THATWERE OBSERVED IN THE SYSTEM!?CLIENT - XXX 2010 Monday, October 18, 2010 62. CACHE ITSTUPID SIMPLE SOLUTIONKeep an LRU cache of nodes that have been traversedStart traversals at the most selective relevant nodeIf that node doesnt satisfy you, traverse up the treeAlong with your result set, return a list of nodes that weretraversed so the caller can add them to its cacheSTRANGE LOOP 2010 Monday, October 18, 2010 63. TRAVERSALNEAREST NEIGHBORo oo xoSTRANGE LOOP 2010 Monday, October 18, 2010 64. TRAVERSALNEAREST NEIGHBORo oo xoSTRANGE LOOP 2010 Monday, October 18, 2010 65. TRAVERSALNEAREST NEIGHBORo oo xoSTRANGE LOOP 2010 Monday, October 18, 2010 66. KEY CHARACTERISTICSPERFORMANCEBest case on the happy path (everything cached) has zeroread overheadWorst case, with nothing cached, O(log(n)) read overheadRE-BALANCING SEEMS UNNECESSARY!Makes worst case more worser, but so far so good CLIENT - XXX 2010 Monday, October 18, 2010 67. DISTRIBUTED TREESUPPORTED QUERIESEXACT MATCH RANGE PROXIMITY SOMETHING ELSE I HAVENT EVEN HEARD OF STRANGE LOOP 2010 Monday, October 18, 2010 68. DISTRIBUTED TREESUPPORTED QUERIES MUL EXACT MATCHDI P RANGE NS NS! PROXIMITY SOMETHING ELSE I HAVENT EVEN HEARD OF STRANGE LOOP 2010 Monday, October 18, 2010 69. PEACE Integrity and AvailabilityLocality and DistributednessReductionism and EmergenceSTRANGE LOOP 2010 Monday, October 18, 2010 70. QUESTIONS? MIKE MALONEINFRASTRUCTURE [email protected]@mjmaloneSTRANGE LOOP 2010 Monday, October 18, 2010 71. STRANGE LOOP 2010 Monday, October 18, 2010