Scalable , Consistent, and Elastic Database Systems for Cloud Platforms

57
Scalable, Consistent, and Elastic Database Systems for Cloud Platforms Sudipto Das Computer Science, UC Santa Barbara [email protected] Sponso rs:

description

Scalable , Consistent, and Elastic Database Systems for Cloud Platforms. Sudipto Das Computer Science, UC Santa Barbara [email protected]. Sponsors:. Web replacing Desktop. Paradigm shift in Infrastructure. Cloud computing. - PowerPoint PPT Presentation

Transcript of Scalable , Consistent, and Elastic Database Systems for Cloud Platforms

The Design of Scalable, Consistent, Autonomic, and Elastic Database Systems for Cloud Platforms

Scalable, Consistent, and Elastic Database Systems for Cloud PlatformsSudipto DasComputer Science, UC Santa [email protected]:

DSL

DSL

1Thanks for coming to my talk. In this talk, I will cover a sliver of my work in the area of data management in the cloud, focusing on Scalable, Consistent, and Elastic Database Systems for Cloud Platforms.

Web replacing DesktopSudipto Das {[email protected]}2

DSLDSLDSL

In the last few years, we have witnessed a trend where web applications have been replacing desktop applications and large numbers of applications are now accessed via the browsers. 2

3Paradigm shift in Infrastructure

Sudipto Das {[email protected]}DSLDSL

DSL

This shift from desktop to the web has also resulted in a paradigm shift in the application deployment infrastructure resulting in a paradigm popularly known as Cloud Computing. 3

Cloud computingComputing infrastructure and solutions delivered as a serviceIndustry worth USD150 billion by 2014*Contributors to successEconomies of scaleElasticity and pay-per-use pricingPopular paradigmsInfrastructure as a Service (IaaS)Platform as a Service (PaaS)Software as a Service (SaaS)4

Sudipto Das {[email protected]}*http://www.crn.com/news/channel-programs/225700984/cloud-computing-services-market-to-near-150-billion-in-2014.htmDSLDSL

DSL

In its simplest form, cloud computing is essentially computing infrastructure and solutions delivered as a service. Analysts predict that this industry will be worth 150 billion dollars by 2014.

Even though almost every aspect of computing can be provided as a service, there have been three popular cloud paradigms:

Infrastructure as a service, the lowest level of abstraction, provides raw CPU, storage, and network as a service. Popular examples include Amazon web services, Rackspace, etc.

The next higher level of abstraction is platform as a service that provides a platform or containers to deploy applications where the platform provider abstracts data management, fault-tolerance, elastic scaling etc, thus simplifying application deployment. Popular examples include Google AppEngine, Windows Azure, etc.

The highest level of abstraction is software as a service that exposes a simple interface to customize pre-designed application logic. Popular examples include Salesforce.com.

Major factors that have contributed to the success of cloud platforms are advances in the technology front, such as virtualization and pervasive broadband internet connectivity, as well as business and economic factors, such as economies of scale, transfer of risks etc.

In this talk, we focus on Cloud application platforms, in particular, the database systems that serve these cloud application platforms.4Databases for cloud platformsData is central to applicationsDBMSs are mission critical component in cloud software stackManage petabytes of data, drive revenueServe a variety of applications (multitenancy)Data needs for cloud applicationsOLTP systems: store and serve dataData analysis systems: decision support, intelligence

5Sudipto Das {[email protected]}DSLDSL

DSL

Data is central to all modern applications and most modern enterprises manage petabytes of data. Hence DBMSs form a mission critical component in the cloud software stack and is the key to success as well as generating revenue.

Considering the data needs for web-applications, there are two broad categories of systems:

On one hand are OLTP systems that store and serve data. On the other hand are OLAP systems that provide intelligence and decision support.

In this talk, we will focus on OLTP systems.

Bring in the concept of service provider and the service user and whose problem are we solving (NEC discussion).5

Application landscapeSocial gaming

Rich content and mash-ups

Managed applications

Cloud application platformsSudipto Das {[email protected]}6

DSLDSL

DSL

Challenges for OLTP systemsScalabilityWhile ensuring efficient transaction execution!7Lightweight ElasticityScale on-demand!Self-ManageabilityIntelligence without a human controller!

Sudipto Das {[email protected]}DSLDSL

DSL

Therefore, in summary, the major challenges for an OLTP database in the cloud are:

Supporting transactions and scale-out while minimizing the number of distributed transactions,

Supporting lightweight elastic scaling in a live system, and

Providing autonomic control with intelligence similar to a human controller. 7Two approaches to scalabilityScale-upPreferred in classical enterprise setting (RDBMS)Flexible ACID transactionsTransactions access a single nodeScale-outCloud friendly (Key value stores)Execution at a single serverLimited functionality & guaranteesNo multi-row or multi-step transactionsSudipto Das {[email protected]}8

DSLDSL

DSL

Why care about transactions?9confirm_friend_request(user1, user2){ begin_transaction(); update_friend_list(user1, user2, status.confirmed); update_friend_list(user2, user1, status.confirmed); end_transaction();}Simplicity in application design with ACID transactionsSudipto Das {[email protected]}DSLDSL

DSL

Stress about the ACID properties of transactions and how the applications benefit from it by simplifying their design.910confirm_friend_request_A(user1, user2) { try { update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); return; } try { update_friend_list(user2, user1, status.confirmed); } catch(exception e) { revert_friend_list(user1, user2); report_error(e); return; }}

confirm_friend_request_B(user1, user2) { try{ update_friend_list(user1, user2, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user1, user2, current_time());} try { update_friend_list(user2, user1, status.confirmed); } catch(exception e) { report_error(e); add_to_retry_queue(operation.updatefriendlist, user2, user1, current_time()); }}It gets too complicated with reduced consistency guaranteesSudipto Das {[email protected]}DSLDSL

DSL

10Challenge: Transactions at Scale11Scale-outACID transactionsKey Value StoresRDBMSsSudipto Das {[email protected]}DSLDSL

DSL

Therefore, if we consider Scale-out as the vertical axis and Functionality (or support for transactions) as the horizontal axis, at one extreme are the RDBMSs that support rich functionality but are hard to scale-out, and at the other extreme are Key-Value stores that allow scaling out to thousands of servers but support limited functionality.

There exists a big chasm between the two types of systems and the challenge is to bridge this divide by efficiently supporting transactions while scaling out.

Cloud platforms are multitenant and must support a variety of applications with varying needs. Therefore, bridging this chasm is important to support a variety of applications.

Functionality , whether transactions are a subset.11Unused resourcesChallenge: Lightweight ElasticityProvisioning on-demand and not for peakOptimize operating cost!Traditional InfrastructuresDeployment in the CloudDemandCapacityTimeResourcesDemandCapacityTimeResourcesSlide Credits: Berkeley RAD Lab12Sudipto Das {[email protected]}DSLDSL

DSL

In addition, when such a database is deployed on an elastic pay-per-use cloud infrastructure that allows for on-demand provisioning compared to static provisioning for the peak load, the challenge is to make the database layer elastic as the underlying cloud infrastructure without introducing a lot of overhead to make it elastic.

Scale vs Elasticity12Challenge: Self-ManageabilityManaging a large distributed systemDetecting failures and recoveringCoordination and synchronizationProvisioningCapacity planningA large distributed system is a ZooCloud platforms inherently multitenantBalance conflicting goalsMinimize operating cost while ensuring good performance13Sudipto Das {[email protected]}

DSLDSL

DSL

To top it off, in designing these large distributed database management systems, another major challenge is in managing such large systems.

For instance, detecting and recuperating from node failures that become a norm in large systems, loose coordination and synchronization between the nodes, lease and load management, performance modeling, and the laundry list continues.

A famous quote says that a large distributed system is a Zoo and the challenge for us is to automate the management of such large systems through the design of Autonomic system controllers that minimize the need for human intervention.13Contributions for OLTP systemsTransactions at ScaleElasTraS [HotCloud 2009, UCSB TR 2010]G-Store [SoCC 2010]

14Lightweight ElasticityAlbatross [VLDB 2011]Zephyr [SIGMOD 2011]Self-ManageabilityPythia [in progress]

Sudipto Das {[email protected]}DSLDSL

DSL

To this end, my dissertation makes the following contributions to address these challenges:

We propose two different solutions to support transactions at scale for two different application scenarios: Elastras allows for elastically scalable transaction execution in databases where partitions are statically defined, while G-Store allows efficient transaction processing where database partitions are dynamically defined.

Supporting lightweight elasticity essentially boils down to lightweight migration of database partitions in a live system. To this end, we propose two different techniques for two common database architectures: Albatross is a technique for live migration in databases that use the storage abstraction, while Zephyr is a technique for live migration in shared nothing database architectures.

Finally, we are currently working on the design of Pythia, an autonomic controller.

For the interest of time, in this talk, I will only get into the details of G-Store and Zephyr while providing a very high level overview of Elastras.14ContributionsSudipto Das {[email protected]}15Data ManagementAnalyticsTransaction ProcessingRicardo [SIGMOD 10]MD-HBase [MDM 11]Best Paper Runner upCoTS [ICDE 09], [VLDB 09]Dynamic partitioningG-Store [SoCC 10]Static partitioningElasTraS[HotCloud 09][TR 10]Albatross [VLDB 11]Zephyr [SIGMOD 11]Pythia [in progress]Novel ArchitecturesHyder [CIDR 11]Best PaperTCAM [DaMoN 08]This talkAnonimos [ICDE 10], [TKDE]DSLDSL

DSL

But before we delve into the details, I would like to spend a couple of minutes to give an overview of my research in the broader area of data management.

The current talk, and my thesis, focuses on the OLTP aspect.

In the data analysis front, I have worked on multiple projects. As an intern at IBM Almaden, I worked on a project called Ricardo that provides the ability for deep statistical analysis and modeling over large amounts of data. This paper was published in SIGMOD 2010 and parts of the framework ship in IBM InfoSphere BigInsights Enterprise edition. Recently, I worked on a project called MD-Hbase that presents the design and implementation of a scalable multi-dimensional indexing mechanism to support efficient high throughput location updates and multi-dimensional analysis queries on top of a Key-value store. Earlier, I have also worked on data stream processing systems providing intra-operator parallelism in common data stream operators, such as frequent elements or top-k elements, to efficiently exploit multicore processors.

I have also worked on designing systems to exploit novel hardware architectures.15Transactions at Scale16Scale-outACID transactionsKey Value StoresRDBMSsSudipto Das {[email protected]}DSLDSL

DSL

Scale-out with static partitioningTable level partitioning (range, hash)Distributed transactionsPartitioning the Database schemaCo-locate data items accessed togetherGoal: Minimize distributed transactions17Sudipto Das {[email protected]}

DSLDSL

DSL

The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.17Scale-out with static partitioningTable level partitioning (range, hash)Distributed transactionsPartitioning the Database schemaCo-locate data items accessed togetherGoal: Minimize distributed transactionsScaling-out with static partitioningElasTraS [HotCloud 2009, TR 2010]Cloud SQL Server [ICDE 2011]Megastore [CIDR 2011] Relational Cloud [CIDR 2011]18Sudipto Das {[email protected]}DSLDSL

DSL

The goal of partitioning the schema is to leverage the application semantics and access patterns to minimize the number of distributed transactions.18Dynamically formed partitionsAccess patterns change, often rapidlyOnline multi-player gaming applicationsCollaboration based applicationsScientific computing applicationsNot amenable to static partitioningHow to get the benefit of partitioning when accesses do not statically partition?Ours is the first solution to allow that19Sudipto Das {[email protected]}DSLDSL

DSL

Now we know how to scale-out when the partitions are statically defined. So lets make it a bit more interesting: How to scale-out with transactions on dynamically formed partitions?

Recall that our concept of partitions are the data items that are frequently being accessed within the same transaction. For certain applications, that might change with time. For instance, in online multi-player games, the application needs transactional access on the player profiles that are part of the same game instance, and this set changes with time. Similar behavior is observed in a number of collaboration based applications (examples?).19Online Multi-player Games20

Player ProfileIDName$$$ScoreSudipto Das {[email protected]}DSLDSLDSL

Online Multi-player Games21

Execute transactions on player profiles while the game is in progressSudipto Das {[email protected]}DSLDSLDSL

If the player profiles are part of the same database partition, then transactions on this group of players can be executed efficiently.21Online Multi-player Games22

Sudipto Das {[email protected]}

Partitions/groups are dynamic

DSLDSLDSL

However, this group of players change with time, thus resulting in the concept of dynamically defined database partitions.22Online Multi-player Games23

Sudipto Das {[email protected]}

Hundreds of thousands of concurrent groups

DSLDSLDSL

Scale.23Data Fusion for dynamic partitions[G-Store, SoCC 2010]Transactional access to a group of data items formed on-demandChallenge: Avoid distributed transactions!Key Group AbstractionGroups are smallGroups execute non-trivial no. of transactionsGroups are dynamic and on-demandGroups are dynamically formed tenant databases24Sudipto Das {[email protected]}DSLDSL

DSL

Transactions on GroupsWithout distributed transactions25

Ownership of keys at a single nodeKey GroupOne key selected as the leaderFollowers transfer ownership of keys to leaderSudipto Das {[email protected]}Grouping ProtocolDSLDSLDSL

Why is group formation hard?Guarantee the contract between leaders and followers in the presence of: Leader and follower failuresLost, duplicated, or re-ordered messagesDynamics of the underlying systemHow to ensure efficient and ACID execution of transactions?Sudipto Das {[email protected]}26DSLDSL

DSL

Grouping protocolConceptually akin to lockingLocks held by groupsSudipto Das {[email protected]}27Follower(s)LeaderL(Creating)L(Joined)L(Joining)L(Joined)L(Deleting)L(Free)L(Deleted)Group OpnsJJAJAADDALog entriesTimeCreate RequestDelete RequestDSLDSL

DSL

Efficient transaction processingHow does the leader execute transactions?Caches data for group members underlying data store equivalent to a diskTransaction logging for durabilityCache asynchronously flushed to propagate updatesGuaranteed update propagation

28Sudipto Das {[email protected]}LogTransaction ManagerCache ManagerLeaderFollowersAsynchronous update PropagationDSLDSL

DSL

Prototype: G-Store [SoCC 2010]An implementation over Key-value stores29Grouping LayerKey-Value Store LogicDistributed StorageApplication ClientsTransactional Multi-Key AccessG-StoreTransaction ManagerGrouping LayerKey-Value Store LogicTransaction ManagerGrouping LayerKey-Value Store LogicTransaction ManagerGrouping middleware layer resident on top of a key-value storeSudipto Das {[email protected]}DSLDSL

DSL

G-Store EvaluationImplemented using HBaseAdded the middleware layer~10000 LOCExperiments in Amazon EC2 Benchmark: An online multi-player gameCluster size: 10 nodesData size: ~1 billion rows (>1 TB)For groups with 100 keysGroup creation latency: ~10 100msMore than 10,000 groups concurrently created30Sudipto Das {[email protected]}DSLDSL

DSL

Paper has more detailed evaluation30G-Store EvaluationSudipto Das {[email protected]}31

Group creation latencyGroup creation throughputDSLDSLDSL

Unused resourcesLightweight ElasticityProvisioning on-demand and not for peakOptimize operating cost!Traditional InfrastructuresDeployment in the CloudDemandCapacityTimeResourcesDemandCapacityTimeResourcesSlide Credits: Berkeley RAD Lab32Sudipto Das {[email protected]}DSLDSL

DSL

32Elasticity in the Database tier

Database tier

Sudipto Das {[email protected]}

Load BalancerApplication/Web/Caching tier

33DSLDSL

DSL

So what does elasticity in the database tier mean?

Mention the cost performance trade-off and repeat the fact that it is the cloud infrastructure that allows us to optimize operating cost, something that was not thought important in classical infrastructures.33Live database migrationMigrate a database partition (or tenant) in a live systemOptimize operating costResource orchestration in multitenant systemsDifferent fromMigration between software versionsMigration in case of schema evolution34Sudipto Das {[email protected]}DSLDSL

DSL

VM migration for DB elasticityOne DB partition-per-VM Pros: allows fine-grained load balancingCons Performance overheadPoor consolidation ratio [Curino et al., CIDR 2011]Multiple DB partitions in a VMPros: good performanceCons: Migrate all partitions Coarse-grained load balancingSudipto Das {[email protected]}35HypervisorVMVMVMHypervisorVMDSLDSL

DSL

Live database migrationMultiple partitions share the same database processShared process multitenancyMigrate individual partitions on-demand in a live systemVirtualization in the database tierStraightforward solutionStop serving partition at the source Copy to destinationStart serving at the destinationExpensive!36Sudipto Das {[email protected]}DSLDSL

DSL

Migration cost measuresService un-availabilityTime the partition is unavailableNumber of failed requests Number of operations failing/transactions aborting Performance overheadImpact on response timesAdditional data transferred37Sudipto Das {[email protected]}DSLDSL

DSL

Two common DBMS architecturesDecoupled storage architecturesElasTraS, G-Store, Deuteronomy, MegaStorePersistent data is not migratedAlbatross [VLDB 2011]

Shared nothing architecturesSQL Azure, Relational Cloud, MySQL ClusterMigrate persistent dataZephyr [SIGMOD 2011]38Sudipto Das {[email protected]}

DSLDSL

DSL

Persistent data must be migrated (GBs)How to ensure no downtime?Nodes can fail during migrationHow to guarantee correctness during failures?Transaction atomicity and durabilityRecover migration state after failureTransactions execute during migrationHow to guarantee serializability?Transaction correctness equivalent to normal operationSudipto Das {[email protected]}Why is live DB migration hard?39DSLDSL

DSL

Migration executed in phasesStarts with transfer of minimal information to destination (wireframe)Database pages used as granule of migrationUnique page ownershipSource and destination concurrently execute transactions in one migration phaseMinimal transaction synchronizationGuaranteed serializabilityLogging and handshaking protocolsSudipto Das {[email protected]}Our approach: Zephyr[SIGMOD 2011]40DSLDSL

DSL

Define wireframe in this slide. Defer index wireframe definition to the later slide.40For this talkTransactions access a single partitionNo replicationNo structural changes to indicesExtensions in the paper [SIGMOD 2011]Relaxes these assumptionsSudipto Das {[email protected]}Simplifying assumptions41DSLDSL

DSL

Sudipto Das {[email protected]}Design overviewOwned PagesActive transactionsPage owned by NodePage not owned by NodeP1P2P3PnTS1,, TSkSourceDestination42DSLDSL

DSL

Sudipto Das {[email protected]}Init modeOwned PagesActive transactionsUn-owned PagesFreeze indices and migrate wireframeP1P2P3PnTS1,, TSkSourceDestinationP1P2P3PnPage owned by NodePage not owned by Node43DSLDSL

DSL

Freeze No structural modifications to the indices.Wireframe Minimal information needed to start executing transactions at the destination, schema information, user authentication, the index wireframes, etc.43Sudipto Das {[email protected]}What is an index wireframe?SourceDestination44DSLDSL

DSL

Just to give a concrete example of a wireframe, if we consider a B+ tree index, then only the internal nodes of the indices are migrated as part of the wireframe.44Dual modeSudipto Das {[email protected]}Requests for un-owned pages can blockOld, still active transactionsNew transactionsP1P2PnTSk+1,, TSlTD1,, TDmP3P3 accessed by TDiP3 pulled from sourceSourceDestinationP1P2P3PnIndex wireframes remain frozenPage owned by NodePage not owned by Node45DSLDSLDSL

Once the destination is initialized with the minimal information, it can start executing transactions. At this point, migration enters the Dual mode where both the source and destination are executing transactions, new transactions arrive at the destination while the source continues execution of transactions that were active at the start of migration.45Finish modeSudipto Das {[email protected]}Pages can be pulled by the destination, if neededCompletedPnSourceDestinationP1P2P3P1, P2, pushed from sourceTDm+1,, TDnPnP1P2P3Page owned by NodePage not owned by Node46DSLDSLDSL

Normal operationSudipto Das {[email protected]}SourceDestinationP1P2P3TDn+1,, TDpPnIndex wireframe un-frozenPage owned by NodePage not owned by Node47DSLDSLDSL

Once migrated, pages are never pulled back by sourceAbort transactions at source accessing the migrated pagesNo structural changes to indices during migrationAbort transactions (at both nodes) that make structural changes to indicesDestination pulls pages on-demandTransactions at the destination experience higher latency compared to normal operationSudipto Das {[email protected]}Artifacts of this design48DSLDSL

DSL

Only concern is dual modeInit and Finish: only one node is executing transactionsLocal predicate locking of internal index and exclusive page ownership no phantomsStrict 2PL Transactions are locally serializablePages transferred only once No Tdest Tsource conflict dependencyGuaranteed serializabilitySudipto Das {[email protected]}Serializability49DSLDSL

DSL

Transaction recoveryFor every database page, Tsrc TdstRecovery: transactions replayed in conflict orderMigration recoveryAtomic transitions between migration modesDeveloped logging and handshake protocolsEvery page has exactly one ownerBookkeeping at the index levelSudipto Das {[email protected]}Recovery50DSLDSL

DSL

In the presence of arbitrary repeated failures, Zephyr ensures:Updates made to database pages are consistentFailure does not leave a page without an ownerBoth source and destination are in the same migration modeGuaranteed termination and starvation freedomSudipto Das {[email protected]}Correctness51DSLDSL

DSL

Prototyped using an open source OLTP database H2Supports standard SQL/JDBC APISerializable isolation levelTree IndicesRelational data modelModified the database engineAdded support for freezing indicesPage migration status maintained using index~6000 LOCTungsten SQL Router migrates JDBC connections during migrationSudipto Das {[email protected]}Implementation52DSLDSL

DSL

Downtime (partition unavailability)S&C: 3 8 seconds (needed to migrate, unavailable for updates)Zephyr: No downtime. Either source or destination is availableService interruption (failed operations)S&C: ~100s 1,000s. All transactions with updates are abortedZephyr: ~10s 100s. Order of magnitude less interruptionMinimal operational and data transfer overheadSudipto Das {[email protected]}Results Overview53DSLDSL

DSL

Sudipto Das {[email protected]}Failed OperationsOrder of magnitude fewer failed operations

54DSLDSLDSL

Concluding RemarksSudipto Das {[email protected]}55

Major enabling technologiesScalable distributed database infrastructureElasTraSDynamically formed data partitionsG-StoreLive database migrationAlbatross, ZephyrDSLDSL

DSL

Future DirectionsSelf-managing controller for large multitenant database infrastructures

Novel data management architecturesLeveraging advances from novel hardwareConvergence of transactional and analytics systems for real-time intelligence

Putting human-in-the-loop: Leveraging crowd-sourcingSudipto Das {[email protected]}56DSLDSL

DSL

Make the future more specific.56Thank you!

CollaboratorsUCSB: Divy Agrawal, Amr El Abbadi, mer EecioluShashank Agarwal, Shyam Antony, Aaron Elmore, Shoji Nishimura (NEC Japan)Microsoft Research Redmond:Phil Bernstein, Colin ReidIBM Almaden:Yannis Sismanis, Kevin Beyer, Rainer Gemulla, Peter Haas, John McPherson

DSL

DSL