Amazon Dynamo DB

30
Amazon Dynamo DB - a term project presentation CSc 244 | CSU Sacramento Spring 2014 Instructor : Dr. Ying Jin 8 th May, 2014 Team 1 : Chetan Nagarkar Saransh

description

A class presentation on Amazon Dynamo DB - the NoSQL database service from Amazon Web Services(AWS)

Transcript of Amazon Dynamo DB

Page 1: Amazon Dynamo DB

Amazon Dynamo DB- a term project presentation

CSc 244 | CSU SacramentoSpring 2014

Instructor : Dr. Ying Jin 8th May, 2014

Team 1 :

Chetan Nagarkar

Saransh

Page 2: Amazon Dynamo DB

Agenda

• NoSQL background

• CAP theorem

• AWS offerings

• Dynamo DB

• Dynamo DB Architecture

• How to use

• Pricing

• Advantages

• Limitations

• References

• Questions 2

Page 3: Amazon Dynamo DB

3

NoSQL background• Traditional databases have a concrete schema.

Page 4: Amazon Dynamo DB

4

NoSQL systems are designed to be schema-less.

• Entry 1

• {“name”:“emp1”}

• Entry 2

• {“name”:“emp2”,“e_id”:“1”,“e_addr”:“Cupertino”}

• Entry 3

• {“name”:“emp3”,“e_id”:“3”}

• Entry 4

• {“name”:“emp4”,“e_id”:“6”, “dob”:“03-Sep-1964”}

Page 5: Amazon Dynamo DB

5

Types of NoSQL systems

1. Key-Value Pair : • Every item is a set of key-value pair

Voldemort(LinkedIn), DynamoDB(Amazon), Redis(VMWare), etc.

2. Document Oriented :

• Each key is paired with a complex data structure called ‘document’. Documents may contain multiple key-value pairs

CouchDB, MongoDB, etc.

3. Column Family Stores :

• Key value is mapped to set of columns optimized for query over large datasets.Cassandra(Facebook), HyperTable, Hbase, BigTable(Google) etc.

4. Graph Databases :

• Contains information about network, social connectionsNeo4j, FlockDB(Twitter).

Page 6: Amazon Dynamo DB

6

ACID properties in traditional RDBMs

• Atomicity

• Requires each transaction to be all or nothing

• Consistency

• Brings databases from one valid state to another

• Isolation

• Takes care of concurrent execution and avoid overwritting

• Durability

• Commited transactions will remain so in the event of power loss, crashes etc

Page 7: Amazon Dynamo DB

7

CAP theorem

• It states : A distributed system possibly cannot guarantee all of the following simultaneously –

• Consistency: All nodes see the same data at the same time

• Availability: The system is ‘always on’, i.e., every request receives a response, whether it serves the request or not.

• Partition Tolerance: System continues to operate in case of system failure or message loss.

Page 8: Amazon Dynamo DB

8

Page 9: Amazon Dynamo DB

9

Amazon Web Services(AWS) offerings• Launched in 2006, AWS is used by thousands of companies

today for cloud based computing services.

• Includes an array of services: Remote Computing(EC2), Identity(IAM), Database services(SimpleDB, RDS, ElastiCache, DynamoDB, etc.)

• 8 Regions and Availability zones :Code Name

ap-northeast-1 Asia Pacific (Tokyo) Region

ap-southeast-1 Asia Pacific (Singapore) Region

ap-southeast-2 Asia Pacific (Sydney) Region

eu-west-1 EU (Ireland) Region

sa-east-1 South America (Sao Paulo) Region

us-east-1 US East (Northern Virginia) Region

us-west-1 US West (Northern California) Region

us-west-2 US West (Oregon) Region

Page 10: Amazon Dynamo DB

10

Dynamo DB• Dynamo DB is a managed and eventually consistent NoSQL

database service to support highly available data access.

• A flexible data model with key/attribute pairs. No schema required.

• Optimized for availability to maximize :

• Data consistency

• Durability

• Performance

ALL THIS WITHOUT THE OPERATIONAL BURDEN

Page 11: Amazon Dynamo DB

11

What is ‘Fully Managed’?

Never worry about:

• Hardware provisioning

• Cross-availability zone replication

• Hardware and Software updates

• Monitoring and handling of hardware failures

• Replicas automatically generated.

Page 12: Amazon Dynamo DB

DynamoDB Architecture

• True distributed architecture

• Data is spread across hundreds of servers called storage nodes

• Hundreds of servers form a cluster in the form of a “ring”

• Client application can connect using one of the two approaches

• Routing using a load balancer

• Client-library that reflects Dynamo’s partitioning scheme and can determine the storage host to connect

• DynamoDB is designed to be “always writable” storage solution

• Allows multiple versions of data on multiple storage nodes

• Conflict resolution happens while reads and NOT during writes

• Syntactic conflict resolution

• Symantec conflict resolution

Page 13: Amazon Dynamo DB

• Put(key,context object)

• Determines where replicas of the object should be placed based on key,& write replicas to disk

• Context Info is stored along with object so as to verify validity of context object

• If at least W-1 nodes respond, write is successful

• Get(key)-

• Operation locates all object replicas associated with key in storage system

• Returns a single or list of object with conflict version along with context

• Client reconcile divergent versions and supersede current version with based on context

• Node handling read and write operation is called “coordinator”

• For reads and writes dynamo uses consistency protocol

• Consistency protocol has 3 variables

• N – Number of replicas of data to be read or written

• W – Number of nodes that must participate in successful write operation

• R – Number of machines contacted in read operation

• R+W > N Latency determined by slowest of R or W replicas. Thus R&W less than N

DynamoDB System Interface – get() and put()

Page 14: Amazon Dynamo DB

DynamoDB Architecture - Partitioning• Data is partitioned over multiple hosts called storage nodes (ring)

• Uses consistent hashing to dynamically partition data across storage hosts

• Two problems associated with Consistent hashing

• Hashing of storage hosts can cause imbalance of data and load

• Consistent hashing treats every storage host as same capacity

A [3,4]

B [1]

C[2]

1

2

3

4

3

4

A[4]

B[1]

1

2D[2,3]

Initial Situation

Situation after C left and D joined

Page 15: Amazon Dynamo DB

DynamoDB Architecture - Partitioning• Solution – Virtual hosts mapped to physical hosts (tokens)

• Number of virtual hosts for a physical host depends on capacity of physical host

• Virtual nodes are function-mapped to physical nodes

AB

C

D

E

F

G

H

I

J

K

L

C

B Physical HostVirtual Host

Page 16: Amazon Dynamo DB

DynamoDB Architecture – Scaling• Gossip protocol used for node membership

• Every second, storage node randomly contact a peer to bilaterally reconcile persisted membership history

• Doing so, membership changes are spread and eventually consistent membership view is formed

• When a new members joins, adjacent nodes adjust their object and replica ownership

• When a member leaves adjacent nodes adjust their object and replica and distributes keys owned by leaving member

A

B

C

D

E

F

G

H

Replication factor = 3

Page 17: Amazon Dynamo DB

DynamoDB Architecture – Data Versioning• Eventual Consistency – data propagates asynchronously

• Its possible to have multiple version on different storage nodes

• If latest data is not available on storage node, client will update the old version of data

• Conflict resolution is done using “vector clock”

• Vector clock is a metadata information added to data when using get() or put()

• Vector clock effectively a list of (node, counter) pairs

• One vector clock is associated with every version of every object

• One can determine two version has causal ordering or parallel branches by examining vector clocks

• Client specify vector clock information while reading or writing data in the form of “context”

• Dynamo tries to resolve conflict among multiple version using syntactic reconciliation

• If syntactic reconciliation does not work, client has to resolve conflict using semantic reconciliation

Page 18: Amazon Dynamo DB

18

Page 19: Amazon Dynamo DB

Strongly consistent vs Eventually Consistent

• Multiple copies of same item to ensure durability

• Takes time for update to propagate multiple copies

• Eventually consistent

• After write happens, immediate read may not give latest value

• Strongly consistent

• Requires additional read capacity unit

• Gets most up-to-date version of item value

• Eventually consistent read consumes half the read capacity unit as strongly consistent read.

Page 20: Amazon Dynamo DB

20

How to use

• AWS provides SDKs to interact with DynamoDB for the following platforms/languages:

• Android

• iOS

• Java

• .Net

• In Java, there are two means to interact with your DynamoDB table:

• AWS Management Console

• AWS Toolkit for Eclipse

• Python

• PHP

• Node.js

• Ruby

Page 21: Amazon Dynamo DB

21

Management Console

Page 22: Amazon Dynamo DB

22

Console – create table

Page 23: Amazon Dynamo DB

Inside Amazon DynamoDB

• Primary key (mandatory for every table)

• Hash or Hash + Range

• Data model in the form of tables

• Data stored in the form of items (name – value attributes)

• Secondary Indexes for improved performance

• Local secondary index

• Global secondary index

• Scalar data type (number, string etc) or multi-valued data type (sets)

key=value key=value key=value key=value Table

Item (64KB max)

Attributes

Page 24: Amazon Dynamo DB

24

Create Instance in Java

Page 25: Amazon Dynamo DB

25

Using AWS SDK: AWS Toolkit for Eclipse

Page 26: Amazon Dynamo DB

26

Pricing

• In the free tier, AWS offers 100 MB of free data storage per month.

• Read capacity: 1 read = 4 KB data

• Write capacity: 1 write = 1 KB data

• With eventually consistant reads, you get twice reads/sec

• In free tier you get 5 writes/sec & 10 reads/sec

• Or 432,000 writes and 864,000 writes per day or 40 million data operations per month.

Page 27: Amazon Dynamo DB

27

Advantages

• Easy Administration: No h/w or s/w provisioning, setup or configuration

• Flexible: Secondary Indexes: query on whichever attribute

• Fast, predictable performance: SSDs, single digit latency

• Built-in fault tolerance

• Stong consistency and atomic counters

• Automatic Data Replication: Data stored across 3 availability zones

• Security: AWS Identity Access Management(IAM)

• CloudWatch Alarms: Alarms are raised if you are near to crossing provisioned throughput or storage.

• Provisioned throughput/Cost effective: Pay how you use

Page 28: Amazon Dynamo DB

28

Limitations

• 64KB limit on item size (row size)

• 1 MB limit on fetching data

• Pay more if you want strongly consistent data

• Size is multiple of 4KB (provisioning throughput wastage)

• Cannot join tables

• Indexes should be created during table creation only

• No triggers or server side scripts

• Limited comparison capability (no not_null, contains etc)

Page 30: Amazon Dynamo DB

30

Questions ?

Thank You !