1. Big Data A broad term for data sets so large or complex that traditional data processing...

19
1

Transcript of 1. Big Data A broad term for data sets so large or complex that traditional data processing...

1

Big Data

A broad term for data sets so large or complex that traditional data processing applications ae inadequate.

2

Examples• Walmart: 106 transactions per hour, all in databases > 2.5

petabytes • Large Hadron Collider: 150 million sensors deliver data

40 million times/sec• Amazon: millions of sales per day, three largest Linux

databases 7.8 TB, 18.5 TB, 24.7 TB

3

Database Systems and Big Data• RDBMS have trouble handling big data• Generally, software running on tens, hundreds or

thousands of servers is required• We are generally talking about data accumulation and

analysis, not transaction processing

4

Exabytes! (1018 bytes)

5

MapReduce• 2004 paper from Google

• Map function • processes a key/value pair to generate a set of intermediate key/value

pairs

• Reduce function• Merges all intermediate values associated with that key

• Hadoop is an open-source implementation of MapReduce

6

MapReduce Example• Map

• Reduce

7

MapReduce Execution

8

Example Applications

9

BASE—an alternative to ACID?• BASE is a new approach

• Basically Available• Soft state• Eventually consistent

• Changes the fundamental approach• ACID approach frees applications from concern about partial

transaction completion—it’s done completely or not at all• BASE returns control to the application quickly, but not all

operations may be complete

• BASE is used in the Large Data realm

10

The CAP Theorem• It is impossible for a distributed computer system to

simultaneously guarantee all three of:• Consistency. The client perceives that a set of operations has

occurred all at once.• Availability. Every operation must terminate in an intended

response.• Partition tolerance. Operations will complete, even if individual

components are unavailable.

At first this was Brewer’s Conjecture, has now been proved so is called the CAP theorem

11

Forfeit Availability

Consis-tency

Tolerance To

Partitions

Availability

12

Distributed databasesDistributed lockingMajority protocols X

Forfeit Consistency

Consis-tency

Tolerance To

Partitions

Availability

13

DNSWeb Caches

X

Forfeit Partition Tolerance

Consis-tency

Tolerance To

Partitions

Availability

14

Single-site databasesLDAP

X

Optimistic Replication• Underlies the BASE approach—replicas are allowed to diverge but

are ultimately converged• Operations:

1. Operation submission: users submit operations from independent sites

2. Propagation: each site shares operations it knows about with other sites

3. Scheduling: each site decides on order for the operations it knows about

4. Conflict resolution: if there are conflicts among operations at a site, the sequence is modified

5. Commitment: sites agree on a final schedule and conflict resolution result, and changes are made permanent

• Note that a reliable message queue for transactions, and idempotent transactions, are often employed

15

Example: CVS

CVS does version control• Users edit local versions of files• Can pull updates from server or can push out updates they think

are ready• Changes made in order received• If conflicts are detected they are flagged for manual repair by users

16

Implications• Applications must ensure that delayed updates don’t

impair user view of correctness

• Testing in more limited environment can mask problems in larger production environment

• Validity constraints can become order-sensitive to change operations, cause reconciliation problems

17

A Thought• There is little work on the continuum between ACID and

BASE• There could be a PhD dissertation in this area

18

19