1. Big Data A broad term for data sets so large or complex that traditional data processing...
-
Upload
shanon-merritt -
Category
Documents
-
view
213 -
download
0
Transcript of 1. Big Data A broad term for data sets so large or complex that traditional data processing...
Big Data
A broad term for data sets so large or complex that traditional data processing applications ae inadequate.
2
Examples• Walmart: 106 transactions per hour, all in databases > 2.5
petabytes • Large Hadron Collider: 150 million sensors deliver data
40 million times/sec• Amazon: millions of sales per day, three largest Linux
databases 7.8 TB, 18.5 TB, 24.7 TB
3
Database Systems and Big Data• RDBMS have trouble handling big data• Generally, software running on tens, hundreds or
thousands of servers is required• We are generally talking about data accumulation and
analysis, not transaction processing
4
MapReduce• 2004 paper from Google
• Map function • processes a key/value pair to generate a set of intermediate key/value
pairs
• Reduce function• Merges all intermediate values associated with that key
• Hadoop is an open-source implementation of MapReduce
6
BASE—an alternative to ACID?• BASE is a new approach
• Basically Available• Soft state• Eventually consistent
• Changes the fundamental approach• ACID approach frees applications from concern about partial
transaction completion—it’s done completely or not at all• BASE returns control to the application quickly, but not all
operations may be complete
• BASE is used in the Large Data realm
10
The CAP Theorem• It is impossible for a distributed computer system to
simultaneously guarantee all three of:• Consistency. The client perceives that a set of operations has
occurred all at once.• Availability. Every operation must terminate in an intended
response.• Partition tolerance. Operations will complete, even if individual
components are unavailable.
At first this was Brewer’s Conjecture, has now been proved so is called the CAP theorem
11
Forfeit Availability
Consis-tency
Tolerance To
Partitions
Availability
12
Distributed databasesDistributed lockingMajority protocols X
Forfeit Partition Tolerance
Consis-tency
Tolerance To
Partitions
Availability
14
Single-site databasesLDAP
X
Optimistic Replication• Underlies the BASE approach—replicas are allowed to diverge but
are ultimately converged• Operations:
1. Operation submission: users submit operations from independent sites
2. Propagation: each site shares operations it knows about with other sites
3. Scheduling: each site decides on order for the operations it knows about
4. Conflict resolution: if there are conflicts among operations at a site, the sequence is modified
5. Commitment: sites agree on a final schedule and conflict resolution result, and changes are made permanent
• Note that a reliable message queue for transactions, and idempotent transactions, are often employed
15
Example: CVS
CVS does version control• Users edit local versions of files• Can pull updates from server or can push out updates they think
are ready• Changes made in order received• If conflicts are detected they are flagged for manual repair by users
16
Implications• Applications must ensure that delayed updates don’t
impair user view of correctness
• Testing in more limited environment can mask problems in larger production environment
• Validity constraints can become order-sensitive to change operations, cause reconciliation problems
17
A Thought• There is little work on the continuum between ACID and
BASE• There could be a PhD dissertation in this area
18