Non-Relational Databases Jeff Allen. Overview Background Benefits Pitfalls Available Platforms ...
-
date post
19-Dec-2015 -
Category
Documents
-
view
213 -
download
1
Transcript of Non-Relational Databases Jeff Allen. Overview Background Benefits Pitfalls Available Platforms ...
Non-Relational Databases
Jeff Allen
Overview Background Benefits Pitfalls Available Platforms Summary
Overview Background Benefits Pitfalls Available Platforms Summary
“A database is a collection of data.”Ramakrishnan, Gehrke Database Management Systems p.4
Non-Relational Database Many names
Key/Value Store (Amazon) Document-Oriented (CouchDB) Attribute-Oriented Distributed Hash Table Sharded Sorted Arrays Distributed Database (non-unique)
Contains a collection of data which cannot (necessarily) be described using predetermined relations. No schema defined
Scales to petabytes of data across thousands of servers
How It Works Relational Model
Non-Relational Model
Lookups, insertions, and updates all handled through key
(Generally, key would be hashed, not the raw integer)
CarID Make Model Interior Color Mileage
13 Honda Accord NULL Green NULL
18 Mitsubishi
NULL Grey NULL NULL
Key Value
13 Make=“Honda”, Model=“Accord”, Color=“Green”
18 Make=“Mitsubishi”, Interior=“Grey”
Indexing Storing the value in the hash table on the
primary key Redundantly store the “entity model” with
each row Make=“Honda”…
This hash table is the only “index” in the database
Overview Background Benefits Pitfalls Available
Platforms Summary
“Even though RDBMS have provided database users with the best mix of simplicity, robustness, flexibility, performance, scalability, and compatibility, their performance in each of these areas is not necessarily better than that of an alternate solution pursuing one of these benefits in isolation.”
Tony BainReadWrite Enterprise
Environments Non-Relational Data
Attribute table Different data may be available about different objects Difficult to search or index
CarID Make Model Color Year License
1 Honda Accord NULL 2002 NULL
2 Mitsubishi
NULL Green NULL NULL
3 NULL Escape Black 2003 NULL
4 NULL NULL Blue NULL YC3-XYZ
5 Ford Mustang
Red 1998 NULL
Environments Non-Relational Data
Attribute table Different data may be available about different objects Difficult to search or index
CarID Make Model Color Year License Interior
1 Honda Accord NULL 2002 NULL NULL
2 Mitsubishi
NULL Green NULL NULL NULL
3 NULL Escape Black 2003 NULL NULL
4 NULL NULL Blue NULL YC3-XYZ NULL
5 Ford Mustang
Red 1998 NULL Black
Distributed DBMS Divide up workload across multiple servers
In the interest of time/computation Also for space – don’t want to redundantly store the
entire DB Attainable with RDBMS
Difficult to program and coordinate Makes a well-implemented distributed RDBMS expensive May have to interrogate multiple server to get an answer
to a single query In a Non-Relational DBMS
Easy to partition data Unlikely that a query could span >1 slave node Overlap partitions for redundancy Seamless failure recovery
Storage Can tailor data storage to its use
May reduce the need to join multiple normalized tables
Could make re-assembly/display of database data easier
Storing model redundantly Beneficial to have short column names
Performance Minimize time spend “plumbing” the data Much lighter overhead on distributed systems Faster queries (10-20k operations/second)
Due to limitations on what queries are allowed If data can be stored to work within these limitations,
valid claim
Overview Background Benefits Pitfalls Available
Platforms Summary
“The responsibility for ensuring data integrity falls entirely to the application. But application code often carries bugs. Bugs in a properly designed relational database usually don't lead to data integrity issues; bugs in a key/value database, however, quite easily lead to data integrity issues.”
Tony BainReadWrite Enterprise
Pitfalls – Foreign Keys No integrity checks Relies on the Model View Control (MVC)
principle Assumes the integrity will be handled elsewhere
Pitfalls – Proprietary RDMBS rely on SQL as a standard interface No such interface between Non-RDBMS Locked in to a single system
Many of whom have only had these systems deployed for a handful of years
Pitfalls – Proprietary
Voldemort
String bootstrapUrl = "tcp://localhost:6666";
StoreClientFactory factory = new SocketStoreClientFactory(new ClientConfig().setBootstrapUrls(bootstrapUrl));
StoreClient client = factory.getStoreClient("my_store_name");
Versioned value = client.get("some_key");
value.setObject("some_value");
client.put("some_key", value);
Pitfalls – Proprietary
CouchDBSession s = new Session("localhost",5984); Database db = s.getDatabase("foodb"); Document doc = db.getDocument("documentid1234"); doc.put("foo","bar"); db.saveDocument(doc); Document newdoc = new Document(); newdoc.put("foo","baz"); newdoc.saveDocument(newdoc); ViewResults result = db.getAllDocuments(); for (Document d: result.getResults()) { System.out.println(d.getId()); Document full = db.getDocument(d.getId());
} ViewResults resultAdHoc = db.adhoc("function (doc) { if (doc.foo=='bar') { return doc; }}");
Pitfalls – Analytics If only able to query a few rows at a time,
could be very difficult to extract large chunks of data
Difficult to mine or analyze Difficult to export to another system Amazon’s offering can’t run a query that takes
longer than 5 seconds Google’s AppEngine Datastore can’t retrieve
more than 1,000 items for a query
Pitfalls - Consistency Exists in distributed systems All data is versioned
Each key has an implicit value field called “time” Can be used to detect collisions or out-of-date
updates across redundant data Garbage collected automatically (implementation-
dependent) Many will “eventually” get to a consistent
state
Pitfalls – Column-Based Querying Difficult to index non-primary key columns Any “WHERE” clause on non-primary key
columns would have to scan through the entire database Will be very expensive on large/distributed
databases Forces one, exclusive view of the data (Could combine DBMSs to get the benefits of both)
Can’t guarantee that all rows have any field
CouchDB’s Views Cache results to these very expensive queries Extract only those entries that have the value
in question Apply filter as we progress through the database
(height>18) Build a temporary table containing the results Will not, necessarily, be the result of a consistent
state May be outdated
//Get all rows from a particular user idmap: function(doc) {
if (doc.user_id) { emit(doc.user_id, null);
}}
CouchDB’s Views
SELECT min(field) FROM table;SELECT max(field) FROM table;
Min/Max of a Column in SQL
CouchDB’s Views
// Map functionfunction(doc) { var risk_exponent = -3.194 + doc.CV_VOLOCC_1 *1.080 + doc.CV_VOLOCC_M *0.627 + doc.CV_VOLOCC_R *0.553 + doc.CORR_VOLOCC_1M *1.439 + doc.CORR_VOLOCC_MR *0.658 + doc.LAG1_OCC_M *0.412 + doc.LAG1_OCC_R *1.424 + doc.MU_VOL_1 *0.038 + doc.MU_VOL_M *0.100 + doc["CORR_OCC_1M X MU_VOL_M"] *-
0.168 + doc["CORR_OCC_1M X SD_VOL_R" ]
*0.479 + doc["CORR_OCC_1M X LAG1_OCC_R"] *-
1.462 ; var risk = Math.exp(risk_exponent); // parse the date and "chunk" it up var pattern = new RegExp("(.*)-0?(.*)-0?
(.*)T0?(.*):0?(.*):0?(.*)(-0800)"); var result = pattern.exec(doc.EstimateTime); var day; if(result){ //new Date(year, month, day, hours,
minutes, seconds, ms) // force rounding to 5 minutes, 0 seconds,
for aggregation of 5 minute chunks var fivemin = 5 * Math.floor(result[5]/5) day = new Date(result[1],result[2]-
1,result[3],result[4], fivemin, 0); } var weekdays =
["Sun","Mon","Tue","Wed","Thu","Fri","Sat"];
emit([weekdays[day.getDay()],day.toLocaleTimeS
tring( )],{'risk':risk});} // Reduce functionfunction (keys, values, rereduce) { // algorithm for on-line computation of
moments from // // Tony F. Chan, Gene H. Golub, and Randall J.
LeVeque: "Updating // Formulae and a Pairwise Algorithm for
Computing Sample // Variances." Technical Report STAN-CS-79-
773, Department of // Computer Science, Stanford University,
November 1979. url: //
ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/7
9/773/CS-TR-79-773.pdf // so there is some weirdness in that the
original was Fortran, index from 1, // and lots of arrays (no lists, no hash tables) // also consulted
http://people.xiph.org/~tterribe/notes/homs.html
// and http://www.jstor.org/stable/2683386 // and (ick!) the wikipedia description of
Knuth's algorithm // to clarify what was going on with
http://www.slamb.org/svn/repos/trunk/projects/co
mmon/src/java/org/slamb/common/stats/
Sample.java /* combine the variance esitmates for two
partitions, A and B. partitionA and partitionB both should contain { S : the current estimate of the second
moment Sum : the sum of observed values M : the number of observations used in the
partition to calculate S and Sum } The output will be an identical object,
containing the S, Sum and M for the combination of partitions A and B This routine is derived from original fortran
code in Chan et al, (1979) But it is easily derived by recognizing that all
you're doing is multiplying each partition's S and Sum by its
respective count M, and then dividing by the new count Ma + Mb.
The arrangement of the diff etc is just rearranging terms to make it
look nice. And then summing up the sums, and summing
up the counts */ function combine_S(partitionA,partitionB){ var NewS=partitionA.S; var NewSum=partitionA.Sum; var min = partitionA.min; var max = partitionA.max; var M = partitionB.M; if(!M){M=0;} if(M){ var diff = ((partitionA.M * partitionB.Sum /
partitionB.M) - partitionA.Sum ); NewS += partitionB.S +
partitionB.M*diff*diff/(partitionA.M *
(partitionA.M+partitionB.M) ); NewSum += partitionB.Sum ; min = Math.min(partitionB.min, min); max = Math.max(partitionB.max, max); } return {'S':NewS,'Sum':NewSum, 'M':
partitionA.M+M, 'min':min, 'max':max }; } /* This routine is derived from original fortran
code in Chan et al, (1979), with the combination step split out
above to allow that to be called independently in the rereduce step. Arguments: The first argument (values) is an array of
objects. The assumption is that the key to the variable of
interest is 'risk'. If this is not the case, the seventh argument
should be the correct key to use. More complicated data structures
are not supported. The second, third, and fourth arguments are in
case this is a running tally. You can pass in exiting values
for M (the number of observations already processed), Sum (the
running sum of those M observations) and S (the current estimate of
variance for those M observations). Totally optional, defaulting to
zero. The fifth parameter is for the running min, and
the sixth for the max. Pass "null" for parameters 2 through 6 if you
need to pass a key in the seventh slot. Some notes on the algorithm. There is a
precious bit of trickery with stack pointers, etc that make for a
minimal amount of temporary storage. All this was included in
the original algorithm. I can't see that it makes much
sense to include all that effort given that I've got gobs of RAM and
am instead most likely processor bound, but it reminded me of
programming in assembly so I kept it in. If you watch the progress of this algorithm in a
debugger or firebug, you'll see that the size of the stack
stays pretty small, with the bottom (0) entry staying at zero, then
the [1] entry containing a power of two (2,4,8,16, etc), and
the [2] entry containing the next power of two down from
[1] and so on. As the slots of the stack get filled up, they get
cascaded together by the inner loop. You could skip all that, and just pairwise
process repeatedly until the list of intermediate values is empty,
but whatever. And there seems to be some super small gain in
efficiency in using identical support for two groups being
combined, in that you don't have to consider different Ma and Mb in the
computation. One less divide I guess) */ function pairwise_update (values, M, Sum, S,
min, max, key){ if(!key){key='risk';} if(!Sum){Sum = 0; S = 0; M=0;} if(!S){Sum = 0; S = 0; M=0;} if(!M){Sum = 0; S = 0; M=0;} if(!min){ min = Infinity; } if(!max){ max = -Infinity; } var T; var stack_ptr=1; var N = values.length; var half = Math.floor(N/2); var NewSum; var NewS ; var SumA=[]; var SA=[]; var Terms=[]; Terms[0]=0; if(N == 1){ Nsum=values[0][key]; Ns=0; }else if(N > 1){ // loop over the data pairwise for(var i = 0; i < half; i++){ // check min max if(values[2*i+1][key] < values[2*i]
[key] ){ min = Math.min(values[2*i+1][key],
min); max = Math.max(values[2*i][key],
max); }else{ min = Math.min(values[2*i][key],
min); max = Math.max(values[2*i+1]
[key], max); } SumA[stack_ptr]=values[2*i+1][key] +
values[2*i][key]; var diff = values[2*i + 1][key] -
values[2*i][key] ; SA[stack_ptr]=( diff * diff ) / 2; Terms[stack_ptr]=2; while( Terms[stack_ptr] ==
Terms[stack_ptr-1]){ // combine the top two elements in
storage, as // they have equal numbers of
support terms. this // should happen for powers of two
(2, 4, 8, etc). // Everything else gets cleaned up
below stack_ptr--; Terms[stack_ptr]*=2; // compare this diff with the below
diff. Here // there is no multiplication and
division of the // first sum (SumA[stack_ptr])
because it is the // same size as the other. var diff = SumA[stack_ptr] -
SumA[stack_ptr+1]; SA[stack_ptr]= SA[stack_ptr] +
SA[stack_ptr+1] + (diff * diff)/Terms[stack_ptr]; SumA[stack_ptr] +=
SumA[stack_ptr+1]; } // repeat as needed stack_ptr++; } stack_ptr--; // check if N is odd if(N % 2 != 0){ // handle that dangling entry stack_ptr++; Terms[stack_ptr]=1; SumA[stack_ptr]=values[N-1][key]; SA[stack_ptr]=0; // the variance of a
single observation is zero! min = Math.min(values[N-1][key],
min); max = Math.max(values[N-1][key],
max); } T=Terms[stack_ptr]; NewSum=SumA[stack_ptr]; NewS= SA[stack_ptr]; if(stack_ptr > 1){ // values.length is not power of two, so
not // everything has been scooped up in
the inner loop // above. Here handle the remainders for(var i = stack_ptr-1; i>=1 ; i--){ // compare this diff with the above
diff---one // more multiply and divide on the
current sum, // because the size of the sets
(SumA[i] and NewSum) // are different. var diff = Terms[i]*NewSum/T-
SumA[i]; NewS = NewS + SA[i] + ( T * diff * diff )/ (Terms[i] * (Terms[i] + T)); NewSum += SumA[i]; T += Terms[i]; } } } // finally, combine NewS and NewSum with S
and Sum return combine_S( {'S':NewS,'Sum':NewSum, 'M': T ,
'min':min, 'max':max}, {'S':S,'Sum':Sum, 'M': M , 'min':min,
'max':max}); } /* This function is attributed to Knuth, the Art of
Computer Programming. Donald Knuth is a math god, so
I am sure that it is numerically stable, but I haven't read the
source so who knows. The first parameter is again values, a list of
objects with the expectation that the variable of
interest is contained under the key 'risk'. If this is
not the case, pass the correct variable in the 7th
field. Parameters 2 through 6 are all optional. Pass
nulls if you need to pass a key in slot 7. In order they are mean: the current mean value estimate M2: the current estimate of the second
moment (variance) n: the count of observations used in the
current estimate min: the current min value observed max: the current max value observed */ function KnuthianOnLineVariance(values, M2,
n, mean, min, max, key){ if(!M2){ M2 = 0; } if(!n){ n = 0; } if(!mean){ mean = 0; } if(!min){ min = Infinity; } if(!max){ max = -Infinity; } if(!key){ key = 'risk'; } // this algorithm is apparently a special case
of the above // pairwise algorithm, in which you just apply
one more value // to the running total. I don't know why bun
Chan et al // (1979) and again in their later paper claim
that using M // greater than 1 is always better than not. // but this code is certainly cleaner! code
based on Scott // Lamb's Java found at //
http://www.slamb.org/svn/repos/trunk/projects/co
mmon/src/java/org/slamb/common/stats/
Sample.java // but modified a bit for(var i=0; i<values.length; i++ ){ var diff = (values[i][key] - mean); var newmean = mean + diff / (n+i+1); M2 += diff * (values[i][key] - newmean); mean = newmean; min = Math.min(values[i][key], min); max = Math.max(values[i][key], max); } return {'M2': M2, 'n': n + values.length,
'mean': mean, 'min':min, 'max':max }; } function KnuthCombine(partitionA,partitionB){ if(partitionB.n){ var newn = partitionA.n + partitionB.n; var diff = partitionB.mean -
partitionA.mean; var newmean = partitionA.mean +
diff*(partitionB.n/newn) var M2 = partitionA.M2 + partitionB.M2 +
(diff * diff * partitionA.n * partitionB.n / newn ); min = Math.min(partitionB.min,
partitionA.min); max = Math.max(partitionB.max,
partitionA.max); return {'M2': M2, 'n': newn, 'mean':
newmean, 'min':min, 'max':max }; } else { return partitionA; } } var output={}; var knuthOutput={}; // two cases in the application of reduce. In
the first reduce // case the rereduce flag is false, and we have
raw values. We // also have keys, but that isn't applicable here. // // In the rereduce case, rereduce is true, and
we are being passed // output for identical keys that needs to be
combined further. if(!rereduce) { output = pairwise_update(values); output.variance_n=output.S/output.M; output.mean = output.Sum/output.M; knuthOutput =
KnuthianOnLineVariance(values);
knuthOutput.variance_n=knuthOutput.M2/knuth
Output.n; output.knuthOutput=knuthOutput; } else { /* we have an existing pass, so should have
multiple outputs to combine */ for(var v in values){ output = combine_S(values[v],output); knuthOutput =
KnuthCombine(values[v].knuthOutput,
knuthOutput); } output.variance_n=output.S/output.M; output.mean = output.Sum/output.M;
knuthOutput.variance_n=knuthOutput.M2/knuth
Output.n; output.knuthOutput=knuthOutput; } // and done return output;}
Min/Max of a Column – 334 lines
Overview Background Benefits Pitfalls Available Platforms Summary
Applications Very popular in the Web 2.0 community Appeal to web startups concerned with
enormous scalability issues (Traffic triples overnight) Need dynamic and easy scalability Not concerned with relationships between data Maybe no budget for Oracle or other systems
Server Solutions Apache’s CouchDB
Simple API, Views Project Voldemort -
http://project-voldemort.com/ Automatic replication and scaling across multiple
servers Mongo - http://www.mongodb.org/
Supports indexing other columns Drizzle - https://launchpad.net/drizzle
Hybrid Non-relational/relational Based on a MySQL core
“Cloud” Solutions Amazon’s SimpleDB
http://aws.amazon.com/simpledb/ Google’s AppEngine Datastore
http://code.google.com/appengine/docs/python/datastore/
Overview Background Benefits Pitfalls Available Platforms Summary
Summary “Non-Relational Databases”
Also called “Document-Oriented” or “Key-Value Databases”
Useful when data is almost exclusively retrieved by its unique ID
Queries could be expressed without needing complex joins
Non-trivial amount of data and don’t have the capability to manage a distributed RDBMS
Bibliography Amazon. Amazon SimpleDB. Access October 15, 2009. http://aws.amazon.com/simpledb/ Apache. CouchDB: Technical Overview. Accessed October 15, 2009.
http://couchdb.apache.org/docs/overview.html Bain, Tony. Is the Relational Database Doomed? ReadWrite Enterprise. February 12, 2009.
http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php?p=3
CouchDB. View_Snippets. September 20, 2009. http://wiki.apache.org/couchdb/View_Snippets
Cheng et. al, Bigtable: A Distributed Storage System for Structured Data, Google Inc., 2007 DeCandia et. al., Dynamo: Amazon's Highly Available Key-value Store, Amazon.com, 2007 Faler, Wille. Don't store non-relational data in a relational database. May 20, 2009.
http://faler.wordpress.com/2009/05/20/dont-store-non-relational-data-in-a-relational-database/
Jones, Richard. Andti-RDBMS: A list of distributed key-value stores. January 19, 2009. http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
Kleppmann, Martin. Should you go Beyond Relational Databases? 24 June, 2009. http://carsonified.com/blog/dev/should-you-go-beyond-relational-databases/
Project Voldemort. Accessed October 15, 2009. http://project-voldemort.com/quickstart.php Taylor, Bret. How FriendFeed uses MySQL to store schema-less data. February 27, 2009.
http://bret.appspot.com/entry/how-friendfeed-uses-mysql
Questions?