Post on 05-Apr-2018
7/31/2019 Seeing Past SQL
1/3
Looking at data in terms of common usage patterns provides context for the SQL-NoSql-big data debate
and a basis for professionals building applications to choose between the available alternatives.
Seeing Past SQL
SQL has been an increasingly prominent part of the Information Technology landscape for over 30
years4,3
. Current SQL implementations offer a veritable smorgasbord of features seemingly capable of
dealing with any data management scenario. While theres little doubt that relational algebra can be
used to describe any data representation and its manipulation, SQL perversions of the algebra aside, it
seems reasonable to ask, is SQL always the right answer? Two developments in the form of map-reduce
and NoSql data stores are a reflection of the fact that, for some, the answer is No! There has been
lively debate around both developments arguing the merits and demerits of the approaches2,5,6
; it is
useful to frame the discussion in terms of the types of data involved.
The picture is not intended as a data architecture but rather as a reflection of the way that data iscommonly stored and manipulated within an organization. For example, a single application may exhibit
all four types of data; events might be handled by several different implementations across an
organization. Using ecommerce as a background:
Objects are used to deliver web pages, serve ads, handle messages, user profiles and so on.Objects are geographically distributed, (mostly) requiring low consistency, high availability,
medium volume, and medium schema stability
ObjectsSimple key
lookup
No joinsNo scans
EventsHigh volume in, low volume out
unpredictable content,Scan, scrub & grep
ReportsCubes &
spreadsheets
TablesJoins, inequalities, complex
fixed structure
7/31/2019 Seeing Past SQL
2/3
Events represent web page interactions (by-products of navigation, searching, and url querycontent). They have a low consistency requirement, low availability, high volume, and volatile
schema
Tables support data analysis, for example product search optimization, or user profiling. Theyrequire high consistency, low availability, medium volume, and a very stable schema
Reports provide actionable data principally used to control event handling and analysis. Reportsrequire high consistency, high availability, low volume, and have very volatile schema
The precise characterization of the four data types will vary depending on the industry. The distinction
between the types is important mainly because the access requirements differ significantly across the
four types and, if one data store is used for all four, it is very likely that some of the requirements will be
poorly met.
For example, tables require complex indexing and the set based operations typical of table accesses
require significant amounts of memory relative to the size of the data being manipulated. By contrast
events must optimize for I/O throughput, not I/O access. What matters is transfer rate, not access time.Indexes are just overhead as they have to be built up and then torn down. Memory utilization for data
processing is bounded by buffer size. If the application is reading a buffer, processing it and then writing
it back, all it needs is enough memory to accommodate whatever parallelism exists in the I/O system.
There are similar collisions between the requirements typical of object stores versus events, and object
stores versus tables and again between events and reports and tables and reports. It is possible to
accommodate all four in a single data store but, particularly once a system starts to scale up, collisions
between the requirements are likely to become a critical problem.
Any given organization will have what might be regarded as an ideal profile with respect to the four data
types. For example, the ecommerce realm is heavily biased towards events leading to a picture like this:
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
objects events tables reports
relative size
7/31/2019 Seeing Past SQL
3/3
In practice, events, objects and reports tend to get mixed in with tables resulting in a departure from the
ideal and resulting in issues with report generation times, event load times, object access times, data
distribution and so on. Obviously keeping to the ideal profile will not solve these problems, but it will
mitigate them.
The sheer success of SQL data stores is, perversely, a part of the problem. There are adequate SQLimplementations of all four data types providing an insidious path of least resistance for the developer
to follow. SQL expertise is the norm. Most developers faced with a data manipulation problem will come
up with a SQL answer, regardless of whether they are dealing with events, objects, reports or
(appropriately) tables.
Breaking the data realm up into function specific stores is no harder than dealing with the issues around
the various data types within a single store (just ask anyone who has had to deal with a distributed
object store built on SQL databases6). Making such a break in the tail end of the development cycle is
practically impossible. Once an application has been built around a unitary SQL database, introducing
separate stores for events, objects, reports and tables will amount to starting again from scratch.
It seems reasonable to look carefully at the applications data architecture in these terms before any
commitment is made to a specific store. The application should be designed with an appropriate
allocation of data types to data stores, rather than confronting the decision when it is too late.
Organizations need to recognize that, particularly so far as events and objects are concerned, it is
unlikely that individual development projects will come up with optimal solutions that can meet any
requirements beyond those of the project at hand. Common infrastructure for dealing with objects and
events is essential if server proliferation and consequent scaling limits are to be avoided.
So far as application design is concerned, it is important to be able to recognize as soon as possible howthe applications data will be allocated across the data types, while at the same time being able to defer
for as long as possible a commitment to a specific underlying store.
References
1. Pavlo, A., et al, A comparison of approaches to large-scale data analysis, Proceedings of the35th SIGMOD international conference on Management of data, 2009
2. Dean, J., Ghemawat, S., MapReduce: a flexible data processing tool, Communications of theACM - Volume 53 Issue 1, January 2010
3. Codd, E.F. (June 1970). "A Relational Model of Data for Large Shared Data Banks".Communications of the ACM (Association for Computing Machinery) 13 (6): 377387.4. Chamberlin, Donald D.; Boyce, Raymond F. (1974). "SEQUEL: A Structured English QueryLanguage" (PDF). Proceedings of the 1974 ACM SIGFIDET Workshop on Data Description, Access
and Control (Association for Computing Machinery): 249264.
5. Stonebraker, M., SQL databases v. NoSQL databases, Communications of the ACM, Volume 53Issue 4, April 2010
6. Pujol, Josep M., et al, The little engine(s) that could: scaling online social networks, SIGCOMM'10 Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM