Seeing Past SQL

7/31/2019 Seeing Past SQL

1/3

Looking at data in terms of common usage patterns provides context for the SQL-NoSql-big data debate

and a basis for professionals building applications to choose between the available alternatives.

Seeing Past SQL

SQL has been an increasingly prominent part of the Information Technology landscape for over 30

years4,3

. Current SQL implementations offer a veritable smorgasbord of features seemingly capable of

dealing with any data management scenario. While theres little doubt that relational algebra can be

used to describe any data representation and its manipulation, SQL perversions of the algebra aside, it

seems reasonable to ask, is SQL always the right answer? Two developments in the form of map-reduce

and NoSql data stores are a reflection of the fact that, for some, the answer is No! There has been

lively debate around both developments arguing the merits and demerits of the approaches2,5,6

; it is

useful to frame the discussion in terms of the types of data involved.

The picture is not intended as a data architecture but rather as a reflection of the way that data iscommonly stored and manipulated within an organization. For example, a single application may exhibit

all four types of data; events might be handled by several different implementations across an

organization. Using ecommerce as a background:

Objects are used to deliver web pages, serve ads, handle messages, user profiles and so on.Objects are geographically distributed, (mostly) requiring low consistency, high availability,

medium volume, and medium schema stability

ObjectsSimple key

lookup

No joinsNo scans

EventsHigh volume in, low volume out

unpredictable content,Scan, scrub & grep

ReportsCubes &

spreadsheets

TablesJoins, inequalities, complex

fixed structure


2/3

Events represent web page interactions (by-products of navigation, searching, and url querycontent). They have a low consistency requirement, low availability, high volume, and volatile

schema

Tables support data analysis, for example product search optimization, or user profiling. Theyrequire high consistency, low availability, medium volume, and a very stable schema

Reports provide actionable data principally used to control event handling and analysis. Reportsrequire high consistency, high availability, low volume, and have very volatile schema

The precise characterization of the four data types will vary depending on the industry. The distinction

between the types is important mainly because the access requirements differ significantly across the

four types and, if one data store is used for all four, it is very likely that some of the requirements will be

poorly met.

For example, tables require complex indexing and the set based operations typical of table accesses

require significant amounts of memory relative to the size of the data being manipulated. By contrast

events must optimize for I/O throughput, not I/O access. What matters is transfer rate, not access time.Indexes are just overhead as they have to be built up and then torn down. Memory utilization for data

processing is bounded by buffer size. If the application is reading a buffer, processing it and then writing

it back, all it needs is enough memory to accommodate whatever parallelism exists in the I/O system.

There are similar collisions between the requirements typical of object stores versus events, and object

stores versus tables and again between events and reports and tables and reports. It is possible to

accommodate all four in a single data store but, particularly once a system starts to scale up, collisions

between the requirements are likely to become a critical problem.

Any given organization will have what might be regarded as an ideal profile with respect to the four data

types. For example, the ecommerce realm is heavily biased towards events leading to a picture like this:

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

objects events tables reports

relative size


3/3

In practice, events, objects and reports tend to get mixed in with tables resulting in a departure from the

ideal and resulting in issues with report generation times, event load times, object access times, data

distribution and so on. Obviously keeping to the ideal profile will not solve these problems, but it will

mitigate them.

The sheer success of SQL data stores is, perversely, a part of the problem. There are adequate SQLimplementations of all four data types providing an insidious path of least resistance for the developer

to follow. SQL expertise is the norm. Most developers faced with a data manipulation problem will come

up with a SQL answer, regardless of whether they are dealing with events, objects, reports or

(appropriately) tables.

Breaking the data realm up into function specific stores is no harder than dealing with the issues around

the various data types within a single store (just ask anyone who has had to deal with a distributed

object store built on SQL databases6). Making such a break in the tail end of the development cycle is

practically impossible. Once an application has been built around a unitary SQL database, introducing

separate stores for events, objects, reports and tables will amount to starting again from scratch.

It seems reasonable to look carefully at the applications data architecture in these terms before any

commitment is made to a specific store. The application should be designed with an appropriate

allocation of data types to data stores, rather than confronting the decision when it is too late.

Organizations need to recognize that, particularly so far as events and objects are concerned, it is

unlikely that individual development projects will come up with optimal solutions that can meet any

requirements beyond those of the project at hand. Common infrastructure for dealing with objects and

events is essential if server proliferation and consequent scaling limits are to be avoided.

So far as application design is concerned, it is important to be able to recognize as soon as possible howthe applications data will be allocated across the data types, while at the same time being able to defer

for as long as possible a commitment to a specific underlying store.

References

1. Pavlo, A., et al, A comparison of approaches to large-scale data analysis, Proceedings of the35th SIGMOD international conference on Management of data, 2009

2. Dean, J., Ghemawat, S., MapReduce: a flexible data processing tool, Communications of theACM - Volume 53 Issue 1, January 2010

3. Codd, E.F. (June 1970). "A Relational Model of Data for Large Shared Data Banks".Communications of the ACM (Association for Computing Machinery) 13 (6): 377387.4. Chamberlin, Donald D.; Boyce, Raymond F. (1974). "SEQUEL: A Structured English QueryLanguage" (PDF). Proceedings of the 1974 ACM SIGFIDET Workshop on Data Description, Access

and Control (Association for Computing Machinery): 249264.

5. Stonebraker, M., SQL databases v. NoSQL databases, Communications of the ACM, Volume 53Issue 4, April 2010

6. Pujol, Josep M., et al, The little engine(s) that could: scaling online social networks, SIGCOMM'10 Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM

Seeing Past SQL

Documents

Transcript of Seeing Past SQL