Seeing Past SQL

download Seeing Past SQL

of 3

Transcript of Seeing Past SQL

  • 7/31/2019 Seeing Past SQL

    1/3

    Looking at data in terms of common usage patterns provides context for the SQL-NoSql-big data debate

    and a basis for professionals building applications to choose between the available alternatives.

    Seeing Past SQL

    SQL has been an increasingly prominent part of the Information Technology landscape for over 30

    years4,3

    . Current SQL implementations offer a veritable smorgasbord of features seemingly capable of

    dealing with any data management scenario. While theres little doubt that relational algebra can be

    used to describe any data representation and its manipulation, SQL perversions of the algebra aside, it

    seems reasonable to ask, is SQL always the right answer? Two developments in the form of map-reduce

    and NoSql data stores are a reflection of the fact that, for some, the answer is No! There has been

    lively debate around both developments arguing the merits and demerits of the approaches2,5,6

    ; it is

    useful to frame the discussion in terms of the types of data involved.

    The picture is not intended as a data architecture but rather as a reflection of the way that data iscommonly stored and manipulated within an organization. For example, a single application may exhibit

    all four types of data; events might be handled by several different implementations across an

    organization. Using ecommerce as a background:

    Objects are used to deliver web pages, serve ads, handle messages, user profiles and so on.Objects are geographically distributed, (mostly) requiring low consistency, high availability,

    medium volume, and medium schema stability

    ObjectsSimple key

    lookup

    No joinsNo scans

    EventsHigh volume in, low volume out

    unpredictable content,Scan, scrub & grep

    ReportsCubes &

    spreadsheets

    TablesJoins, inequalities, complex

    fixed structure

  • 7/31/2019 Seeing Past SQL

    2/3

    Events represent web page interactions (by-products of navigation, searching, and url querycontent). They have a low consistency requirement, low availability, high volume, and volatile

    schema

    Tables support data analysis, for example product search optimization, or user profiling. Theyrequire high consistency, low availability, medium volume, and a very stable schema

    Reports provide actionable data principally used to control event handling and analysis. Reportsrequire high consistency, high availability, low volume, and have very volatile schema

    The precise characterization of the four data types will vary depending on the industry. The distinction

    between the types is important mainly because the access requirements differ significantly across the

    four types and, if one data store is used for all four, it is very likely that some of the requirements will be

    poorly met.

    For example, tables require complex indexing and the set based operations typical of table accesses

    require significant amounts of memory relative to the size of the data being manipulated. By contrast

    events must optimize for I/O throughput, not I/O access. What matters is transfer rate, not access time.Indexes are just overhead as they have to be built up and then torn down. Memory utilization for data

    processing is bounded by buffer size. If the application is reading a buffer, processing it and then writing

    it back, all it needs is enough memory to accommodate whatever parallelism exists in the I/O system.

    There are similar collisions between the requirements typical of object stores versus events, and object

    stores versus tables and again between events and reports and tables and reports. It is possible to

    accommodate all four in a single data store but, particularly once a system starts to scale up, collisions

    between the requirements are likely to become a critical problem.

    Any given organization will have what might be regarded as an ideal profile with respect to the four data

    types. For example, the ecommerce realm is heavily biased towards events leading to a picture like this:

    0.00%

    10.00%

    20.00%

    30.00%

    40.00%

    50.00%

    60.00%

    70.00%

    80.00%

    90.00%

    100.00%

    objects events tables reports

    relative size

  • 7/31/2019 Seeing Past SQL

    3/3

    In practice, events, objects and reports tend to get mixed in with tables resulting in a departure from the

    ideal and resulting in issues with report generation times, event load times, object access times, data

    distribution and so on. Obviously keeping to the ideal profile will not solve these problems, but it will

    mitigate them.

    The sheer success of SQL data stores is, perversely, a part of the problem. There are adequate SQLimplementations of all four data types providing an insidious path of least resistance for the developer

    to follow. SQL expertise is the norm. Most developers faced with a data manipulation problem will come

    up with a SQL answer, regardless of whether they are dealing with events, objects, reports or

    (appropriately) tables.

    Breaking the data realm up into function specific stores is no harder than dealing with the issues around

    the various data types within a single store (just ask anyone who has had to deal with a distributed

    object store built on SQL databases6). Making such a break in the tail end of the development cycle is

    practically impossible. Once an application has been built around a unitary SQL database, introducing

    separate stores for events, objects, reports and tables will amount to starting again from scratch.

    It seems reasonable to look carefully at the applications data architecture in these terms before any

    commitment is made to a specific store. The application should be designed with an appropriate

    allocation of data types to data stores, rather than confronting the decision when it is too late.

    Organizations need to recognize that, particularly so far as events and objects are concerned, it is

    unlikely that individual development projects will come up with optimal solutions that can meet any

    requirements beyond those of the project at hand. Common infrastructure for dealing with objects and

    events is essential if server proliferation and consequent scaling limits are to be avoided.

    So far as application design is concerned, it is important to be able to recognize as soon as possible howthe applications data will be allocated across the data types, while at the same time being able to defer

    for as long as possible a commitment to a specific underlying store.

    References

    1. Pavlo, A., et al, A comparison of approaches to large-scale data analysis, Proceedings of the35th SIGMOD international conference on Management of data, 2009

    2. Dean, J., Ghemawat, S., MapReduce: a flexible data processing tool, Communications of theACM - Volume 53 Issue 1, January 2010

    3. Codd, E.F. (June 1970). "A Relational Model of Data for Large Shared Data Banks".Communications of the ACM (Association for Computing Machinery) 13 (6): 377387.4. Chamberlin, Donald D.; Boyce, Raymond F. (1974). "SEQUEL: A Structured English QueryLanguage" (PDF). Proceedings of the 1974 ACM SIGFIDET Workshop on Data Description, Access

    and Control (Association for Computing Machinery): 249264.

    5. Stonebraker, M., SQL databases v. NoSQL databases, Communications of the ACM, Volume 53Issue 4, April 2010

    6. Pujol, Josep M., et al, The little engine(s) that could: scaling online social networks, SIGCOMM'10 Proceedings of the ACM SIGCOMM 2010 conference on SIGCOMM