Martin Willcox - What is a Data Lake, Anyway?

11
Martin Willcox What is Data Lake, Anyway? DEBUNKING THE MYTHS Speaker 10 of 17 Followed by Anthony Miller @Willcoxmnk

Transcript of Martin Willcox - What is a Data Lake, Anyway?

Martin Willcox

What is Data Lake, Anyway?

DEBUNKING THE MYTHS Speaker 10 of 17

Followed by

Anthony Miller

@Willcoxmnk

2 © 2014 Teradata

 One of the Big Data labels that we risk over-loading to complete abstraction is the idea of a "Data Lake”…

“…store all data present and future and create a centralised data archive location.”

“A large object-based repository that holds data in its native format”

“Sometimes called the bit bucket or the landing zone”

“All Water and Little Substance”

“As more and more applications are created that derive value from… new types of data… the Data Lake forms”

3 © 2014 Teradata

 …and some of the discussions sound eerily familiar

“Data lakes can help resolve the nagging problem of accessibility and data integration”

Data accessibility and integration? Isn’t that what the Data Warehouse is for?

4 © 2014 Teradata

 So is the Data Lake a new architectural construct? Or are we just re-platforming Data Marts?

Simple, single subject area Dimensional Data Marts – with all of the dimensions pre-joined to the fact table? One-per-workload / application?

Is this really the future of Enterprise Analytics? Or circa 1995 silo, departmental Decision Support Systems warmed-over?

5 © 2014 Teradata

 Take the merits of the different technologies out of the equation – and this is what some of us are thinking…

6 © 2014 Teradata

 …but there are no free lunches in Information Management – merely more and different options

Explicit, or implicit, there is always, always, always (at least one) schema

Agile application development, versus agile data acquisition

None of the information management strategies / technologies are magic - “pay me

now, or pay me later”

7 © 2014 Teradata

 Big Data Are Plural

For the foreseeable future, we will need multiple Information Management strategies - and multiple Information Management technologies

Integration becomes a critical concern

DISCOVERY PLATFORM

DATA WAREHOUSE

DATA PLATFORM

– Gartner – Logical Data Warehouse

– Forrester – Enterprise Data Hub

– Teradata – Unified Data Architecture

8 © 2014 Teradata

 A definition of the Data Lake (Data Reservoir)

A centralised, consolidated, persistent store of raw, un-modelled and un-transformed data from multiple sources / silos (without an explicit, pre-defined schema, without externally defined metadata –

and without guarantees about the quality, provenance and security of the data)

Agile data acquisition – a haystack to go looking

for needles…

Now that is new, interesting and (potentially) very, very useful…

…with a natural storage model for complex,

multi-structured data…

…support for efficient non-relational

computation…

…and provision for cost-effective storage of large

and noisy data-sets.

9 © 2014 Teradata

 Data. Science

10 © 2014 Teradata

 STOP PRESS: Laws of Physics* Unchanged!  (* More specifically, the 2nd Law of Thermodynamics)

None of the new information management strategies and technologies is by itself a cure for information entropy – data silos form naturally, just like lakes form naturally

Left to its own devices, does nature tend to give us a single, beautiful lake? Or a messy patchwork of lakes, plural?

11 © 2014 Teradata

 Summary and conclusions