Can Data Warehousing Survive Big Data Survive Big Data?...

Can Data Warehousing Survive Big Data

Sponsored by HP Vertica

Speaker: Dr. Barry Devlin, Founder and Principal of 9sight Consulting Moderated by Ron Powell

Ron Powell: Welcome, everyone, to our web event, 'Can Data Warehousing Survive Big Data', sponsored by HP Vertica. I'm Ron Powell, Associate Publisher and Editorial Director of the BeyeNETWORK, a part of TechTarget, and I will be the moderator for this event.

Copyright © 2011 9sight Consulting, All Rights Reserved

Dr Barry Devlin

Founder & Principal

9sight Consulting

Can Data WarehousingSurvive Big Data?

B-Eye-Network Webinar

August 2011

This presentation features Dr. Barry Devlin. Barry is one of the BeyeNETWORK's expert channel leaders, is founder and principal of 9sight Consulting. He is also a founder of the data warehousing industry and among the foremost authorities worldwide on business intelligence and beyond. He is a widely respected consultant, lecturer and author, with more than 30 years of experience in the IT industry. For more than 25 years, data warehousing has been the accepted architecture for providing information to support decision makers. However, big data challenges many of the underlying principles behind data warehousing. In this web seminar, Barry will explain how big data challenges the traditional data warehouse architecture and how that architecture must evolve. Barry will also provide key considerations for using big data alongside

an existing data warehouse. And, now, without further adieu, here is Barry Devlin. Welcome Barry!

Barry Devlin

2Copyright © 2011 9sight Consulting

Founder and Principal

9sight Consulting, www.9sight.com

Dr. Barry Devlin is a founder of the data warehousing industry

and among the foremost authorities worldwide on business

intelligence (BI) and beyond. He is a widely respected

consultant, lecturer and author of ―Data Warehouse—from

Architecture to Implementation‖. Barry has 30 years of

experience in the IT industry, previously with IBM, as an

architect, consultant, manager and software evangelist.

As founder and principal of 9sight Consulting

(www.9sight.com), Barry provides strategic consulting and

thought-leadership to buyers and vendors of BI solutions. He

is currently developing a new architectural model for fully

consistent business support—from informational to

operational and collaborative—Business Integrated Insight

(BI2). Based in Cape Town, South Africa, Barry’s knowledge

and expertise are in demand both locally and internationally.

Email: [email protected] (preferred contact method)

Phone: (S. Africa) +27 71 557 7479

(Ireland) +353 86 237 5128

Barry Devlin: Thanks Ron! It is a pleasure to be here with you and to be discussing this really interesting topic of big data. It is one of those things that really has become so interesting and so important in the marketplace recently that it's hard to avoid it. If you were to listen to the proponents of it, you would think that you could use it for everything from selling the TVs to solving world hunger. And, from a data warehousing point of view, it seems to be causing quite a bit of angst among data warehouse implementers. Who were the original big data people after all? So, perhaps the first question that we would like to pose during this webinar is what is big data and why is there all of the buzz about it. And, let's start from there and try and figure out where we go with big data and data warehousing.

3Copyright © 2011 9sight ConsultingCopyright © 2011 9sight Consulting

The world of big data is actually big. That's for sure. I have here a couple of charts taken from IDC's Expanding Digital Universe Study which has been running for four years now, from 2007, sponsored by EMC in fact. And, what they've been doing is they've been tracking the amount of digital information that is stored in the world. I have taken a subset of that information, which is the information that they attribute to enterprises, i.e. that is managed and controlled by enterprises, because clearly there is another growth beyond this as well, which is the stuff that people gather on their smart phones, on their cameras, on their video recorders, and whatever, that seldom get seen in the enterprise world. We're just focusing for the moment on enterprise data. Down in the bottom left of this screen, you will see a chart which shows the growth of enterprise hard information since 2005. I'm going to use the phrase of hard and soft information to represent two different types of information. Hard information is the stuff that we're used to. We store it in relational databases and spreadsheets. It's structured in a way that works well for computers. It does things like allow us to summarize it. It is the typical data that we have loved and used for years in the computer industry, and particularly in IT shops. Soft information, which is often called unstructured information, soft information is the other stuff. It's textual. It's image. It's data that has a much less useful structure in terms of its friendliness for a computer. I tend not to like the word unstructured information because, being a bit of a purist, unstructured information to me would be noise. So, I call it soft information. So, just bear with me as we go through this and we talk about hard and soft information on trying to convert the world.

Information volumes are growing exponentially… much of it from the world of “Big Data”.

Volumes– Hard Info: CAGR 22%

– Soft Info: CAGR 60%

Type– Hard Info: 15% in 2005

– As low as 5% in 2010

Source– Internal Largely external


0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00Enterprise Hard Info

Source: IDC “Expanding Digital Universe”

2007-2011 sponsored by EMC– Note 1: Categories loosely defined

– Note 2: Some level of “guessimation”!

0.00

100.00

200.00

300.00

400.00

500.00

600.00

2005 2006 2007 2008 2009 2010 2011 2012

Soft Info

Hard Info

Exabytes 1 Exabyte = 1,000,000 Terabytes

Hard information, as I said, is the stuff that we'd be using in our enterprises, in our data warehouses, our operational systems, etc., for years. And, if you look down on that chart, you would see that in the world, it is growing. It's growing at an annual…a compound annual growth rate of 22%, from about 4 exabytes in the world, way back in 2005, looking up to somewhere around 16 exabytes in 2012. An exabyte, by the way, is 1,000 terabytes, so we're getting into big figures here. But, when we talk about soft information, we're really getting into very big figures and very big growth rates. The graph on the right-hand side of this slide, the red bars are soft information. And, just look at that. Wow, 60% compound annual growth rate! That is the sort of growth rate that is beyond the capacity of…our capacity to really understand it. And, the other interesting thing about this is, if you look at the proportions, because right down at the bottom of those pillars you will see there is a tiny piece of blue. That tiny piece of blue represents the proportion of hard information in this big pillar. So, hard information, as a percentage, has decreased from 15% in 2005 to as low as 5% last year. So, this is a very interesting change. It's a very interesting change for us technologically; it's a very interesting change conceptually. And, there is one thing which doesn‟t show on this graph that has been pointed out in the study and that is that the source of this information has moved. It's gone from internal to largely external to the enterprise. So, clearly it's big, and that we've just seen. But, big is big and does it tell us enough about what it is? Elephants are big. Whales are big. But, they're very different beasts. And, we just need to look a little deeper. And, when we look a little bit deeper, you see that there is a different view of what it means. But, first of all, let me just go back to the source of all information, which is Wikipedia of course these days. What is the definition of big data? It's a term applied to data sets which sizes beyond the ability of commonly used software to capture, manage and process the data within a tolerable elapsed time. Big data sizes are constantly using targets, currently ranging from a few dozen terabytes to many petabytes in a single data set. Now, my observations on reading that is that's rather vague about the data size and the type. It gives an ever-changing…the possibility of ever-changing. And, this includes the sort of thing that we call, in my world, whatever you'd like yourself. And, that's very useful for vendors because they can basically say that whatever tool they have and whatever piece of technology they want

to sell, it's good for big data. My view, in some sense, is that big data is a largely useless name, except for marketing. But, let's go back to our elephant and whales. What is going on here in big data? How can we characterize it? And, I'd like to characterize it first simply in terms of where it's coming from. There are two broad sources of the buzz, the excitement around big data, and just to be a little confusing I'm going to start with the second one.

Information volumes are growing exponentially… much of it from the world of “Big Data”.

Volumes– Hard Info: CAGR 22%

– Soft Info: CAGR 60%

Type– Hard Info: 15% in 2005

– As low as 5% in 2010

Source– Internal Largely external


0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00Enterprise Hard Info

Source: IDC “Expanding Digital Universe”

2007-2011 sponsored by EMC– Note 1: Categories loosely defined

– Note 2: Some level of “guessimation”!

0.00

100.00

200.00

300.00

400.00

500.00

600.00

2005 2006 2007 2008 2009 2010 2011 2012

Soft Info

Hard Info

Exabytes 1 Exabyte = 1,000,000 Terabytes

Big data, first of all, came to prominence in the world of science and engineering. Astronomical measurements, physics, biological data capture had been, for probably 10 years, talking about big data. And, for example, as I've put here, the CERN accelerator, under the ground in Switzerland and France, when it's running, which is not very often, generates 40 TB of data per second. So, when you have an experiment, you've got a lot of data flowing very fast. So, that was the original source of the interest in big data. the other one, which is of similar type, if you think about it in terms of the actual structure of the data, is one that's growing in much…and of great interest to commercial organizations, data coming from sensors, from machines, from measuring devices, mechanical, geolocational, RFIDs. There is a huge growth in that sort of machine-generated data. And, just to give you an example, over 2 billion RFID tags were sold in 2010, and every time they pass by a sensor you're going to get a squirt of data. So, we're talking about a lot of data, but reasonably structured or highly structured data from these types of sources. But, the real buzz that's happened in the last couple of years is from what you might call the denizens of the web. Here, we've got the Facebooks, the Twitters, the Yahoos, the eBays, the LinkedIns, the Googles, you name it, of this world, where there is an enormous volume of data being tracked, being gathered, being managed, being analyzed, web pages, links to and from them, web click logs, server logs, links in social networks, all used for things like behavioral analysis, which we'll come back to later. This is a very different sort of data, if you think about it. It's very soft. It's very changeable. It has a different set of characteristics than the data that we're used to. Indeed, both of these types of data, the science and engineering, and the web data that I'm mentioning on this slide, both have characteristics that are quite different from the sort of data that we've managed, that we've known and loved in data warehousing for many years. And, that really is more than the volume, the thing that we have to worry about and the thing we have to look at when we think about big data.

The sources of the buzz

Web denizens– Web pages and links

– Web click logs, server logs

– Social links

– Behaviour analysis

Science and engineering– Astronomical, physics and biological data capture

– Sensors – mechanical, geo-locational, RFIDs


CERN – generating 40 TB/sec

Over 2 billion RFID

tags sold in 2010

Barry Devlin: So, what I'd like to do is to briefly look at some of the tools that people talk about and indeed use around the area of big data. And, there are a very different set of tools than we're used to in the data warehouse world. The first phrase you will hear, no doubt, is Macro Ge's. Macro Ge's now is, in a way, simply a software framework, a framework for handling parallel data for huge data sets being handled in a distributed processing fashion over a large number of computers. It was a framework that was introduced by Google and is proprietary to Google, but it's a framework that was introduced by Google in 2004. If you think about it as a software framework, it is simply a way of writing programming, and that is, in many ways, what this slide talks about, first of all Macro Ge's as a framework. Then, we had this thought that, "Well, as I've got to write programs and I've got to write code to process, in a distributed manner, this large…this big data, perhaps I could do with some tools to manage it, perhaps I could do with some powerful pieces of software that we could use. We have here a list of wonderfully named products and tools from the open source community by Hadoop, HBase, Hive, Big, Easy, Scoop through and (Inaudible). You name it, and we've got the most wonderful set of names. And, I am not going to go through them in great detail in order to tell you what each one of them does. What I want to do is just summarize this really quickly. What is happening here is that Macro Ge's, as I've said, is a framework where we write programs. The rest of these tools are actually pieces of a data system. They're pieces of a database system that enable you to manage the process of handling the data, storing the data, looking after it, making sure that jobs can restart, making sure that you've got consistency, all of the good things that databases do. So, the thought was as you go through this idea that you want to handle this distributed data, large sets of data in a distributed fashion, you need a set of tools. And, Hadoop and all of the tools that are mentioned on this page are essentially pieces of software, pieces of tooling that would enable you to mange this data in a better way. So, that's one thing you…it's worth thinking about, the idea that when this big data world started, people were saying, "Well, we're never going to get databases to do this stuff. We have to do it by hand. We have to do it manually." And, then as we got used to using this stuff and we got to see the sort of management and control issues that were around this big data, then we had to start introducing the sort of tooling that you've got in a database anyway. Maybe it's not all of it, but it is certainly a part of it.

The MapReduce framework &Hadoop are key technologies

Web denizens and science/engineering favour DIY and Open Source!


• MapReduce – Powerful, parallel data processing framework

• HDFS – Self-healing distributed file system

• Hadoop Common – a set of utilities that support the

Hadoop subprojects

• HBase – Hadoop database for random read/write access

• Hive – SQL-like queries and tables on large datasets

• Pig – Dataflow language and compiler• Oozie – Workflow for interdependent Hadoop jobs

• Sqoop – Integrates databases with Hadoop• Flume – Highly configurable streaming data collection

• Zookeeper – Coordination service for distributed apps

• Hue – User interface framework / SDK for visual Hadoop apps

And, you will also have heard about…no doubt have heard about NULL Sequel, or NULL SQL, which is another flavor of big data thinking which have perhaps some additional technologies. I have here a categorization, thanks to my good Rick van der Lans, who has come up with simply looking at these various different types of NULL SQL data stores. Up on the top right, he has identified wide column storage, which includes the Hadoop and the Cassandras, and so on, of this world. But, there are various other types of stores and technologies, again which I'm not going to go through. I've given you some examples here on the screen of the tools and the products involved. But, the idea here is what's going on? We're trying to do what we did in the early days of data processing. We're trying to optimize the technology to deal with the sizes of data and the quantities of data, and the speeds of data, and the varieties of the data that we need to handle. So, there are various different ways of looking at these various different tools.

NoSQL – another flavour of Big Data thinking with some additional technologies


Thanks to Rick van der Lans for this classification

So, what this says to me, at least as an old data warehousing hack, is that in comparison to where we were when we started introducing data warehousing, with this new stuff, big data, with its different…with its huge volumes and with its different varieties we have just come through this phase of looking at different tools, of beginning to standardize around a particular tool set and the particular ways of doing things, and now we're going to start productionalizing that in a big way. And, you will see that has happened in, of course, in the large Amazons and eBay‟s of this world, who have had to deal with big data from the go get. However, for many other companies, as the data that we're talking about, this less structured data, this external data, this machine-generated data becomes more important in all aspects of business, and then we will see that we, in all other areas of the IT world, have to begin to deal with it too. And, the question that probably arises is as we just posed in the beginning of the presentation, is will big data kill the data warehouse. Here, we're back to our (Inaudible) metaphor. Will big data kill the data warehouse? Because, if our data warehouse was originally all about big data, big data from the operational world, from the transaction world, this new set of big data, coming from machines, coming from the web, coming from the server logs, etc. etc., is that going to become more important? Is that going to kill the data warehouse? And, let's try and answer that question.

WILL BIG DATA KILL THE DATA WAREHOUSE?


So, as I've said, big data exists in a wide variety of shapes and sizes, but let‟s try and do a little bit of in-depth study of what that means. I've been looking at this in a more scientific way, trying to come up with some sort of data classification framework that would enable us to look at data in general, but specifically the types of data that we talk about as big data and try to understand how that might look. Here, I've done one particular group, with two axes, one called anatomy and the other called temporality. And, there are other groups which I've been working with, but let's just look at this one for this particular webinar. The anatomy of data is really about its internal shape. What does it look like if I was to delve into it? What structure would be in it? What sort of considerations would I make…need to take about what it does, what I need to know about it before I use it, and so on? The anatomy axis here is divided into five, and let's just run down that quickly. So, one class of anatomy is schematic data. Used for generic programming work, it's what we do in relational, mostly relational databases, the sort of generic relational database that has a schema defined in advance. By the way, in the past that might have been hierarchical or network databases, but these days it's mostly relational. And, that's the world of schematics and that schematic set of data is probably the one that you are most familiar with. Compound data is a combination of schematic and textual. And, as I said here, it's mostly XML these days. Compound is a very useful set type of data because it covers the large world of metadata. It works very well for metadata, but it also works for other things. So, as we begin to have more flexibility in the world, we would see that compound data becomes a more interesting thing. Programmatic structure or programmatic anatomy is used for specific computer processing tasks. In comparison to the schematic, what it's saying here is that, "You know what? I've got a job to do. I've got a particular piece of work I need to do. I need to write a program. And, when I do that program, I need to know what sort of data I'm going to use and what shape it's going to be. And, I can decide as I define, as I design and deliver, and develop that program, what the shape of the data is going to look like." And, it can change if I need to change it, because only my program needs to do it. So, it's much more flexible, but it still has a structure to it. As we move onto textual, beginning to get less structured, we get emails and documents, and you can imagine, if you look at the structure of that, some fields, like subject or receiver, and so

on, but large chunks of text wherein the meaning and the value is not (Inaudible) by field. It has to be understood as part of the context of what's written there. And, finally, you get into that multiplex category which covers image, audio and video. So, that's one axis. That's the Y axis. The X axis is temporality. Exactly what sort of timeframe are we looking at? And, this is an old one. You know this pretty well as a data warehouse person runs from in-flight, it's on the network to live, either operational, machine generated timeframe, but the data gets over-written whenever we need to. Stable, this is live data that is never over-written and you may think of this as stuff that you would put into you data warehouse or data mart.

Big Data exists in a wide range of shapes (and sizes)

Anatomy– Schematic: generic work,

mostly relational

– Compound: mostly XML

– Programmatic: specific

computer processing

– Textual: e-mails and docs

– Multiplex: image, audio, video

Temporality– In-flight: on the network

– Live: operational and machine-generated data

– Stable: live data that is never over-written

– Historical: an agreed record of past data values

– Archived: historical or stable data that is seldom used

Relational DBs address only a (small) segment


Live ArchivedHistoricalIn-flight

Textual

Compound

Multiplex

Schematic

Temporality

Anatomy

Stable

Programmatic

Relational DB

Historical, well, that's when we really decided that it's an agreed record of the past. It is stable, but we're actually going to keep it. And, finally archived, which is a special case of historical or stable data that is seldom used. And, looking at that grid, you can instantly see, I think, that what we do in relational databases today sits up at the top of that chart. It's more or less in the middle, in the green box that I've drawn into the chart. And, it addresses a relatively small fragment. However, if you think about this, the whole area of big data warehouse, just in passing, there is a long and growing history. Data warehouses have been big. And, I've mentioned this before, but do you know how big they are? Way back in the then and distant pass, Wal-Mart was always talked about as one of the biggest data warehouses. And, back in the early days, in the early '90s, we would talk about hundreds of gigabytes. Today, the Wal-Mart data warehouse is larger than, as far as I understand, larger than 1.5 petabytes. eBay, at the Strata Conference in February of this year, talked about a 4 PB data warehouse on Teradata. Vertica has apparently seven customers with data warehouses much larger than a petabyte. So, although most data warehouses are considerably smaller, there are large data warehouses out there today and they're running in relational technology. So, it has scaled and it has scaled by doing things like columnar databases, by doing compression, by using the technology and using the processes and using the new memory advances to do things that we couldn‟t do in the past. So, clearly, relational databases have not run out of steam. They have grown. They continue to grow and they continue to support the growing amount of that highly structured schematic data that we've known and loved in the past.

But note: Big Data Warehouses also have a long and growing history…

Wal-Mart has grown from

100s to GB in the early „90s

to >1.5 petabytes today

eBay – >4 petabyte Data

Warehouse on Teradata

Vertica has 7 customers with

data warehouses/marts >1 petabyte

Most data warehouses are considerably smaller

But, relational technology does scale to significantly big


But, as we move on, as we look at the whole area of the data warehouse, the question that really does arise is does it make sense to funnel all of these other data types into the data warehouse? So, I've given you here a picture. The top part of the picture is this traditional data warehouse architecture which dates from the early 1990s. In fact, this is a picture that I developed for my book, from 'Data Warehouse to…wow! If I need them in my book, it would be wonderful! The idea of this picture, showing this structure, the generic structure of the data warehouse, we have the operational systems that feed through ETL, data into the enterprise data warehouse, and out through ETL into the data marts. And, metadata is sitting at the box on the side. And, if you look at that, what you will notice instantly is that is optimized for hard data. That's structured data coming from operational systems and transactional systems, and also for hard data in the sense that we're going to store it in an enterprise data warehouse. The data comes from internally managed systems, internally managed sources. So, you've got a lot of control over it. The big blob on the bottom, the big blue blob, is big data, and the suggestion is perhaps we should bring all of that into a data warehouse. And, I think the question is, is it really so? Does that make sense? If we talk about, as we've seen in some of the non-traditional data types that are involved in big data, are they amenable to being stored in a predefined schema? Do we know them in advance, well enough that we could, okay, this is how I would structure an enterprise data warehouse set of tables around it. Are they suited to relational? There is a lot of textual and video and image information involved there. And, yes of course, we've got the ability to store (Inaudible) in relational databases. But, do we really want to do that for everything? Is this big data perhaps oversized for some of the data warehouse and ETL tools, although as I've said, some of the tooling has grown. Is it stable enough for the warehouse? Do we really want to put all of this stuff that's changing all of the time, from web logs and from server logs, into a warehouse and store it there? Is it well enough governed and managed, around taking stuff from the web and from the internet, from perhaps unreliable sources? Does that make…does it make it useful to put it into the data warehouse, into the relational environment? And, I think, to my view, the answer is no. Some of that data, much of that data needs to stay in different storage technologies, in different places. And, I've matched this out on the grid that I showed you earlier. So, you can see, if we look at this, there are various different types of stores, content stores, Hadoop types of databases, other than those SQL databases, offline storage that sense different types of this data classification.

Does it make sense to funnel all other data types into the data warehouse?

Traditional DW architecture dates

from early 1990s

Optimized for hard data from

internally managed operational

sources

Non-traditional data types– Amenable to pre-defined schema?

– Suited to relational?

– Over-sized for the data warehouse

and ETL tools?

– Stable enough for the warehouse?

– Well-enough governed and

managed for the warehouse?


Operational systems

Data marts

Enterprise data warehouse

Meta

data

Big Data

So, the idea I'm really coming to at this stage is that no, it doesn‟t make sense to put all of this big data into the data warehouse because in terms of its volume and its variety, it is far too different from what we've typically done in relational databases. Some of it overlaps, and you could see the overlaps on this slide, but others of it simply do not belong there. So, what's going on here? Well, what's going on here is that we have a new set of data where we need varying levels of management and control. When we talked about the enterprise data warehouse and the data marts (Inaudible) from it, it was about getting consistency and about getting integration and integrity of our information for management information purposes. And, certainly, we do want to get some level of consistency and integrity on the big data that we're feeding into our data warehouse, or that we're, sorry, feeding into our organization and into our IT systems. But, do we need to bring it all into the data warehouse? I think the answer is no. We need to bring some of it into the data warehouse in order to create a set of consistent, integrated data that can link to all different places.

Big Data exists in many forms...

Multiple storage options for different types of data

Varying levels of management and control

...BUT, still a need for consistent, integrated access


Live ArchivedHistoricalIn-flight

Textual

Compound

Multiplex

Schematic

Temporality

Anatomy

Stable

Hadoop and

similar

Relational DB

Offline

StoreProgrammatic

Content Store

Other

NoSQL

And, more, in terms of vendors in this space, I think they really are beginning to recognize this too. They're recognizing it in that the big data warehouse vendors have all got…have begun playing in the big data space. They have partnerships with Cloudera, with Teradata-Cloudera partnership for example has the idea that we want to bring back the big data and make it available through the relational database. Teradata also acquired Aster Data recently, which is enabling fast data transfer between Hadoop and the relational database. IBM has been involved with InfoSphere BigInsights, which is Hadoop-based. HP Vertica-Hadoop connector, EMC Green (Inaudible) now produce extensions, many, many more, all saying, "You know what? We need to link these two things together." So, what we're getting here is a picture that is much bigger than the old picture of data warehousing, where everything got funneled through the enterprise data warehouse. What's happening here is a picture where the enterprise data warehouse becomes what I'm calling core business information. This is the stuff that has to be consistent, that has to be right, and that it sits at the heart of decision making that needs to be consistent across the enterprise.

RDBMS vendors recognizing importance of non-relational storage.

Big DW vendors playing in big data space– Teradata / Cloudera partnership

– Teradata – Aster Data provide fast data transfer between

Hadoop and RDBMS

– IBM InfoSphere BigInsights (Hadoop-based)

New RDBMS entrants link / embed MapReduce

functionality– (HP) Vertica Hadoop connector

– (EMC) Greenplum MapReduce extension

– And more...


On either side of that, you'll see the idea that there may be content stored, there may be Hadoop or NULL SQL, or other stores which contain data which is of those different structures and optimized for using them. I turned the metadata box the other way to span across all of these different types of data because it's the metadata that describes how we use those things and how we get at them. And, data virtualization becomes the key technology that enables us to link these together. So, we have an enterprise data warehouse, perhaps plus the operational data store, or MDM, a focus on the core business information that is getting smaller in relative terms to the amount of data that we're passing through our management information systems, because most analytical data and content flows alongside the enterprise data warehouse, and that's growing rapidly in volume and in variety. And, the virtualization then bridges across them. And, by the way, just in passing, note that the links from data virtualization also go down to the operational systems, which is saying that this is how, in the future, we deal with getting close to real time, or getting to that real time. Rather than talking about moving this stuff into operational data stores, we start talking about accessing it directly where we need to. So, this is a new architecture. This is an architecture that‟s an expansion of the data warehousing architecture that I call business information, business integrated inside of BI to the power of 2. And, this is a piece of work that I have been developing over the last three years, really beginning to describe a new way of looking, a new layered architecture, a new way of looking at information and its use within the entire organization.

Metadata

Data virtualization will be a key technology

EDW (plus ODS, MDM) focus

on Core Business Information– Shrink (in relative terms)

Most analytical data and

content flows beside EDW,

with metadata links– Grow rapidly in volume

and variety

Data virtualization becomes

mandatory technology to bridge

different information resources– Metadata is key


Operational systems

Core

Business

Info

Hadoop,

NoSQL

and other

Stores

Content

Stores

Data

Virtualization

I'm going to do this reasonably quickly. The bottom layer here is called the business information resort. It is enterprise-wide and it is all of the information of the business. Well, you say, "I'm overwhelmed just doing the management information," and that may be so, but what we're seeing, big data shows us this operational BI shows us this, we can't avoid looking at all of the information, all of the data within the organization if we want to go forward. So, we have to have a concept which spans all of that information, all of the storage types, all locations, operational, informational, collaborative, you name it. Above that, we have a process layer which I call business function assembly, which are all of the processes of the business. And, by the way, I mean both business and IT. In the past, we've talked about IT processes being different from business processes. This doesn‟t make sense anymore. There are business people that are out there mashing up data that are creating processes themselves on the fly. We have to have a way that maps all of those together and that allows us to manage them within IT, as well as allow the business people to do what they want to do. Service-oriented architecture is part of this, but it's all about creating, managing and accessing and using the information, all of the information of the business. And, we need a new layer at the top, which I call the personal action domain, which addresses all of the users' intent and actions, hiding entirely the technical and physical aspects of process and information. This is a new way of looking at IT. It's a new way of looking at how we do IT within the business. But, those are even…I'm sure most of you who have experience and background in data warehousing will recognize that the sort of problem that you deal with today, within the management information sphere, are the same problems that this new architecture is going to have to deal with. So, you're going to have the opportunity to look at that and to do that in a different way, but have a key role, play a key role in moving the focus from integrating management information to integrating all information.

A new layered architecture… including BI, operational, social & IT aspects – Business Integrated Insight (BI2)

Personal Action Domain– Addressing all users‟ intent & actions

– Hiding technical and physical aspects

of process and information

Business Function Assembly – All the processes of the business

– Business and IT

– Creating, managing & accessing information

Business Information Resource– All the information of the business

– Across all storage types & locations

– Operational, informational, collaborative


Enterprise-wide

See: Devlin, B. ―Business Integrated Insight (BI2): Reinventing enterprise

information management‖, (2009), http://bit.ly/BI2_White_Paper

Let's step back from the future a little bit and ask, "Well, what should I do about big data? Isn't he gorgeous? Let's have a look at what he says.

WHAT SHOULD YOU DO ABOUT

BIG DATA?


The first point is that there is significant value to be found in big data and my friend, Mark Matson, put this slide up in Strata, in February, and I just couldn‟t resist it. Here, we have these guys and the gold fields of California, panning for gold. And, you see that they have the stream coming out through a pipe and they're working with one pan. Now, what they need is a fatter pipe and lots of

pan working in parallel. And, this is essentially what we're doing with big data. And, why? Because there is gold to be found in that water. It may be a small amount, but when you find it, it's of large value.

There is significant value to be found...


All we need is a

fatter pipe and lots

of pans working in

parallel...

Thanks to Mark Madsen for this!

What is the sort of…what are the sort of data, what are the sort of uses that we can make of it and where should we focus? Well, I think the focus has to be on data at the moment where (Inaudible) use could make a difference. In other words, we're not trying to take all of this big data and use it for ongoing management dashboards or make it part of the operational processes, at least not just yet. Of course, some of the eBays and the Googles, of course, are doing some of that and they have machines that would blow your socks off. But, for most of us, when we look at big data today, you want to try and find where it makes a difference at a small scale. This is going to start being…it's going to start with an experimental view of the world. So, for example, we might be analyzing and indexing textual information from call logs, going in there to look at, trying to figure out what is the common set of problems, what are the common set of solutions that could be found in there. We might be looking into behavioral analysis or looking at our own website and trying to find out what's going on there, analyzing users' actions and click flows, and looking at where people go from and where they go to, and what leads to buyer behavior changes. We might be looking at some of the social networks that gather around our particular products or our particular brand in order to discover who the influencers of behavior are and who can…and to identify emerging trends. And, there are other aspects here on the right-hand side of the slide, which I think in the interest of time I'm going to leave you to read.

Focus on data where innovative use could make a difference

Analyze and index textual

information

Behaviour Analysis– Recommender system for

behavioural targeting

– Analyze user‟s actions, click

flow, and links

– Session analysis and report

generation

– Process data relating to people

on the web

Analyse social network

relationships and activities– Discover influencers of

behaviour

– Identify emerging trends

Web Log Analysis– Crawling, processing, and log

analysis

– Process clickstream and

demographic data to create web

analytic reports

Blog crawling– Crawl and later process posts

– Filter and index listings, remove

duplicates and group similar ones

Search Indexes– Gather WWW DNS data to

discover content distribution

networks & configuration issues

– Parse and index mail logs


So, the idea, first of all, is to focus on data where innovation can be found, to enable it to find those small grains of gold and to then think about what they mean in process terms, how can I move it into the process. But, first I want to give you a little warning and I want to show you the…a picture of perhaps the most handsome guy in the world. Isn't he just! Big data, to me, it's a bit like spreadsheets on steroids. In other words, many of the same issues that we have in data warehousing, around spreadsheets, come to the fore when we start talking about big data. Big data in the Hadoop environment or in the Macro Ge's, or the NULL SQL environment, often has very little oversight or control. It's highly distributed and it's used by multiple users, a bit like spreadsheets, only worse because the people who are using this are programmers. And, many of us have been programmers in the past, so we know what we like, so the programmers rather than ordinary users. So, we know how to do more damage. There are enormous data volumes, so the bigger the data volumes, the more that can go wrong. It's often real-time. This data is often flowing in real time. So, we don‟t necessarily have the ability to go back and redo it. It's a mixture of soft data as well as hard data. So, it's not stuff that is as easily understood in a traditional data warehousing or data analysis view. And, one of the things I just want to mention in passing, the environment, the Hadoop environment, the big data environment, as it's currently structured, is largely procedural as opposed to the database view, which is more mapped in advance and more schematic in its view. And, there are mixed blessings in this. Procedural approaches can do more detailed work, but they require a much closer attention to the structure and the semantics, and making sure that the data works. So, this is, in many ways, spreadsheets on steroids and we need to take care about it. So, two points here, there is value to be had here, but we need to be careful around it.

But, big data = spreadsheets on steroids

Many of the same issues...– No data oversight / control

– Highly distributed

– Multiple users

... only worse!– Programmers rather than

“ordinary” users

– Enormous data volumes

– Often “real-time”

– Soft data as well as hard

– Procedural – mixed

blessing


As we come towards the wrap-up, I have a couple of summary slides. There are four key architectural differences between big data and the traditional warehouse, traditional data warehouse. The first is that big data today is fully distributed, versus the largely centralized approach that we're used to in data warehousing. Now, I understand and I'm not suggesting that there are not distributed NTPs within the data warehousing world. For sure, there are, but they are distributed within a centralized environment and largely controlled. Whereas the big data world is fully distributed and therefore has no obvious control point for information management and governance. So, we have to be able to split out these two pieces, the piece that needs to be governed and managed and the piece that we can allow to free flow in order to get the innovation from. The second point is that when we look at big data today, you really have a much more transient focus, versus the long-term view that traditional data warehouse has taken. By transient, I mean much of this data is used more or less when we see it, when we gather it soon thereafter. We may want to store it for future reference, but we don‟t go back to it in the same way that we go back to enterprise data warehouse data. And, when we look at that transient focus, we have to distinguish between whether we're going to be driven by business needs or the data volume. In the past, when we've talked about data warehousing roles that focus on the business needs and use that to drive what behavior and what technology, and so on, at the moment I suspect that big data and the transient focus are probably going to be forced to focus and to be driven by the data volume. The third point of the book, how does…what is the way that we get to understand what's going on in here. In the data warehouse and indeed in relational databases in general, what we've got is a metadata-driven world. We've got a schema. We've got a definition of what the data means. We've got a place where we can go and understand it, and know what the relationships are. In a lot of this big data, the meaning is tattered. In other words, it's not fully described. If you go looking at large quantities of text, the meaning and the value is embedded within the text, without knowing what field it is necessarily in. So, it's difficult to create core business information from that, but we have to be able to do it. And, how we will do it in the long term is by text mining, by using textual analysis to, on an ongoing basis, extract meaning from it and indeed extract metadata from it. And, this is related to the procedural versus declarative points that I mentioned

on the previous slide, or procedural is more of the big data world, declarative is the database and data warehousing world. The fourth point is that big data is very much structure agnostic, as opposed to the highly structured data that we typically use in data warehousing. So, it's a very different view to the relational database world. All of those things say when you are looking at big data, in technology terms we're going to sit most of it in different data stores, in a different environment, beside our data warehouses. But, it doesn‟t mean that we're going to take our data warehouse away, because the data warehouse becomes the place and is the place, but it becomes more and more the place, for being the centralized, core business information that we use to tie all of the world together.

In summary, there are four key architectural differences between Big Data and traditional DW

1. Fully distributed vs. (largely) centralised– No obvious control point for information

management / governance

2. Transient focus vs. long-term view– Driven by business needs or data volumes?

3. Tacit meaning vs. metadata– Difficult to create core business information for ongoing use

– (Related to procedural vs. declarative approach)

4. Structure-agnostic vs. highly structured data– Very different to relational DB world view


Based on those four differences, we might, I think, ask a final question, which really is what should you do now. And, I would say to you that the thing to do now is to explore the business opportunities, first to explore the business opportunity to read big data. In other words, where can I get some value? That means going out and playing with it. It means allowing…maybe employing a data scientist. That means allowing a few of these data scientists to play with the big data. We need to plan an integration and differentiation point with our existing data warehouse. Integration means where would the pieces of information that I pick up from the big data going to be funneled into the data warehouse, because they have become part of the ongoing and long-term process. Differentiation means what is the stuff that I don‟t want to put in the data warehouse. I certainly don‟t want to overwhelm my data warehouse with vast volumes of information that are changing, that are never going to be of long-term value. The data warehouse is about creating the history and the memory of the business. And, I need to separate and control the production and exploration use. Production, to me, is something that goes with the data warehouse. Exploration goes with the Hadoops and the NULL SQLs of this world. And, you need to separate them and ensure that we know what we're controlling and how we make them work together. And, as I said earlier, allow a few data scientists to play with the tools.

So... What should you do?


This is where I think we're at the moment. It's early days yet. There is no doubt in my mind that not only will data warehousing survive big data, but in fact it will thrive in the big data world, because what's going to happen as big data becomes more common and more widespread is we're going to find more interesting pieces of information to bring into this data warehouse and to track on a longer term basis. On this second to last slide, I just have some further resources that you could go look at and read. And, indeed, I star in a short, 10 minute You Tube video on the heat death of the data warehouse, which I presented at Strata 2011. And, by the way, the answer was that it's not the heat death of the data warehouse, as I've just shown here. And, with that, I would like to thank you for being here for this meeting. I hope you got value from it. And, and I'm more than happy to hear from you, with your comments and thoughts as we go forward into this exciting world of big data. Thanks very much! Back to you, Ron.

Further resources from Dr. Barry Devlin

―Beyond Business Intelligence‖, Business Intelligence Journal,

Vol. 15, No. 2, June 2010, http://bit.ly/Beyond_BI

―From Business Intelligence to Enterprise IT Architecture‖, B-eye-

Network, February 2011, http://bit.ly/BI2EIA

―The Heat Death of the Data Warehouse‖, Strata 2011 Keynote

(10 min YouTube ), February 2011, http://bit.ly/HeatDeathDW

Blog at BeyeNetwork: www.b-eye-network.com/blogs/devlin

And more at: www.9sight.com/resources.htm


Ron Powell: Well, Barry, excellent presentation! You really put big data in perspective, with regards to the data warehouse. And, this is probably one of the best presentations that I have ever seen on big data to date. So, I appreciate all of your time and effort today and I want to thank everyone for viewing the web seminar and looking forward to hosting Barry on several web seminars, going forward. Thank you!

Copyright © 2011 9sight Consulting, All Rights Reserved

Dr Barry Devlin

Founder & Principal

9sight Consulting

Can Data Warehousing Survive Big Data Survive Big Data?...

Documents

Transcript of Can Data Warehousing Survive Big Data Survive Big Data?...