INDEX [usatrainings.files.wordpress.com] › 2011 › 06 › ... · On the other hand, business...

53
INDEX 1. Getting Started with Learning About Data Warehousing 2. A Definition of Data Warehousing 3. A Definition of Decision Support 4. The Case for Data Warehousing 5. The Case Against Data Warehousing 6. Actions for Data Warehouse Success 7. Data Warehousing Gotchas 8. Performing Data Warehousing Software Evaluations 9. An (Informal) Taxonomy of Data Warehouse Data Errors 10. Data Warehousing Political Issues 11. Different Aspects of Data Warehouse Architecture 12. What to Learn About in Order to Speed Up Data Warehouse Querying 13. What to Learn About in Order to Speed Up Data Warehouse Loading 14. How to Save Money on Your Data Warehousing Efforts 15. Using Data Warehousing in Strategic Decision Making 16. Maintenance Issues for Data Warehousing Systems 17. What Decision Support Tools are Used For 18. Is Web Data Analysis (i.e., Web Data Mining) Different? Getting Started with Learning About Data Warehousing If you are new to this field and the way you like to get into a new field is by getting an overview, I suggest that you: Read the books "Building the Data Warehouse" by W. H. Inmon, "The Data Warehouse Toolkit" by Ralph Kimball, "Data Warehouse from Architecture to Implementation" by Barry Devlin, and "Data Warehousing in the Real World" by Sam Anahory and Dennis Murray With due respect to all the other fine books on data warehousing and decision support, when read in combination I believe these four books provide a great introduction to and overview of the strategic and tactical issues system developers face (even though the books are several years old - despite what you read in the trade media, data warehousing does not change that much.) Especially valuable are Inmon's overall overview and description of the iterative nature of data warehouse development, Kimball's description of data modeling principles and query/report tools, Devlin's descriptions of data extraction, cleaning, and loading issues and metadata, and Anahory/Murray's description of what can be done so a system can run efficiently and their description of the main tasks in a data warehouse project. If you are a really ambitious reader, consider a couple of other titles. "The Data Warehouse Lifecycle Toolkit" by Ralph Kimball, et. al., is a 700+ page, clearly written description of a methodology for constructing data warehouses. If you use Oracle, "Oracle8i Data Warehousing" by Gary Dodge and Tim Gorman provides practical technical advice that even a non-DBA can understand and appreciate. Finally, "Data Warehouse Design Solutions" by Christopher Adamson and Michael Venerable provides insight on model design for specific business problems. (By the way, the above material contains the only recommendations of commercial products in this site. There is no commercial connection between this site and the authors or publishers of the books just cited.)

Transcript of INDEX [usatrainings.files.wordpress.com] › 2011 › 06 › ... · On the other hand, business...

INDEX

1. Getting Started with Learning About Data Warehousing2. A Definition of Data Warehousing3. A Definition of Decision Support4. The Case for Data Warehousing5. The Case Against Data Warehousing6. Actions for Data Warehouse Success7. Data Warehousing Gotchas8. Performing Data Warehousing Software Evaluations9. An (Informal) Taxonomy of Data Warehouse Data Errors10. Data Warehousing Political Issues11. Different Aspects of Data Warehouse Architecture12. What to Learn About in Order to Speed Up Data Warehouse Querying13. What to Learn About in Order to Speed Up Data Warehouse Loading14. How to Save Money on Your Data Warehousing Efforts15. Using Data Warehousing in Strategic Decision Making16. Maintenance Issues for Data Warehousing Systems17. What Decision Support Tools are Used For18. Is Web Data Analysis (i.e., Web Data Mining) Different?

Getting Started with Learning About Data Warehousing

If you are new to this field and the way you like to get into a new field is by getting an overview, Isuggest that you:

Read the books "Building the Data Warehouse" by W. H. Inmon, "The Data Warehouse Toolkit"by Ralph Kimball, "Data Warehouse from Architecture to Implementation" by Barry Devlin, and"Data Warehousing in the Real World" by Sam Anahory and Dennis Murray

With due respect to all the other fine books on data warehousing and decision support, whenread in combination I believe these four books provide a great introduction to and overview ofthe strategic and tactical issues system developers face (even though the books are severalyears old - despite what you read in the trade media, data warehousing does not change thatmuch.) Especially valuable are Inmon's overall overview and description of the iterative natureof data warehouse development, Kimball's description of data modeling principles andquery/report tools, Devlin's descriptions of data extraction, cleaning, and loading issues andmetadata, and Anahory/Murray's description of what can be done so a system can runefficiently and their description of the main tasks in a data warehouse project. If you are areally ambitious reader, consider a couple of other titles. "The Data Warehouse LifecycleToolkit" by Ralph Kimball, et. al., is a 700+ page, clearly written description of a methodologyfor constructing data warehouses. If you use Oracle, "Oracle8i Data Warehousing" by GaryDodge and Tim Gorman provides practical technical advice that even a non-DBA canunderstand and appreciate. Finally, "Data Warehouse Design Solutions" by ChristopherAdamson and Michael Venerable provides insight on model design for specific businessproblems. (By the way, the above material contains the only recommendations of commercialproducts in this site. There is no commercial connection between this site and the authors orpublishers of the books just cited.)

Visit a couple of organizations that have had warehousing systems in production for over a year

You will get an excellent education if you can ask an organization who 'has done it'what are the biggest issues it faced in developing systems and what are the biggestissues it faces in maintaining systems. Also, ask what the organization felt it didright and what it felt it could have done differently. I believe that if you do this youwill learn a great deal aspects of data warehousing that do not get discussed much inthe literature - specifically the politics of data warehousing projects, themaintenance burdens data warehousing imposes, and how to deal with datawarehousing software/hardware vendors and consultants.

Read up on some fundamental technical topics

You may find you will be greatly helped by reading up on SQL queries (especiallymulti-table and summary queries and subqueries), database indexing, join processing,and how query optimization works. Also helpful would be some knowledge about howlogical structures can be created and how database partitioning can be used inconjunction with logical structures. - There are many fine books on SQL. The latterknowledge will most likely be found in books aimed at DBAs for specific commercialdatabases.

Build something!

Computer texts love to cite a (supposedly) Confucian quote "What I hear I forget.What I see I remember. What I do I understand." Well, this quote is apt in the caseof learning about data warehousing. After you build something, no matter howmodest, you will gain a more profound appreciation of the topic.

A Definition of Data Warehousing

My favored definition of a data warehouse is a slightly modified version of Ralph Kimball's definitionon page 310 of The Data Warehouse Toolkit:

A data warehouse is a copy of transaction data specifically structured for querying andreporting.

Ralph states that a data warehouse is "a copy of transaction data specifically structured for queryand analysis". Two quibbles I have with Ralph's definition are: 1) Sometimes non-transaction data arestored in a data warehouse - though probably 95-99% of the data usually are transaction data. 2) Isay "querying and reporting" rather than "query and analysis" because the main output from datawarehouse systems are either tabular listings (queries) with minimal formatting or highly formatted"formal" reports. Queries and reports generated from data stored in a data warehouse may or maynot be used for analysis. - For some more information about why the transaction data are copied, youmay want to see my essay The Case for Data Warehousing.What I especially like about Ralph's definition is what he does not say.

The form of the stored data has nothing to do with whether something is a data warehouse.

A data warehouse can be normalized or denormalized. It can be a relational database, multidimensionaldatabase, flat file, hierarchical database, object database, etc. Data warehouse data often getschanged. And data warehouses often focus on a specific activity or entity.

Data warehousing is not necessarily for the needs of "decision makers" or used in the process ofdecision making.

Of course if you want to define every user as a decision maker and all activities as decisionmaking processes, then my assertion is false. But in my experience, the overwhelming uses ofdata warehouses are for quite mundane, non-decision making purposes rather than for gristfor making decisions with wide ranging effects (so-called "strategic" decisions.). In fact, Iwould assert that most of data warehouses are used for post-decision monitoring of theeffects of decisions (or as some people might say, for "operational" issues. By the way, this isnot saying that using data warehousing in the decision making process is not a wonderful,potentially high return effort. But my caution is that though the trade press, vendors, andmany industry experts trumpet the role of data warehousing vis-à-vis decision making, this isan area in reality we really do not have a clear understanding of. (See the writing of PeterKeen for more on this perspective.)

A Definition of Decision Support

The term decision support, if my knowledge of history of this area is correct, goes back to the 1970swhen it was coined by some academics associated with the Massachusetts Institute of Technology.Since then, many academic definitions have been offered. - My purpose in this essay is to provide adefinition that may lend clarity to practitioners.

A decision support system or tool is one specifically designed to allow business end users toperform computer generated analyses of data on their own.

I believe the essence of decision support is, in the language of the 1960s, to allow end users to dotheir own thing. I note that this definition is still fuzzy because what constitutes analyses and "ontheir own" are debatable points.

We cannot say that decision support systems or tools necessarily support the making ofdecisions.

What's in a name? - As far as I know, cognitive researchers do not agree on how decisions are made.Therefore, saying that these tools support making decisions is not a provable statement. Nor, is it, inmay opinion, an insightful way of defining these tools.

These tools do not analyze by themselves - rather they help a person analyze

In other words, the tools facilitate analyses rather than perform analyses. If you want to to learnmore about how the tools facilitate analyses, see my essay on What Decision Support Tools are UsedFor.

Data warehousing and decision support systems and tools do not necessarily go hand in hand.

Many data warehouses are not used as decision support systems. And decision support systems ortools do not necessarily require the use of a data warehouse as a source for data. I assert that, byfar, the most used decision support tools are spreadsheets not connected in any automated way with adata warehouse.

Business intelligence seems to have become the vendors' preferred synonym for decision support

My guess is because decision support has an academic connotation and, as just mentioned, decisionsupport systems do not necessarily support decisions. On the other hand, business intelligencesystems do not necessarily make a business more intelligent. By the way, the consultant-coined termbusiness intelligence goes back to the late 1980s, fell out of use, and then was revived by theDW/DSS world in the late 1990s. Confusingly, business intelligence is also used as a synonym forcompetitive intelligence (and is probably a more apt term for that area). By the way, "analytics" seemsto be an up and coming name for this area - despite the mid-1990 consultant-coined term "analyticalapplications" never taking hold.

The Case for Data Warehousing

The following is a list of the basic reasons why organizations implement data warehousing. This listwas put together because too much of the data warehousing literature confuses "next order" benefitswith these basic reasons. For example, spend a little time reading data warehouse trade material andyou will read about using a data warehouse to "convert data into business intelligence", "makemanagement decision making based on facts not intuition", "get closer to the customers", and theseemingly ubiquitously used phrase "gain competitive advantage". In probably 99% of the datawarehousing implementations, data warehousing is only one step out of many in the long road towardthe ultimate goal of accomplishing these highfalutin objectives.The basic reasons organizations implement data warehouses are:

To perform server/disk bound tasks associated with querying and reporting on servers/disks notused by transaction processing systems

Most firms want to set up transaction processing systems so there is a high probability thattransactions will be completed in what is judged to be an acceptable amount of time. Reports andqueries, which can require a much greater range of limited server/disk resources than transactionprocessing, run on the servers/disks used by transaction processing systems can lower the probabilitythat transactions complete in an acceptable amount of time. Or, running queries and reports, withtheir variable resource requirements, on the servers/disks used by transaction processing systemscan make it quite complex to manage servers/disks so there is a high enough probability thatacceptable response time can be achieved. Firms therefore may find that the least expensive and/ormost organizationally expeditious way to obtain high probability of acceptable transaction processingresponse time is to implement a data warehousing architecture that uses separate servers/disks forsome querying and reporting.

To use data models and/or server technologies that speed up querying and reporting and that arenot appropriate for transaction processing

There are ways of modeling data that usually speed up querying and reporting (e.g., a star schema) andmay not be appropriate for transaction processing because the modeling technique will slow down andcomplicate transaction processing. Also, there are server technologies that that may speed up queryand reporting processing but may slow down transaction processing (e.g., bit-mapped indexing) andserver technologies that may speed up transaction processing but slow down query and reportprocessing (e.g., technology for transaction recovery.) - Do note that whether and by how much amodeling technique or server technology is a help or hindrance to querying/reporting and transactionprocessing varies across vendors' products and according to the situation in which the technique ortechnology is used.

To provide an environment where a relatively small amount of knowledge of the technical aspectsof database technology is required to write and maintain queries and reports and/or to provide ameans to speed up the writing and maintaining of queries and reports by technical personnel

Often a data warehouse can be set up so that simpler queries and reports can be written by lesstechnically knowledgeable personnel. Nevertheless, less technically knowledgeable personnel often "hita complexity wall" and need IS help. IS, however, may also be able to more quickly write and maintainqueries and reports written against data warehouse data. It should be noted, however, that much ofthe improved IS productivity probably comes from the lack of bureaucracy usually associated withestablishing reports and queries in the data warehouse.

To provide a repository of "cleaned up" transaction processing systems data that can bereported against and that does not necessarily require fixing the transaction processing systems

Please read my essay on An informal taxonomy of data warehouse data errors for an explanation ofthe type of "errors" that need cleaning up. The data warehouse provides an opportunity to clean upthe data without changing the transaction processing systems. Note, however, that some datawarehousing implementations provide a means to capture corrections made to the data warehouse dataand feed the corrections back into transaction processing systems. Sometimes it makes more sense tohandle corrections this way than to apply changes directly to the transaction processing system.

To make it easier, on a regular basis, to query and report data from multiple transactionprocessing systems and/or from external data sources and/or from data that must be stored forquery/report purposes only

For a long time firms that need reports with data from multiple systems have been writing dataextracts and then running sort/merge logic to combine the extracted data and then running reportsagainst the sort/merged data. In many cases this is a perfectly adequate strategy. However, if acompany has large amounts of data that need to be sort/merged frequently, if data purged fromtransaction processing systems needs to be reported upon, and most importantly, if the data need tobe "cleaned", data warehousing may be appropriate.

To provide a repository of transaction processing system data that contains data from a longerspan of time than can efficiently be held in a transaction processing system and/or to be able togenerate reports "as was" as of a previous point in time

Older data are often purged from transaction processing systems so the expected response time canbe better controlled. For querying and reporting, this purged data and the current data may be storedin the data warehouse where there presumably is less of a need to control expected response time orthe expected response time is at a much higher level. - As for "as was" reporting, some times it isdifficult, if not impossible, to generate a report based on some characteristic at a previous point intime. For example, if you want a report of the salaries of employees at grade Level 3 as of thebeginning of each month in 1997, you may not be able to do this because you only have a record ofcurrent employee grade level. To be able to handle this type of reporting problem, firms mayimplement data warehouses that handle what is called the "slowly changing dimension" issue.

To prevent persons who only need to query and report transaction processing system data fromhaving any access whatsoever to transaction processing system databases and logic used tomaintain those databases

The concern here is security. For example, data warehousing may be interesting to firms that want toallow report and querying only over the Internet.Some firms implement data warehousing for all the reasons cited. Some firm implement datawarehousing for only one of the reasons cited.By the way, I am not saying that a data warehouse has no "business" objectives. (I grit my teeth whenI say that because I am not one to assume that an IT objective is not a business objective. We ITpeople are businesspeople too.) I do believe that the achievement of a "business" objective for a datawarehouse necessarily comes about because of the achievement of one or many of the aboveobjectives.If you examine the list you may be struck that need for data warehousing is mainly caused by thelimitations of transaction processing systems. These limitations of transaction processing systems arenot, however, inherent. That is, the limitations will not be in every implementation of a transactionprocessing system. Also, the limitations of transaction processing systems will vary in how cripplingthey are.Finally, to repeat the point I made initially, a firm that expects to get business intelligence, betterdecision making, closeness to its customers, and competitive advantage simply by plopping down a datawarehouse is in for a surprise. Obtaining these next order benefits requires firms to figure out,usually by trial and error, how to change business practices to best use the data warehouse and thento change their business practices. And that can be harder than implementing a data warehouse.

The Case Against Data Warehousing

The literature is full of testimonials for data warehousing. There is almost nothing about thearguments against data warehousing. In this paper I attempt to slightly fill that void by shedding lighton business and cultural factors that greatly lessen the value of data warehousing for certainorganizations. By the way, when I refer to data warehousing, I refer to both centralized datawarehousing systems and data marts.Some of the reasons data warehousing efforts may not be appropriate for certain organizations are:

Data warehousing systems, for the most part, store historical data that have been generated ininternal transaction processing systems. This is a small part of the universe of data available tomanage a business. Sometimes this part has limited value.

That is, sometimes the business end user community does not have a strong interest in old transactionprocessing system data beyond what are available in basic reports generated in transaction processingsystems. This lack of interest often stems from the fact that the markets in which a businesscompetes are in great flux or that the internal structure of the organization is in perpetual transition.If these conditions exist, there may not be a solid historical base to compare current performancewith. Also, sometimes there is a lack of interest in looking at this data in any in-depth way because abusiness is so simple that a data warehouse is overkill.

Data warehousing systems can complicate business processes significantly.

Though the interest in business process reengineering seems to have waned, some of the appreciationof how complicated processes can slowly strangle a business has remained. Data warehousing, ifunchecked, can foster the "institutionalization" of easily created reports whose reason for beingquickly is forgotten while people still toil to process these reports. If your organization does not knowhow to throw out processes (pardon my calling producing, distributing, and reading a report a"process"), data warehousing can quickly add clutter to the business environment.

If most of your business needs are to report on data in one transaction processing systemand/or all the historical data you need are in that system and/or the data in the system areclean and/or your hardware can support reporting against the live system data and/or thestructure of the system data is relatively simple and/or your firm does not have much interest inend user ad hoc query/report tools, data warehousing may not be for your business.

Whew! You can say that again. - Anyway, you may find that as more of these conditions are met, theless value data warehousing may add to your firm. And once you get away from the big "Fortune 500,centralized IS" type shops most of the data warehousing vendors slant their marketing to, theseconditions describe the reporting needs of many firms.

Data warehousing can have a learning curve that may be too long for impatient firms.

Despite the speed of the data warehousing development effort, it takes time for an organization tofigure how it can change its business practices to get a substantial return on its data warehousinginvestment. I speculate that rigorous analysis of the return on most of the major data warehousingimplementers' investments would find a much longer average payback period that you would surmisefrom reading the trade press.

Data warehousing can become an exercise in data for the sake of the data.

Organizations find that there are unlimited opportunities to add data to their data warehouse. Datawarehouses, like most other complex systems, take a life of their own. Unfortunately, adding datawithout questioning the business value of the data can lessen the business value of the data warehouseand quickly increase the cost of maintaining the data warehouse.

In certain organizations ad hoc end user query/reporting tools do not "take".

This is of concern to organizations that believe they can get their return on investment by havingusers write many of their own queries and reports. In some firms there are profound cultural barriersin the business organization to the acceptance of a tool that allows a person to ask questions on hisown. Trying to promote the use of such a tool in these organizations is setting yourself up for failure.Or, sometimes these tools do not take because a business is so complicated that only relatively simplereports with little business value can be written by end users.

Many "strategic applications" of data warehousing have a short life span and require thedevelopers to put together a technically inelegant system quickly. Some developers are reluctantto work this way.

Again, the importance of the culture cannot be underestimated. This time, though, the issue is in theIS organization. If your sell of the data warehousing project is the ability to do this strategic work(which is probably now being done by your users with large and complex spreadsheets) as opposed tothe usual development of canned and semi-canned reports and queries, ask yourself if the IS culturecan accept this mode of working. For many organizations this approach to systems work is muchharder to accept than most people realize.

There is a limited number of people available who have worked with the full data warehousingsystem project "life cycle".

I refer to availability of both employees and consultants. Systems of some depth require aconsiderable amount of time to develop fully. In other words, it takes a long time to gain experiencewith the usual problems that develop at different phases of a data warehousing effort. You should bewary of a consultant who says he has experience implementing scores of data warehouses in a coupleof years. Usually this is experience will be with a well-defined part of a data warehousing project thatwas amenable to outsourcing or with minor projects.

Data warehousing systems can require a great deal of "maintenance" which many organizationscannot or will not support.

Despite the best efforts to architect a system so "maintenance" (in quotation marks because it seemsoften there is never the closure to the initial data warehousing effort that the term "maintenance"implies) demands are minimized, many systems by their very nature require a great deal of care andfeeding once they are in "production". It is important to note that the more successful a warehouse iswith the users, the more maintenance it may require. Organizations who cannot or will not staff tomeet these maintenance demands should think twice before they jump into the data warehousingbusiness. By the way, it's very easy for the users to quickly go sour on a system they wereenthusiastic about at roll-out time if the system personnel do not support the maturing of the system.

Sometimes the cost to capture data, clean it up, and deliver it in a format and time frame thatis useful for the end users is too much of a cost to bear.

The percentage of time that must be devoted to extracting, cleaning, and loading data has been welldiscussed in the literature. It should be pointed out that there are some potential "show-stoppers" in

these efforts. Loading data from previous years can require the knowledge of transaction processingsystem developers who have long since moved on. Cleaning data so they are in a form that is acceptableto users from different functional areas may require arbitration skills the typical data warehousingdeveloper may not possess. Finally, data may have to be loaded into a data warehousing system in aprocessing window that just isn't big enough. Sometimes compromises are acceptable get-arounds.Often, though, compromises end up substantially compromising the value of the information in thedata warehouse.

You may have gotten the impression from reading the trade press that data warehousing is only forlarge organizations because it requires huge staffs and huge budgets. Well, most of the trade press isdominated by vendors/consultants/publications trying to market to large organizations with hugestaffs and huge budgets. - Though I have no way to prove this, in terms of numbers, I think most datawarehousing efforts are done by small staffs with modest budgets. In fact, smaller organizations areprobably much more "into" data warehousing than larger organizations. It is only recently thatpractical technology for huge organizations who lust for multi-terabyte databases has becomeavailable. The technology for more modestly sized data warehouses, on the other hand, has beenavailable for many years.Finally, you may have seen articles that state that data warehousing failure rates are between 10%and 90%. Though how these failure rates are determined is suspect, there is no denying that datawarehousing is risky. Now the fact that these efforts are risky does not bolster the case against datawarehousing. Data warehousing has not repealed the positive relationship between risk and expectedreturn in capital projects. However, if your organization does not know how to manage risky projects,then data warehousing may not be for you.

Actions for Data Warehouse Success

The following are some suggestions for the warehouse builder. These are points I rarely see discussedor I do not see discussed enough in the barrage of articles about data warehousing.

From day one establish that warehousing is a joint user/builder project

Warehouse projects will fail if the builders get specs from the users, go off for 6 months, and thencome back with the 'finished' project. Warehouses are iterative! (I think the word iterative meansthere are lots of mistakes in the projects.) Builders and users working with each other will not reducethe number of iterations, but it will reduce the size of them. By the way, see Peter Block's FlawlessConsulting for a great discussion of how to bring about 'joint' projects.

Establish that maintaining data quality will be an ONGOING joint user/builder responsibility

Organizations undertaking warehousing efforts almost continually discover data problems. Best toestablish right up front that this project is going to entail some additional ongoing responsibility.

Train the users one step at a time

Typically users are trained once. In several days they learn both the basics and intermediate andsometimes advanced aspects of using a tool. Slow down! Consider providing training initially in theminimum needed for the user to get something useful from the tool. Then let the user use the tool fora while (meaning several days, weeks, or months). Having basic training and some hands on experience,

the user will have a much better context with which to grasp the next level. Also, once the basics andthe next level are learned, keep training the users! After a year using the tool, schedule advancedtraining.

Train the users about the data stored in the data warehouse

Users often need more training about the stored data than about the tools used to access the data.Do not assume the data are self-explanatory or that any metadata you may provide will answer anyquestions. Note that users are often used to seeing data in canned reports and seeing data in its "raw"form can be confusing.

Consider doing a high level corporate data model / data warehouse architecture "exercise" inthree weeks

Actually, the key point regarding time is to "time-box" the exercise into a relatively short time. Afterabout three weeks, the marginal benefits from additional time devoted to these types of exercisesrapidly decrease. - The corporate model is going to identify, at a high level, subjects and relationshipsand most importantly, what are the chunks of information that it makes sense to deliver in differentprojects. The architecture part of the exercise to determine the dimensions, definitions of deriveddata, attribute names, and information sources that you will attempt to use consistently in your datawarehousing efforts. The exercise also consists of coming to an agreement as to how to keep thecorporate model up-to-date and how to make sure future data warehousing efforts pay attention tothe architectural principles.

Implement a user accessible automated directory to information stored in the warehouse

The majority of successful warehousing efforts I have seen included providing some means for thewarehouse user to locate stored information. Most of the times this involved building a separatedatabase with directory information. And most of the time, a pretty simple database sufficed forinitial use.

Once you know what raw data you want to feed into the data, request that data

If you have done some reading on data warehouse development you probably have read that figuringout the process of extracting, transforming, and loading (ETL) usually takes the majority of the timein initial data warehouse development. In project management lingo, figuring out ETL is usually on thecritical path. - If you know what raw data you need, request it as soon as you know it. You are probablygoing to have to ask one of the programmers of the legacy feeder systems to initially get this data foryou. For reasons of politics, overwork, and just plain lack of knowledge of how data are physicallystored in a system, the feeder system programmer often can take a while to get you that data.

Determine a plan to test the integrity of the data in the warehouse

Do not underestimate the importance of user faith in the integrity of the warehouse data. Hugewarehouse efforts quickly go sour if after system roll-out users find multiple mistakes. A goodinvestment of time in the initial stages of a warehouse project is for the builder and user to jointlydetermine what checks will be made on the warehouse data during development and what checks needto be made on an ongoing basis. The checks including tying warehouse data controls back to controls infeeder systems, checking the correctness of aggregation logic, testing whether classifications codeswere assigned correctly.

From the start get warehouse users in the habit of 'testing' complex queries

Many people will assume that the query result is correct. At the very least, get the user in the habitof eyeballing the query or report to check if several records that should be included are, in fact,included and that several records that should not be included are, in fact, not included.

Coordinate system roll-out with network administration personnel

Use of data warehousing systems can bring about some strange spikes in network activity. If you keepnetwork administration people informed of the roll-out schedule, chances are they will monitornetwork activity for you and be ready to make adjustments to the network as necessary.

Have a good grasp of desktop databases and spreadsheets

Even if you are dealing with a 100 TB database, there are so many little tasks to be done in a datawarehousing project where knowledge of these tools will be helpful. Skillful use of these tools duringdevelopment can be a huge productivity enhancer.

Be prepared to support beginning users immediately and at any time

We developers often greatly underestimate users' hesitation to begin using the data warehouse. Thishesitation could be because of user fear of technology or user fear that they will not get IS support.So, the first point is to be available to help when the user wants to try to use the data warehouse thefirst time. Users also may want to use the data warehouse for the first time during the weekend or at6:00 in the morning or 8:00 at night. The distractions are less at those times. If you want to makethat beginning user as a committed customer of your data warehouse, you better be available tosupport the user when he starts out whatever the day or the hour.

Maintain the audit trail to the feeder systems

That is, make it as easy as possible to tie the data in the data warehouse to the feeder systems. Yourusers have to trust the numbers in the data warehouse. You owe this to the users in order to maintaintheir trust.

Market and sell your data warehousing systems

For the most part, use of data warehousing systems is optional. This means you have toidentify the potential users of the systems, help them understand what are the benefits ofthe system, and then make them want to keep coming back to use the system.

Data Warehousing Gotchas

Here are some points for the warehouse builder I rarely see discussed or I do not see discussedenough in the barrage of articles about data warehousing. Forewarned is forearmed!

You are going to spend much time extracting, cleaning, and loading data

The usual figure quoted is that 80% of the time building a data warehouse will be spent on this type ofwork. (No one has ever explained how this percentage was obtained though.) Suffice it to say, though,the amount of time on these tasks is often grossly underestimated. Note that this point is aboutextracting and cleaning and loading. Though by now many people are aware the cleaning the data iscomplex, extracting data and loading data are equally, if not more, complex.

Despite best efforts at project management, data warehousing project scope will increase

To paraphrase data warehousing author W. H. Inmon, traditional projects start with requirements andend with data. Data warehousing projects start with data and end with requirements. Once warehouseusers see what they can do with 2000's technology, they will want much more. (Which is fine!) Onepiece of advice for the warehouse builder is never to ask the warehouse user what information hewants. Rather, ask what information he wants next.

You are going to find problems with systems feeding the data warehouse

Problems that have gone undetected for years will pop up. You are going to have to make a decision onwhether to fix the problem in what you thought was the 'read-only' data warehouse or fix thetransaction processing system.

You will find the need to store data not being captured by any existing system

A very common problem is to find the need to store data that are not kept in any transactionprocessing system. For example, when building sales reporting data warehouses, there is often a needto include information on off-invoice adjustments not recorded in an order entry system. In this casethe data warehouse developer faces the possibility of modifying the transaction processing system orbuilding a system dedicated to capturing the missing information.

You will need to validate data not being validated by transaction processing systems

Typically once data are in warehouse many inconsistencies are found with fields containing'descriptive' information. For example, many times no controls are put on customer names. Therefore,you could have 'DEC', 'Digital' and, 'Digital Equipment' in your database. This is going to causeproblems for a warehouse user who expects to perform an ad hoc query selecting on customer name.The warehouse developer, again, may have to modify the transaction processing systems or develop (orbuy) some data scrubbing technology.

Some transaction processing systems feeding the warehousing system will not contain detail

This problem is often encountered in customer or product oriented warehousing systems. Often it isfound that a system which contains information that the designer would like to feed into thewarehousing system does not contain information down to the product or customer level. By the way,this is what some people label a 'granularity' problem.

You will underbudget for the resources skilled in the feeder system platforms

In addition to understanding the feeder system data, you may find it advantageous to build some ofthe "cleaning" logic on the feeder system platform if that platform is a mainframe. Often cleaninginvolves a great deal of sort/merging - tasks at which mainframe utilities often excel. Also, you may

find that you want to build aggregates on the mainframe because aggregation also involves substantialsorting.

Many warehouse end users will be trained and never or seldom apply their training

I once read a study that claimed that only one quarter of the people who get training in a query toolactually become heavy users of the tool.

After end users receive query and report tools, requests for IS written reports may increase

This phenomenon was seen with many of the information centers of the 1980s. It comes aboutbecause the query and report tools allow the user the users to gain a much better appreciation ofwhat technology could do. However, for many reasons the users are unable to use the new toolsthemselves to realize the potential. By the way, if this happens do some honest research on why.Granted there are many reports that are so complex that IS expertise is going to be required nomatter what tool the end user has. However, many times this phenomenon points to training needs.

Your warehouse users will develop conflicting business rules

Many warehouse tools allow users to perform calculations. The tools will allow users to perform thesame calculation differently. For instance, suppose you are summarizing beverage sales by flavorcategory. Also suppose that the flavor category includes cherry and cola. If you have a cherry colabrand there is a chance that two users will classify the brand in different categories. You will findthat there are means to incorporate some of the business rules in your warehouse. However, thenumber of possible business rules is so large that you will not be able to incorporate all rules.

Your warehouse users may not know how to use data

After many years of using whatever reports have been thrown in their faces, the users may not knowwhat data to use their newfangled decision support tools to retrieve. To use a phrase from popsociology, the users have been "culturally conditioned" to use what they are given and to never ask formore.

Large scale data warehousing can become an exercise in data homogenizing

Data have quirks! Sometimes when we developers combine detailed data for different subjects, in ourefforts to make everything 'fit' we can take the life out of the data. For instance, if your companysells dog food and auto tires, you want to be careful if you are building a sales data warehouse forboth lines of business. You have to make a judgment call as to whether these businesses fit the samelogical and/or physical model.

'Overhead' can eat up great amounts of disk space

A popular way to design a decision support relational databases is with star or snowflake schemas.Persons taking this approach usually also build aggregate fact tables. If there are many dimensions tothe data, be aware that the combination of the aggregate tables and indexes to the fact tables andaggregate fact tables can eat up many times more space than the raw data. If you are usingmultidimensional databases, be aware that certain products pre-calculate and store summarized data.As with star/snowflake schemas, storage of this calculated data can eat up far more storage than theraw data.

The time it takes to load the warehouse will expand to the amount of the time in the availablewindow... and then some

You'll do yourself well by understanding the different ways to approach updating the warehouse.Before you decide that you can do complete refreshes, be aware that "There's all day Sunday to loadthe database!" have been famous last words of more than a handful of warehouse developers.

You are going to have a tough problem with security - especially if you make your datawarehouse Web-accessible

You are going to face a paradox - the more accessible you make your data warehouse (and byaccessible, I don't just mean making it Web accessible - I mean architecting it in a way that peoplewant to use it), the greater security risk you are exposing yourself too. Frankly, restricting people to"need to know" does not cut it in the organization on the 2000s. But, on the other hand, exposinginformation to theft from anyplace in the globe is not too great for job security either.

The data warehouse data you do not reconcile with the feeder systems will cause the problems

For certain data warehouse data you are going to think that there is no logical way that data in thefeeder systems can be reconciled with what are in the warehouse. Then, when a user looks at a reportand tells you "I think there is a problem", it will be with the unreconciled data. Unfortunately, you willthen discover there is a way, albeit roundabout, to reconcile the data.

You are building a HIGH maintenance system

Reorganizations, product introductions, new pricing schemes, new customers, changes in productionsystems, etc. are going to affect the warehouse. If the warehouse is going to stay 'current' (andbeing current will be a big selling point of the warehouse), changes to the warehouse have to be madefast.

You will fail if you concentrate on resource optimization to the neglect of project, data, andcustomer management issues and an understanding of what adds value to the customer

If you provide a system that is fast and technically elegant but adds little value or hassuspect data, you will probably lose your customer from day one and will have a tough timegetting him back. For the most part, use of data warehousing systems is optional. Thecustomer has to want to use the system.

Performing Data Warehouse Software Evaluations

Here are some ideas that may make the process of evaluating data warehousing software moreeffective. This is not a comprehensive list of tasks to follow in a technology evaluation. Rather, theseare points that seem to be rarely discussed or followed in this wave of interest in data warehousing.An excellent paper to read along with this essay is Nigel Pendse's How not buy an OLAP product -which has advice that, for the most part, is applicable to buying any sort of datawarehousing/decision support technology.

Do the evaluation yourself

That is, do not rely solely (or even in large part) on the ideas of someone outside your organization.There is no "metaphysically" best technology out there. All technologies have to be evaluated in thecontext of your organization's needs, expectations, limitations, and resources - which you know betterthan any outsider. Also, you can never be sure of the outsider's biases. Outsiders's main worth reallycomes from their knowledge of criteria you can use in the evaluation - though you have to decide theweight of each criterion.

Always first ask whether technology already in-house can do the job

Successful data warehousing/decision support systems can often be built without the specialized toolsyou see listed in this site. Taking on additional technology in you organization always imposes someburdens that should always be recognized before you hand over your organization's money.

Get references

Talking to reference sites is one of the most effective means of getting practical information. Youwould be surprised how important operational issues surface while doing evaluations. Some hints onreference gathering practices that have worked for me are:Ask the software vendor for a complete list of referenceable sites - Try to have options as towhich organizations you will call.If this is a major decision for your company, call 5-6 sites - You need a minimum number of sitesto help you detect patterns.Make a telephone appointment to talk with the reference - The reference will appreciate this.Plan on 20 minutes with the reference - Again the reference will appreciate this.Ask open-ended questions - You will find some interesting information with skillful questions.Send your questions to the reference in advance - Some of the references will be more comfortableif they know what you'll be asking.Send a thank you note to your references asking if it would be okay to make a quick follow-upcall if necessary - This will lay the groundwork if you have to call about another issue.

If you are going to see multiple vendor demos, build a test case that each vendor will follow

This will allow you to compare apples to apples and peaches to peaches. Leave some open time at theend of the demo so the vendors can show features that were not covered well in the test case. Onemore point. Because departing from the standard vendor dog and pony show takes time on part of thevendor, many will be unwilling to do this unless you are talking about a major purchase.

Be skeptical of data warehousing pundits' endorsements or reviews of technology

Often these pundits get compensated handsomely for these objective appearing endorsements orreviews.

Read stock analyst reports on publicly held vendors and the industry outlook

Though these reports are intended mainly to get people to buy stocks, many times these reports canbe an excellent source of background information on a vendor. Many libraries will have a largecollection of these reports stored on CD.

Check how well the software handles maintenance

Most of the time spent with a software tool will be with maintenance. See how well the tool handleschanges. For instance, most tools work with something like a data dictionary. See what are theconsequences of changing the name of a field in the data dictionary. See how the dictionary helps youlocate and change queries, reports, forms, macros, etc. that may be affected by the name change.

Understand the tradeoffs the software makes

Usually there is not a free lunch! Designers of tools trade off speed, capacity, computer resourceconsumption, ease of development, ease of use, and ease of maintenance. For example, several reportand query tools can be made quite accessible to end users if you are willing to maintain extensive datadictionaries. Several OLAP tools attain quick retrieval times by requiring the storage of huge amountsof pre-calculated numbers. To prevent some nasty surprises once the tool has been purchased, makesure the persons making the buying decision understand these tradeoffs.

Go to the vendor road shows to talk with other attendees

Sometimes I think that the audience at the vendor road shows is the best source of information. Ifyou'll make a point of talking with several other attendees, chances are you will come across a personwho is in at the same stage in evaluating warehousing tools. You will find that you and that person canexchange information that is mutually beneficial.

Check the financial stability of the vendor

If you for work for an organization with an accounts receivable department, the people in thatdepartment can help you with this. A simple check could save you some major potential grief.

Have a representative team perform the evaluation

Often technology acquisitions fail or go awry because a group within an organization felt it did not getits views heard during the evaluation. One of the first steps in a technology evaluation is to identifyall 'interested parties' in the acquisition. Make sure these parties are asked how they want to berepresented in the evaluation. If parties that are in conflict with each other will actively participate,if you do not have the skills and/or patience to be a mediator, seek the services of an outsidefacilitator. Facilitation skills can be especially helpful if you have sessions dedicated to settingcriteria, making your short list, and making the final decision.

If you're evaluating an end user tool, let an end user lead the evaluation effort

It seems odd but some organizations buy end user tools with little input from the end usersof these tools.

An (Informal) Taxonomy of Data Warehouse Data Errors

You may have seen publications that tell you that you may have to spend the majority of your datawarehouse development time building the means for both the initial and recurring extraction,transforming, and loading of data. What I have not seen, though, is much in-depth discussion of whatexactly are those errors in the dirty data that you will spend your time cleaning up. Forewarned isforearmed. If you know the possibility that certain errors exist, you will be more prone to spot themand to plan your project to attack the errors in a manageable way. Perhaps the material in this papercan help you formulate a checklist of errors you will be checking for. What follows is a list of commonerrors. Also, if you are a relational database expert, bear with my imprecise use of some terminology.Finally, note that when I refer to a data warehouse, I refer to the database that is directly fed withdata from the source systems - not the data marts (or whatever you want to call them) that are fedwith cleansed data.

The categories of "errors"

I place "errors" into four categories. Quotations are around the word errors because some errors arenot, in the metaphysical sense, erroneous. So, with some awkwardness, let me suggest that errorsinvolve data that are either:

IncompleteIncorrectIncomprehensibleInconsistent.

Incomplete errors

These consist of:

Missing records

This means a record that should be in a source system is not there. Usually this iscaused by a programmer who diddled with a file and did not clean up completely. (Iread a white paper about how users have to "fess up" about bad data. Actually,usually system personnel cause MUCH more headaches than users.) Note you may notspot this type of error unless you have another system or old reports to tie to.

Missing fields

These are fields that should be there but are not. There is often a mistaken beliefthat a source system requires entry of a field.

Records or fields that, by design, are not being recorded

That is, by intelligent or careless design, data you want to store in the datawarehouse are not being recorded anywhere. I further divide this situation intothree categories. First, there may be dimension table attributes you will want torecord but which are not in any system feeding the data warehouse. For example, themarketing user may have a personal classification scheme for products indicating thedegree to which items are being promoted. Second, if you are feeding the same typeof data in from multiple systems you may find that one of the source systems doesnot record a field your user wants to store in the data warehouse. Third, there maybe "transactions" you need to store in the data warehouse that are not recorded in aexplicit manner. For example, updating the source system may not necessarily causethe recording of a transaction. Or, sometimes adjustments to source system data aremade downstream from the source system. Off-invoice adjustments made in generalledger systems are a big offender. In this case you may find that the grain of theinformation to be stored in the warehouse may be lost in the downstream system.

Incorrect errors

You can say that again! That is, the data really are incorrect.

Wrong (but sometimes right) codes

This usually occurs when an old transaction processing system is assigning a code thatthe transaction processing system users do not care about. Now if the code is notvalid, you are going to catch it. The "gotcha" comes when the code is wrong but it isstill a valid code. For example, you may have to extract data from an ancient repairparts ordering system that was programmed in 1968 to assign a product code of 100to all transactions. Now, however, product code 100 stands for something other thanrepair parts.

Wrong calculations, aggregations

This situation refers to when you decide to or have to load data that have alreadybeen calculated or aggregated outside the data warehouse environment. You will haveto make a judgment call on whether to check the data. You may find it necessary tobring data into the warehouse environment solely to allow you to check thecalculation.

Duplicate records

There usually are two situations to be dealt with. First, there are duplicate recordswithin one system whose data are feeding the warehouse. Second, there isinformation that is duplicated in multiple systems that feed in the same type ofinformation. For example, maybe you are feeding in data from an order entry systemfor products and an order entry system for services. Unbeknownst to you, yourbranch in West Wauwatosa is booking services in both the product and service orderentry systems. (The possibility of situation like this may sound crazy until youencounter the quirks in real world systems.) In both cases, note that you may missthe duplicates if you feed already aggregated data into the warehouse.

Wrong information entered into source system

Sometimes a source system contains data that were simply incorrectly entered intothe system. For instance, someone may have keypunched 6/9/96 as 9/6/96. Now theobvious action is to correct the source system. However, sometimes, for variousreasons, the source system cannot be corrected. Note that if you have many errorsin a source system that cannot be corrected, you have a much larger issue in that youdo not really have a reliable "system of record".

Incorrect pairing of codes

This is best described by an example. Sometimes there are supposed to be rules thatstate that if a part number suffix is XXX, then the category code should be eitherA, B, or C. In more technical terms, there is a non-arithmetic relationship betweenattributes whose rules have been broken.

Incomprehensibility errors

These are the types of conditions that make source data difficult to read.

Multiple fields within one field

This is the situation where a source system has one field which contains informationthat the data warehouse will carry in multiple fields. By far the most commonoccurrence of this problem is when a whole name, e.g., "Joe E. Brown", is kept in onefield in the source system and it is necessary to parse this into three fields in thewarehouse.

Weird formatting to conserve disk space

This occurs when the programmer of the source system resorted to some out of theordinary scheme to save disk space. In addition to singular fields being formattedstrangely, the programmer may also have instituted a record layout that varies.

Unknown codes

Many times you can figure out what 99% of what codes mean. However, you usuallyfind that there will be a handful of records with unknown codes and usually theserecords contain huge or minuscule dollar amounts and are several years old.

Spreadsheets and word processing files

Often in order to perform the initial load of a data warehouse it is necessary toextract critical data being held in spreadsheet files and/or "merge list" files.However, often anything goes in these files. They may contain a semblance of astructure with data that are half validated.

Many-to-many relationships and hierarchical files that allow multiple parents

Watch out for this architecture in source systems. It is easy to incorrectly transferdata organized in such manner.

Inconsistency errors

The category of inconsistency errors encompasses the widest range of problems. Obviously similardata from different systems can easily be inconsistent. However, data within one system can beinconsistent across locations, reporting units, and time.

Inconsistent use of different codes

Much of the data warehousing literature gives the example of one system that uses"M" and "F" and another system that uses "1" or "2" to distinguish gender. May Isuggest that you wish that this is the toughest data cleaning problem you will face.

Inconsistent meaning of a code

This is usually an issue when the definition of an organizational entity changes overtime. For example, say in 1995 you have customers A, B, C, and D. In 1996, customerA buys customer B. In 1997, customer A buys customer C. In 1998, Customer A sellsof part of what was A and C to customer D. When you build your warehouse in 1999,based on the type of business analysis you perform, you may face the dilemma of howto identify the sales to customers A, B, C, and D in previous years.

Overlapping codes

This is a situation where one source system records, say, all its sales to Customer Awith three customer numbers and another source system records its sales tocustomer A with two different customer numbers. Now, the obvious solution is to useone customer number here. The problem is that there is usually some good businessreason why there are five customer numbers.

Different codes with the same meaning

For example, some records may indicate a color of violet and some may indicate acolor of purple. The data warehouse users may want to see these as one color. Moreannoyingly, sometimes spaces and other extraneous information have beeninconsistently embedded in codes.

Inconsistent names and addresses

Strictly speaking this is a case of different codes with the same meaning. Myunscientific impression of this type of problem is that decent knowledge of stringsearching will allow you to relatively easily make name and address information 80%consistent. Going for 90% consistency requires a huge jump in the level of effort,Going for 95% consistency requires another incremental huge jump in effort. As for100% consistency in a database of substantial size, you may want to decide if sendinga person to Mars is easier.

Inconsistent business rules

This, for the most part, is a fancy way of saying that calculated numbers arecalculated differently. Normally, you will probably avoid loading calculated numbersinto the warehouse but there sometimes is the situation where this must be done. Asnoted before, you may have to feed data into the warehouse solely to checkcalculations. - This can also mean that a non-arithmetic relationship between twofields (e.g., if a part number suffix is XXX, then the category code should be eitherA, B, or C) is non consistently followed.

Inconsistent aggregating

Strictly speaking this is a case of inconsistent business rules. In a nutshell, thisrefers to when you need to compare multiple sets of aggregated data and the dataare aggregated differently in the source systems. I believe the most commoninstance of this type of problem is where data are aggregated by customer.

Inconsistent grain of the most atomic information

Certain times you need to compare multiple sets of information that are not availableat the same grain. For example, customer and product profitability systems comparesales and expenses by product and customer. Often sales are recorded by productand customer but expenses are recorded by account and profit center. The problemoccurs when there is not necessarily a relation between the customer or productgrain of the sales data and the account - profit center grain of the expense data.

Inconsistent timing

Strictly speaking this is a case of inconsistent grain of the most atomic information.This problem especially comes into play when you buy data. For example, if you workfor a pickle company you might want to analyze purchased scanner data for grocerystore sales of gherkins. Perhaps you purchase weekly numbers. When someone comesup with the idea to produce a monthly report that incorporates monthly expense datafrom internal systems, you'll find that you are, well, in a pickle.

Inconsistent use of an attribute

For example, an order entry system may have a field labeled shipping instructions.You may find that this field contains the name of the customer purchasing agent, thee-mail address of the customer, etc. A more difficult situation is when differentbusiness policies are used to populate a field. For example, perhaps you have a facttable with ledger account numbers. You may find that entity A uses account '1000'for administrative expenses while entity B uses '1500' for administrative expenses.(This problem gets more interesting if entity A uses '1500' and entity B uses '1000'for something other than administrative expenses.)

Inconsistent date cut-offs

Strictly speaking this is a case of inconsistent use of an attribute. This is when youare merging data from two systems that follow different policies as to dating

transactions. As you can imagine, the issue comes up most with dating sales and salesreturns.

Inconsistent use of nulls, spaces, empty values, etc.

Now this is not the hardest problem to correct in a warehouse. It is easy, though, toforget about this until it is discovered at the worst possible time.

Lack of referential integrity

It is surprising about how many source systems have been built without this basiccheck.

Out of synch fact data

Certain summary information may be derived independently from data in differentfact tables. For example, a total sales number may be derived from adding up eithertransactions in a ledger debit/credit fact table or transactions in a sales invoice facttable. Obviously there may be differences because one table is updated later thananother table. Often, however, the differences are symptoms of deeper problems.

Some ending thoughts

I hope this paper adds to the understanding of what takes up the majority of time in a datawarehouse. Let me offer the following ending thoughts:

Be prepared for a lot of tedious work.

Probably the most important "tools" for solving these problems are a sharp eye andendurance for checking an abundance of detail information.

You may spend much more time checking for errors than cleaning up errors.

Most of these errors do not jump out at you.

The errors of inconsistency are the most difficult to handle.

At least that is my experience.

The complexity of a data warehouse increases geometrically with the number ofsources of data fed into it.

Having to reconcile inconsistent systems is the reason. For example, if it takes 100hours to reconcile data from two source systems, you can expect that it will take onthe order of 400, not 200, hours to reconcile data from four source systems.

The complexity of a data warehouse increases geometrically with the span oftime of data to be fed into it.

My previous comment applies. Note, however, that reconciling inconsistencies overtime may be even harder because the people who know what happened in previousyears may not be around to answer your questions.

You will be faced with an economic and political question as to how erroneous thedata in your system will be.

Completely fixing some of these problems can be quite expensive. More vexingly,often what constitutes "correct" data is debatable. What you do, more often thennot, boils down to a question of money and politics.

Data Warehousing Political Issues

This paper is a list of political issues that frequently come up in data warehousing projects. Peopleoften get blind sided by politics. My hope is that this paper might give readers some advance warningof these issues. Though what is done about these issues varies by organization, I believe the bestadvice to data warehouse implementers is to do your best to spot these issues early and then pick yourbattles wisely.I recommend that you read Marc Demarest's The Politics of Data Warehousing in conjunction withthis paper. In his June 1997 paper, Marc comments on how little extended discussion of politics thereis in the data warehousing literature. As of the writing of this paper, to the best of my knowledge,that situation still has not changed. This is unfortunate because ambitious data warehousing projectsare rife with political issues.My working definition of a data warehousing "political issue" is a situation where the equally valid andreasonable goals and interests of two or more parties collide with each other. That is, these aresituations where there is great potential for conflict. Though these issues can appear minor and evenpetty, they can account for a good portion of the mental wear and tear experienced by datawarehouse developers.In this paper, I have classified the political issues into those that are within the IS organization (ISto IS), those that are between IS and the users (IS to Users), and those that are between users(User to User).Finally, in this paper I try to list the political issues that are peculiar to data warehousing. Datawarehousing experiences all the usual political problems (i.e., resources, deadlines, etc.) that occur incomplex technology projects. Just check into literature about IS project management and you willfind a wealth of material on these issues.

IS to IS issues

Internecine conflicts in IS projects can be the most difficult to deal with. Data warehousing projectsprobably are typical in this respect.

Where does the data warehousing development group report to

The issue is whether the data warehousing development group should be a free standing developmentorganization or whether it should be part of a group that traditionally has concentrated its efforts ontransaction processing development. Often transaction processing development organizations havebeen driven by their work order backlogs and the need to react to whatever is the crisis on hand.Some persons believe that data warehousing, however, best flourishes when done with anentrepreneurial orientation rather than with a reactive orientation. On the other hand, many

organizations quickly come to depend on data warehousing systems for day-to-day work. These datawarehousing systems need to be as "industrial safe" as some of the transaction processing systems.Placing the data warehousing effort in a separate development group can lessen knowledge transferand appreciation of how to make data warehouses industrial safe.

Who should administer the data warehousing databases - the DBA group or the data warehousingdevelopment group

The need to make data warehouse database structure changes can be relatively frequent.Proliferating data marts, uncertainty about usage patterns, and the "I'll know what I want when I seeit" nature of data warehouse development can necessitate table and index changes. Data warehousedevelopers, concerned about losing the favor and interest of data warehouse users, want changesmade quickly and get quite frustrated being put on the DBA backlog. On the other hand, DBAs oftenhave knowledge about how to make database processing industrial safe. Cutting the DBA organizationout of the data warehousing support loop can deprive the data warehousing effort of some valuablewisdom.

How to gain the cooperation of feeder system developers who appear to have much more to losethan to gain in the data warehouse development effort

Data warehousing efforts often bring to light problems in feeder transaction processing systems thatmay have been "hidden" for years. The developers of these systems, whose knowledge is often crucialto the data warehousing effort, may be reluctant to help if they feel that the data warehousingeffort is going to be audit of their work.

Should feeder system problems be corrected in the data warehouse or in the feeder system

Actually, the question often becomes whether: 1) The feeder system should be fixed or 2) The feedersystem should be left alone and the data in the warehouse should be fixed or 3) Data should be fixedin the data warehouse with the fixes fed back to the feeder system. And to further complicatematters, usually there are multiple problems with different groups suggesting different combinationsof actions.

Against what data should reports be written

Often an organization quickly discovers that quite a few reports can be written against data in thedata warehouse or against data in the transaction processing systems. This can be quite perplexing toorganizations where there is not agreement as to what the data warehouse is for.

How big is the data warehousing batch processing window

Often there is need for a time period where transaction processing systems are kept stable sochanges made to the systems can be captured and fed into the data warehouse. When changes cannotbe easily identified, a typical course of action is to compare a previous copy of the transaction systemdatabase with the current database. After the changes are identified, a copy of the current databaseis made for comparison in the next processing cycle. In some firms, the need to "freeze" transactionprocessing system databases can cause inconveniences to other processing. How much time should beallotted to the window in which transaction processing system databases are frozen can be a source ofcontention.

Who has ongoing responsibility for data quality monitoring

Data quality is not a one time concern to many firms that implement data warehouses. In a firm withcomplex feeder systems, it is not uncommon for previously undiscovered data quality problems occurafter the big push to clean data for the initial load of the data warehouse is done. Firms find itnecessary to install procedures to regularly audit data quality. And in most firms it is unclear whoshould have responsibility for executing these procedures.

How are requests to make feeder transaction processing system changes approved and how isknowledge about the changes communicated

Small changes in feeder transaction processing systems can have major impacts on the feed to a datawarehouse. Conflicts arise when transaction processing system developers, under pressure from theirusers to make changes, now have to work with data warehouse developers to assess the impact ondownstream systems. Even more vexing situations come when a change is made in the feedertransaction processing system and is not communicated to the data warehouse developers.

IS to User issues

User issues can be especially thorny with data warehouses because, unlike with transaction processingsystems, use of data warehousing systems is often optional. Unless data warehouses are tailored totheir preferences, users may quickly decide not to use the data warehouse.

Why should users give up control of user managed databases

Many user departments have, on their own, developed databases that meet some of their keyreporting needs. Often these systems were built by user organizations on their own because the ISorganization was unwilling or unable to help the users or the users were skeptical about the level ofsupport they would receive if they were to work with IS. It is highly likely when a data warehousethat will subsume the functions of these user managed databases is proposed, these users may beskeptical about whether the IS organization can do as good a job supporting the user reporting needsas the users did on their own.

How to gain the cooperation of a user whose spreadsheet is being automated

Often part of the goal of a data warehouse is to automate the production of a spreadsheet or seriesof spreadsheets that have been manually created by a user. Sometimes the user's corporate identityis tied to the spreadsheets and he or she feels (rightfully) threatened by the prospect of automation.This user's cooperation will be needed in the data warehouse development. Though dealing with thissensitive personnel issue probably should be to be the responsibility of user management, often theIS organization has the burden of figuring out how to gain cooperation.

Should design be for the needs of the masses or for the needs of the most demanding user

In many data warehousing projects it is not uncommon for the IS organization to find one to a handfulof users whose "needs" go way beyond those of most of the data warehouse users. Usually, the need isfor a far greater level of detail and/or for far more history and/or for a series of reports of both ahigh deal of technical and business complexity. It can be quite expensive and time consuming to satisfythe needs of these far more demanding users. On the other hand, these users can have a peculiar

need that is especially beneficial to the business and/or can be people whose support is vital to thesuccess of the project.

What requirements should be frozen; When should requirements be frozen (and unfrozen)

Data warehousing development is iterative. This does not mean that requirements never get frozen.Rather, there can be many start-stop cycles in data warehousing requirements definition. Also, somerequirements may be frozen while some are always loose. Managing requirements definition in a datawarehouse effort can require a deft political touch.

How many data marts should there be

Users want their own data marts for a variety of reasons. Some of the reasons are: 1) The desire toput their data on different hardware platforms so their reporting needs are less impacted by otherpeople's processing 2) The desire to modify data at their own discretion (though this may striketerror in a data warehousing purist) 3) The desire not have to work with other groups on resolvingdata definition issues. - Some reasons sometimes do make good business sense. Unfortunately, it canget quite expensive to support a proliferating number of data marts.

In how timely a manner are data corrected

Sometimes users are used to being able to make a correction to data and then immediately run reportsagainst corrected data. Perhaps the users have been running reports against a transaction systemdatabase which could immediately be adjusted. Perhaps the users had their own database orspreadsheets which they could adjust at their will and then generate reports. Problems come if datawarehouse developers design systems so corrections now are now incorporated into the datawarehouse during a batch feed at the end of the day or at the end of the week or at the end of themonth.

Who should have responsibility for maintaining data warehouse data not fed by transactionprocessing systems

Often as part of a data warehouse it is necessary to manually maintain dimension tables and conversiontables that contain data not in any transaction processing system. Also, sometimes budget, forecast,or quota data must be manually maintained. This maintenance can be quite involved. Determiningwhether users and/or IS should bear the maintenance burden can be a major issue.

Who is in charge of ongoing audit of data quality

As mentioned before, data errors pop up after the data warehouse is implemented. For example,problems occur because sometimes data is not fed from the transaction processing systems or fedmultiple times. Many times it is necessary to make someone explicitly responsible for regularlyauditing data. However, it often is not clear who this person should be.

How to pass responsibility for running and maintaining a report from the users to IS

Users write reports that the business comes to depend on for day-to-day functioning. Here is whatoften happens: 1) The reports become too technically difficult for the users to change and/or 2) Thereport "code" becomes lost or corrupted and/or 3) The user leaves the organization (usually withoutdocumenting the report). In these cases, IS usually gets called in. This need to obtain IS involvement

can create great consternation in an IS organization who thought that building a data warehouse wasgoing to get it out of the report writing business.

User to User issues

These are issues that involve potential conflicts among the users of a data warehouse. This does notmean that IS is not involved. Rather, IS can be right in the middle between users.

Who has access to what data

As can be imagined, one business group may not want another business group to see its data and onelocation may not want another location to see its data. Also common is for division personnel not towant corporate personnel to see detail division data. Perhaps more complicated to deal with areconcerns of one user group that another user group may misinterpret data. Often one functional areathinks another won't understand certain data, e.g., Sales say Finance won't understand "its" numbersand Finance says Sales won't understand "its" numbers. Often people's whose formal job it is toanalyze information question whether people whose formal job is not to analyze information willmisinterpret data, e.g. , financial and market analysts question whether line accountants and salespeople can understand certain data.

What dimensions, attributes, calculations should be defined similarly

You may have seen some data warehousing literature that talks about how the data warehouse shouldcreate a "common view" (or some similar term) of all the data. To put this is in what I believe are inmore concrete terms, I believe that this is referring to making sure that dimensions conform, thatattributes are used consistently, and that calculations are always calculated the same way. Thoughthis is a nice ideal, I believe that most firms do not have the patience to do this. Rather, through agreat deal of give and take, firms implementing data warehouse decide a subset of dimensions,attributes, and calculations whose definition is worthwhile making the effort to calculate similarly.

How to define a customer; How is profitability calculated

Most firms end up wanting to determine similar definitions of customers and profitability. It is myopinion that these definition tasks probably cause more political issues than any other definition tasks. - Note that a common use of a data warehouse is to report profitability for internal purposes in away more meaningful than profitability as calculated per generally accepted accounting principles. It isvery common to want to report profitability by customer and/or by product. If so, the firm may haveissues as to what a customer is. A customer may be a legal entity, it may be a location, or it may be thepeople performing a function for a legal entity or a location, etc. To determine profitability, it may benecessary to include expense allocations, the determination of which can be politically contentious.Finally, another common major issue regarding profitability is when a sale should be recognized.

Who has final say over the correctness of data

If multiple user organizations are going to be accessing the same data, there will be ongoingdisagreements about the "correctness" of data added to the data warehouse. These debates aboutcorrectness will not be which items are in error. Rather, these will be debates regardinginterpretation of data. Note that an unexpected consequence of data warehousing is that while beforeusers might be able to reconcile their differences by making adjustments to summarized numbers,data warehousing may force them to agree on how the detail should be interpreted.

Conclusion

If you go through these issues I believe you will see three common threads regarding whydata warehousing projects engender political issues: 1) Data warehousing imposes newobligations whose responsibilities are unclear 2) Data warehousing requires changes inprocesses that an organization is comfortable with 3) Data warehousing requires agreementon some, but not all, definitions of data.

Different Aspects of Data Warehouse Architecture

This page is a list of the different aspects of data warehouse architecture. Architecture is a prettynebulous term. I think of architecture as a system design decision that is usually not easily changed.The decision is not easily changed because the amount of work, money, and politics involved in doing so.This a list of aspects of architecture that the data warehouse decision maker will have to deal withthemselves. There are many other architecture issues that affect the data warehouse, e.g., networktopology, but these have to be made with all of an organization's systems in mind (and with peopleother than the data warehouse team being the main decision makers.)This list will not attempt to provide detailed explanations of the different types of architecture.Rather, I am presenting this list because the data warehousing literature usually muddles the subjectof architecture by lumping different types of decisions together or by forgetting certain types ofdecisions.Also, the literature makes these decisions seem much more black and white than they are. Forexample, in the area of what I call reporting and staging data store architecture, much of theliterature discusses only the "enterprise" data warehouse, the dependent data mart, and theindependent data mart options. In reality, there are many more variations being used that cannoteasily be given a snappy label.

Data consistency architecture

Doug Hackney's excellent but confusingly titled article on what he calls incremental data martenterprise architecture is the most succinct statement of what this means. This is the choice of whatdata sources, dimensions, business rules, semantics, and metrics an organization chooses to put intocommon usage. (Though the article does not say it explicitly, it is also the equally important choice ofwhat data sources, dimensions, business rules, semantics, and metrics an organization chooses not toput into common usage.) This is by far the hardest aspect of architecture to implement and maintainbecause it involves organizational politics. However, determining this architecture has more to do withdetermining the place of the data warehouse in your business than any other architectural decision. Inmy opinion, the decisions involved in determining this architecture should drive all other architecturaldecisions. Unfortunately, this determination of this architecture seems to often be backed into thanconsciously made.

Reporting data store and staging data store architecture

The main reasons we store data in a data warehousing systems are so they can be: 1) reported against,2) cleaned up, and (sometimes) 3) transported to another data store where they can be reportedagainst and/or cleaned up. Determining where we hold data to report against is what I call thereporting data store architecture. All other decisions are what I call staging data store architecture.

As mentioned before, there are infinite variations of this architecture. Many writings on this aspector architecture take on a religious overtone. That its, rather than discussing what will make mostsense for the organization implementing the data warehouse, the discussion is often one ofarchitectural purity and beauty or of the writer's conception of rightness and wrongness.

Data modeling architecture

This is the choice of whether you wish to use denormalized, normalized, object-oriented, proprietarymultidimensional, etc. data models. As you may guess, it makes perfect sense for an organization touse a variety of models.

Tool architecture

This is your choice of the tools you are going to use for reporting and for what I call infrastructure.

Processing tiers architecture

This is your choice of what physical platforms will do what pieces of the concurrent processing thattakes place when using a data warehouse. This can range from an architecture as simple as host-basedreporting to one as complicated as the diagram on page 32 of Ralph Kimball's "The Data WebhouseToolkit".

Security architecture

If you need to restrict access down to the row or field level, you will probably have to use some othermeans to accomplish this other than the usual security mechanisms at your organization. Note thatwhile security may not be technically difficult to implement, it can cause political consternation.As a final comment, let me assert that in the long run, decisions on data consistency architecture willprobably have much more influence on the return of investment in the data warehouse than any otherarchitectural decisions. To get the most return from a data warehouse (or any other system), businesspractices have to change in conjunction with or as a result of the system implementation. Consciousdetermination of data consistency architecture is almost always a prerequisite to using a datawarehouse to effect business practice change.

What to Learn About in Order to Speed Up Data Warehouse Querying

This paper is a laundry list of items data warehouse implementers may wish to learn more about inorder to speed up their data warehouse queries or to make the data warehouse "environment" moreresponsive to the bulk of the data warehouse query users. This paper will not attempt to providedetailed explanations of these topics. Nor is including a topic in this list a declaration that knowledgeof the topic will definitely speed up querying. Rather, data warehouse implementers may use this paperas a starting point in their search for ways to speed up queries. This list includes topics that arerelevant to many of the relational database and data access tool technologies. Some topics that apply,to the best of my knowledge, to one or two vendors' technologies are not listed.

SQL SELECT statements

This is bedrock knowledge. It is quite worthwhile to get an book on SQL (there are quite a few goodones) and review (or learn) this topic. Though you may think that your query tool's SQL generationcapabilities lessen the need for this knowledge, you will eventually find the SQL knowledge quitehelpful.

How does your database join tables, union tables, uses indexes, choose access paths

This is some more bedrock knowledge. Unfortunately, this information may not be that accessible. Ifthe information exists, it may be poorly written, written for an academic audience, and/or scatteredamong many manuals. Nevertheless, it is worth making a determined effort to understand thesetopics. - The vendor/consultant community would do itself well if it tried much harder to communicatethis information in coherent and comprehensible terms.

What statistics your database provides on query execution

Sometimes those of us building stores of information for users to analyze forget about our owninformation needs. You need this information to identify which queries are especially resourceconsumptive. You probably will be concerned with a clump of queries that are far more consumptivethan average. Sometimes the resolution of consumption issues is a simple rewrite of the query.Sometimes resolution is more technically involved and requires doing many things listed in this paper.And sometimes the solution is to do nothing - you just have to accept that your data warehouse has tosupport these demanding queries.

Aggregate tables

This is probably the most used method of speeding up queries. There are many discussions of this inthe literature. The books "The Data Warehouse Lifecycle Toolkit", "The Data Warehouse Toolkit",and "Data Warehousing in the Real World" have especially good non-technology specific discussions ofthis topic.

Aggregate navigators/query redirectors

This is the technology that automatically directs a query to aggregated data if such data are availableand appropriate for the query.

Partitioning

This is probably the second most common method of speeding up queries. Note that partitioning comesin many ways, shapes, and forms. At the very least, it is dividing one table into several tables usuallybased on the time the table data represent. Note that both tables and indexes may be partitioned.

B-tree indexing

Adding numerous indexes is another common method for speeding up queries. Note that persons with atransaction processing mindset may have a hard time accepting as much use of these indexes as isusually helpful in a data warehouse.

Dimensional modeling

With certain database technologies, this modeling can reduce the amount of sort/merging that goeson when joining tables. And, some query tools may generate more efficient SQL if data are modeleddimensionally. Also, if you use surrogate keys in conjunction with dimension modeling, joins may bemore efficient.

Parallelizing query execution

Developments in database technology have made doing this much easier. Note, however, the number ofusers running queries and the amount of data to be returned in a query can sometimes limit thistechnique's effectiveness.

Archiving/purging data

Sometimes the cost of having to scan through older data exceeds the benefit of having it available inthe unlikely possibility someone wants to examine it.

Reducing the width of large tables that get scanned

There are also many ways to do this. Before getting fancy with this it is worth taking the time tounderstand what actually takes up space in your database tables.

Completely denormalizing aggregate tables

If these tables can be heavily indexed and can be maintained by complete refreshing, therequirements of join processing can be eliminated.

Loading tables completely in memory

Presuming the memory is available to do this and you have researched other topics in this paper, thismay be an interesting strategy.

Bit mapped indexing

This technique can work well when a field takes on a low number of distinct values (i. e., lowcardinality) and tends to be in WHERE clauses often.

Striping files

This means spreading a file over several physical disks. Look into the topic of RAID for more details.

Locating different files used concurrently on different disks

This is basic stuff but it can be helpful.

Defragmentation of table and index files

This is more basic stuff.

Solid State Disk

Supposedly prices have come down in the last few years.

Disk controllers

Too few can be a query bottleneck.

What your query tool attempts to do via SQL and what it does internally

The book "The Data Warehouse Toolkit" has a good discussion of where query tools may fall short.The reason you need to learn about this is to prevent using the query tool where it is inefficient or toknow when you might build some "get arounds".

Query scheduling capabilities

This does not necessarily speed up a given query. However, scheduling resource consumptive queriesfor off-hours times may free up resources for other queries during prime time.

Query queuing

As with scheduling, this does not speed a given query up. However, this facility gives you a means sopriority queries (such as a query needed to gain information for the monthly close of the financialbooks) can execute faster.

Query accelerators

These help you generate more efficient SQL. Note that they are probably more helpful to those whoreport off of highly normalized databases.

Query governors

These stop queries usually after a specified number of rows have been returned and/or a specifiedtime has elapsed.

Query nannies

This is my term for technologies that warn (scold?) the user if he submits an inefficient query. Someof these provides hints about how to make the query more efficient and some (I have heard) actuallytry to fix up the queries.

"Productionizing" regularly used, highly resource consumptive queries

Certain queries probably should be written by someone with a great deal of knowledge how to makequeries efficient.

Storing the image of the report

If a report based on a query is used by many people and on-line retrieval of the report is needed, theimage of the report may be stored. The query then need be run only once and perhaps at a less busytime. There are tools that allow intelligent retrieval of stored report data.

Query tool caching of results

Some tools store the results of some queries. If the same query is run again, the tool may check tosee if the results are stored. Or, if a subset of a previously retrieved result set is desired, the toolwill read the previously retrieved query result set rather than the data warehouse.

Query tool preview of a subset of records

When a query is being developed, some tools make it easy to retrieve a small subset of records thatmeet the query criteria. This makes it quicker to test the query and cuts down the number ofpotentially expensive test queries.

Making two copies of the data warehouse - one for "operational" users and one for "analytical"users

It actually is hard to draw a line between what is operational use and what is analytical use of a datawarehouse. However, in a typical data warehouse most of the users (usually with more "operational"needs) are running IS written, parameterized queries. A relatively small number of users (usually withmore "analytical" needs) are running potentially highly resource consumptive ad hoc queries. - Thoughit is not necessarily pretty, sometimes the best way to handle this mixed use of the data warehouse isto create a separate copy of the data warehouse for each user group.

Multi-tiered architectures/Application partitioning

Some query tools allow you to run different components (i.e., "tiers" or "partitions") of the tool ondifferent hardware servers.

Network bottlenecks

Though you do not have to become an expert at network topologies, if some of your users will runqueries that generate large result sets (and do not assume that only lengthy reports bring back largeresult sets to the query tool), it pays to trace the flow of data from the server to the user'sworkstation in order to see if there are any mismatched network components. For example, FastEthernet may be in your new facility but your user may have a 10Mbps network interface card.. Or,your user may have a card that was advertised to perform at 100Mbps which in actuality performs at30Mbps. Also, find out how your network people load balance. They are more used to dealing withpredictable transaction processing than extremely variable data warehousing demands. And ifnecessary, find out the costs of dropping more cable so you can put your users that run large resultset producing queries on dedicated network segments. If you have invested millions in the datawarehouse, the cost of an electrician and wire may be worth it.

Database technology designed specifically for data warehousing and third party indexingtechnology designed to speed up queries

Look at my Database page and Query and Load Accelerators page for more information.

The cost of installing more/faster CPU, memory, disk

Sometimes buying metal is (by far) the least expensive way to speed up your queries.Some final thoughts about speeding up queries:

You best expect that many of your queries are going to run a "long" time. You will prevent someproblems if you spend some time teaching your users about what, in general, will take a long time.In line with what I just said, you can spend plenty of time tuning queries. Though many IS people liketo spend their time tuning queries, this tuning time can take IS away from other data warehouseproblems whose solution is more meaningful to the business.In reality the area of speeding up queries involves plenty of guesswork, doings thing by intuition, trialand error, and making uncomfortable trade-offs.

What to Learn About in Order to Speed Up Data Warehouse Loading

This paper is another laundry list of items data warehouse implementers may wish to learn more aboutin order to speed up the process of extracting, transforming and loading data (henceforth simplyreferred to as loading) or to make these processes less prone to errors. This paper will not attempt toprovide detailed explanations of these topics. Nor is including a topic in this list a declaration thatknowledge of the topic will definitely speed up loading. Rather, data warehouse implementers may usethis paper as a starting point in their search for ways to speed up loading. This list does not includepoints relevant to a specific vendor's technology. Your DBA should know some ways of speeding up theload that apply only to the technology of your DBMS vendor.

How often the users really need updated data

Oftentimes data warehouse developers unquestioningly give in to the most extreme demands forfreshness of data or they automatically assume data need to be updated far more often than makesbusiness sense. Though you read sometimes ridiculous articles in the trade press and from industryanalysts (who have coined the awful term "information latency") about how the business world wantsto know everything immediately, the reality is quite different. If your data warehouse is not there tosupport day-to-day monitoring and analysis, question why it should be updated daily. If your datawarehouse is not there for week-to-week monitoring and analysis, question why it should be updatedweekly. By the way, though, if you do decide to update weekly or monthly, try to design your loadingprocess so you are not tied to loading at a specific interval. There may be certain "crunch" times whenyou have to load more frequently.

How to drop and re-establish indices and how to set index fill factors

If you update a large portion of the database (I've heard estimates from 10 - 25% up), you may wantto learn about dropping indices before a database load and then re-establishing them after the load.If you do not drop indices, you want to make sure you set the index fill factors so your server's diskdrives do not waste time looking for space in which to write index updates.

What facilities does the database have for bulk loading data and which of those facilities doesit make sense to use

Many databases have ways of speeding up loading at the expense of data integrity checking. Note thatcertain bulk loaders do more than load - they will reformat data and sometimes aggregate data.

What input file formatting will speed up bulk loading

Oftentimes operations done on the input data on the feeder system platform (e.g., sorting, eliminatingpacked and signed fields) can speed up loading.

How to parallelize table load and index maintenance or re-creation

Dropping indices and bulk loading in parallel can drastically improve loading time. By the way, learn thedifferences between pipeline, component, and data parallelism. Given the circumstances, thesedifferent types of parallelism can have widely varying amounts of effect.

How to load databases via a stream

Certain ETL tools will allow you to extract, transform, and load in one process. That is, it is notnecessary to create intermediate files. You do, though, have to be careful about data source,platform, size, scalability restrictions and limitations on how sophisticated your transformations canefficiently be.

How indices are used by your database optimizer

You need to learn this so you can figure out whether your indices are actually going to get used. Inmore recent versions of DBMS software, you may be able to get away with less indices than in olderversions.

What integrity checks should be done in the loading process

After you perform the initial load of data warehouse tables, you may want to start a "discussion" ofhow all the errors you found should be trapped in the feeder systems (preferably at data entry time).

Where does it make sense to transform the data

There may be faster places to do it than in your data warehouse database system. You may want towork with flat files and a dedicated sort/merge utility either on the data warehouse platform or, ifthe source data are on another platform, you may want to do it on that platform. The problem withdoing this on the source system platform, though, is that you then will need people skilled in thatplatform and you may be invading someone else's fiefdom.

Where processes can be done in memory

If you have got the available memory, learn how to use it. Sorts especially can be speeded up by doingthem in memory.

What domain integrity checks should be in the data warehouse database

Depending on how you resolve the above two issues, you have to investigate the sensibility ofincorporating referential integrity or any other type of domain integrity checking in your database.

Where does it make sense to aggregate the data

Sometimes if you do the aggregating outside the data warehouse database environment you can createmultiple aggregate output files in one "pass" of the input data. You will probably have to learn how to

use memory very carefully if you do this (and have a lot of memory on the server on which you aredoing the aggregating).

What statistics are available on aggregate table usage

As you might have read ad nauseum, building a data warehouse is an iterative undertaking. You willprobably create aggregates that seldom get used. You need these statistics for making the case fordeleting the aggregates (though be forewarned this can get you into a quirky political aspect of datawarehouse management.)

What level of data it makes sense to aggregate it and what non-additive measures are sensibleto include in your aggregate tables

Say you have region, territory, customer, product, and salesperson dimensions. You may find that youget the most benefit by creating a region, territory, customer, product, and salesperson aggregateand say, that, an additional region, territory, customer, product aggregate adds little to theperformance of your queries. A complicating factor, though, is use of non-additive measures in youraggregates because they will force you to re-aggregate. Suffice it to say that you should think twicebefore adding these measures to your aggregates.

What are non-FTP ways of transferring data

FTP-ing can be slow. There are a number of high speed transfer technologies to investigate. Also,don't forget about tape. Even if you have to send a tape overnight for early delivery, tape issometimes the fastest way to transfer data. Also, don't forget about using compression technology inconjunction with transferring.

Whether you should incrementally update or rebuild a table

Sometimes you have the option to either incrementally update a table or rebuild a table. You may findthat after a certain level of update activity it is faster to rebuild than to update. A rule of thumbsometimes stated is that if 20% of the records will be updated, it is faster to rebuild. This is a roughrule and the actual threshold will vary. Nevertheless, if you have options, it may be worthexperimenting with them.

What are alternate methods for changed data capture

Presuming you must incrementally update your data warehouse database and you are not extractingfrom date stamped transaction records in the feeder system, you may find you have a technicallydaunting task in capturing changed information. Be aware that you may have options in how you do thisand the options will differ in speed.

How to modify feeder systems so changes to records are written to flat files

Though this usually is not worth it, if this is done it can eliminate the time needed to go throughsometimes time consuming, convoluted processing to determine what feeder system data has changed.

How to use report scraping software

If a report that has the data you need to extract is available, sometimes it make sense to put thereport image in a file and use software specially designed to extract data from report image files.You do run a risk if the report format changes. But this technique often makes sense for extractingdata the systems whose code hasn't been touched in the last ten years.

How to perform disk mirroring and hot backups

Disk mirroring and hot backups will not speed up loading the data warehouse database (in fact, if adisk is mirrored while being bulk loaded, loading time can greatly increase) but they can give you somegreatly desired flexibility and breathing room. With mirrored disks, you can "break" the mirror,update the copy, and restore the mirror with the updated copy. This means that you can still have yourdata warehouse available while loading it. (Though be careful that you understand how mirroring can behandled by both hardware and software). Similarly, hot backups allow you to have your data warehousedatabase available when backing it up. By the way, a cycle of partial backups followed by a full backupis also worth looking into.

How to schedule loading processes

Loading a data warehouse usually requires quite a few processes. Obviously, you want to understandwhere there are and are not dependencies so you can "multi-task" these processes as much aspossible. Where there are dependencies, you want to do risk analyses so you can find out whether it isworth the effort to build in restart capabilities in the intermediate processes. And you want to makesure you have the human and automated support for scheduling the way you want to.

How to set a restartable checkpoint

Again, checkpoints will not by themselves speed up the loading process. However, if you have a tightwindow for loading the data warehouse and that loading takes considerable time, availability of acheckpoint can be a lifesaver when the load crashes (which it does at the worst times).

How certain forms of RAID technology can both speed and slow loading

RAID technology can both help and harm loading speed.

Partial updating of multidimensional (MOLAP) databases

Many of these tools allow you to only recalculate some of the calculated numbers stored in the "cube".Most of these tools that have the capability will warn you that you do so at the risk of possibly gettingdata out of synch.

How to distribute data on multiple physical disks

If you can afford multiple disks, you may want to make sure input data, data warehouse tables,indexes, and logs (if you do not disable logging) are on different physical disks. In fact, you may wantto learn about striping to spread a file over multiple disks and partitioning to divide a logical file intomany physical files spread over different disks.

How to defragment table and index files

This is basic knowledge it will probably do you well to know.

How to make a copy of your transaction system database

If you really want to use your data warehouse only for production reporting, you may be better offjust copying the transaction database periodically as is. Architectural purists hate this solution butsometimes it just makes sense to handle your reporting needs this way.

How to use multiple disk controllers

You will want high-speed interconnects to these controllers.

What is the cost of installing more/faster CPU, memory, disk

Sometimes buying metal is (by far) the least expensive way to speed up loading.Some final comments - In the long run long loading times usually will cause bigger problems than longquery times. It is not completely uncommon that data warehouse development teams find themselveswith systems they have promised to update daily but then they find the update time stretches to 12,14, 16, and maybe even 20 hours. You can throw more and more technology at this but ultimately yourbest tactics are the ability to understand what really is most important to the business and good userexpectation management. And, unless it is done by design, do not let your data warehouse be the mainsource for operational-oriented query and report functionality that, in the big picture, ought to be inthe feeder transaction processing systems.

How to Save Money on Your Data Warehousing Efforts

This essay is not a list of tactics to be used in deploying the technology of your choice. Rather this isa list a pointers that may prompt a data warehouse developer to think twice before making thoseproject management, political, and technical design decisions whose cumulative effect is to force farmore resources to be committed to a data warehousing effort than what was expected.First, though, note how much more discretion there usually is in the design and implementation of datawarehousing systems as opposed to transaction processing systems. In a transaction processingsystem, the data to be stored in the system, the users of the system, the service level provided tothe users, the technology to be used, and, in many cases, the functionality of the system are usuallysubject to relatively little discretion. In a data warehousing effort, there is generally far greaterdiscretion over these factors. However, for lack of time, political pressure, or unquestioningacceptance of mainstream industry thinking, data warehousing developers often fail to understand therange of choices they have.That being said, I hope these pointers will give you a little pause....

Have a reason besides expediency for building a report or query in the data warehouse asopposed to the feeder transaction processing system

You probably won't be far into your data warehousing efforts when you see a report or query thatcould be done in the data warehousing system or in the feeder transaction processing system. Andsince you're the data warehouse developer you'll probably decide that the report or query is easier todo in the data warehouse.- Welcome to the slippery slope! You're going to find more reports andqueries that could go "both ways". Before you know it, you can end up with a data warehousing systemthat is in effect your "production" report and query generation system and which requires the sameservice level as the feeder transaction processing system. You may even end up doing transactionprocessing in your data warehousing (some data warehousing analysts politely call this "a feedback

mechanism") to send corrected data back to the transaction processing system. Now, using a datawarehouse for the unbundling the querying and reporting functionality from a transaction processingsystem may be a good investment if you do it by design. If this unbundling is done insidiously, you canquickly back yourself into supporting, at great cost, two production systems that provide duplicatefunctionality.

Set expectations about response time before the users use the data warehouse

These "obvious" points never get mentioned enough: 1) Data warehousing performance can fluctuatefar more than transaction processing system performance (e.g., for some reason every user will wantto do a five year trend analysis at the same time) 2) Not everyone starts using the data warehouse atthe same rate. As more users start using the system, average performance tends to drop 3) If yourdata warehouse is being used for ad hoc end user work, you most likely won't be able to "tune" yourdata warehouse system for everything your users are going to throw at it. - You best discussperformance issues with your users at the very start of your data warehouse investigations. Else theymay expect response time to be the same as moving a cell in an Excel worksheet. If you do not discussexpected performance issues with your users, you are setting yourself up for costly (and possiblyperpetual) rework of your design when the data warehouse performance does not meet the initialexpectations of the users.

Do the work to determine the economics of different service levels

Get an appreciation of how much increments to the data warehouse service level cost. This type ofanalysis is an "art" but an art that your database/hardware vendor/consultant (with your questioningevery assumption they make) should be able to help you with. By the way, the important knowledge ishow making adjustments with a given set of technologies will change cost and expected performance.Be skeptical about comparing this type of analysis between different sets of technologies.

Do the analysis of whether platforms your organization has been using for a long time areappropriate for your data warehousing efforts

Mainframe, proprietary midrange, and file server network operating systems are legitimate platformsfor data warehousing. Before data warehousing was called data warehousing, these platforms werebeing used quite successfully for data warehousing systems. In fact, though you will not read about itin the trade media, these platforms still are being used successfully for data warehousing. Theplatforms are not always appropriate but if you have a substantial investment in these platforms andthe "keepers" of those platforms are not overly resistant, it is worthwhile to do the analysis.

Do the analysis of whether your users should directly report/query against data stored in thetransaction processing systems

In the 1970s, the mainstream industry wisdom was that data should be extracted and reportedagainst. In the 1980s the mainstream wisdom did a "180" and said that "data shall not be duplicated"and that you should go against the real stuff. In the 1990s, the mainstream wisdom did done another"180". - Reporting against transaction processing system data is not always appropriate, but unlessyou automatically want to accept mainstream wisdom which never seems to consider the varieties ofsituations people face, you may find doing the analysis worthwhile. (And then in the 2000s you will beconsidered in the avant garde and you will be a source for mainstream wisdom.)

Bargain with the database and hardware vendors

Chances are you are going to buy your database and your hardware from some well known, historicallyprofitable vendors. If you do your homework, you will find written material (not specifically about datawarehousing though) and consultants available to advise you how to deal with specific vendors.

If you will have large numbers of users who only run canned reports, consider the alternatives toproviding these users with "full blown" client based report and query, OLAP tools

In the typical data warehouse, the majority of users will strictly be running canned reports.(Estimates that 75% - 98% of data warehouse users are strictly report users have appeared in thetrade press.) A great deal of money can be spent licensing and supporting functionality that the userswill rarely use. Alternatives to providing canned report users with full blown tools vary based on thetechnology you are using and the politics of the situation. But the alternatives are usually there if youlook.

Implement query efficiency enhancing design techniques that do not require special hardware orsoftware

Specifically learn about using aggregate tables and partitioning. These techniques can be used with anytype of database or file access methods. Though these techniques can be overused, they generally arethe simplest, most effective, and least expensive ways to speed up retrieval of information.

Itemize possible data cleaning tasks and, with the data warehouse users, examine if each of themajors tasks is worth the effort

You will probably come up with a long list of data problems many of which are not worth the effort toclean up. Note that "worth" is a judgment that the data warehouse developers and the users have toagree upon.

Think twice before building the means to perform complex calculations that few business usersunderstand

It is not that uncommon for one business user to decide that he or she needs the data warehouse tostore or report a set of numbers that are extremely difficult to determine and more importantly, thatmost business users have a hard time understanding. In this case, the data warehouse developer hasto diplomatically discuss whether it is worth calculating a set of numbers that perhaps only businessuser will understand. Sometimes it is, most times it is not.

If the main reason you are considering a data warehousing is to get around the difficultiescaused by a dysfunctional transaction processing system, do the work of costing how much it willfix the transaction processing system before you make the data warehouse decision

It may not be surprising that the primary motivation for the construction of many data warehouses isto get around the difficulties caused by a problematic transaction processing system. Immediatelydeciding upon a data warehouse as a "fix" can be an expensive mistake. If you don't do the work ofcosting how much it will cost to fix the transaction processing systems, you may never understandwhat is really causing the problems. And then you're setting yourself up for a situation where thesame problems recur in the data warehouse and you end up supporting both a dysfunctional transactionprocessing system and a dysfunctional data warehouse.

If most of your business needs are to report on data in one transaction processing systemand/or all the historical data you need are in that system and/or the data in the system areclean and/or your hardware can support reporting against the live system data and/or thestructure of the system data is relatively simple and/or your firm does not have much interest inend user ad hoc query/report tools, you may not NEED a data warehouse

Sometimes a good report generator will do just fine.

Question whether you really will benefit from certain categories of tools

For some data warehouse implementations, certain types of tools just do not make good businesssense. For example, if you have no need for the slice-and-dice or modeling capabilities of OLAP tools, areport and query tool may meet your reporting needs more than adequately. If you have to performfairly complex data transformations and/or you have relatively few data sources and targets, you maybe better off coding by hand than using a so called "data mart" tool. The database you use fortransaction processing may do just fine based on the number of users, amount of data, and time youhave to load the database. Before buying data mining tools do your best to assess whether they willyield "actionable" insights worth the effort in making the data mining tool work.

Accept that data warehousing is going to be technically messy

If someone were ever to write "The Zen Of Data Warehousing" (perish the thought - please), one ofthe concepts would probably be that at some point, the more technically elegant you try to make thesesystems, the messier (and more costly and less beneficial) they end up being. There are no rules fordetermining where this point is. Use your judgment and intuition to make the determination.

Using Data Warehousing in Strategic Decision Making

Though you can read many definitions of data warehouses that say that these systems are designedfor "strategic decision makers" (or some other similar term) there is little written about actuallyusing data warehouses in strategic decision making processes. In this essay, I would like offer someinsight into using data warehouses in such decision making exercises.First, let me define strategic decision making. There probably are thousands of published definitions.For working purposes let me say that a strategic decision is one that involves spending a lot of moneyand/or firing/re-assigning/hiring a lot of people and/or that is going to cause a lot of pain/joy untilthe next strategic decision is made. (Of course "a lot of" is a relative term.)I assert that most of the uses of data warehouses are not for strategic decision making. Probably themost important reason for this is that strategic decision making usually is not done that often. Rather,I believe that most data warehouses are used primarily for post decision monitoring of the effects ofdecisions. Nevertheless, some data warehouse do get used in strategic decision making and are usedvery profitably.What follows are some personal observations on how you may actually use a data warehouse in astrategic decision making exercise.

Creating "special" databases, modeling (not in the IS sense of the word), and formal reportingare the most time consuming tasks when using data warehouses in strategic decision making.

Later I will go into more detail regarding these topics.

Systems for strategic decision making tend to be relatively short-lived.

The amount of time spent using these systems sometimes can be measured in days counted on onehand. Those couple of days using the system, though, can bring more payoff than some cannedreporting system used for years.

Usually the work must be done quickly and is requested with little advanced notice.

This work usually has to be done in anything from a long afternoon to several weeks. This is "figure itout as you go along work" where IS often must take the part of the business analyst. There is usuallyno time for formal interviewing and extended data modeling exercises. The "requirements" are usuallygleaned from "business" meetings which IS may have a little struggle to get into or are relatedsecondhand from attendees of these meetings. These requirements are usually ambiguous. IS usuallyhas to put on its business hat and figure out what is really needed by the business.

You will probably have to aggregate data differently, use different calculations for derivednumbers, and combine data that never have before been combined.

The work you are doing allows the business to see a point of view that is not the common view of thebusiness. (In other words, a part of many effective strategic decision making exercises is to see thebusiness in a different perspective.) You are doing this work because when you built the datawarehouse, you built it according to what then was the common view of the business.

You may need to create special databases.

Often you need to run repeated queries against a subset of the data warehouse. The subset may beone created by an extract query with quite complex constraints. Or, as I just mentioned, you may needto repeatedly access new aggregates and calculations or you may have to repeatedly concurrentlyaccess data that are not in the production data warehouse or that are in the production database butare not easily combined. For the sake of simplicity and efficiency, your best course is to create aspecial database. You may be thinking you created a data warehouse so you would not have to buildspecial "extracts" but, perhaps to no surprise, often there just is no way of avoiding these extracts.(For more on somewhat similar ideas about these special databases, see Thomas Davenport'sdescription of a "data deli" and Ralph Kimball's discussion of "behavioral studies".)

You may have to "feed" data into user maintained spreadsheet models.

Much of the use of data warehousing for strategic decision making ultimately involves "feeding" usermaintained spreadsheets. These "feeds" are either links to data stored in a data warehouse or theactual loading of data into spreadsheets. The spreadsheets are used because the user needs to changecomplex calculations - maybe as part of a scenario analysis but usually because there is continual doubtabout how certain calculations should be made - and the user is most knowledgeable about doing thesechanges in the spreadsheet environment. (To put this in a little more technical terms, many of thesecalculations are inter-record, cross dimensional calculations). Many OLAP tools allow a great deal offlexibility in making calculations but these capabilities tend to be too difficult for the user who is in ahurry in the strategic decision making exercise. Note also that oftentimes it is necessary to, in turn,feed spreadsheet data into the special databases you have created.

Sometimes data cleanliness is much less of a concern in strategic decision making.

Sometimes the analysis being done with highly summarized data and/or the need for speed lessens theneed for extremely clean data. I do suggest, however, that whatever the data expectations are, youkeep an audit trail that lets you trace how data were derived from feeder systems.

You may have to create some highly formatted reports.

The information from the data warehouse has to be communicated to people who do not have and/orwant direct access to the data warehouse. In a strategic decision making exercise, despite the rush,your users may want to communicate the information in printed reports that look just "so". Thesereports are usually being created to persuade someone. Many of your users will want a polished look tothe reports in order to convey credibility. Also, graphs are usually created for these exercises. By theway, there is usually some give and take as to whether these reports and graphs should be createdmanually (i.e., with a word processor, presentation tool, spreadsheet) or generated directly from thedatabase.Now some advice:

Probably the most important determinant of the benefit you will get from technology is yourability to figure out the most insightful questions that the technology enables you to ask.

Do not assume that your users have full appreciation of the power of the technology. Unless you havesome users with good gut instincts about technology, IS has to take the part of the business analystto spur the imagination of the users.

Try to get in "the loop" early.

Users will tend to either grossly underestimate or overestimate the power of the data warehouses inthese strategic decision making exercises. This means that either IS can miss an opportunity or befaced with an impossible task that must be done quickly. Note that there are usually politics in gettingin the loop early. However, having previously built up a relationship of trust with a "decision maker"helps greatly.

When you are initially designing the warehouse, do not try to design for every contingency thatcould occur in a strategic decision making exercise.

You are not going to be able to foresee everything that will be needed in these exercises. Do not puteverything you can possibly think of in the data warehouse. Do, though, try to keep atomic data insome electronically retrievable format. Do your best to conform the main dimensions of data used inyour business. (That means customer, product, financial account, and internal "entity", i.e., people anddepartment, identification.) Do address the slowly changing dimension issue. And do not make yourselfcompletely dependent on outside resources whose availability you cannot control. These exercisescome up unexpectedly.

Do not let the knowledge of the systems stay in the minds of the outside technical consultants

This trite and obvious piece of advice needs to be repeated. The technical consultants are gone andnot available when these opportunities come up. If the key knowledge of your systems are in the headsof consultants, you may be up the creek when these exercises come up.

Learn spreadsheets and how your data warehouse can interact with them.

We in the data warehouse world often forget that the spreadsheet is by far the most used decisionsupport tool. Persons supporting data warehouses that really will be used for decision support shouldbe encouraged to learn the scripting language of the spreadsheet (which for most people is VisualBasic for Applications) so they have the flexibility in coming up with solutions in these strategicdecision making exercises.

Don't "production-ize" your work.

The technical work done in these exercises is usually not "industrial strength" and it is probably notworth the effort to make it so. You may learn, though, that you need to modify your production datawarehouse database. Also, do keep your work around so you can cannibalize code for the nextstrategic decision making exercise.

Do not claim that data warehousing alone will necessarily improve strategic decision making

It needs to be oft-repeated that if a person is a mediocre decision maker, technology alone will notmake that person a better decision maker - especially in the realm of strategic decision making where,despite our 100 TB databases, much more remains unknown than known.

Don't miss these opportunities.

It is hard to calculate the expected ROI of a data warehouse project. Most businesses haveto go on faith that the effort somehow will be worth it. Well, success (or, sometimes, justparticipation) in a strategic decision making exercise, despite the messiness of the work, canstrongly bolster the belief that the data warehouse was worth the effort. If you do notjustify a data warehouse before building it, it is smart, perhaps imperative, to justify thedata warehouse after the fact. And the best way you are going to do this is "anecdotally" withsuccessful war stories like a strategic decision making exercise.

Maintenance Issues for Data Warehousing Systems

Another important aspect of data warehousing and decision support systems (hereafter referred toas DW/DSS systems and I know that is redundant) where I see little public discussion is maintenanceof these systems. Here I present some of the issues that you may face when your systems are "inproduction", as if these systems ever achieve the stability implied by that term. How you will deal withthe issues will depend on your environment. This list is presented because, just as mentioned in mygotchas page, forewarned is forearmed!

You will be challenged to learn about business and feeder system changes that will affect theDW/DSS systems

You as the system developer would like to know of developments that will affect the DW/DSSsystems in time to allow adequate time to assess what is impacted, make changes, test changes, etc.Of course this is no new concern to anyone doing systems maintenance. If you are responsible for asystem being fed from, say, 10 sources, you may have much more exposure than you have with thetypical transaction processing system. And though intelligent use of the data extraction, cleaning, andloading tools and the information catalogs can greatly ease the burden here, many changes will require

a fair amount of effort. By the way, keeping informed and assessing the impact of technically drivenchanges to the feeder systems may be more difficult than keeping track of the business drivenchanges. If your IS organization has change control meetings, it is a major mistake for a DW/DSSdeveloper not to attend those meetings regularly.

You will have to figure out if, when, and how to purge data

There comes a point when it does not make business sense to hold certain data in the warehousingsystem. This usually comes sooner than you expect. Either you are at some type of capacity limit ormore likely, you are restructuring data and it is not worth the effort to restructure certain data.When you are at this point you may realize that the DW/DSS system has becoming a breeding groundfor corporate information pack rats ("Why just last week ______ asked for an analysis going back to1956!"). Before you get into a discussion about purging data, one piece of advice is to learn about lessexpensive, alternative means of storage.

You will have to determine which queries and reports should be IS written and which should beuser written

Probably when you got started into this area you had an idea about who would be doing what. And ifyou are like most DW/DSS developers, after you have been in production a while you have seen howreality has differed from your expectations. A very common IS expectation is that the end users willtake over the overwhelming majority of query and report writing duties. And an all too common realityis that IS ends up taking over almost all the query and report writing or IS writes some semi-cannedqueries and the potential of the system for answering ad hoc questions never gets fully realized. - Youmay have a challenge on two fronts. You may have to push the end users into "deep water". You mayalso have to convince your IS staff that the report and query building tools are not "toys".

You will be motivated to store data in the data warehouse "for data's sake"

You and/or the users of the system will see "holes" in the data you store in the data warehouse.Mainly for the sake of completeness, you will be tempted to add this data. Unfortunately, when youhave yielded to this temptation several times, you will find you have exploded the size and complexityof your data warehouse without proper consideration of whether the incremental size and complexityhad business worth.

You will find endless opportunities to tune DW/DSS system databases

I once saw a quote from the director of IS of a well-known retailing business who said that thebiggest data warehousing lesson he learned is "there aren't many data warehousing experts outthere". If you are allowing a fair degree of end user developed access to systems and your systemsare large and complex, you will discover that there are myriad ways to drag the systems down to acrawl. It is unlikely than an "expert" can foresee all the problems. And many of the problems are socrazy that they only way you are going to solve them is on a trial-and-error basis. By the way, you mayhave sold the DW concept as a way that "killer queries" will not drag down your "production" systems.Now that you've put in a data warehousing systems, you will find out that the users are just asdependent on the data warehousing systems for recurring needs as they are on the so-calledproduction systems and killer queries hurt wherever they occur.

You will have to balance the need for building aggregate structures for processing efficiency withthe desire not to build a maintenance nightmare

Many DW/DSS systems involve building structures to contain aggregated information. These"structures" can be many things - separate tables in relational systems, dimensions in the OLAP world,etc. Anyway, after a while you will see countless ways to add or refine these aggregate structuresusually in the name of reducing end user retrieval time. The issue you face is balancing your desire tospeed things up with the need to be careful with how much a maintenance burden you want to take on.There two aspects of this burden. First, you have to consider developer time. Secondly, you have toconsider the amount of time it takes to update your systems on a recurring basis.

You will be uncertain whether to create certain reports/queries in the data warehousing systemor in the "feeder" transaction processing system

You are best advised to have some guidelines as to what goes where. If not, you may eventually findthat you have almost a clone of your transaction processing system in your data warehousing system.

You will be pressured to implement a means to interactively correct data in the data warehouse(and perhaps send back corrections to the transaction processing system)

And you though your data warehouse was read-only! I am not saying this is necessarily bad. Though, asin the last point, you have to be careful you are not setting yourself up to building a clone of adysfunctional transaction processing system.

You will be uncertain which tools are most appropriate for a certain task

DW/DSS systems present IS with yet another set of tools with overlapping uses. You will find that itis not clear what is the best tool for many applications. For instance, if you have invested in relationaland multidimensional database technology, you will find that for many applications, at a technical level,it is a toss-up as to which database technology will do the job better. Many organizations also have aheavy duty tool and a more lightweight tool that have similar ends. You will come across manysituations where it is not clear whether to go heavy duty or lightweight.

You will have to figure out how to test the effect of structure changes on end user writtenqueries and reports

After a while you are going to make some database structure changes that may affect the reports andqueries that your end users have written. In order that the need to re-test their work does not comeas too bad a surprise to your end users, may I suggest that you get them into good housekeepinghabits early on. This means, for example, not keeping their work in 10 different directories andstoring descriptions of their work.

You will have to determine how problems with feeder system update processing affect DW/DSSsystem update processing

Again, if you have 10 systems feeding your data warehouse, you are going to have to develop anappreciation of what to do when there is a processing problem with one or several of those feedersystems. At the simplest level, this means determining if and when you will process updates to thedata warehousing system. At a more difficult level, this means determining if and how to processpartial updates to the warehousing system. The dependencies in DW/DSS update processing can getquite complex. Do take the time to understand these dependencies especially if you do not have themost well-behaved feeder systems.

You will find that maintaining a data warehouse architecture may be much harder thanestablishing the architecture

By architecture, I refer to consistent use of dimensions, definitions of derived data, attribute names,and data sources for specific information. Unless there is someone with responsibility to keep his eyeon subsequent data warehouse development, it is easy to quickly lose the benefits of the hard work itusually takes to establish the architecture. By the way, the person keeping his eye on thisdevelopment must: 1) Have some judgment - your expectations of what should remain consistent willchange over time 2) Be able to work in a persuasive, not coercive manner - data warehouse developersespecially resent "architecture police".

You will find that the business changes the meanings of attributes over time and that thesechanges can be overlooked

For example, say that you work for a fruit distribution company. Perhaps it has a policy of usingcategory code "100" for sales of apples and oranges. If the company suddenly starts using code "150"for oranges, though your dimension table change capture mechanism may handle the change (I hopeyou know about slowly changing dimensions), there now is a question of how, well, apples to apples andoranges to oranges comparison should be made for historical purposes. Often there is no "right" wayto handle these issues that come up in comparing historical. You do, though, have to do your best soyou know there is an issue.

You will have to rework how you have implemented security

Most firms, if their data warehousing systems are used for ad hoc reporting, will find their securityschemes are either too loose or too tight. You will find that assigning security is a balancing act. Youwant to minimize security breaches but on the other hand you do not want to minimize the chance of auser discovering some useful business insight as a result of his examining something that someone elsemight have thought was beyond the scope of his everyday concerns.

You will have to keep reconciling feeder systems with the DW/DSS systems

After things are going smoothly for a while, some times there is a tendency to be slack in whateverprocess you have implemented to reconcile systems. Also, if you have end users reconcile information,you may find that it is an ongoing discussion as to how to handle responsibility for regularreconciliation.

You will have to perform euthanasia on some DW/DSS systems

DW/DSS systems tend to be changed frequently. They experience entropy much more quickly than,say, general ledger systems. If your firm is used to keeping and patching a system for as long as youkeep a refrigerator (and these days there are firms like that dipping their feet in DW/DSS for thefirst time), you may be in for a surprise.

You will find it is far more expensive (and complex) to maintain a data warehouse than to buildone

Hope you got that point by now!

What Decision Support Tools are Used For

In the section on the "dirty little secrets of data warehousing" in her fascinating book "e-Data", JillDyché notes many IT departments don't really know how the business is using its data warehouse. Itis not necessarily bad, though, if IT does not know all the specific uses. Sometimes the sign of a greatwarehouse is that the users "run with it" on their own.Nevertheless, it is possible to get a general idea just what the decision support (a.k.a., businessintelligence) tools used to access a data warehouse are being used for. In this essay, I will attempt tomake a general statement about use of these tools. Perhaps data warehouse support people can do abetter job if they have a better feel for what the tools are really being used for.The main uses of decision support tools are:

To check that "everything" is okay

Surprise! Nothing will be done with many, perhaps most, of the queries and reports created withdecision support tools. They are run to confirm a person's usually not crisply defined notion butintuitively felt notion of "okayness". If I were able to write the essay on "The Zen of DataWarehousing" (which I will not), I would say a primary function of decision support tools is to supportnon-action.

To confirm the "obvious"

Most end users the reports and queries are ultimately being produced for have a pretty good gut feelfor what is going on in their area of concern. Decision support tools do not tell these people anythingamazing that the people don't already suspect. But the information produced with the tools givesthem confidence their gut feel is okay.

To figure out how something "works"

Most people are not looking for some grand Unified Theory of how firm XYZ works. Rather, they wantto understand some small aspect of an operation like Customer A always pays on time, Customer Busually pays late and still takes the early payment discount, etc.

To convey information in a more digestible manner

These tools are often used to convey what a person or persons already know. These knowing people usethe tools simply to present information to other people in a way that it is more easily read.

To compare information about customers, products, cost/profit centers, financial accounts

Sometimes this is side by side comparisons of a series of measures. Sometimes this is identificationof the most, the least, the earliest, the latest, etc.

To compare the same type of information in different time periods

This is simply the usual daily, weekly, monthly, quarterly, yearly comparisons.

To check performance versus formal and informal goals or constraints

That is, measures of what actually occurred are compared with budgets, forecasts, quotas, or someother types of goals.

To identify the out of the ordinary

Usually the ultimate consumer of the tool's output has somewhat vague criteria of what is out of theordinary. The decision support tools kind of do double duty in that they help refine the criteria ofwhat is out of the ordinary and identify what fit the refined criteria of out of ordinariness.

To grab a little piece of information out of a large volume of information

These tools make picking that virtual needle out of that virtual haystack a lot simpler.

To get around an Information Technology department that does not have the time or theresources to write reports

Often end users use these tools out of impatience with the IT department. Or, the IT departmentgives the user these tools to relieve the pressure off of itself. The end users in these cases oftenwrite reports that could hardly be called analyses.

To provide a report "of record"

For all kinds of reasons it is often necessary for people to agree that "these are the numbers". Notethey do not have to agree on all the data - just some data whose credibility must be accepted foractions to be taken. Decision support tools often are used to produce this "official" information.

To confirm and sometimes to discover trends and relationships

With all respect to the people working hard on data mining, I think that most good businesspeoplehave an intuitive feeling of the most important trends and relationships between factors that areaffecting their business. The decision support tools perform the function of confirming their intuition.Yes, the tools also can help discover trends and relationships but it is difficult (though potentiallyprofitable) to sift out the meaningless and spurious trends.

To help advocate a position

These tools are not just for "objective" presentation of the facts. Often they are cleverly used tohelp bolster the case for doing (or not doing) something.

To provide data for a what if analysis or a forecast

That is, the tools are used to feed data into a spreadsheet where the actual what-if analysis orforecast will be done. The tools can do some of the what-if-ing and forecasting themselves but mostbusiness users are more comfortable doing this work in spreadsheets.

To repeat points I have made in other essays, despite their name most of these tools are not used asthe sole input into making a non-trivial decision. Nor do they directly supply what I would consider tobe business intelligence. Decisions are made and business intelligence is garnered only with thecombination of the output of the decision support tools, human judgment and intuition, and the abilityto put the information spit out by tools into a context of information that is much wider than any datawarehouse, transaction processing system, knowledge repository can handle.

Is Web Data Analysis (i.e., Web Mining) Different?

The topic of analyzing web data (also referred to as clickstream data ) is one of the more discussedtopics in the niche of data warehousing/decision support. Though there has been some intelligentwriting on the topic, most of what is written seems to be the same unquestioning praise of supposedlyrevolutionary changes that analyzing this data is going to bring about.This essay is not meant to be a how-to primer but rather to raise some questions in the mind of thereader. In this essay I would like to challenge some of the usual industry hyperbole.

Web data are the record of what actions a user takes with his mouse and keyboard while visitinga site

That is all it is. It is not that mysterious. In fact, if data could be characterized as mundane, webdata would have to rank among the most mundane.

Web data are just another source of data - with its own quirks and with limitations that comewith all other sources of data

If you have worked with a variety of other data sources, you probably know much of what you need toknow about working with web data. Yes, web data have quirks but what data (especially data asdetailed as raw web data) do not have quirks.

The primary beneficiaries of web data analysis are web designers

Not many bet-your-company (and bet-your-career) decisions are going to be made with the results ofweb data analysis. Mostly it will be used for making many little decisions about how to modify thedesign of a web site . On the other hand, if your company is betting its continuance on smart use of itsweb site (and, except for the dot-coms, not many companies fall into that category), the cumulativeeffect of these little decisions may be company and career endangering.

The businesspeople will want and benefit most from highly aggregated web data that are usuallycombined with non-web data

Most web data has far more detail than the usual marketing or financial person wants to see. Andthese people think in terms of relative performance of "channels", most of which, for non dot-comcompanies, are not web based.

The person who is going to get the most insight from web data is the person who understandsdesigning web sites so they are used profitably and who understands the power of data analysis

These people are hard to find! Sorry about the stereotypes but, at least in my limited exposure togood web designers and people who may not be hands-on designers but do have a good feel for thepower of a web site, they are very different people from the financial and marketing analysts thatdata warehousing/decision support developers are used to working with. Most students of effectivegood web design do not strike me as people who want to sit down with a query/report tool or OLAPtool and refine some analysis for three hours.

Often web data analysis yields conclusions that would be immediately obvious to a good webdesigner

Web data analysis can serve as a very expensive substitute for a good web designer. On the otherhand, though, sometimes web data analysis can be an inexpensive substitute for a very expensive webdesigner.

The value of detailed web data declines pretty fast over time

Though many data warehousing implementers won't admit it, most data loses value over time. (If youwant to be a little more academic, the expected value of the data declines over time.) Because websites change so much, the value of the web data declines quickly. Imagine doing a traditional costcenter spending analysis. Now imagine what would happen if the cost centers and their reportinghierarchy would change everyday. This is kind of what it is like to analyze some web data.

In the same vein, the value of old detailed web data is dubious

I have read the publications predicting petabyte sized warehouses of months and even years of webdata. What I have not read, though, is what people will do with older web data. Probably any web sitethat generates that much detailed data changes so often that, except at a very aggregated level, it ishard and perhaps meaningless to compare older data with newer data.

You can deliver "real-time" access to web data but your users will not be able to analyze it inreal time

I read the pundits who say now you have got to go out and build usually expensive means to let usersanalyze web data generated up to the last millisecond. - I don't know who the pundits work with but

most people I have encountered who analyze data are not polymaths who can, on an recurring hourlybasis, disgorge meaningful analyses.

Web data is far "dirtier" than the usual data warehouse data

Web data often present problems with identifying web site users, identifying what was viewed,identifying the sequence of user activity on a web site, and identifying when the user started andstopped looking at a web site. Data may have gaps or data may be suspect. Many of these problems arenot solvable given the design goals of a web site.

Web data relies on some pretty fuzzy categorization

All you may know about the web site user is (what you think are) the sequence of his clicks. To makethis data sensible, you may have to categorize users by their clicking sequences. Also, you may have tocategorize the pages on the web site. These categorizations can get pretty fuzzy. By that, I meanthere may be many, many ways to categorize with no compelling reason to use one categorizationmethod over another. Also, though it is not exactly categorization, you also have to define a "session"- when a user started and stopped accessing a web site. The definition of a session can be arbitrary.

If session data are culled from multiple servers, you probably have a unique problem

If the servers' clocks are not exactly (!!) in sync, you are going to have a hard time tracing useractivity

If your site generates pages dynamically, you may have to write your own system to track thedynamic content

This information also has to be correlated with the log file analysis. If a page consists of multipledynamically generated areas, then you have a more complicated problem.

Web data issues make it harder to do the manual judgment tasks needed to use data miningtools to separate useful information from gibberish

By now there is awareness that a great deal of judgment that can only be provided by a human being isneeded to for most data mining work. As you can imagine, all the problems with web data make itharder to do these judgment tasks that no software can do.

Often cursory analysis of web data produces most of the value that can be gained fromanalyzing the data

Or, in more academic terms, the marginal value of additional analysis may drop pretty rapidly. Thedata may be so dirty and so fuzzy that analyzing it further may not be worth it.

Web data by itself do not give you much information about the web site user

Unless the web site user has bought something from the site, you know very little about the site user.(I read that most registration information, if given, is false.) And even if a site user has boughtsomething, you need to combine the web data with data from internal and external (like and Equifax,etc) non-Web data to learn something about the web site user.

Web data do not give you that much information about why a person does not become a customer

When you read that web data is supposed to help you find why a person did not customer, you find youdo this by analyzing the clicks of a customer who left the site without buying. Also, the last page aperson clicked on is supposed to be important to analyze. - In actuality, you get a little informationthat is usually not great. Remember, usually the only thing you know about the non-customer is hisclicking pattern. Analysis of clicking patterns, as mentioned before, can be quite moot.

Some marketing writers have questioned the effectiveness of the extremely targeted marketingsome firms attempt via web data analysis

Though I make no claim to be a marketing expert, some of the supposed experts whose publications Ihave read have question the effectiveness of finely segmenting markets (which at its most extreme issegmenting markets to one person). They say that at some point in segmenting a market it is actuallypossible to get negative marginal returns. I interpret their writings to mean that marketers have tobe humble about their understanding of consumer behavior. Though it seems counterintuitive, muchmore can be effectively acted upon by observation of group behavior rather than by observation ofindividual behavior.

This essay is not meant to dissuade anyone from analyzing web data. Web data analysis can beextremely profitable. But like all other applications of data warehousing/decision support, web dataanalysis has to be done intelligently. That is, we have to know who are our real users, honestlyacknowledge the data problems we cannot solve or can partially solve, and make our decisions on howmuch we want to analyze with an eye to expected marginal benefits versus marginal expected costs.