Post on 04-Apr-2018
7/29/2019 Datwarehouse Project Implementation
1/43
This Data Warehousing site aims to help people get a good high-level understanding of what it takes to
implement a successful data warehouse project. A lot of the information is from my personal experience
as a business intelligence professional, both as a client and as a vendor.
This site is divided into five main areas.
- Tools: The selection of business intelligence tools and the selection of the data warehousing team.
Tools covered are:
Database, Hardware
ETL (Extraction, Transformation, and Loading)
OLAP
Reporting
Metadata
- Steps: This selection contains the typical milestones for a data warehousing project, from requirement
gathering, query optimization, to production rollout and beyond. I also offer my observations on the data
warehousing field.
- Business Intelligence: Business intelligence is closely related to data warehousing. This section
discusses business intelligence, as wellas the relationship between business intelligence and data
warehousing.
- Concepts: This section discusses several concepts particular to the data warehousing field. Topics
include:
Dimensional Data Model
Star Schema
Snowflake Schema
Slowly Changing Dimension
Conceptual Data Model
Logical Data Model
Physical Data Model
Conceptual, Logical, and Physical Data Model
Data Integrity
What is OLAP
http://var/www/apps/conversion/tmp/scratch_2/tools.htmlhttp://www.1keydata.com/datawarehousing/tooldatabase.htmlhttp://www.1keydata.com/datawarehousing/tooletl.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toolmetadata.htmlhttp://www.1keydata.com/datawarehousing/processes.htmlhttp://www.1keydata.com/datawarehousing/processes.htmlhttp://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/rollout.htmlhttp://www.1keydata.com/datawarehousing/observations.htmlhttp://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/concepts.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/scd.htmlhttp://www.1keydata.com/datawarehousing/conceptual-data-model.htmlhttp://www.1keydata.com/datawarehousing/conceptual-data-model.htmlhttp://www.1keydata.com/datawarehousing/logical-data-model.htmlhttp://www.1keydata.com/datawarehousing/logical-data-model.htmlhttp://www.1keydata.com/datawarehousing/physical-data-model.htmlhttp://www.1keydata.com/datawarehousing/physical-data-model.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-integrity.htmlhttp://www.1keydata.com/datawarehousing/what-is-olap.htmlhttp://www.1keydata.com/datawarehousing/tooldatabase.htmlhttp://www.1keydata.com/datawarehousing/tooletl.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toolmetadata.htmlhttp://www.1keydata.com/datawarehousing/processes.htmlhttp://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/rollout.htmlhttp://www.1keydata.com/datawarehousing/observations.htmlhttp://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/concepts.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/scd.htmlhttp://www.1keydata.com/datawarehousing/conceptual-data-model.htmlhttp://www.1keydata.com/datawarehousing/logical-data-model.htmlhttp://www.1keydata.com/datawarehousing/physical-data-model.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-integrity.htmlhttp://www.1keydata.com/datawarehousing/what-is-olap.htmlhttp://var/www/apps/conversion/tmp/scratch_2/tools.html7/29/2019 Datwarehouse Project Implementation
2/43
MOLAP, ROLAP, and HOLAP
Bill Inmon vs. Ralph Kimball
- Business Intelligence Conferences: Lists upcoming conferences in the business intelligence / data
warehousing industry.
- Glossary: A glossary of common data warehousing terms.
********************
As the old Chinese saying goes, "To accomplish a goal, make sure the proper tools are selected." This is
especially true when the goal is to achieve business intelligence. Given the complexity of the data
warehousing system and the cross-departmental implications of the project, it is easy to see why the
proper selection of business intelligence software and personnel is very important. This section will talk
about the such selections. They are grouped into the following:
General Considerations
Database/Hardware
ETL Tools
OLAP Tools
Reporting Tools
Metadata Tools
Data Warehouse Team Personnel
Please note that this site is vendor neutral. Some business intelligence vendor names will be
mentioned, but it should not be considered as an endorsement from this site.
Buy vs. Build
The only choices here are what type of hardware and database to purchase, as there is basically no way
that one can build hardware/database systems from scratch.
Database/Hardware Selections
In making selection for the database/hardware platform, there are several items that need to be carefullyconsidered:
Scalability: How can the system grow as your data storage needs grow? Which RDBMS and hardware
platform can handle large sets of data most efficiently? To get an idea of this, one needs to determine the
approximate amount of data that is to be kept in the data warehouse system once it's mature, and base
any testing numbers from there.
http://www.1keydata.com/datawarehousing/molap-rolap.htmlhttp://www.1keydata.com/datawarehousing/inmon-kimball.htmlhttp://www.1keydata.com/datawarehousing/inmon-kimball.htmlhttp://var/www/apps/conversion/tmp/scratch_2/business-intelligence-conferences.phphttp://www.1keydata.com/datawarehousing/glossary.htmlhttp://www.1keydata.com/datawarehousing/tool.htmlhttp://www.1keydata.com/datawarehousing/tooldatabase.htmlhttp://www.1keydata.com/datawarehousing/tooletl.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toolmetadata.htmlhttp://www.1keydata.com/datawarehousing/dw_team.htmlhttp://www.1keydata.com/datawarehousing/dw_team.htmlhttp://www.1keydata.com/datawarehousing/molap-rolap.htmlhttp://www.1keydata.com/datawarehousing/inmon-kimball.htmlhttp://var/www/apps/conversion/tmp/scratch_2/business-intelligence-conferences.phphttp://www.1keydata.com/datawarehousing/glossary.htmlhttp://www.1keydata.com/datawarehousing/tool.htmlhttp://www.1keydata.com/datawarehousing/tooldatabase.htmlhttp://www.1keydata.com/datawarehousing/tooletl.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toolmetadata.htmlhttp://www.1keydata.com/datawarehousing/dw_team.html7/29/2019 Datwarehouse Project Implementation
3/43
Parallel Processing Support: The days of multi-million dollar supercomputers with one single CPU are
gone, and nowadays the most powerful computers all use multiple CPUs, where each processor can
perform a part of the task, all at the same time. When I first started working with massively parallel
computers in 1993, I had thought that it would be the best way for any large computations to be done
within 5 years. Indeed, parallel computing is gaining popularity now, although a little slower than I had
originally thought.
RDBMS/Hardware Combination: Because the RDBMS physically sits on the hardware platform, there
are going to be certain parts of the code that is hardware platform-dependent. As a result, bugs and bug
fixes are often hardware dependent.
True Case: One of the projects I have worked on was with a major RDBMS provider paired with a
hardware platform that was not so popular (at least not in the data warehousing world). The DBA
constantly complained about the bug not being fixed because the support level for the particular type of
hardware that client had chosen was Level 3, which basically meant that no one in the RDBMS support
organization will fix any bug particular to that hardware platform.
Popular Relational Databases
Oracle
Microsoft SQL Server
IBM DB2
Teradata
Sybase
MySQL
Popular OS Platforms
Linux
FreeBSD
Microsoft
Buy vs. Build
When it comes to ETL tool selection, it is not always necessary to purchase a third-party tool. This
determination largely depends on three things:
Complexity of the data transformation: The more complex the data transformation is, the more
suitable it is to purchase an ETL tool.
Data cleansing needs: Does the data need to go through a thorough cleansing exercise before it
is suitable to be stored in the data warehouse? If so, it is best to purchase a tool with strong data
http://launch%28%27http//www.oracle.com')http://launch%28%27http//www.microsoft.com')http://launch%28%27http//www.ibm.com')http://launch%28%27http//www.teradata.com')http://launch%28%27http//www.sybase.com')http://launch%28%27http//www.mysql.com')http://launch%28%27http//www.linux.org')http://launch%28%27http//www.freebsd.org')http://launch%28%27http//www.microsoft.com')http://launch%28%27http//www.microsoft.com')http://launch%28%27http//www.oracle.com')http://launch%28%27http//www.microsoft.com')http://launch%28%27http//www.ibm.com')http://launch%28%27http//www.teradata.com')http://launch%28%27http//www.sybase.com')http://launch%28%27http//www.mysql.com')http://launch%28%27http//www.linux.org')http://launch%28%27http//www.freebsd.org')http://launch%28%27http//www.microsoft.com')7/29/2019 Datwarehouse Project Implementation
4/43
cleansing functionalities. Otherwise, it may be sufficient to simply build the ETL routine from
scratch.
Data volume. Available commercial tools typically have features that can speed up data
movement. Therefore, buying a commercial product is a better approach if the volume of data
transferred is large.
ETL Tool Functionalities
While the selection of a database and a hardware platform is a must, the selection of an ETL tool is highly
recommended, but it's not a must. When you evaluate ETL tools, it pays to look for the following
characteristics:
Functional capability: This includes both the 'transformation' piece and the 'cleansing' piece. In general,
the typical ETL tools are either geared towards having strong transformation capabilities or having strong
cleansing capabilities, but they are seldom very strong in both. As a result, if you know your data is going
to be dirty coming in, make sure your ETL tool has strong cleansing capabilities. If you know there are
going to be a lot of different data transformations, it then makes sense to pick a tool that is strong in
transformation.
Ability to read directly from your data source: For each organization, there is a different set of data
sources. Make sure the ETL tool you select can connect directly to your source data.
Metadata support: The ETL tool plays a key role in your metadata because it maps the source data to
the destination, which is an important piece of the metadata. In fact, some organizations have come to
rely on the documentation of their ETL tool as their metadata source. As a result, it is very important to
select an ETL tool that works with your overall metadata strategy.
Popular Tools
IBM WebSphere Information Integration (Ascential DataStage)
Ab Initio
Informatica
Talend
Buy vs. Build
OLAP tools are geared towards slicing and dicing of the data. As such, they require a strongmetadata layer, as well as front-end flexibility. Those are typically difficult features for anyhome-built systems to achieve. Therefore, my recommendation is that if OLAP analysis is part ofyour charter for building a data warehouse, it is best to purchase an existing OLAP tool ratherthan creating one from scratch.
OLAP Tool Functionalities
Before we speak about OLAP tool selection criterion, we must first distinguish between the two
7/29/2019 Datwarehouse Project Implementation
5/43
types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational OLAP).
1. MOLAP: In this type of OLAP, a cube is aggregated from the relational data source (datawarehouse). When user generates a report request, the MOLAP tool can generate the createquickly because all data is already pre-aggregated within the cube.
2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube, the ROLAPengine essentially acts as a smart SQL generator. The ROLAP tool typically comes with a'Designer' piece, where the data warehouse administrator can specify the relationship betweenthe relational tables, as well as how dimensions, attributes, and hierarchies map to the underlyingdatabase tables.
Right now, there is a convergence between the traditional ROLAP and MOLAP vendors.ROLAP vendor recognize that users want their reports fast, so they are implementing MOLAPfunctionalities in their tools; MOLAP vendors recognize that many times it is necessary to drilldown to the most detail level information, levels where the traditional cubes do not get to forperformance and size reasons.
So what are the criteria for evaluating OLAP vendors? Here they are:
Ability to leverage parallelism supplied by RDBMS and hardware: This would greatlyincrease the tool's performance, and help loading the data into the cubes as quickly as possible.
Performance: In addition to leveraging parallelism, the tool itself should be quick both in termsof loading the data into the cube and reading the data from the cube.
Customization efforts: More and more, OLAP tools are used as an advanced reporting tool.This is because in many cases, especially for ROLAP implementations, OLAP tools often can beused as a reporting tool. In such cases, the ease of front-end customization becomes an important
factor in the tool selection process.Security Features: Because OLAP tools are geared towards a number of users, making surepeople see only what they are supposed to see is important. By and large, all established OLAPtools have a security layer that can interact with the common corporate login protocols. Thereare, however, cases where large corporations have developed their own user authenticationmechanism and have a "single sign-on" policy. For these cases, having a seamless integrationbetween the tool and the in-house authentication can require some work. I would recommendthat you have the tool vendor team come in and make sure that the two are compatible.
Metadata support: Because OLAP tools aggregates the data into the cube and sometimes servesas the front-end tool, it is essential that it works with the metadata strategy/tool you have
selected.
Popular Tools
Business Objects
Cognos
Hyperion
7/29/2019 Datwarehouse Project Implementation
6/43
Microsoft Analysis Services
MicroStrategy
Pentaho
Palo OLAP Server
The OLAP Tool Market Share shows the market share of the above 5 vendors.
Buy vs. Build
There is a wide variety of reporting requirements, and whether to buy or build a reporting toolfor your business intelligence needs is also heavily dependent on the type of requirements.Typically, the determination is based on the following:
Number of reports: The higher the number of reports, the more likely that buying areporting tool is a good idea. This is not only because reporting tools typically make
creating new reports easier (by offering re-usable components), but they also alreadyhave report management systems to make maintenance and support functions easier.
Desired Report Distribution Mode: If the reports will only be distributed in a single mode(for example, email only, or over the browser only), we should then strongly consider thepossibility of building the reporting tool from scratch. However, if users will access thereports through a variety of different channels, it would make sense to invest in a third-party reporting tool that already comes packaged with these distribution modes.
Ad Hoc Report Creation: Will the users be able to create their own ad hoc reports? If so,it is a good idea to purchase a reporting tool. These tool vendors have accumulated
extensive experience and know the features that are important to users who are creatingad hoc reports. A second reason is that the ability to allow for ad hoc report creationnecessarily relies on a strong metadata layer, and it is simply difficult to come up with ametadata model when building a reporting tool from scratch.
Reporting Tool Functionalities
Data is useless if all it does is sit in the data warehouse. As a result, the presentation layer is ofvery high importance.
Most of the OLAP vendors already have a front-end presentation layer that allows users to callup pre-defined reports or create ad hoc reports. There are also several report tool vendors. Either
way, pay attention to the following points when evaluating reporting tools:
Data source connection capabilities
In general there are two types of data sources, one the relationship database, the other is theOLAP multidimensional data source. Nowadays, chances are good that you might want to haveboth. Many tool vendors will tell you that they offer both options, but upon closer inspection, itis possible that the tool vendor is especially good for one type, but to connect to the other type ofdata source, it becomes a difficult exercise in programming.
http://www.1keydata.com/datawarehousing/olap-market-share.htmlhttp://www.1keydata.com/datawarehousing/olap-market-share.html7/29/2019 Datwarehouse Project Implementation
7/43
Scheduling and distribution capabilities
In a realistic data warehousing usage scenario by senior executives, all they have time for is tocome in on Monday morning, look at the most important weekly numbers from the previousweek (say the sales numbers), and that's how they satisfy their business intelligence needs. Allthe fancy ad hoc and drilling capabilities will not interest them, because they do not touch thesefeatures.
Based on the above scenario, the reporting tool must have scheduling and distributioncapabilities. Weekly reports are scheduled to run on Monday morning, and the resulting reportsare distributed to the senior executives either by email or web publishing. There are claims byvarious vendors that they can distribute reports through various interfaces, but based on myexperience, the only ones that really matter are delivery via email and publishing over theintranet.
Security Features: Because reporting tools, similar to OLAP tools, are geared towards a numberof users, making sure people see only what they are supposed to see is important. Security can
reside at the report level, folder level, column level, row level, or even individual cell level. Byand large, all established reporting tools have these capabilities. Furthermore, they have asecurity layer that can interact with the common corporate login protocols. There are, however,cases where large corporations have developed their own user authentication mechanism andhave a "single sign-on" policy. For these cases, having a seamless integration between the tooland the in-house authentication can require some work. I would recommend that you have thetool vendor team come in and make sure that the two are compatible.
Customization
Every one of us has had the frustration over spending an inordinate amount of time tinkeringwith some office productivity tool only to make the report/presentation look good. This is
definitely a waste of time, but unfortunately it is a necessary evil. In fact, a lot of times, analystswill wish to take a report directly out of the reporting tool and place it in their presentations orreports to their bosses. If the reporting tool offers them an easy way to pre-set the reports to lookexactly the way that adheres to the corporate standard, it makes the analysts jobs much easier,and the time savings are tremendous.
Export capabilities
The most common export needs are to Excel, to a flat file, and to PDF, and a good report toolmust be able to export to all three formats. For Excel, if the situation warrants it, you will want toverify that the reporting format, not just the data itself, will be exported out to Excel. This canoften be a time-saver.
Integration with the Microsoft Office environment
Most people are used to work with Microsoft Office products, especially Excel, for manipulatingdata. Before, people used to export the reports into Excel, and then perform additional formatting/ calculation tasks. Some reporting tools now offer a Microsoft Office-like editing environmentfor users, so all formatting can be done within the reporting tool itself, with no need to export thereport into Excel. This is a nice convenience to the users.
7/29/2019 Datwarehouse Project Implementation
8/43
Popular Tools
Business Objects (Crystal Reports)
Cognos
Actuate
Buy vs. Build
Only in the rarest of cases does it make sense to build a metadata tool from scratch. This is because
doing so requires resources that are intimately familiar with the operational, technical, and business
aspects of the data warehouse system, and such resources are difficult to come by. Even when such
resources are available, there are often other tasks that can provide more value to the organization than
to build a metadata tool from scratch.
In fact, the question is often whether any type of metadata tool is needed at all. Although metadata plays
an extremely important role in a successful data warehousing implementation, this does not always meanthat a tool is needed to keep all the "data about data." It is possible to, say, keey such information in the
repository of other tools used, in a text documentation, or even in a presentation or a spreadsheet.
Having said the above, though, it is author's believe that having a solid metadata foundation is one of the
keys to the success of a data warehousing project. Therefore, even if a metadata tool is not selected at
the beginning of the project, it is essential to have a metadata strategy; that is, how metadata in the data
warehousing system will be stored.
Metadata Tool Functionalities
This is the most difficult tool to choose, because there is clearly no standard. In fact, it might be better to
call this a selection of the metadata strategy. Traditionally, people have put the data modeling information
into a tool such as ERWin and Oracle Designer, but it is difficult to extract information out of such data
modeling tools. For example, one of the goals for your metadata selection is to provide information to the
end users. Clearly this is a difficult task with a data modeling tool.
So typically what is likely to happen is that additional efforts are spent to create a layer of metadata that is
aimed at the end users. While this allows the end users to gain the required insight into what the data and
reports they are looking at means, it is clearly inefficient because all that information already resides
somewhere in the data warehouse system, whether it be the ETL tool, the data modeling tool, the OLAP
tool, or the reporting tool.
There are efforts among data warehousing tool vendors to unify on a metadata model. In June of 2000,
the OMG released a metadata standard called CWM (Common Warehouse Metamodel), and some of the
vendors such as Oracle have claimed to have implemented it. This standard incorporates the latest
technology such as XML, UML, and SOAP, and, if accepted widely, is truly the best thing that can happen
to the data warehousing industry. As of right now, though, the author has not really seen that many tools
leveraging this standard, so clearly it has not quite caught on yet.
http://launch%28%27http//www.omg.org')http://launch%28%27http//www.omg.org')7/29/2019 Datwarehouse Project Implementation
9/43
So what does this mean about your metadata efforts? In the absence of everything else, I would
recommend that whatever tool you choose for your metadata support supports XML, and that whatever
other tool that needs to leverage the metadata also supports XML. Then it is a matter of defining your
DTD across your data warehousing system. At the same time, there is no need to worry about criteria that
typically is important for the other tools such as performance and support for parallelism because the size
of the metadata is typically small relative to the size of the data warehouse.
Data Warehouse Team Personnel Selection
There are two areas of discussion: First is whether to use external consultants or hire permanentemployees. The second is on what type of personnel is recommended for a data warehousingproject.
The pros of hiring external consultants are:
1. They are usually more experienced in data warehousing implementations. The fact of thematter is, even today, people with extensive data warehousing backgrounds are difficult to find.With that, when there is a need to ramp up a team quickly, the easiest route to go is to hireexternal consultants.
The pros of hiring permanent employees are:
1. They are less expensive. With hourly rates for experienced data warehousing professionalsrunning from $100/hr and up, and even more for Big-5 or vendor consultants, hiring permanentemployees is a much more economical option.
2. They are less likely to leave. With consultants, whether they are on contract, via a Big-5 firm,
or one of the tool vendor firms, they are likely to leave at a moment's notice. This makesknowledge transfer very important. Of course, the flip side is that these consultants are mucheasier to get rid of, too.
The following roles are typical for a data warehouse project:
Project Manager: This person will oversee the progress and be responsible for the success ofthe data warehousing project.
DBA: This role is responsible to keep the database running smoothly. Additional tasks for thisrole may be to plan and execute a backup/recovery plan, as well asperformance tuning.
Technical Architect: This role is responsible for developing and implementing the overalltechnical architecture of the data warehouse, from the backend hardware/software to the clientdesktop configurations.
ETL Developer: This role is responsible for planning, developing, and deploying the extraction,transformation, and loading routine for the data warehouse.
Front End Developer: This person is responsible for developing the front-end, whether it beclient-server or over the web.
http://www.1keydata.com/datawarehousing/performance.htmlhttp://www.1keydata.com/datawarehousing/performance.html7/29/2019 Datwarehouse Project Implementation
10/43
OLAP Developer: This role is responsible for the development of OLAP cubes.
Trainer: A significant role is the trainer. After the data warehouse is implemented, a person onthe data warehouse team needs to work with the end users to get them familiar with how thefront end is set up so that the end users can get the most benefit out of the data warehousesystem.
Data Modeler: This role is responsible for taking the data structure that exists in the enterpriseand model it into a schema that is suitable for OLAP analysis.
QA Group: This role is responsible for ensuring the correctness of the data in the datawarehouse. This role is more important than it appears, because bad data quality turns awayusers more than any other reason, and often is the start of the downfall for the data warehousingproject.
The above list is roles, and one person does not necessarily correspond to only one role. In fact,it is very common in a data warehousing team where a person takes on multiple roles. For atypical project, it is common to see teams of 5-8 people. Any data warehousing team thatcontains more than 10 people is definitely bloated.
Data Warehouse Design
After the tools and team personnel selections are made, the data warehouse design can begin.The following are the typical steps involved in the datawarehousing project cycle.
Requirement Gathering
Physical Environment Setup
Data Modeling
ETL
OLAP Cube Design
Front End Development
Report Development
Performance Tuning
Query Optimization
Quality Assurance
Rolling out to Production
Production Maintenance
Incremental Enhancements
http://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/environment.htmlhttp://www.1keydata.com/datawarehousing/datamodeling.htmlhttp://www.1keydata.com/datawarehousing/etl.htmlhttp://www.1keydata.com/datawarehousing/olap.htmlhttp://www.1keydata.com/datawarehousing/frontend.htmlhttp://www.1keydata.com/datawarehousing/report-development.htmlhttp://www.1keydata.com/datawarehousing/performance.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/qa.htmlhttp://www.1keydata.com/datawarehousing/rollout.htmlhttp://www.1keydata.com/datawarehousing/maintenance.htmlhttp://www.1keydata.com/datawarehousing/enhancement.htmlhttp://www.1keydata.com/datawarehousing/requirement.htmlhttp://www.1keydata.com/datawarehousing/environment.htmlhttp://www.1keydata.com/datawarehousing/datamodeling.htmlhttp://www.1keydata.com/datawarehousing/etl.htmlhttp://www.1keydata.com/datawarehousing/olap.htmlhttp://www.1keydata.com/datawarehousing/frontend.htmlhttp://www.1keydata.com/datawarehousing/report-development.htmlhttp://www.1keydata.com/datawarehousing/performance.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/qa.htmlhttp://www.1keydata.com/datawarehousing/rollout.htmlhttp://www.1keydata.com/datawarehousing/maintenance.htmlhttp://www.1keydata.com/datawarehousing/enhancement.html7/29/2019 Datwarehouse Project Implementation
11/43
Each page listed above represents a typical data warehouse design phase, and has severalsections:
Task Description: This section describes what typically needs to be accomplished duringthis particular data warehouse design phase.
Time Requirement: A rough estimate of the amount of time this particular datawarehouse task takes.
Deliverables: Typically at the end of each data warehouse task, one or more documentsare produced that fully describe the steps and results of that particular task. This isespecially important for consultants to communicate their results to the clients.
Possible Pitfalls: Things to watch out for. Some of them obvious, some of them not soobvious. All of them are real.
The Additional Observationssection contains my own observations on data warehouse processesnot included in any of the design steps.
Requirement Gathering
Task Description
The first thing that the project team should engage in is gathering requirements from end users.Because end users are typically not familiar with the data warehousing process or concept, thehelp of the business sponsor is essential. Requirement gathering can happen as one-to-onemeetings or as Joint Application Development (JAD) sessions, where multiple people are talkingabout the project scope in the same meeting.
The primary goal of this phase is to identify what constitutes as a success for this particularphase of the data warehouse project. In particular, end user reporting / analysis requirements areidentified, and the project team will spend the remaining period of time trying to satisfy theserequirements.
Associated with the identification of user requirements is a more concrete definition of otherdetails such as hardware sizing information, training requirements, data source identification, andmost importantly, a concrete project plan indicating the finishing date of the data warehousingproject.
Based on the information gathered above, a disaster recovery plan needs to be developed so thatthe data warehousing system can recover from accidents that disable the system. Without aneffective backup and restore strategy, the system will only last until the first major disaster, and,as many data warehousing DBA's will attest, this can happen very quickly after the project goeslive.
Time Requirement
2 - 8 weeks.
http://www.1keydata.com/datawarehousing/observations.htmlhttp://www.1keydata.com/datawarehousing/observations.htmlhttp://www.1keydata.com/datawarehousing/observations.html7/29/2019 Datwarehouse Project Implementation
12/43
Deliverables
A list of reports / cubes to be delivered to the end users by the end of this current phase.
A updated project plan that clearly identifies resource loads and milestone delivery dates.
Possible Pitfalls
This phase often turns out to be the most tricky phase of the data warehousing implementation.The reason is that because data warehousing by definition includes data from multiple sourcesspanning many different departments within the enterprise, there are often political battles thatcenter on the willingness of information sharing. Even though a successful data warehousebenefits the enterprise, there are occasions where departments may not feel the same way. As aresult of unwillingness of certain groups to release data or to participate in the data warehousingrequirement definition, the data warehouse effort either never gets off the ground, or could notstart in the direction originally defined.
When this happens, it would be ideal to have a strong business sponsor. If the sponsor is at theCXO level, she can often exert enough influence to make sure everyone cooperates.
Physical Environment Setup
Task Description
Once the requirements are somewhat clear, it is necessary to set up the physical servers anddatabases. At a minimum, it is necessary to set up a development environment and a productionenvironment. There are also many data warehousing projects where there are threeenvironments: Development, Testing, and Production.
It is not enough to simply have different physical environments set up. The different processes(such as ETL, OLAP Cube, and reporting) also need to be set up properly for each environment.
It is best for the different environments to use distinct application and database servers. In otherwords, the development environment will have its own application server and database servers,and the production environment will have its own set of application and database servers.
Having different environments is very important for the following reasons:
All changes can be tested and QA'd first without affecting the production environment.
Development and QA can occur during the time users are accessing the data warehouse.
When there is any question about the data, having separate environment(s) will allow thedata warehousing team to examine the data without impacting the productionenvironment.
Time Requirement
Getting the servers and databases ready should take less than 1 week.
7/29/2019 Datwarehouse Project Implementation
13/43
Deliverables
Hardware / Software setup document for all of the environments, including hardwarespecifications, and scripts / settings for the software.
Possible Pitfalls
To save on capital, often data warehousing teams will decide to use only a single database and asingle server for the different environments. Environment separation is achieved by either adirectory structure or setting up distinct instances of the database. This is problematic for thefollowing reasons:
1. Sometimes it is possible that the server needs to be rebooted for the development environment.Having a separate development environment will prevent the production environment from beingimpacted by this.
2. There may be interference when having different database environments on a single box. Forexample, having multiple long queries running on the development database could affect the
performance on the production database.
Data Modeling
Task Description
This is a very important step in the data warehousing project. Indeed, it is fair to say that thefoundation of the data warehousing system is the data model. A good data model will allow thedata warehousing system to grow easily, as well as allowing for good performance.
In data warehousing project, the logical data model is built based on user requirements, and then
it is translated into the physical data model. The detailed steps can be found in the Conceptual,Logical, and Physical Data Modeling section.
Part of the data modeling exercise is often the identification of data sources. Sometimes this stepis deferred until the ETL step. However, my feeling is that it is better to find out where the dataexists, or, better yet, whether they even exist anywhere in the enterprise at all. Should the datanot be available, this is a good time to raise the alarm. If this was delayed until the ETL phase,rectifying it will becoming a much tougher and more complex process.
Time Requirement
2 - 6 weeks.
Deliverables
Identification of data sources.
Logical data model.
Physical data model.
http://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.html7/29/2019 Datwarehouse Project Implementation
14/43
Possible Pitfalls
It is essential to have a subject-matter expert as part of the data modeling team. This person canbe an outside consultant or can be someone in-house who has extensive experience in theindustry. Without this person, it becomes difficult to get a definitive answer on many of thequestions, and the entire project gets dragged out.
ETL
Task Description
The ETL (Extraction, Transformation, Loading) process typically takes the longest to develop,and this can easily take up to 50% of the data warehouse implementation cycle or longer. Thereason for this is that it takes time to get the source data, understand the necessary columns,understand the business rules, and understand the logical and physical data models.
Time Requirement
1 - 6 weeks.
Deliverables
Data Mapping Document
ETL Script / ETL Package in the ETL tool
Possible Pitfalls
There is a tendency to give this particular phase too little development time. This can prove
suicidal to the project because end users will usually tolerate less formatting, longer time to runreports, less functionality (slicing and dicing), or fewer delivered reports; one thing that they willnot tolerate is wrong information.
A second common problem is that some people make the ETL process more complicated thannecessary. In ETL design, the primary goal should be to optimize load speed without sacrificingon quality. This is, however, sometimes not followed. There are cases where the design goal is tocover all possible future uses, whether they are practical or just a figment of someone'simagination. When this happens, ETL performance suffers, and often so does the performance ofthe entire data warehousing system.
OLAP Cube Design
Task Description
Usually the design of the olap cube can be derived from the Requirement Gatheringphase. Moreoften than not, however, users have some idea on what they want, but it is difficult for them tospecify the exact report / analysis they want to see. When this is the case, it is usually a good ideato include enough information so that they feel like they have gained something through the datawarehouse, but not so much that it stretches the data warehouse scope by a mile. Remember that
http://var/www/apps/conversion/tmp/scratch_2/requirement.htmlhttp://var/www/apps/conversion/tmp/scratch_2/requirement.htmlhttp://var/www/apps/conversion/tmp/scratch_2/requirement.html7/29/2019 Datwarehouse Project Implementation
15/43
data warehousing is an iterative process - no one can ever meet all the requirements all at once.
Time Requirement
1 - 2 weeks.
Deliverables
Documentation specifying the OLAP cube dimensions and measures.
Actual OLAP cube / report.
Possible Pitfalls
Make sure your olap cube-bilding process is optimized. It is common for the data warehouse tobe on the bottom of the nightly batch load, and after the loading of the data warehouse, thereusually isn't much time remaining for the olap cube to be refreshed. As a result, it is worthwhileto experiment with the olap cube generation paths to ensure optimal performance.
Front End Development
Task Description
Regardless of the strength of the OLAP engine and the integrity of the data, if the users cannotvisualize the reports, the data warehouse brings zero value to them. Hence front end developmentis an important part of a data warehousing initiative.
So what are the things to look out for in selecting a front-end deployment methodology? Themost important thing is that the reports should need to be delivered over the web, so the only
thing that the user needs is the standard browser. These days it is no longer desirable nor feasibleto have the IT department doing program installations on end users desktops just so that they canview reports. So, whatever strategy one pursues, make sure the ability to deliver over the web isa must.
The front-end options ranges from an internal front-end development using scripting languagessuch as ASP, PHP, or Perl, to off-the-shelf products such as Seagate Crystal Reports, to the morehigher-level products such as Actuate. In addition, many OLAP vendors offer a front-end ontheir own. When choosing vendor tools, make sure it can be easily customized to suit theenterprise, especially the possible changes to the reporting requirements of the enterprise.Possible changes include not just the difference in report layout and report content, but alsoinclude possible changes in the back-end structure. For example, if the enterprise decides to
change from Solaris/Oracle to Microsoft 2000/SQL Server, will the front-end tool be flexibleenough to adjust to the changes without much modification?
Another area to be concerned with is the complexity of the reporting tool. For example, do thereports need to be published on a regular interval? Are there very specific formattingrequirements? Is there a need for a GUI interface so that each user can customize her reports?
Time Requirement
7/29/2019 Datwarehouse Project Implementation
16/43
1 - 4 weeks.
Deliverables
Front End Deployment Documentation
Possible Pitfalls
Just remember that the end users do not care how complex or how technologically advancedyour front end infrastructure is. All they care is that they receives their information in a timelymanner and in the way they specified.
Report Development
Task Description
Report specification typically comes directly from the requirements phase. To the end user, the
only direct touchpoint he or she has with the data warehousing system is the reports they see. So,report development, although not as time consuming as some of the other steps such asETL anddata modeling, nevertheless plays a very important role in determining the success of the datawarehousing project.
One would think that report development is an easy task. How hard can it be to just followinstructions to build the report? Unfortunately, this is not true. There are several points the datawarehousing team need to pay attention to before releasing the report.
User customization: Do users need to be able to select their own metrics? And how do usersneed to be able to filter the information? The report development process needs to take thosefactors into consideration so that users can get the information they need in the shortest amount
of time possible.
Report delivery: What report delivery methods are needed? In addition to delivering the reportto the web front end, other possibilities include delivery via email, via text messaging, or in someform of spreadsheet. There are reporting solutions in the marketplace that support report deliveryas a flash file. Such flash file essentially acts as a mini-cube, and would allow end users to sliceand dice the data on the report without having to pull data from an external source.
Access privileges: Special attention needs to be paid to who has what access to whatinformation. A sales report can show 8 metrics covering the entire company to the companyCEO, while the same report may only show 5 of the metrics covering only a single district to aDistrict Sales Director.
Report development does not happen only during the implementation phase. After the systemgoes into production, there will certainly be requests for additional reports. These types ofrequests generally fall into two broad categories:
1. Data is already available in the data warehouse. In this case, it should be fairly straightforwardto develop the new report into the front end. There is no need to wait for a major production pushbefore making new reports available.
http://www.1keydata.com/datawarehousing/etl.htmlhttp://www.1keydata.com/datawarehousing/etl.htmlhttp://www.1keydata.com/datawarehousing/datamodeling.htmlhttp://www.1keydata.com/datawarehousing/datamodeling.htmlhttp://www.1keydata.com/datawarehousing/etl.htmlhttp://www.1keydata.com/datawarehousing/datamodeling.html7/29/2019 Datwarehouse Project Implementation
17/43
2. Data is not yet available in the data warehouse. This means that the request needs to beprioritized and put into a future data warehousing development cycle.
Time Requirement
1 - 2 weeks.
Deliverables
Report Specification Documentation.
Reports set up in the front end / reports delivered to user's preferred channel.
Possible Pitfalls
Make sure the exact definitions of the report are communicated to the users. Otherwise, userinterpretation of the report can be errenous.
Performance Tuning
Task Description
There are three major areas where a data warehousing system can use a little performancetuning:
ETL - Given that the data load is usually a very time-consuming process (and hence theyare typically relegated to a nightly load job) and that data warehousing-related batch jobsare typically of lower priority, that means that the window for data loading is not verylong. A data warehousing system that has its ETL process finishing right on-time is going
to have a lot of problems simply because often the jobs do not get started on-time due tofactors that is beyond the control of the data warehousing team. As a result, it is alwaysan excellent idea for the data warehousing group to tune the ETL process as much aspossible.
Query Processing - Sometimes, especially in a ROLAP environment or in a system wherethe reports are run directly against the relationship database, query performance can be anissue. A study has shown that users typically lose interest after 30 seconds of waiting fora report to return. My experience has been that ROLAP reports or reports that run directlyagainst the RDBMS often exceed this time limit, and it is hence ideal for the datawarehousing team to invest some time to tune the query, especially the most popularlyones. We present a number ofquery optimizationideas.
Report Delivery - It is also possible that end users are experiencing significant delays inreceiving their reports due to factors other than the query performance. For example,network traffic, server setup, and even the way that the front-end was built sometimesplay significant roles. It is important for the data warehouse team to look into these areasfor performance tuning.
Time Requirement
http://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.htmlhttp://www.1keydata.com/datawarehousing/query-optimization.html7/29/2019 Datwarehouse Project Implementation
18/43
3 - 5 days.
Deliverables
Performance tuning document - Goal and Result
Possible Pitfalls
Make sure the development environment mimics the production environment as much aspossible - Performance enhancements seen on less powerful machines sometimes do notmaterialize on the larger, production-level machines.
Query Performance
For any production database, SQL query performance becomes an issue sooner or later. Havinglong-running queries not only consumes system resources that makes the server and applicationrun slowly, but also may lead to table locking and data corruption issues. So, query optimization
becomes an important task.
First, we offer some guiding principles for query optimization:
1. Understand how your database is executing your query
Nowadays all databases have their own query optimizer, and offers a way for users to understandhow a query is executed. For example, which index from which table is being used to execute thequery? The first step to query optimization is understanding what the database is doing. Differentdatabases have different commands for this. For example, in MySQL, one can use "EXPLAIN[SQL Query]" keyword to see the query plan. In Oracle, one can use "EXPLAIN PLAN FOR[SQL Query]" to see the query plan.
2. Retrieve as little data as possible
The more data returned from the query, the more resources the database needs to expand toprocess and store these data. So for example, if you only need to retrieve one column from atable, do not use 'SELECT *'.
3. Store intermediate results
Sometimes logic for a query can be quite complex. Often, it is possible to achieve the desiredresult through the use of subqueries, inline views, and UNION-type statements. For those cases,the intermediate results are not stored in the database, but are immediately used within the query.
This can lead to performance issues, especially when the intermediate results have a largenumber of rows.
The way to increase query performance in those cases is to store the intermediate results in atemporary table, and break up the initial SQL statement into several SQL statements. In manycases, you can even build an index on the temporary table to speed up the query performanceeven more. Granted, this adds a little complexity in query management (i.e., the need to managetemporary tables), but the speedup in query performance is often worth the trouble.
7/29/2019 Datwarehouse Project Implementation
19/43
Below are several specific query optimization strategies.
Use Index
Using an index is the first strategy one should use to speed up a query. In fact, thisstrategy is so important that index optimization is also discussed.
Aggregate TablePre-populating tables at higher levels so less amount of data need to be parsed.
Vertical Partitioning
Partition the table by columns. This strategy decreases the amount of data a SQL queryneeds to process.
Horizontal Partitioning
Partition the table by data value, most often time. This strategy decreases the amount ofdata a SQL query needs to process.
Denormalization
The process of denormalization combines multiple tables into a single table. This speedsup query performance because fewer table joins are needed.
Server Tuning
Each server has its own parameters, and often tuning server parameters so that it can fullytake advantage of the hardware resources can significantly speed up query performance.
Quality Assurance
Task Description
Once the development team declares that everything is ready for further testing, the QA teamtakes over. The QA team is always from the client. Usually the QA team members will knowlittle about data warehousing, and some of them may even resent the need to have to learnanother tool or tools. This makes the QA process a tricky one.
Sometimes the QA process is overlooked. On my very first data warehousing project, the projectteam worked very hard to get everything ready for Phase 1, and everyone thought that we hadmet the deadline. There was one mistake, though, the project managers failed to recognize that itis necessary to go through the client QA process before the project can go into production. As aresult, it took five extra months to bring the project to production (the original development timehad been only 2 1/2 months).
Time Requirement
1 - 4 weeks.
Deliverables
QA Test Plan
7/29/2019 Datwarehouse Project Implementation
20/43
QA verification that the data warehousing system is ready to go to production
Possible Pitfalls
As mentioned above, usually the QA team members know little about data warehousing, andsome of them may even resent the need to have to learn another tool or tools. Make sure the QA
team members get enough education so that they can complete the testing themselves.
Rollout To Production
Task Description
Once the QA team gives thumbs up, it is time for the data warehouse system to go live. Somemay think this is as easy as flipping on a switch, but usually it is not true. Depending on thenumber of end users, it sometimes take up to a full week to bring everyone online! Fortunately,nowadays most end users access the data warehouse over the web, making going productionsometimes as easy as sending out an URL via email.
Time Requirement
1 - 3 days.
Deliverables
Delivery of the data warehousing system to the end users.
Possible Pitfalls
Take care to address the user education needs. There is nothing more frustrating to spend several
months to develop and QA the data warehousing system, only to have little usage because theusers are not properly trained. Regardless of how intuitive or easy the interface may be, it isalways a good idea to send the users to at least a one-day course to let them understand what theycan achieve by properly using the data warehouse.
Production Maintenance
Task Description
Once the data warehouse goes production, it needs to be maintained. Tasks as such regularbackup and crisis management becomes important and should be planned out. In addition, it is
very important to consistently monitor end user usage. This serves two purposes: 1. To captureany runaway requests so that they can be fixed before slowing the entire system down, and 2. Tounderstand how much users are utilizing the data warehouse for return-on-investmentcalculations and future enhancement considerations.
Time Requirement
Ongoing.
7/29/2019 Datwarehouse Project Implementation
21/43
Deliverables
Consistent availability of the data warehousing system to the end users.
Possible Pitfalls
Usually by this time most, if not all, of the developers will have left the project, so it is essentialthat proper documentation is left for those who are handling production maintenance. There isnothing more frustrating than staring at something another person did, yet unable to figure it outdue to the lack of proper documentation.
Another pitfall is that the maintenance phase is usually boring. So, if there is another phase of thedata warehouse planned, start on that as soon as possible.
Task Description
Once the data warehousing system goes live, there are often needs for incrementalenhancements. I am not talking about a new data warehousing phases, but simply small changes
that follow the business itself. For example, the original geographical designations may bedifferent, the company may originally have 4 sales regions, but now because sales are going sowell, now they have 10 sales regions.
Deliverables
Change management documentation
Actual change to the data warehousing system
Possible Pitfalls
Because a lot of times the changes are simple to make, it is very tempting to just go ahead andmake the change in production. This is a definite no-no. Many unexpected problems will pop upif this is done. I would very strongly recommend that the typical cycle of development --> QA--> Production be followed, regardless of how simple the change may seem.
observations
This section lists the trends I have seen based on my experience in the data warehousing field:
Quick implementation time
Lack of collaboration with data mining efforts
Industry consolidation
How to measure success
Recipes for data warehousing project failure
Quick Implementation Time
If you add up the total time required to complete the tasks from Requirement Gathering to
http://www.1keydata.com/datawarehousing/quick-implementation.htmlhttp://www.1keydata.com/datawarehousing/lack-data-mining.htmlhttp://www.1keydata.com/datawarehousing/industry-consolidation.htmlhttp://www.1keydata.com/datawarehousing/how-to-measure-success.htmlhttp://www.1keydata.com/datawarehousing/recipes-for-failure.htmlhttp://www.1keydata.com/datawarehousing/recipes-for-failure.htmlhttp://var/www/apps/conversion/tmp/scratch_2/requirement.htmlhttp://www.1keydata.com/datawarehousing/quick-implementation.htmlhttp://www.1keydata.com/datawarehousing/lack-data-mining.htmlhttp://www.1keydata.com/datawarehousing/industry-consolidation.htmlhttp://www.1keydata.com/datawarehousing/how-to-measure-success.htmlhttp://www.1keydata.com/datawarehousing/recipes-for-failure.htmlhttp://var/www/apps/conversion/tmp/scratch_2/requirement.html7/29/2019 Datwarehouse Project Implementation
22/43
Rollout to Production, you'll find it takes about 9 - 29 weeks to complete each phase of the datawarehousing efforts. The 9 weeks may sound too quick, but I have been personally involved in aturnkey data warehousing implementation that took 40 business days, so that is entirely possible.Furthermore, some of the tasks may proceed in parallel, so as a rule of thumb it is reasonable tosay that it generally takes 2 - 6 months for each phase of the data warehousing implementation.
Why is this important? The main reason is that in today's business world, the businessenvironment changes quickly, which means that what is important now may not be important 6months from now. For example, even the traditionally static financial industry is coming up withnew products and new ways to generate revenue in a rapid pace. Therefore, a time-consumingdata warehousing effort will very likely become obsolete by the time it is in production. It is bestto finish a project quickly. The focus on quick delivery time does mean, however, that the scopefor each phase of the data warehousing project will necessarily be limited. In this case, the 80-20rule applies, and our goal is to do the 20% of the work that will satisfy 80% of the user needs.The rest can come later.
Lack Of Collaboration With Data Mining Efforts
Usually data mining is viewed as the final manifestation of the data warehouse. The ideal is thatnow information from all over the enterprise is conformed and stored in a central location, datamining techniques can be applied to find relationships that are otherwise not possible to find.Unfortunately, this has not quite happened due to the following reasons:
1. Few enterprises have an enterprise data warehouse infrastructure. In fact, currently they aremore likely to have isolated data marts. At the data mart level, it is difficult to come up withrelationships that cannot be answered by a good OLAP tool.
2. The ROI for data mining companies is inherently lower because by definition, data miningwill only be performed by a few users (generally no more than 5) in the entire enterprise. As a
result, it is hard to charge a lot of money due to the low number of users. In addition, developingdata mining algorithms is an inherently complex process and requires a lot of up frontinvestment. Finally, it is difficult for the vendor to put a value proposition in front of the clientbecause quantifying the returns on a data mining project is next to impossible.
This is not to say, however, that data mining is not being utilized by enterprises. In fact, manyenterprises have made excellent discoveries using data mining techniques. What I am saying,though, is that data mining is typically not associated with a data warehousing initiative. It seemslike successful data mining projects are usually stand-alone projects.
Industry Consolidation
In the last several years, we have seen rapid industry consolidation, as the weaker competitorsare gobbled up by stronger players. The most significant transactions are below (note that thedollar amount quoted is the value of the deal when initially announced):
IBM purchased Cognos for $5 billion in 2007.
SAP purchased Business Objects for $6.8 billion in 2007.
http://var/www/apps/conversion/tmp/scratch_2/rollout.htmlhttp://var/www/apps/conversion/tmp/scratch_2/rollout.html7/29/2019 Datwarehouse Project Implementation
23/43
Oracle purchased Hyperion for $3.3 billion in 2007.
Business Objects (OLAP/ETL) purchased FirstLogic (data cleansing) for $69 million in2006.
Informatica (ETL) purchased Similarity Systems (data cleansing) for $55 million in 2006.
IBM (database) purchased Ascential Software (ETL) for $1.1 billion in cash in 2005.
Business Objects (OLAP) purchased Crystal Decisions (Reporting) for $820 million in2003.
Hyperion (OLAP) purchased Brio (OLAP) for $142 million in 2003.
GEAC (ERP) purchased Comshare (OLAP) for $52 million in 2003.
For the majority of the deals, the purchase represents an effort by the buyer to expand into otherareas of data warehousing (Hyperion's purchase of Brio also falls into this category because,
even though both are OLAP vendors, their product lines do not overlap). This clearly showsvendors' strong push to be the one-stop shop, from reporting, OLAP, to ETL.
There are two levels of one-stop shop. The first level is at the corporate level. In this case, thevendor is essentially still selling two entirely separate products. But instead of dealing with twosets of sales and technology support groups, the customers only interact with one such group.The second level is at the product level. In this case, different products are integrated. In datawarehousing, this essentially means that they share the same metadata layer. This is actually arather difficult task, and therefore not commonly accomplished. When there is metadataintegration, the customers not only get the benefit of only having to deal with one vendor insteadof two (or more), but the customer will be using a single product, rather than multiple products.
This is where the real value of industry consolidation is shown.
How To Measure Success
Given the significant amount of resources usually invested in a data warehousing project, a veryimportant question is how success can be measured. This is a question that many projectmanagers do not think about, and for good reason: Many project managers are brought in tobuild the data warehousing system, and then turn it over to in-house staff for ongoingmaintenance. The job of the project manager is to build the system, not to justify its existence.
Just because this is often not done does not mean this is not important. Just like a data
warehousing system aims to measure the pulse of the company, the success of the datawarehousing system itself needs to be measured. Without some type of measure on the return oninvestment (ROI), how does the company know whether it made the right choice? Whether itshould continue with the data warehousing investment?
There are a number of papers out there that provide formula on how to calculate the return on adata warehousing investment. Some of the calculations become quite cumbersome, with anumber of assumptions and even more variables. Although they are all valid methods, I believe
7/29/2019 Datwarehouse Project Implementation
24/43
the success of the data warehousing system can simply be measured by looking at one criteria:
How often the system is being used.
If the system is satisfying user needs, users will naturally use the system. If not, users willabandon the system, and a data warehousing system with no users is actually a detriment to the
company (since resources that can be deployed elsewhere are required to maintain the system).Therefore, it is very important to have a tracking mechanism to figure out how much are theusers accessing the data warehouse. This should not be a problem if third-party reporting/OLAPtools are used, since they all contain this component. If the reporting tool is built from scratch,this feature needs to be included in the tool. Once the system goes into production, the datawarehousing team needs to periodically check to make sure users are using the system. If usagestarts to dip, find out why and address the reason as soon as possible. Is the data quality lacking?Are the reports not satisfying current needs? Is the response time slow? Whatever the reason,take steps to address it as soon as possible, so that the data warehousing system is serving itspurpose successfully.
Business IntelligenceBusiness intelligence is a term commonly associated with data warehousing. In fact, many of thetool vendors position their products as business intelligence software rather than datawarehousing software. There are other occasions where the two terms are used interchangeably.So, exactly what is business inteligence?
Business intelligence usually refers to the information that is available for the enterprise to makedecisions on. A data warehousing (or data mart) system is the backend, or the infrastructural,component for achieving business intellignce. Business intelligence also includes the insightgained from doing data mining analysis, as well as unstrctured data (thus the need fo contentmanagement systems). For our purposes here, we will discuss business intelligence in the context
of using a data warehouse infrastructure.
This section includes the following:
Business intelligence tools: Tools commonly used for business intelligence.
Business intelligence uses: Different forms of business intelligence.
Business intelligence news: News in the business intelligence area.
Business Intelligence > Tools
The most common tools used for business intelligence are as follows. They are listed in the
following order: Increasing cost, increasing functionality, increasing business intelligencecomplexity, and decreasing number of total users.
Excel
Take a guess what's the most common business intelligence tool? You might be surprised to findout it's Microsoft Excel. There are several reasons for this:
http://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/business-intelligence-tools.phphttp://www.1keydata.com/datawarehousing/business-intelligence-uses.phphttp://www.1keydata.com/datawarehousing/business-intelligence-news.phphttp://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/business-intelligence.phphttp://www.1keydata.com/datawarehousing/business-intelligence-tools.phphttp://www.1keydata.com/datawarehousing/business-intelligence-uses.phphttp://www.1keydata.com/datawarehousing/business-intelligence-news.phphttp://www.1keydata.com/datawarehousing/business-intelligence.php7/29/2019 Datwarehouse Project Implementation
25/43
1. It's relatively cheap.
2. It's commonly used. You can easily send an Excel sheet to another person without worryingwhether the recipient knows how to read the numbers.
3. It has most of the functionalities users need to display data.
In fact, it is still so popular that all third-party reporting / OLAP tools have an "export to Excel"functionality. Even for home-built solutions, the ability to export numbers to Excel usually needsto be built.
Excel is best used for business operations reporting and goals tracking.
Reporting tool
In this discussion, I am including both custom-built reporting tools and the commercial reportingtools together. They provide some flexibility in terms of the ability for each user to create,schedule, and run their own reports. The Reporting Tool Selectionselection discusses how one
should select an OLAP tool.
Business operations reporting and dashboard are the most common applications for a reportingtool.
OLAP tool
OLAP tools are usually used by advanced users. They make it easy for users to look at the datafrom multiple dimensions. The OLAP Tool Selectionselection discusses how one should selectan OLAP tool.
OLAP tools are used for multidimensional analysis.
Data mining tool
Data mining tools are usually only by very specialized users, and in an organization, even largeones, there are usually only a handful of users using data mining tools.
Data mining tools are used for finding correlation among different factors.
Business Intelligence Uses
Business intelligence usage can be categorized into the following categories:
1. Business operations reporting
The most common form of business intelligence is business operations reporting. This includesthe actuals and how the actuals stack up against the goals. This type of business intelligenceoften manifests itself in the standard weekly or monthly reports that need to be produced.
2. Forecasting
Many of you have no doubt run into the needs for forecasting, and all of you would agree thatforecasting is both a science and an art. It is an art because one can never be sure what the future
http://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/business-intelligence-uses.phphttp://www.1keydata.com/datawarehousing/toolreporting.htmlhttp://www.1keydata.com/datawarehousing/toololap.htmlhttp://www.1keydata.com/datawarehousing/business-intelligence-uses.php7/29/2019 Datwarehouse Project Implementation
26/43
holds. What if competitors decide to spend a large amount of money in advertising? What if theprice of oil shoots up to $80 a barrel? At the same time, it is also a science because one canextrapolate from historical data, so it's not a total guess.
3. Dashboard
The primary purpose of a dashboard is to convey the information at a glance. For this audience,there is little, if any, need for drilling down on the data. At the same time, presentation and easeof use are very important for a dashboard to be useful.
4. Multidimensional analysis
Multidimensional analysis is the "slicing and dicing" of the data. It offers good insight into thenumbers at a more granular level. This requires a solid data warehousing / data mart backend, aswell as business-savvy analysts to get to the necessary data.
5. Finding correlation among different factors
This is diving very deep into business intelligence. Questions asked are like, "How do differentfactors correlate to one another?" and "Are there significant time trends that can beleveraged/anticipated?"
Data Warehousing > Concepts
Several concepts are of particular importance to data warehousing. They are discussed in detailin this section.
Dimensional Data Model: Dimensional data model is commonly used in data warehousingsystems. This section describes this modeling technique, and the two common schema types,starschema andsnowflake schema.
Slowly Changing Dimension: This is a common issue facing data warehousing practioners. Thissection explains the problem, and describes the three ways of handling this problem withexamples.
Conceptual Data Model: What is a conceptual data model, its features, and an example of thistype of data model.
Logical Data Model: What is a logical data model, its features, and an example of this type ofdata model.
Physical Data Model: What is a physical data model, its features, and an example of this type of
data model.
Conceptual, Logical, and Physical Data Model: Different levels of abstraction for a datamodel. This section compares and constrasts the three different types of data models.
Data Integrity: What is data integrity and how it is enforced in data warehousing.
What is OLAP: Definition of OLAP.
http://www.1keydata.com/datawarehousing/datawarehouse.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/scd.htmlhttp://www.1keydata.com/datawarehousing/conceptual-data-model.htmlhttp://www.1keydata.com/datawarehousing/logical-data-model.htmlhttp://www.1keydata.com/datawarehousing/physical-data-model.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-integrity.htmlhttp://www.1keydata.com/datawarehousing/what-is-olap.htmlhttp://www.1keydata.com/datawarehousing/datawarehouse.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/scd.htmlhttp://www.1keydata.com/datawarehousing/conceptual-data-model.htmlhttp://www.1keydata.com/datawarehousing/logical-data-model.htmlhttp://www.1keydata.com/datawarehousing/physical-data-model.htmlhttp://www.1keydata.com/datawarehousing/data-modeling-levels.htmlhttp://www.1keydata.com/datawarehousing/data-integrity.htmlhttp://www.1keydata.com/datawarehousing/what-is-olap.html7/29/2019 Datwarehouse Project Implementation
27/43
MOLAP, ROLAP, and HOLAP: What are these different types of OLAP technology? Thissection discusses how they are different from the other, and the advantages and disadvantages ofeach.
Bill Inmon vs. Ralph Kimball: These two data warehousing heavyweights have a different viewof the role between data warehouse and data mart.
Dimensional Data Modeling
Dimensional data model is most often used in data warehousing systems. This is different fromthe 3rd normal form, commonly used for transactional (OLTP) type systems. As you canimagine, the same data would then be stored differently in a dimensional model than in a 3rdnormal form model.
To understand dimensional data modeling, let's define some of the terms commonly used in thistype of modeling:
Dimension: A category of information. For example, the time dimension.
Attribute: A unique level within a dimension. For example, Month is an attribute in the TimeDimension.
Hierarchy: The specification of levels that represents relationship between different attributeswithin a dimension. For example, one possible hierarchy in the Time dimension is Year Quarter Month Day.
Fact Table: A fact table is a table that contains the measures of interest. For example, salesamount would be such a measure. This measure is stored in the fact table with the appropriategranularity. For example, it can be sales amount by store by day. In this case, the fact tablewould contain three columns: A date column, a store column, and a sales amount column.
Lookup Table: The lookup table provides the detailed information about the attributes. Forexample, the lookup table for the Quarter attribute would include a list of all of the quartersavailable in the data warehouse. Each row (each quarter) may have several fields, one for theunique ID that identifies the quarter, and one or more additional fields that specifies how thatparticular quarter is represented on a report (for example, first quarter of 2001 may berepresented as "Q1 2001" or "2001 Q1").
A dimensional model includes fact tables and lookup tables. Fact tables connect to one or morelookup tables, but fact tables do not have direct relationships to one another. Dimensions andhierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup
tables.In designing data models for data warehouses / data marts, the most commonly used schematypes are Star Schemaand Snowflake Schema.
Whether one uses a star or a snowflake largely depends on personal preference and businessneeds. Personally, I am partial to snowflakes, when there is a business case to analyze theinformation at that particular level.
http://www.1keydata.com/datawarehousing/molap-rolap.htmlhttp://www.1keydata.com/datawarehousing/inmon-kimball.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/fact-table-granularity.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/molap-rolap.htmlhttp://www.1keydata.com/datawarehousing/inmon-kimball.htmlhttp://www.1keydata.com/datawarehousing/dimensional.htmlhttp://www.1keydata.com/datawarehousing/fact-table-granularity.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.html7/29/2019 Datwarehouse Project Implementation
28/43
Fact Table Granularity
Granularity
The first step in designing a fact table is to determine the granularity of the fact table. Bygranularity, we mean the lowest level of information that will be stored in the fact table. This
constitutes two steps:
Determine which dimensions will be included.
Determine where along the hierarchy of each dimension the information will be kept.
The determining factors usually goes back to the requirements.
Which Dimensions To Include
Determining which dimensions to include is usually a straightforward process, because businessprocesses will often dictate clearly what are the relevant dimensions.
For example, in an off-line retail world, the dimensions for a sales fact table are usually time,geography, and product. This list, however, is by no means a complete list for all off-lineretailers. A supermarket with a Rewards Card program, where customers provide some personalinformation in exchange for a rewards card, and the supermarket would offer lower prices forcertain items for customers who present a rewards card at checkout, will also have the ability totrack the customer dimension. Whether the data warehousing system includes the customerdimension will then be a decision that needs to be made.
What Level Within Each Dimensions To Include
Determining which part of hierarchy the information is stored along each dimension is a bit more
tricky. This is where user requirement (both stated and possibly future) plays a major role.In the above example, will the supermarket wanting to do analysis along at the hourly level? (i.e.,looking at how certain products may sell by different hours of the day.) If so, it makes sense touse 'hour' as the lowest level of granularity in the time dimension. If daily analysis is sufficient,then 'day' can be used as the lowest level of granularity. Since the lower the level of detail, thelarger the data amount in the fact table, the granularity exercise is in essence figuring out thesweet spot in the tradeoff between detailed level of analysis and data storage.
Note that sometimes the users will not specify certain requirements, but based on the industryknowledge, the data warehousing team may foresee that certain requirements will beforthcoming that may result in the need of additional details. In such cases, it is prudent for the
data warehousing team to design the fact table such that lower-level information is included.This will avoid possibly needing to re-design the fact table in the future. On the other hand,trying to anticipate all future requirements is an impossible and hence futile exercise, and thedata warehousing team needs to fight the urge of the "dumping the lowest level of detail into thedata warehouse" symptom, and only includes what is practically needed. Sometimes this can bemore of an art than science, and prior experience will become invaluable here.
Fact Table Types
http://www.1keydata.com/datawarehousing/fact-table-granularity.htmlhttp://www.1keydata.com/datawarehousing/fact-table-types.htmlhttp://www.1keydata.com/datawarehousing/fact-table-types.htmlhttp://www.1keydata.com/datawarehousing/fact-table-granularity.htmlhttp://www.1keydata.com/datawarehousing/fact-table-types.html7/29/2019 Datwarehouse Project Implementation
29/43
Types of Facts
There are three types of facts:
Additive: Additive facts are facts that can be summed up through all of the dimensions inthe fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of thedimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of thedimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes thatwe are a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each store on a dailybasis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can
sum up this fact along any of the three dimensions present in the fact table -- date, store, andproduct. For example, the sum ofSales_Amount for all 7 days in a week represent the total sales
amount for that week.
Say we are a bank with the following fact table:
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the end of each day,as well as the profit margin for each account for each day. Current_Balance and
Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to addthem up for all accounts (what's the total current balance for all accounts in the bank?), but itdoes not make sense to add them up through time (adding up all current balances for a given
account for each day of the month does not give us any useful information). Profit_Margin is anon-additive fact, for it does not make sense to add them up for the account level or the day
level.
7/29/2019 Datwarehouse Project Implementation
30/43
Types of Fact Tables
Based on the above classifications, there are two types of fact tables:
Cumulative: This type of fact table describes what has happened over a period of time.For example, this fact table may describe the total sales by product by store by day. The
facts for this type of fact tables are mostly additive facts. The first example presentedhere is a cumulative fact table.
Snapshot: This type of fact table describes the state of things in a particular instance oftime, and usually includes more semi-additive and non-additive facts. The secondexample presented here is a snapshot fact table.
Types of Facts
There are three types of facts:
Additive: Additive facts are facts that can be summed up through all of the dimensions in
the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of thedimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of thedimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes thatwe are a retailer, and we have a fact table with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product in each store on a dailybasis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can
sum up this fact along any of the three dimensions present in the fact table -- date, store, andproduct. For example, the sum ofSales_Amount for all 7 days in a week represent the total sales
amount for that week.
Say we are a bank with the following fact table:
Date
Account
7/29/2019 Datwarehouse Project Implementation
31/43
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each account at the end of each day,
as well as the profit margin for each account for each day. Current_Balance andProfit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add
them up for all accounts (what's the total current balance for all accounts in the bank?), but itdoes not make sense to add them up through time (adding up all current balances for a given
account for each day of the month does not give us any useful information). Profit_Margin is anon-additive fact, for it does not make sense to add them up for the account level or the day
level.
Types of Fact Tables
Based on the above classifications, there are two types of fact tables:
Cumulative: This type of fact table describes what has happened over a period of time.For example, this fact table may describe the total sales by product by store by day. Thefacts for this type of fact tables are mostly additive facts. The first example presentedhere is a cumulative fact table.
Snapshot: This type of fact table describes the state of things in a particular instance oftime, and usually includes more semi-additive and non-additive facts. The secondexample presented here is a snapshot fact table.
Types of Facts
There are three types of facts:
Additive: Additive facts are facts that can be summed up through all of the dimensions inthe fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of thedimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of thedimensions present in the fact table.
Let us use examples to illustrate each of the three types of facts. The first example assumes thatwe are a retailer, and we have a fact table with the following colu