Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer...

23
Object_Oriented Databases , by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Dr. Awad Khalil Computer Science Department Computer Science Department AUC AUC

Transcript of Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer...

Page 1: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

1

Data Warehousing Concepts

Dr. Awad KhalilDr. Awad Khalil

Computer Science DepartmentComputer Science Department

AUCAUC

Page 2: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

2

Content

Why Data Warehousing?Why Data Warehousing? What Data Warehousing?What Data Warehousing? Benefits of Data WarehousingBenefits of Data Warehousing Problems of Data WarehousingProblems of Data Warehousing

Page 3: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

3

Why Data Warehousing? Corporate decision-makers require access to all the organization’s Corporate decision-makers require access to all the organization’s

data, wherever it is located.data, wherever it is located. To provide comprehensive analysis of the organization, its To provide comprehensive analysis of the organization, its

business, its requirements, and any trends, requires access to not business, its requirements, and any trends, requires access to not only the current values in the database, but also to historical data.only the current values in the database, but also to historical data.

To facilitate this type of analysis, the data warehouse has been To facilitate this type of analysis, the data warehouse has been created to hold data drawn from several data sources, maintained created to hold data drawn from several data sources, maintained by different operating units, together with historical and summary by different operating units, together with historical and summary transformations.transformations.

However, decision-makers also require powerful analysis tools. However, decision-makers also require powerful analysis tools. Two main types of analysis tools have emerged over the last few Two main types of analysis tools have emerged over the last few years:years: Online Analytical Processing (OLAP)Online Analytical Processing (OLAP) Data MiningData Mining

Page 4: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

4

What Data Warehousing? Data Warehousing: : A subject-oriented, integrated, time-A subject-oriented, integrated, time-

variant, and non-volatile collection of data in support of variant, and non-volatile collection of data in support of management’s decision-making processmanagement’s decision-making process..

Subject-oriented: : as the warehouse is organized around the as the warehouse is organized around the major subjects of the enterprise (such as customers, products, major subjects of the enterprise (such as customers, products, and sales) rather than the major application areas (such as and sales) rather than the major application areas (such as customer invoicing, stock control, and product sales).customer invoicing, stock control, and product sales).

Integrated: : because of the coming together of source data because of the coming together of source data from different enterprise-wide applications systems.from different enterprise-wide applications systems.

Time-variant:: because data in the warehouse is only accurate because data in the warehouse is only accurate and valid at some point in time or over some time interval.and valid at some point in time or over some time interval.

Non-volatile: : as the data is not updated in real time but as the data is not updated in real time but refreshed from operational systems on a regular basis.refreshed from operational systems on a regular basis.

Page 5: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

5

Benefits of Data Warehousing

Potential high returns on investmentPotential high returns on investmentCompetitive advantageCompetitive advantageIncrease productivity of corporate Increase productivity of corporate

decision-makersdecision-makers

Page 6: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

6

OLTP versus Data WarehousingOLTP Systems

Holds current data Stores detailed data Data is dynamic Repetitive processing High level of transaction

throughput Predictable pattern of usage Transaction-driven Application-oriented Supports day-to-day

decisions Serve large number of users

Data Warehousing Holds historical data 

Stores detailed, lightly, and highly summarized data

 

Data is largely static 

Ad hoc, unstructured and heuristic processing

 

Medium to low level of transaction throughput

 

Unpredictable pattern of usage 

Analysis-driven 

Subject-oriented Supports strategic decisions Serves relatively low number

of managerial users

Page 7: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

7

Problems of Data Warehousing Underestimation of resources for data loadingUnderestimation of resources for data loading Hidden problems with source systemsHidden problems with source systems Required data not capturedRequired data not captured Increased end-user demandsIncreased end-user demands Data homogenizationData homogenization High demand for resourcesHigh demand for resources Data ownershipData ownership High maintenanceHigh maintenance Long-duration projectsLong-duration projects Complexity of integrationComplexity of integration

Page 8: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

8

Data Warehouse Architecture

Operational Data Operational Data

Store (ODS) Load Manager Warehouse

Manager Query Manager Detailed Data Lightly and Highly

Summarized Data Archive/Backup

Data Metadata End-User Access

Tools

Page 9: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

9

Data Warehouse Architecture (Cont’d)

Operational Data Mainframe operational data held in first generation hierarchical Mainframe operational data held in first generation hierarchical

and network databases.and network databases. Departmental data held in proprietary file systems such as Departmental data held in proprietary file systems such as

VSAM, RMS, and relational DBMSs such as Oracle and DB2.VSAM, RMS, and relational DBMSs such as Oracle and DB2. Private data held on workstations and private servers.Private data held on workstations and private servers. External systems such as the Internet, commercially available External systems such as the Internet, commercially available

databases, or databases associated with an organization’s databases, or databases associated with an organization’s suppliers or customers.suppliers or customers.

Operational Data Store (ODS) ODS is a repository of current and integrated operational data ODS is a repository of current and integrated operational data

used for analysis.used for analysis. It is often structured and supplied with data in the same way as It is often structured and supplied with data in the same way as

the data warehouse , but may in fact act simply as a staging area the data warehouse , but may in fact act simply as a staging area for data to be moved into the data warehouse.for data to be moved into the data warehouse.

Page 10: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

10

Data Warehouse Architecture (Cont’d)

Load Manager The load manager (also called the The load manager (also called the frontend frontend component) component)

performs all the operations associated with the performs all the operations associated with the extraction and loading of data into the warehouse.extraction and loading of data into the warehouse.

The operations performed by the load manager may The operations performed by the load manager may include simple transformations of the data to prepare include simple transformations of the data to prepare the data for entry into the warehouse.the data for entry into the warehouse.

The size and complexity of this component will vary The size and complexity of this component will vary between data warehouses and may be constructed using between data warehouses and may be constructed using a combination of vendor data loading tools and custom-a combination of vendor data loading tools and custom-built programs.built programs.

Page 11: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

11

Data Warehouse Architecture (Cont’d)Warehouse Manager The warehouse manager performs all the operations associated with the management The warehouse manager performs all the operations associated with the management

of the data in the warehouse.of the data in the warehouse. This component is constructed using vendor data management tools and custom-built This component is constructed using vendor data management tools and custom-built

programs.programs. The operations performed by the warehouse manager include:The operations performed by the warehouse manager include:

Analysis of data to ensure consistency.Analysis of data to ensure consistency. Transformation and merging of source data from temporary storage into data Transformation and merging of source data from temporary storage into data

warehouse tables.warehouse tables. Creation of indexes and views on base tables.Creation of indexes and views on base tables. Generation of denormalizations (if necessary).Generation of denormalizations (if necessary). Generation of aggregations (if necessary).Generation of aggregations (if necessary). Backing-up and archiving data.Backing-up and archiving data.

Query Manager The query manager (also called the The query manager (also called the backendbackend component) performs all the operations component) performs all the operations

associated with the management of user queries.associated with the management of user queries. This component is typically constructed using vendor end-user data access tools, data This component is typically constructed using vendor end-user data access tools, data

warehouse monitoring tools, database facilities, and custom-built programs.warehouse monitoring tools, database facilities, and custom-built programs.

Page 12: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

12

Data Warehouse Architecture (Cont’d)

Detailed Data This area of the warehouse stores all the detailed data in the This area of the warehouse stores all the detailed data in the

database schema.database schema. In most cases, the detailed data is not stored online but is made In most cases, the detailed data is not stored online but is made

available by aggregating the data to the next level of detail.available by aggregating the data to the next level of detail. However, on a regular basis, detailed data is added to the However, on a regular basis, detailed data is added to the

warehouse to supplement the aggregated data.warehouse to supplement the aggregated data.

Lightly and Highly Summarized Data This area stores all the predefined lightly and highly summarized This area stores all the predefined lightly and highly summarized

(aggregated) data generated by the warehouse manager.(aggregated) data generated by the warehouse manager. It is transient area as it will be subject to change on an ongoing It is transient area as it will be subject to change on an ongoing

basis in order to respond to changing query profiles.basis in order to respond to changing query profiles. The purpose of summary information is to speed up the The purpose of summary information is to speed up the

performance of the queries.performance of the queries.

Page 13: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

13

Data Warehouse Architecture (Cont’d)

Archive/Backup Data This area of the warehouse stores the detailed and This area of the warehouse stores the detailed and

summarized data for the purpose of archiving and summarized data for the purpose of archiving and backup.backup.

Metadata This area stores all the metadata (data about data) This area stores all the metadata (data about data)

definitions used by all the processes in the warehouse. definitions used by all the processes in the warehouse. Metadata is used for a variety of purposes including:Metadata is used for a variety of purposes including: The extraction and loading processes.The extraction and loading processes. The warehouse management process.The warehouse management process. As part of the query management process.As part of the query management process.

Page 14: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

14

Data Warehouse Architecture (Cont’d)

End-User Access Tools The principal purpose of data warehousing is to provide The principal purpose of data warehousing is to provide

information to business users for strategic decision-making.information to business users for strategic decision-making. These users interact with the warehouse using end-user access These users interact with the warehouse using end-user access

tools.tools. The data warehouse must efficiently support ad hoc and routine The data warehouse must efficiently support ad hoc and routine

analysis.analysis. Although the definition of end-user access tools can overlap, Although the definition of end-user access tools can overlap,

they can be categorized into five main groups:they can be categorized into five main groups: Reporting and query tools;Reporting and query tools; Application development tools;Application development tools; Executive Information Systems (EIS) tools;Executive Information Systems (EIS) tools; Online Analytical Processing (OLAP) tools;Online Analytical Processing (OLAP) tools; Data mining tools.Data mining tools.

Page 15: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

15

Data Warehouse Data Flows

Data warehousing focuses on the Data warehousing focuses on the management of five primary data management of five primary data flows, namely the inflow, upflow, flows, namely the inflow, upflow, downflow, outflow, and downflow, outflow, and metaflow:metaflow:

InflowInflow: Extracting, cleansing, : Extracting, cleansing, and loading of the source and loading of the source data.data.

UpflowUpflow: Adding value to the : Adding value to the data in the warehouse data in the warehouse through summarizing, through summarizing, packaging, and distribution packaging, and distribution of the data.of the data.

DownflowDownflow: Archiving and : Archiving and backing-up the data in the backing-up the data in the warehouse.warehouse.

OutflowOutflow: Making the data : Making the data available to end-usersavailable to end-users

MetaflowMetaflow: Managing the : Managing the metadata.metadata.

Page 16: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

16

Data Warehouse Tools and Technologies

Extraction, Cleansing, and Transformation Tools Selecting the correct extraction, cleansing, and Selecting the correct extraction, cleansing, and

transformation tools are critical steps in the transformation tools are critical steps in the construction of a data warehouse.construction of a data warehouse.

The tasks of capturing data from a source system, The tasks of capturing data from a source system, cleansing and transforming it, and then loading the cleansing and transforming it, and then loading the results into target system can be carried out either by results into target system can be carried out either by separate products, or by single integrated solution.separate products, or by single integrated solution.

Integrated solutions fall into one of the following Integrated solutions fall into one of the following categories:categories: Code generators;Code generators; Database replication tools;Database replication tools; Dynamic transformation engines.Dynamic transformation engines.

Page 17: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

17

Data Warehouse Tools and Technologies (Cont’d)

Data Warehouse DBMS Requirements: The specialized requirements for a relational DBMS suitable for data warehousing : The specialized requirements for a relational DBMS suitable for data warehousing

are as follows:are as follows: Load performance and load processingLoad performance and load processing Data quality managementData quality management Query performanceQuery performance Terabyte scalability and mass user scalabilityTerabyte scalability and mass user scalability Networked data warehouseNetworked data warehouse Warehouse administrationWarehouse administration Integrated dimensional analysisIntegrated dimensional analysis Advanced query functionalityAdvanced query functionality

Parallel DBMSs: Data warehousing requires the processing of enormous amounts of data and : Data warehousing requires the processing of enormous amounts of data and parallel database technology offers a solution to providing the necessary growth in performance.parallel database technology offers a solution to providing the necessary growth in performance.

The success of parallel DBMSs depends on the efficient operation of many resources including The success of parallel DBMSs depends on the efficient operation of many resources including processors, memory, disks and network connections.processors, memory, disks and network connections.

The aim behind using parallel DBMS is to solve decision-support problems using multiple nodes The aim behind using parallel DBMS is to solve decision-support problems using multiple nodes working on the same problem.working on the same problem.

The major characteristics of parallel DBMSs are scalability, operability, and availability.The major characteristics of parallel DBMSs are scalability, operability, and availability. The parallel DBMS performs many database operations simultaneously, splitting individual tasks The parallel DBMS performs many database operations simultaneously, splitting individual tasks

into smaller parts so that tasks can be spread across multiple processors.into smaller parts so that tasks can be spread across multiple processors. Parallel DBMSs must be capable of running parallel queries.Parallel DBMSs must be capable of running parallel queries. Parallel DBMSs must be capable of parallel data loading, table scaling, and data archiving and Parallel DBMSs must be capable of parallel data loading, table scaling, and data archiving and

backup.backup.

Page 18: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

18

Data Warehouse Tools and Technologies (Cont’d)

Data Warehouse Metadata The major purpose of metadata is to show the pathway back to where the data begun, so that The major purpose of metadata is to show the pathway back to where the data begun, so that

the warehouse administrators know the history of any item in the warehouse.the warehouse administrators know the history of any item in the warehouse. Metadata has several functions within the warehouse that relates to the processes associated Metadata has several functions within the warehouse that relates to the processes associated

with data transformation and loading, data warehouse management, and query generation.with data transformation and loading, data warehouse management, and query generation. The metadata associated with data transformation and loading must describe the source data The metadata associated with data transformation and loading must describe the source data

and any changes that were made to the data.and any changes that were made to the data. The metadata associated with data management describes the data as it is stored in the The metadata associated with data management describes the data as it is stored in the

warehouse.warehouse. The metadata is also required by the query manager to generate appropriate queries. In turn, The metadata is also required by the query manager to generate appropriate queries. In turn,

the query manager generates additional metadata about the queries that are run, which can the query manager generates additional metadata about the queries that are run, which can be used to generate a history on all the queries and a query profile for each user, group of be used to generate a history on all the queries and a query profile for each user, group of users, or the data warehouse.users, or the data warehouse.

There is also metadata associated with the users of queries that includes, for example, There is also metadata associated with the users of queries that includes, for example, information describing what the term ‘price’ or ‘customer’ means in a particular database information describing what the term ‘price’ or ‘customer’ means in a particular database and whether the meaning has changed over time.and whether the meaning has changed over time.

Synchronizing Metadata The major integration issue is how to synchronize the various types of metadata used The major integration issue is how to synchronize the various types of metadata used

throughout the data warehouse. throughout the data warehouse. The various tools of a data warehouse generate and use their own metadata, and to achieve The various tools of a data warehouse generate and use their own metadata, and to achieve

integration, we require that these tools are capable of sharing their metadata.integration, we require that these tools are capable of sharing their metadata. The challenge is to synchronize metadata between different products from different vendors The challenge is to synchronize metadata between different products from different vendors

using different metadata stores.using different metadata stores.

Page 19: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

19

Data Warehouse Tools and Technologies (Cont’d)

Administration and Management Tools A data warehouse requires tools to support the administration and A data warehouse requires tools to support the administration and

management of such a complex environment.management of such a complex environment. The data warehouse administration and management tools must be capable The data warehouse administration and management tools must be capable

of supporting the following tasks:of supporting the following tasks: Monitoring data loading from multiple sources;Monitoring data loading from multiple sources; Data quality and integrity checks;Data quality and integrity checks; Managing and updating metadata;Managing and updating metadata; Monitoring database performance to ensure efficient query response Monitoring database performance to ensure efficient query response

times and resources utilization;times and resources utilization; Auditing data warehouse usage to provide user chargeback information;Auditing data warehouse usage to provide user chargeback information; Replicating, subsetting, and distributing data;Replicating, subsetting, and distributing data; Maintaining efficient data storage management;Maintaining efficient data storage management; Purging data;Purging data; Archiving and backing-up data;Archiving and backing-up data; Implementing recovery following failures;Implementing recovery following failures; Security management.Security management.

Page 20: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

20

Data Marts

Data Mart is a subset of a data Data Mart is a subset of a data warehouse that supports the warehouse that supports the requirements of a particular requirements of a particular department or business department or business function.function.

The characteristics that The characteristics that differentiate data marts and data differentiate data marts and data warehouses include:warehouses include: A data mart focuses on only A data mart focuses on only

the requirements of users the requirements of users associated with one associated with one department or business department or business function;function;

Data marts do not normally Data marts do not normally contain detailed operational contain detailed operational data, unlike data data, unlike data warehouse;warehouse;

As data marts contain less As data marts contain less data compared with data data compared with data warehouse, data marts are warehouse, data marts are more easily understood and more easily understood and navigated.navigated.

Page 21: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

21

Reasons for Creating a Data Mart

To give users access to the data they need to analyze most often;To give users access to the data they need to analyze most often; To provide data in a form that matches the collective view of the data by a To provide data in a form that matches the collective view of the data by a

group of users in a department or business function;group of users in a department or business function; To improve end-user response time due to the reduction in the volume of To improve end-user response time due to the reduction in the volume of

data to be accessed;data to be accessed; To provide appropriately structured data as dictated by the requirements of To provide appropriately structured data as dictated by the requirements of

end-user access tools such as OLAP and data mining tools, which may end-user access tools such as OLAP and data mining tools, which may require their own internal database structures.require their own internal database structures.

Data marts normally use less data so tasks such as data cleansing, loading, Data marts normally use less data so tasks such as data cleansing, loading, transformation, and integration are far easier, and hence implementing and transformation, and integration are far easier, and hence implementing and setting up a data mart is simpler than establishing a corporate data setting up a data mart is simpler than establishing a corporate data warehouse.warehouse.

The cost of implementing data marts is normally less than that required to The cost of implementing data marts is normally less than that required to establish a data warehouse.establish a data warehouse.

The potential users of a data mart are more clearly defined and can be more The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a easily targeted to obtain support for a data mart project rather than a corporate data warehouse project.corporate data warehouse project.

Page 22: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

22

Data Marts Issues

Data mart functionality: The capabilities of data marts have increased with the growth in : The capabilities of data marts have increased with the growth in their popularity. Rather than being simply small, easy-to-access databases, some data marts their popularity. Rather than being simply small, easy-to-access databases, some data marts must now be scalable to hundreds of gigabytes, and provide sophisticated analysis using OLAP must now be scalable to hundreds of gigabytes, and provide sophisticated analysis using OLAP and/or data mining tools.and/or data mining tools.

Data mart size: Users expect faster response times from data marts than from data : Users expect faster response times from data marts than from data warehouse, however, performance deteriorates as data marts grow in size. warehouse, however, performance deteriorates as data marts grow in size.

Data mart load performance: A data mart has to balance two critical components: end-: A data mart has to balance two critical components: end-user response time and data loading performance. user response time and data loading performance.

User’s access to data in multiple data marts: One approach is to replicate data : One approach is to replicate data between different data marts or, alternatively, build virtual data marts. Virtual data marts are between different data marts or, alternatively, build virtual data marts. Virtual data marts are views of several physical data marts or the corporate data warehouse tailored to meet the views of several physical data marts or the corporate data warehouse tailored to meet the requirements of specific group of users.requirements of specific group of users.

Data mart Internet/Intranet access: Internet/Intranet technology offers users low-cost : Internet/Intranet technology offers users low-cost access to data marts and the data warehouse using Web browsers.access to data marts and the data warehouse using Web browsers.

Data mart administration: As the number of data marts in an organization increases, so : As the number of data marts in an organization increases, so does the need to centrally manage and coordinate data mart activities.does the need to centrally manage and coordinate data mart activities.

Data mart installation: Data marts are becoming increasingly complex to build. Vendors : Data marts are becoming increasingly complex to build. Vendors are offering products referred to as “Data marts in a box” that provide a low-cost source of data are offering products referred to as “Data marts in a box” that provide a low-cost source of data mart tools.mart tools.

Page 23: Object_Oriented Databases, by Dr. Khalil 1 Data Warehousing Concepts Dr. Awad Khalil Computer Science Department AUC.

Object_Oriented Databases, by Dr. Khalil

23

Thank you