8/12/2019 Sri Sharada Institute of Indian Management
1/26
1
Sri Sharada Institute Of Indian Management -Research
Approved by AICTE
Plot No. 7, Phase-II, Institutional Area, Behind the Grand Hotel, Vasant Kunj,
New Delhi110070 Website: www.srisiim.org
Project Report on Management Information System (208)
On
Data Warehousing and Data Mining
Submitted To: - Submitted By:
Prof. N Venkatesan Vikram Singh Tomar (160)
Udit Kumar (155)
Vijay Krishna (158)
(PGDM 2013-2015)
8/12/2019 Sri Sharada Institute of Indian Management
2/26
2
Declaration
We hereby declare that the following project report of (Data Warehousing and Data
Mining) is an authentic work done by us. This is to declare that all work indulged in the
completion of this work such as research, analysis of activities of an organization is a
profound and honest work of ours.
Place: New Delhi Vikram Singh Tomar
Udit Kumar
Vijay Krishna
(PGDM 2013-2015)
8/12/2019 Sri Sharada Institute of Indian Management
3/26
3
ACKNOWLEDGEMENT
We would like to express my hearty gratitude to my faculty guide, Prof. N Venkatesan for
giving us the opportunity to prepare a project report on Project Report on Data Warehosing
and Data Mining and for his valuable guidance and sincere cooperation, which helped us in
completing this project.
Vikram Singh Tomar
Udit Kumar
Vijay Krishna
PGDM Batch (2013-2015)SRI SIIM
8/12/2019 Sri Sharada Institute of Indian Management
4/26
4
INDEX
1. ABSTRACT
2. DATA WAREHOUSING
Introduction Need of Data Warehousing Purpose of Data Warehousing Characteristics Life cycle Components of a data warehouse Define Online Analytical Processing Tools and technologies Applications
3. Understand Data Marts
Introduction Implementation of a Data Mart Maintenance of a Data Mart Development approaches in a Data Mart
4. Describing OLAP
Introduction The benefits of OLAP The features of OLAP
5. Data Mining
Introduction Types of Data Mining Major elements of Data Mining Data Mining: A KDD process Steps in KDD process Methods of Data Mining
6. Conclusion
7. Bibliography
8/12/2019 Sri Sharada Institute of Indian Management
5/26
5
DATA WAREHOUSING AND DATA MINING
ABSTRACT:
Fast, accurate and scalable data analysis techniques are needed to extract useful information
from huge pile of data. Data warehouse is a single, integrated source of decision support
information formed by collecting data from multiple sources, internal to the organization as
well as external, and transforming and summarizing this information to enable improved
decision making. Data warehouse is designed for easy access by users to large amounts of
information, and data access is typically supported by specialized analytical tools and
applications. Typical applications include decision support systems and execution
information system.
Data mining is the exploration and analysis of large quantities of data in order to discover
valid, novel, potentially useful, and ultimately understandable patterns in data. It is An
information extraction activity whose goal is to discover hidden facts contained indatabases.
The process of extracting valid, previously unknown, comprehensible and actionable
information from large databases and using it to make crucial business decisions. The
project entitled Website Data Mining is an application of data miningwhich is built
for the website developers for their effective creation of websites in internet.
Data mining finds patterns and subtle relationships in data and infers rules that allow the
prediction of future results. It produces output values for an assigned set of input values.
Typical applications include market segmentation, customer profiling, fraud detection,
evaluation of retail promotions, and credit risk analysis.
8/12/2019 Sri Sharada Institute of Indian Management
6/26
6
Data Warehousing - An Overview
Everyday increasingly, organizations are analyzing current and historical data to identify
useful patterns and support business strategies.
A large amount of the right information is the key to survival in todays competitive
environment. And this kind of information can be made available only if theres totallyIntegrated enterprise data warehouse.
What is data warehousing?
According to W.H. Inmon,A data warehouse is a subject-oriented, integrated time-variant, and nonvolatile collection of data in support of managements decision-making
process.
A data warehouse can be defined as a large central repository of data, which helps indecision making process of an enterprise. It comprises of integrated databases which can
be any DBMS, a text, or a flat file.Data warehouse is one of the key components of BI System.
Need for data warehousing:
Prior to business analytical tools, such as OLAP, organizations handled decision supportsystem by accessing data directly from OLTP systems for both transaction and analysis
purposes.
Large organizations generally use the following three different kinds of processes toextract data from the OLTP systems:
Access OLTP database directly for all types of transactions and analysis.Create an offline replicated database from the OLTP database at a pre-defined regular interval. While the source OLTP system is used formanaging daily transactional activities, the replicated database is used foranalysis purposes only.
Create small data warehouses that satisfy the individual needs of thebusiness users, from the OLTP systems where all the past transactionsalso get stored.
Purpose of Data Warehousing:
Better business intelligence for end users.Reduction in time to access and analyze information.
Consolidation of disparate information sources.
Replacement of older, less-responsive decision support systems
Faster time to market for products and services
8/12/2019 Sri Sharada Institute of Indian Management
7/26
7
The following figure shows access of an OLTP database directly by an OLTP transactionsapplication and analysis process together.
The following figure shows the process of accessing offline replicated database by
analysis users.
8/12/2019 Sri Sharada Institute of Indian Management
8/26
8
The following figure shows the creation of smaller data warehouses or data marts that
an application analysis user uses to make a decision.
Multiple data mart architecture leads to creation of an Enterprise Data Warehouse
(EDW) that accumulates data from more than one OLTP system and provides
cumulated and clean data for creation of any kind of a data mart.
The following figure shows an EDW that has been created from OLTP database, which
in turn further creates clean, cumulated, and specific objective data marts.
8/12/2019 Sri Sharada Institute of Indian Management
9/26
9
The following figure shows the schematic representation of the functional parts of an
OLTP system and data warehouse.
Characteristics intrinsic to a data warehouse are:
Consolidated and consistent data
Subject-oriented data
Historical data
Non-volatile data
The following figure shows the behavior of data in RDBMS and in data warehouse.
DATA WAREHOUSE LIFE CYCLE :Data warehousing is a concept. It is not a product that can be purchased off the shelf. It is a
set of hardware and software components integrated together which can be used to analyze
the massive amount of data stored in an efficient manner. It is a process through which one
can build a successful data warehouse. Following are the five steps towards building a
successful data warehouse.
1) JUSTIFICATION
2) REQUIREMENT ANALYSIS
3) DESIGN
4) DEVELOPMENT & IMPLEMENTATION5) DEPLOYMENT
RDBMS
UPDATESQUERIES
DATA
WAREHOUSE
DATALOADS
QUERIES
8/12/2019 Sri Sharada Institute of Indian Management
10/26
10
Tools and Technologies:
The critical steps in the construction of a data warehouse:
Extraction
CleansingTransformation
After the critical steps, loading the results into target system can be carried out either
by separate products, or by a single, categories:
Code generators
Database data replication tools
Dynamic transformation engine
Applications:
Online Transaction Processing:
OLTP systems are the major kinds of enterprise applications:
Examples: Order entry systems, Inventory control systems, Reservation systems, Point-of-
sale systems, Tracking systems, etc.
Executive information system (EIS) :
Present information at the highest level of summarization using corporate business
measures. They are designed for extreme ease-of-use and, in many cases, only a mouse
is required. Graphics are usually generously incorporated to provide at-a-glance indicationsof performance
Decision Support Systems (DSS) :
They ideally present information in graphical and tabular form, providing the user
with the ability to drill down on selected information. Note the increased detail and
data manipulation options presented.
8/12/2019 Sri Sharada Institute of Indian Management
11/26
11
Data analysis and arrangement in a data warehouse is done with the help of:
Metadata
Metadata is the information about data in the data warehouse which is
maintained by the OLAP server.
OLAP systems:
Are used to arrange and analyze data in a data warehouse using the
OLAP systems.
Are used to extract, clean or scrub, and store data in the data
warehouse in a homogeneous form after being collected from various
heterogeneous sources.
Components of a data warehouse are:
Data sources:These are various source systems, such as OLTP systems and
legacy systems that manage the daily transactional data of a business
organization and store this data in a data warehouse.
Data staging area:The data staging area, also known as data preparation
area, is a collection of processes that extracts data from various sources, and
then cleans, transforms, and loads the data in a data warehouse.
Presentation services:Various presentation services, such as summary
reports, are provided by a data warehouse to enable decision-makers in
exploring the information.
Data marts:These are subsets of a data warehouse that store the data specific
to a particular business activity.
8/12/2019 Sri Sharada Institute of Indian Management
12/26
12
The following figure shows the various components of a data warehouse.
Roles and Responsibilities in a Data Warehouse:
A data warehouse primarily performs the following five tasks:
Data extraction
Data cleaning
Data loading
Querying
Backup and recovery
The following figure shows the data warehousing process, detailing the preceding tasks.
8/12/2019 Sri Sharada Institute of Indian Management
13/26
13
The preceding tasks are the responsibilities of the following three roles in a data
warehouse:
Load Manager
Warehouse Manager
Query Manager
The detailed tasks of a load manager are:
Extracting data from disparate sources
Fast-loading extracted data into a temporary database
Performing simple data transformations
The following figure shows the role structure of a load manager.
The various tools used by a load manager for extracting and loading the data are:
Fast loader: Used for fast loading of data from operational to temporary
database.
Copy management tool: Used for simple transformation.
Stored procedures: Used for checking and cleaning of data.
Shell scripts: Used for automating the processes and scheduling job control for
an unattended execution.
8/12/2019 Sri Sharada Institute of Indian Management
14/26
14
The detailed tasks of a warehouse manager are:
Analyzing data for consistency and referential integrity check.
Creating indexes, views, and partitions of the base data.
Generating new aggregations.
Updating existing aggregations.
Creating back-up data.
Archiving obsolete data.
The following figure shows the role structure of a warehouse manager.
The tools used by a warehouse manager are:
Stored procedures that create indexes, generate, and upgrade aggregations, as
well as, multidimensional schemas.
System management tools for backup and archiving data.
Data warehouse-specific tools for query-specific analysis.
The detailed tasks performed by a query manager are:
Directing query to appropriate tables.
Scheduling execution of user queries.
8/12/2019 Sri Sharada Institute of Indian Management
15/26
15
The following figure shows the role structure of a query manager.
The tools used by a query manager are:
User access tools or stored procedures for directing queries to appropriate
tables.
Stored procedures, user access tools, third party software, or database facilities
to schedule execution of queries.
MetaData
DetailedInformation
SummaryInformation
Query
RedirectionStored
Procedures
Query
Managment
Tool
Query Manager
Query
Scheduling
Query
8/12/2019 Sri Sharada Institute of Indian Management
16/26
16
Understanding Data Marts
In an enterprise data warehouse, there can be a collection of smaller data warehouses
known as data marts.
Data Mart:
Is a specific subset of a data warehouse, stored within its own database.
Contains the data required at a department level or for a specific business area
of the organization.
Makes query processing faster by having less volume of data from a typicaldata warehouse.
Also enables mobility of data due to reduced size of the data.
The following figure shows the creation of several data marts from an enterprise data
warehouse.
8/12/2019 Sri Sharada Institute of Indian Management
17/26
17
Implementation of a Data Mart:
In an organization, the implementation of a data mart is generally done by
enterprise Information Technology department or a vendor or may be by bothof them working together.
The integration of internal expertise and vendor helpdesk can be the best and
cost effective solution as well as technological interpretation of the
organization vision can be implemented easily through own employees.
Maintenance of a Data Mart:
Needs periodical effort of loading, refreshing/ updating, and deleting the data
from the data mart.
Has to be done on a regular cycle based on predefined frequency requirement
of a specific data mart of the department.
Development approaches in a Data Mart are:
Top down approach: In the top down approach, the data warehouse is created first andthe dependent data marts are created after that, as shown in the following figure.
Data Warehouse
DATAMARTS
TOP DOWN APPROACH
ETS
8/12/2019 Sri Sharada Institute of Indian Management
18/26
18
Bottom up approach: In the bottom up approach, the data marts are created first, and
these data marts together contribute to the development of the data warehouse, as
shown in the following figure.
Hybrid approach: This approach is a fast and high user-orientation approach, like thebottom up approach, and maintains data integrity of a data warehouse, like the top-down
approach.
The following figure shows the hybrid approach of creating a data mart.
DATAMARTS
BOTTOM UP APPROACH
ETS
1
2
3
4
Data Warehouse
DATAMARTS
HYBRID APPROACH
ETS
1
2
3
4
Data Warehouse
8/12/2019 Sri Sharada Institute of Indian Management
19/26
19
Federated approach:This approach recommends ways to collect large amount of
heterogeneous data from other data warehouses, data marts, and packaged applications that
earlier exist inside companies.
The goal of a federated approach is to integrate existing analytic structures wherever
and however possible.
8/12/2019 Sri Sharada Institute of Indian Management
20/26
20
Describing OLAP
OLAP is a crucial element of an enterprise data warehouse or data mart solution. It
fits into data warehousing and data mart strategies to deliver an exceptional and
convincing way for data reporting, scrutiny, analysis, modeling, planning, and in an
enterprise.
OLAP is a process of analyzing and processing data from variant data sources, such
as a data warehouse.
OLAP is a process of analyzing and processing data from variant data sources, such
as a data warehouse.
The benefits of OLAP are:
OLAP enables enterprises to respond to market demands more efficiently.
Developers using the software specially designed for OLAP solutions are able
to deliver applications to end-users faster and provide better service to them.
OLAP systems improve the performance of OLTP systems by reducing
network traffic and eliminating complex queries from the OLTP database.
The features of OLAP are:
Multidimensional views:OLAP enables business analysts to analyze and
store the data in multidimensional structures. The multidimensional data views
are referred as cubes.
Calculation-intensive capabilities:OLAP applications have the capability to
perform complex calculations and aggregations on the stored data, such as
percentage of totals, calculation of profits, and so on. These complex
calculations and aggregations are beneficial in reaching the ultimate business
solutions.
Time intelligence:All OLAP applications use the time dimension. This is the
most important and widely used parameter for performing business analysis.
The time parameter is used to compare and judge the performance of a
business process.
8/12/2019 Sri Sharada Institute of Indian Management
21/26
21
The following table lists down some basic differences between OLTP and OLAP
systems.
8/12/2019 Sri Sharada Institute of Indian Management
22/26
22
DATA MINING
What is data mining?
Data Mining refers to the process of analyzing the data from different perspectives and
summarizing it into useful information. Data mining software is one of the numbers of toolsused for analyzing data from many different dimensions or angles, categorize it, and
summarize the relationship identified.
Definition:
Data mining is the process of finding correlation or patterns among fields in large relational
databases. The process of extracting valid, previously unknown, comprehensible, and
actionable information from large databases and using it to make crucial business
decision
Different Types of Data Mining: Business, Scientific and Internet Data Mining
Five major elements of Data Mining:
1. Extract, transform, & load transaction data on to the data warehouse system.
2. Store and manage data in multidimensional database system.
3. Provide access to business analysts and IT Professionals.
4. Analyze the data by application software.
5. Present the data in useful format such as graph or table.
8/12/2019 Sri Sharada Institute of Indian Management
23/26
23
DATA MINING: A KDD Process
Steps of KDD Process
1. Learning the application domain
2. Relevant prior knowledge and goals of application
3. Creating a target data set: data selection4. Data cleaning and preprocessing
5. Data reduction and transformation
6. Find useful features, dimensionality or variable reduction, and invariant representation.
7. Choosing functions of data mining
8. Summarization, classification, regression, association, clustering.
9. Choosing the mining algorithm(s)
10. Data mining: search for patterns of interest
11. Pattern evaluation and knowledge presentation
12. Visualization, transformation, removing redundant patterns, etc.
13. Use of discovered knowledge.
Methods of Data Mining:
1. Classification 2.Regression 3.Clustering 4.Associative rules 5.Visualization
8/12/2019 Sri Sharada Institute of Indian Management
24/26
24
Summary
A data warehouse is a large repository of data, which helps in decision-making
process of an enterprise.
The four characteristics intrinsic to a data warehouse are:
Consolidated and consistent data
Subject-oriented data
Historical data
Non-volatile data
In a data warehouse, a specific type of data is used that contains information
about types of data known as metadata.
The various components of a data warehouse are:
Data sources
Data preparation area
Presentation services
Data marts
Data warehouse primarily performs the following five tasks:
Data extraction
Data cleaning
Data loading
Querying
Backup and recovery
The following roles perform the above-defined tasks in a data warehouse:
Load Manager
Warehouse Manager
Query Manager
8/12/2019 Sri Sharada Institute of Indian Management
25/26
25
The four development approaches in creating a data mart are:
Top down approach
Bottom up approach
Hybrid approach
Federated approach
The features of OLAP are:
Multidimensional views
Calculation intensive capabilities
Time intelligence
8/12/2019 Sri Sharada Institute of Indian Management
26/26
CONCLUSION
Data Warehousing provides the means to change the raw data into information formaking effective business decisions-the emphasis on information, not data. The Data
warehouse is the hub for decision support data.
Data mining is a useful tool with multiple algorithms that can be tuned for specific
tasks. It can benefit business, medicine, and science. It needs more efficient algorithms to
speed up data mining process.
Top Related