BC0058 Assignment

ASSIGNMENT

Name : SHEELA KANDULNA

Roll No. : 511120241

Learning Centre : Leading Edge Informatics

Learning Centre Code : 03228 Course : BCA

Subject : Data Warehouse

Semester : 6 TH

Module No. : BC 0058

Date of Submission : 30/11/13

Directorate of Distance EducationSikkim Manipal University'' II Floor, Syndicate House

Manipal * 576104

FALL 2013 ASSIGNMENT OF BC0058

ANSWER THE FOLLOWING QUESTIONS :-

1) Differentiate between OLTP and Data Warehouse.

Ans:- 1. Application databases are OLTP (On-Line Transaction processing) system where every transaction has to be recorded as and when it occurs. Consider the scenario where a bank ATM has disbursed cash to a customer but was unable to record this event in the bank records. If this happens frequently, the bank wouldn’t stay in business for too long. So the banking system is designed to make sure that every transaction gets recorded within the time we stand before the ATM machine.

2. A Data Warehouse (DW) on the other end, is a database that is designed for facilitating querying and analysis. Often designed as OLTP systems, these database contain read-only data that can be queried and analyzed far more efficiently as compared to your regular OLTP application databases. In this sense an OLTP system is designed to be read-optimized.

3. Separation from our application database also ensures that our business intelligence solution is scalable (our bank and ATMs don’t go down just because the CFO asked for a report), better documented and managed.

4. Creation of a DW lead to a direct increase in quality of analysis as the table structures are simpler, standardized, and often de-normalized. Having a well-designed DW is the foundation for successful BI (Business Intelligence) / Analytics initiatives, which are built upon. 5. DW usually store many months or years of data. This is to support historical analysis. OLTP system usually store data from only a few week or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.

2) What are the key issues in Planning a Data Warehouse.

Ans:- Planning for our Data Warehouse begins with a thorough consideration of the key issues. Answers to the key questions are vtal for the proper planning and the successful completion of the project. Therefore let us consider the pertinent issues, one by one.

1. VALUES AND EXPECTATIONS :- Some companies jump into Data Warehousing without assessing the value to be derived from their proposed Data Warehouse. Of course, first we have to be sure that, given the culture and the current requirements of our company; a Data Warehouse is the most viable solution, only then can we begin to enumerate the benefits and value propositions.

2. RISK ASSESSEMENT :- Planners generally associate project risks with the cost of the project. If the project fails, how much many money go down the drain? But the assessment of risks is more than calculating the loss from the project costs. What are the risks faced by the company without the benefits derivable from a Data Warehouse? What losses are likely to be incurred? What opportunities are likely to be missed?

3) Explain Source Data Component and Data Staging Components of Data Warehouse Architecture.

Ans:- SOURCE DATA COMPONENT involves 4 data are as follows:- 1. Production data:- this category of data comes from the various operational systems of the enterprise. Based on the information requirements in the data warehouse, we choose segments of data from the different operational systems. While dealing with this data, from the different variations in the data formats. We also notice that the data resides on different hardware platforms. Further, the data is supported by different database systems and operating systems. This is the data from many vertical applications. 2. Internal data:- in every organization, users keep their “private” spreadsheets, documents, customer profiles, and sometimes even departmental databases. This is the internal data , parts of which could be useful data warehouse for analysis. 3. Archived data:- operational system are primarily intended to run the current business. In every operational system, we periodically take the old data and store it in archived files. The circumstances in your organization dictate how often and which portions of the operational databases are archived for storage. Some data is archived after a year. 4. External data:- Most executives depend on data from external sources for a high percentage of the information they use. They use statistics relating to their industry produced by external agencies. They use market share data of competitors. They use standard values of financial indicators for their business to check on their performance.DATA STAGING COMPONENT Three major functions need to be performed for getting the data ready. We have to extract the data, transform the data, and then load the data into the data warehouse storage. These three major functions of extraction, transformation, and preparation for loading take place in a staging area. The data-staging component consists of a work bench for these functions. Data staging provides a place and an area with asset of functions to clean, change, combine, convert, reduplicate, and prepare source data for storage and use in the data warehouse. 1. Data Extraction:- This function has to deal with numerous data sources. We have to employ the appropriate technique for each data source machines in diverse data formats. Part of the source data may be in relational database systems. Some data may be on other legacy network and hierarchical data models. Many data sources may still be in flat files. We may want to include data from spreadsheets and local department data sets. Data extraction may become quite complex. 2. Data transformation:- In every system implementation, data conversion is an important function. For example, when we implement an operational system such as magazine subscription application, we have to initially populate our database with data from the prior system records. We may be converting over from a manual system. Or, we may be moving from a file-oriented system to a modern system supported with relational database tables. In either case, we will convert the data from the prior systems. So, what is so different for a data warehouse? How is data transformation for a data warehouse more involved than for an operational system? 3. Data Loading:- Two distinct groups of tasks form the data loading function. When we complete the design and construction of the data warehouse and go live for the first time, we do the initial loading of the data into the data warehouse storage. The initial load moves large volumes of data using up substantial amounts of time. As the data warehouse starts functioning, we continue to extract the incremental data revisions on an ongoing basis.

4) Discuss the Extraction Methods in Data Warehouses.

Ans:- The extraction methods in data warehouse depend on the source system, performance and business requirements. There are two types of extractions, Logical and Physical. We will see in detail about the logical and physical designs.

Logical extraction

There are two types of logical extraction methods:

Full Extraction: Full extraction is used when the data needs to be extracted and loaded for the first time. In full extraction, the data from the source is extracted completely. This extraction reflects the current data available in the source system.

Incremental Extraction: In incremental extraction, the changes in source data need to be tracked since the last successful extraction. Only these changes in data will be extracted and then loaded. These changes can be detected from the source data which have the last changed timestamp. Also a change table can be created in the source system, which keeps track of the changes in the source data.

One more method to get the incremental changes is to extract the complete source data and then do a difference (minus operation) between the current extraction and last extraction. This approach causes a performance issue.

Physical extraction

The data can be extracted physically by two methods:

Online Extraction: In online extraction the data is extracted directly from the source system. The extraction process connects to the source system and extracts the source data.

Offline Extraction: The data from the source system is dumped outside of the source system into a flat file. This flat file is used to extract the data. The flat file can be created by a routine process daily.

5) Define the process of Data Profiling, Data Cleansing and Data Enrichment. Ans:- Data Profiling is the process of examining the data available in an existing data source (e.g. a data or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:1. Find out whether existing data can easily be used for other purposes.2. Give metrics on data quality including whether the data conforms to company standards.3. Assess the risk involved in integrating data for new applications, including the challenges of joins.4. Track data quality.5. Assess whether metadata accurately describes the actual values in the source database.6. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project

cost overruns.7. Have an enterprise view of all data, for uses such as Master Data Management where key data is needed, or Data governance for improving data quality. Data Cleaning or Data Scrubbing is the act of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.Data cleaning involves the following tasks:1. Converting data fields to common format2. Correcting errors3. Eliminating inconsistencies4. Matching records to eliminate duplicates5. Filling missing values etc.After cleaning, a data set will be consistent with other similar data sets in the system.

Data Enrichment is the process of adding values to your data. In some cases, external data providers sell data, which may be used to augment existing data. In other cases, data from multiple internal sources are simply integrated to get the “big” picture. In any event, the intended result is a data asset that has been increased in value to the user community.

6. What is Metadata Management? Explain Integrated Metadata Management with a block diagram.

Ans:- The purpose of Metadata management is to support the development and administration of data warehouse infrastructure as well as analysis of the data of time.Metadata widely considered as a promising driver for improving effectiveness and efficiency of data warehouse usage, development, maintenance and administration. Data warehouse usage can be improved because metadata provides end users with additional semantics necessary to reconstruct the business context of data stored in the data warehouse. An integrated metadata management supports all kinds of users who are involved in the data warehouse development process. End users, developers and administrators can use/see the metadata. Developers and administrators mainly focus on technical metadata but make use of business metadata if they want. Developers and administrators need metadata to understand transformations of object data and underlying data flows as well as the technical and conceptual system architecture.

Fig. Central Repository of Metadata Management

Several Metadata management systems are in existence. One such system/tool is Integrated Metadata Repository System (IMRS). It is a metadata management tool used to support a corporate data management function and is intended to provide metadata management services. Thus, the IMRS will support the engineering and configuration management of data environments incorporating e-business transactions, complex databases, federated data environments, and data warehouses / data marts. The metadata contained in the IMRS used to support applications development, data integration, and the system administration functions needed to achieve data element semantic consistency across a corporate data environment, and to implement integrated or shared data environments.Metadata management has several sub processes like data warehouse development.Some of them are listed below,1. Metadata definition2. Metadata collection3. Metadata control4. Metadata publication to the right people at the right time.5. Determining what kind of data to be captured.

3) Explain Source Data Component and Data Staging Components of Data Warehouse Architecture.

Ans:- In the SOURCE DATA COMPONENT of data warehouse architecture four data process are involved are as follows:- 1. Production data:- This category of data from the various operational systems of the enterprise. Based on the information requirements in the Data Warehouse, we choose segments of data from the different operational systems. While dealing with the data, we come across many variations in the data formats. We also notice that the data resides on different hardware platforms. Further, data is supported by different database systems and operating systems. This is the data from many vertical applications. 2. Internal data:- In every organization, users keep their “private” spreadsheets, documents, customer profiles, and sometimes even departmental databases. This is the internal data, parts of which could be useful for data warehouse for analysis.3. Archived data:- Operational systems are primarily intended to run the current business. In every operational systems, we periodically take the old data and store it in archived files. The circumstances in our organization dictate how often and which portions of the operational database are archived for storage. Some data archived after a year.4. External data:- Most executives depend on data from external sources for a high percentage of the information they use. They use statistics relating to their industry produced by external agencies. They use market share data of competitors. They use standard values of financial indicators for their business to check on their performance.

DATA STAGING COMPONENT Three major functions need to be performed for getting the data ready. We have to extract the data, transform the data, and then load the data into the Data Warehouse storage. These three major functions of extraction, transformation, and preparation for loading take place in a staging area. The data-staging component consists of a workbench for these functions to clean, change, combine, convert, reduplicate, and prepare source data for storage and use in the data warehouse.1. Data Extraction:- This function has to deal with numerous data sources. We have to employ the appropriate technique for each data source. Source data may be from different source machines in diverse data formats. Part of the source data may be in relational database systems. Some data may be on other legacy network and hierarchical data models. Many data sources may still be in flat files. We may want to include data from spreadsheets and local departmental data sets. Data extraction may become quite complex.2. Data Transformation:- In every system implementation, data conversion is an important function. For example, when we implement an operational system such as a magazine subscription application, we have to initially populate our database with data from the prior system records. We may be converting over from a manual system. Or, we may be moving from a file-oriented system to a modern system supported with relational database tables. In either case, we will convert the data from the prior systems. So, what is so different for a data warehouse? How is data transformation for a data warehouse more involved than for an operational system?3. Data Loading:- Two distinct groups of tasks form the data loading function. When we complete the design nd construction of the data warehouse and go live for the first time, we do the initial loading of the data into

BC0058 Assignment

Documents

Transcript of BC0058 Assignment