Data Warehousing Fundamentals

download Data Warehousing Fundamentals

of 28

description

d

Transcript of Data Warehousing Fundamentals

  • Data Warehousing

    Company Confidential

  • Data Warehousing Introduction , Terminology Necessity / Why Data Warehouse ? Characteristics of a Data Warehouse Building phases Security Tools The Impact of Web Benefits

    Company Confidential

  • Data Warehousing - A Trend .Data Warehousing is taking the industry by storm, and is now poised to transform itFor too long, enterprises have been data rich and information poor - technologically condemned to be informational mazesMost organizations remain structurally incapable of providing useful business intelligence to managementNow this era is ending

    Company Confidential

  • Data Warehousing - Introduction Warehouse of Data Not a product to buy off the shelf A set of Software & Hardware Set of data managed after & outside SAP

    Company Confidential

  • Data Warehousing - Introduction Data Warehouse Vs Operational Systems

    Data Warehouse Data Operational System DataLong time frame Short time frameStatic Rapid changesData is usually summarized Record-level accessAd hoc query access SAP standard transactionsUpdated periodically Updated in real timeData driven Event driven

    Company Confidential

  • Data Warehousing - Introduction Components of Data Warehouse

    Fact Data Dimension Data Aggregate Data Meta Data Staging Area Data Mart

    Company Confidential

  • Data Warehousing - Terminology Granularity OLAP / ROLAP / MOLAP/ HOLAP Drill Down / Roll Up Slice & Dice

    Company Confidential

  • Data Warehousing - TerminologyGranularity

    Granularity (or Grain) defines the level of detail stored in the physical warehouseLow granularity indicates lot of detail while high granularity indicates less detailExample : A commercial airline is building a data warehouse. What will the granularity be? Choice A : Each record represents a flight (High Granular) Choice B : Each record represents the customer on a flight (Less Granular)

    However, you should be aware that the granularity of data affects Volumes of Data, Data Maintenance, IndexingLevel of Data ExplorationQuery and Reporting Constraints

    Company Confidential

  • Data Warehousing - TerminologyOLAP - Online Analytical Processing

    MOLAP - Multi-dimensional Online Analytical ProcessingThe data from data warehouse is queried and dumped periodically on to a server on local network to a data storage called Multi-dimensional Database(MDDB) provided by the OLAP tool. This MDDB forms a Data Mart which is then used for querying and reporting.ROLAP - Relational Online Analytical Processingrefers to the ability to conduct OLAP analysis directly against a relational warehouse without any constraints on the number of dimensions, database size, analytical complexity, or number and type of usersHOLAP - Hybrid Online Analytical ProcessingAn environment with a combination of MOLAP and ROLAP data storage. Summarized information is typically stored in an MDDB and detailed data is stored in a Relational environment.

    Company Confidential

  • Data Warehousing - TerminologyDrill Down / RollupRegion State District LocationAnalytical technique whereby the user navigates from the most summarized to the most detailed level

    Company Confidential

  • Data Warehousing - TerminologyRotation OR Dicing Slicing

    Company Confidential

  • Data Warehousing - Why Warehouse ? The ability to store Historical Data Consistent Data Access to Corporate & Organizational Data A means of Slicing & Dicing the Data A means to query,analyse & present information A place to publish used Data High Returns on Investment

    Company Confidential

  • Data Warehousing - Why Warehouse ?

    Company Confidential

  • Separate Available Accessible Subject Oriented Integrated Time Variant Non Volatile Data Warehousing - Characteristics

    Company Confidential

  • Data Warehousing - Building phases Designing Extraction Cleaning Transformation Loading Querying

    Company Confidential

  • Data Warehousing - Designing Top Down Approachsetting up enterprise wide architecture first & then going for individual data martsvery difficult, time consuming, expensive Bottom up Approach start with highly focused data marts & then combine them for enterprise wide requirements Hybrid ApproachStart with data mart having focus on enterprise wide scope

    Company Confidential

  • Data Warehousing - Designing Logical Data model Vs Physical Data model

    Logical Data model Physical Data model

    Uses business names Names limited by DBMSBusiness experts drive it DBAs drive itIncludes entities, attributes Includes tables, columns, keys && relationships database triggers, indexes etc.

    Company Confidential

  • Data Warehousing - Designing Star Schema

    Model with Central Fact table surrounded by many Dimension tables.

    Central Fact table is Long & Narrow having many rows & few Columns.

    Dimension tables are Short & Wide having few rows & many columns.

    Company Confidential

  • Data Warehousing - Designing

    Company Confidential

    TimeProduct_salesProduct

    Customer

    Sales Star Schema

    Product_key

    Operational_id

    Product_name

    UPC_code

    Product_class

    Color

    Flavour

    Product_size

    time_key (FK)

    Product_key

    Customer_key

    Amount

    Quantity

    Tax

    time_key

    calender_date

    year

    month

    day

    fiscal_year

    quarter

    day_of_week

    Customer_key

    Name

    Street_address

    City

    State_province

    Country

    Customer_type

  • Data Warehousing - Designing Snow Flake Schema

    Normalized Dimension TablesEach Smaller Dimension table joined to a Fact table & Descriptive Dimension table

    Company Confidential

  • Data Warehousing - Designing

    Company Confidential

    SNOW FLAKE SCHEMA

    CUSTOMER

    PRODUCT

    SALES

    TIME

    STORE

    REGION

    SUMMARY

    REGION

    PRODUCT

    CUSTOMER

    TIME

  • Data Warehousing - Security Identify Data Classify Data Quantify the value of Data Identify Data security vulnerabilities Identify Data protection measures Select Cost Effective security measures Evaluate effectiveness of security measures

    Company Confidential

  • Data Warehousing - Tools OLAP - Business Objects, Cognos Query & Reporting - BO Reports, Cognos, Crystal Reports Extraction - Informatica, D2k, Ardent, Prism Data modelling - ERWIN, Power Designer Meta Data - Platinum, Prism Cleansing - Vality, Trillium, I.d. Centric Mining - SAS Enterpise miner, Data Mind

    Company Confidential

  • Data Warehousing - Impact of Web Consistency. Everyone in the organization can draw upon a common pool of data and see reports that reflect their needs. Accessibility. Accessing the data warehouse through a common pathway, the Web browser, simplifies the process of finding information. Availability. Access to information is available to anyone at anytime, even if the database administrator is not available. The data warehouse is independent of operational activities and can be accessed via the Web whenever necessary.

    Company Confidential

  • Low development costs. Software provides a standard framework for developing Web-enabled applications. Low maintenance costs. Less time is spent maintaining client-side, typically PC-based applications software; and support can be focused on ensuring that information in the data warehouse brings competitive advantages Time savings. If information consumers are directed to reports, then information providers, typically qualified specialists whose time is expensive, spend less time answering the same questions again and again. Data Warehousing - Impact of Web

    Company Confidential

  • Improved business communications. By providing Web-enabled software-based corporate information to customers, business partners, and the public, you can improve your business performance, reinforce brand loyalties and increase your organization's exposure. Data protection. Keeping the importance of standard Web security technology in mind, you can build and deploy secure applications for your organization. Data Warehousing - Impact of Web

    Company Confidential

  • Data Warehousing - Impact of WebLow marginal cost/scaleable solutions. The Rapid Warehousing Methodology -- think big, start small -- demonstrates that to maximize return on investment, it is best to develop for a small number of people first, and then extend the solution to larger groups. With browser technology, the cost of doubling or tripling the user community is negligible. Low training costs. Web browsers are intuitive and easy to use.

    Company Confidential

  • Data Warehousing - Benefits Increase customer profitability Cost effective decision making Manage customer and business partner relationships Manage risk, assets and liabilities Integrate inventory, operations and manufacturing Link multiple locations and geographies Identify developing trends and reduce time to market Facilitate process change Improve quality assurance programs Production & Performance awareness

    Company Confidential

    What All we are going to cover in this presentation !!! Not just data in warehouse, but also architecture & tools to collect, query, analyse, & present information It is not a product to buy off the shelf. It is a set of Hardware & Software Set of data created, managed outside operational systems. Components of Data Warehouse are the data & process that make information available to users. Fact Data - is numerical measurements of the business. They are transactions happening every day in business. E.g. Sales figures, costs, stock movements, orders, invoices etc. Dimension Data - is the data remaining constant over a time e.g. Product, customer, location, etc. Aggregate Data - is data summarized over different dimension e.g regional level, product/customer group level, year/quarter/month/week aggregation Meta Data - is essential & additional information about the data in the warehouse The staging area - The staging area is a set of database tables that will be used to receive the information from the operational data sources. Often we get flat files containing the data. The staging area provides a simple environment from which we can create the data transforms and load the data in to the data warehouse Data Marts - A data mart is a collection of subject areas organized for decision support based on the needs of a given department. Finance has their data mart, sales has theirs, marketing has theirs and so on. Typically the database design for a data mart is built around a star-join structure that is optimal for the needs of the users found in the department. The data mart contains only a modicum of historical information and is granular only to the point that it suits the needs of the department. Ability to store historical Data - Large amounts of historical data can be stored . Consistent Data - All persons accessing data get the same results for same query. Access to corporate / Organization data - means Mgrs/Analysts can get all the data all the time from desktop. Slicing/Dicing - Check the data by different views with every possible measure. Query, Analyse & present information - visible aspect of DWH is analysis & presentation tools. Publish Used Data - data is extracted from various sources, consolidated, cleaned, transformed, quality assured & published in a single place for use in DSS.

    Separate - DWH is physically separate from operational systems Available - entirely to the business/is community for analysis Accessible - to users who have limited knowledge of computers Subject Oriented - mainly revolving around subject like suppliers, vendors, products, locations etc. Integrated - data is always integrated Time variant - each data is accurate at some moment of time. Once updated it can not be changed. Non Volatile - Updates do not occur like OLTP, but are done in Batch mode. DesigningOne of the key challenges in designing a warehouse is to structure a solution that will be effective for a reasonable period of time. The multidimensional database employs the Star/Snowflake Schema design which is a physical database structure that store factual data in the center, surrounded by reference or dimension data. ExtractingThis process consists of identifying the type of information that the data warehouse should contain, the current location of such information (in the existing operational databases, external databases, manual records, etc.) and retrieving the relevant information from the identified sources. Data should be in a consistent state when it is extracted from the source systems. If data is being extracted from multiple data sources, they should all represent the same snapshot of time. CleaningThe data extracted from the various sources must be made consistent within itself and consistent with other data. This could mean standardizing the date formats, identifying and deleting junk values, ensuring that the values in the fields conform to the data types, etc. TransformingRarely can source data be loaded directly into a warehouse without some degree of translation and transformation. The data that has been cleaned needs to be transformed into a form consistent with the design of the data warehouse database. Typical types of transformations are:combining several values into a single field; splitting a value into several fields; translating values (converting 1 to YES and 0 to NO); loading a single row into several tables adding a time dimension to the data. LoadingThe cleaned, scrubbed and transformed data is loaded into the data warehouse using loading tools. Data loading is normally done through bulk loads which use a number of processors parallely to improve efficiency. A number of tools are available for extraction, cleaning, transformation and loading of data warehouses. Querying and AnalyzingDecision makers and Analysts need intelligent query tools to generate ad hoc queries without having to understand how to code efficient SQL or how the data is physically stored. An intelligent query generator should offer the following functionality:a conceptual layer to present data in a user-friendly language (i.e. hide the physical structure of the data) a point and click mechanism for building up a query the ability to construct efficient SQL access control facilities a facility to submit queries to run in batch mode or real time the ability to store query results in user tables or extracts facilities to manage user extracts (i.e. viewing and deleting these tables) tracking the types of queries submitted scheduling and prioritizing queries & calculating the cost of a query in terms of resources or time Identify Data - complete inventory of all data available to end users needs to be documented & retained for next phase. Classify - on the basis of sensitivity to disclosure & destruction. Least sensitive is public information, then confidential & lastly Top secret. Quantify the value - cost to reconstruct the data, restore the integrity of corrupted data etc. Security vulnerabilities - in built DBMS security, limitations, human factors, insider/outsider threats, natural factors. Protection measures - classify user access, integrity controls, data encryption etc. Cost effectiveness - focus on actual cost of protecting data does not exceed maximum dollar loss of the data. Evaluation - should be done on continual basis to make sure that measurements are kept up to date & within company guide lines & industry standards.