datawarehousing chap01

download datawarehousing chap01

of 27

Transcript of datawarehousing chap01

  • 8/7/2019 datawarehousing chap01

    1/27

    Data Warehousing

  • 8/7/2019 datawarehousing chap01

    2/27

    Outline What is data warehousing The benefit of data warehousing

    Differences between OLTP and data warehousing

    The architecture of data warehouse

    The main components Data flows

    Tools and technologies

    Integration

    The importance of managing meta-data

    Data marts

  • 8/7/2019 datawarehousing chap01

    3/27

    What is data warehousing? data warehousing is subject-oriented,

    integrated, time-variant, and non-volatile

    collection of data in support of managementsdecision-making process.

    a data warehouse is data management anddata analysis

    data webhouse is a distributed datawarehouse that is implement over the webwith no central data repository

    goal: is to integrate enterprise wide corporate

    data into a single reository from which userscan easil run ueries

  • 8/7/2019 datawarehousing chap01

    4/27

  • 8/7/2019 datawarehousing chap01

    5/27

    The benefits of data

    warehousing The potential benefits of data

    warehousing are high returns on

    investment.. substantial competitive advantage.. increased productivity of corporate

    decision-makers..

  • 8/7/2019 datawarehousing chap01

    6/27

    The difference bewteen OLTP

    and data warehousing A DBMS built for online transaction

    processing (OLTP) is generally

    regarded as unsuitable for datawarehousing because each system isdesigned with a differing set of

    requirements in mind example: OLTP systems are design to maximize the transaction

    processing capacity, while data warehouses are designed tosupport ad hoc query processing

  • 8/7/2019 datawarehousing chap01

    7/27

    comparision ofOLTP systems and datawarehousing system

    OLTP systems Data warehousingsystems

    Hold current dataStores detailed dataData is dynamic

    Repetitive processingHigh level of transaction throughputPredictable pattern of usageTransaction-drivenApplication-orentedSupports day-to-day decisionsServes large number of clerical/operation

    users

    Holds historical dataStores detailed, lightly, and highlysummarized data

    Data is largely staticAd hoc, unstructured, and heuristicprocessingMedium to how level of transactionthroughputUnpredictable pattern of usageAnalysis driven

    Subject-orientedsupports strategic decisionsServes relatively how number ofmanagerial users

  • 8/7/2019 datawarehousing chap01

    8/27

    Problems Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenization High demand for resources Data ownership High maintenance

    Long-duration projects Complexity of integration

  • 8/7/2019 datawarehousing chap01

    9/27

    Operationaldata

    source1

    The architectureQuery Manage

    Warehouse ManagerDBMS

    Operationaldata

    source

    2

    Meta-data

    Highsumma

    rizeddata

    Detailed data

    Lightlysumma

    rizeddata

    Operationaldatastore

    (ods)

    Operationaldata

    sourcen

    Archive/back

    updata

    Load Manager

    Data mining

    OLAP(online analyticalprocessing) tools

    Reporting, query,application development,and EIS(executiveinformation system) tools

    End-useraccesstoolsTypical architecture of a data

    warehouse

    Operational data store(ODS)

  • 8/7/2019 datawarehousing chap01

    10/27

    The main components Operational data sourcesfor the DW is supplied from

    mainframe operational data held in first generation hierarchicaland network databases, departmental data held in proprietary file

    systems, private data held on workstaions and private servesand external systems such as the Internet, commerciallyavailable DB, or DB assoicated with and organizations suppliersor customers

    Operational datastore(ODS)is a repository of

    current and integrated operational data used for analysis. It isoften structured and supplied with data in the same way as thedata warehouse, but may in fact simply act as a staging area fordata to be moved into the warehouse

  • 8/7/2019 datawarehousing chap01

    11/27

    The main components load manageralso called the frontendcomponent, it

    performance all the operations associated with the extraction andloading of data into the warehouse. These operations include

    simple transformations of the data to prepare the data for entryinto the warehouse

    warehouse managerperforms all the operations associatedwith the management of the data in the warehouse. Theoperations performed by this component include analysis of datato ensure consistency, transformation and merging of sourcedata, creation of indexes and views, generation ofdenormalizations and aggregations, and archiving and backing-up data

  • 8/7/2019 datawarehousing chap01

    12/27

    The main components query manageralso called backend component, it performs

    all the operations associated with the management of userqueries. The operations performed by this component include

    directing queries to the appropriate tables and scheduling theexecution of queries

    detailed, lightly and lightly summarizeddata,archive/backup data

    meta-data

    end-user access toolscan be categorized into five maingroups: data reporting and query tools, application developmenttools, executive information system (EIS) tools, online analyticalprocessing (OLAP) tools, and data mining tools

  • 8/7/2019 datawarehousing chap01

    13/27

    Data flows Inflow- The processes associated with the extraction, cleansing, and

    loading of the data from the source systems into the data warehouse.

    upflow- The process associated with adding value to the data in thewarehouse through summarizing, packaging , packaging, and distribution of thedata

    downflow- The processes associated with archiving and backing-upof data in the warehouse

    outflow- The process associated with making the data availabe to theend-users Meta-flow- The processes associated with the management of the

    meta-data

  • 8/7/2019 datawarehousing chap01

    14/27

    Operationaldata

    source1

    Warehouse Manager

    DBMS

    Meta-data

    Highsumma

    rizeddata

    Detailed data

    Lightlysumma

    rizeddata

    Operationaldatastore

    (ods)

    Operationaldata

    sourcen

    Archive/back

    updata

    LoadManager

    Data mining tools

    OLAP (onlineanalytical processing)tools

    End-useraccesstools

    Information flows of a data

    warehouse

    Reporting, query,applicationdevelopment, and EIS (executiveinformation system) tools

    Downflow

    Inflow

    Meta-flow

    UpflowQuery Manage

    Outflow

    Warehouse Manager

  • 8/7/2019 datawarehousing chap01

    15/27

    Tools and Technologies The critical steps in the construction of a data

    warehouse:

    a. Extractionb. Cleansingc. Transformation after the critical steps, loading the results into

    target system can be carried out either byseparate products, or by a single, categories: code generators database data replication tools dynamic transformation engines

  • 8/7/2019 datawarehousing chap01

    16/27

    Data Warehouse

    DBSM(integration) due to the maturity of such products, most

    relational databases will integrate predictably

    with other types of software The reqirements for data warehose RDBMS Load performance Load processing Data quality management Query perfomance Terabyte scalability

    Mass user scalability Networked data warehouse Warehouse administration Integrated dimensional analysis Advanced query funtionlity

  • 8/7/2019 datawarehousing chap01

    17/27

    The importance of managing

    meta-data(integration) The integration of meta-data, that is data about data Meta-data is used for a variety of purposes and the management

    of it is a critical issue in achieving a fully integrated data

    warehouse The major purpose of meta-data is to show the pathway back towhere the data began, so that the warehouse administratorsknow the history of any item in the warehouse

    The meta-data associated with data transformation and loadingmust describe the source data and any changes that were made

    to the data The meta-data associated with data management describes the

    data as it is stored in the warehouse The meta-data is required by the query manager to generate

    appropriate queries, also is associated with the user of queries

  • 8/7/2019 datawarehousing chap01

    18/27

    The major integration issue is how to synchronize the varioustypes of meta-data use throughout the data warehouse. Thechallenge is to synchronize meta-data between different products

    from different vendors using different meta-data stores Two major standards for meta-data and modeling in the areas ofdata warehousing and component-based development-MDC(Meta Data Coalition) and OMG(Object ManagementGroup)

  • 8/7/2019 datawarehousing chap01

    19/27

    Administration and

    Management Tools a data warehouse requires tools to support the

    administration and management of such complexenviroment.

    for the various types of meta-data and the day-to-dayoperations of the data warehouse, the administrationand management tools must be capable ofsupporting those tasks:

    monitoring data loading from multiple sources data quality and integrity checks managing and updating meta-data monitoring database performance to ensure efficient query

    response times and resource utilization

  • 8/7/2019 datawarehousing chap01

    20/27

    auditing data warehouse usage to provide user chargebackinformation

    replicating, subsetting, and distributing data

    maintaining effient data storage management purging data; archiving and backing-up data implementing recovery following failure security management

  • 8/7/2019 datawarehousing chap01

    21/27

    Data mart data mart a subset of a data

    warehouse that supports the

    requirements of particular departmentor business function

    The characteristics that differentiate

    data marts and data warehousesinclude: a data mart focuses on only the requirements of users

    associated with one department or business function

  • 8/7/2019 datawarehousing chap01

    22/27

    data marts do not normally contain detailed operational data,unlike data warehouses

    as data marts contain less data compared with data warehouses,

    data marts are more easily understood and navigated

  • 8/7/2019 datawarehousing chap01

    23/27

    Operationaldata

    source1

    Warehouse Manager

    DBMS

    Operati

    onaldatasource

    2

    Meta-data

    Highsumma

    rizeddata

    Detailed data

    Lightlysumma

    rizeddata

    Operati

    onaldatastore(ods)

    Operationaldata

    sourcen

    Archive/back

    updata

    LoadManager

    Data mining

    OLAP(online analyticalprocessing) tools

    Reporting, query,application developmenand EIS(executive information system) to

    End-useraccesstools

    Typical data warehouse adn data mart

    architecture

    Operational data store (ODS)

    QueryManage

    summarized

    data(Relational

    database)Summ

    arizeddata

    (Multi-dimension

    database)

    Data Mart

    (First Tier)(Third Tier)

    (Second Tier)

    Warehouse Manager

  • 8/7/2019 datawarehousing chap01

    24/27

    Reasons for creating a data

    mart To give users access to the data they need to analyze most often To provide data in a form that matches the collective view of the

    data by a group of users in a department or business function

    To improve end-user response time due to the reduction in thevolume of data to be accessed To provide appropriately structured data as ditated by the

    requirements of end-user access tools Normally use less data so tasks such as data cleansing, loading,

    transformation, and integration are far easier, and hence

    implementing and setting up a data mart is simpler thanestablishing a corporate data warehouse

  • 8/7/2019 datawarehousing chap01

    25/27

    The cost of implementing data marts is normally less than thatrequired to establish a data warehouse

    The potential users of a data mart are more clearly defined and

    can be more easily targeted to obtain support for a data martproject rather than a corporate data warehouse project

  • 8/7/2019 datawarehousing chap01

    26/27

    data marts issues data mart functionalitythe capabilities of data

    marts have increased with the growth in their popularity

    data mart sizethe performance deteriorates asdata marts grow in size, so need to reduce the sizeof data marts to gain improvements in performance

    data mart load performancetwo criticalcomponents: end-user response time and dataloading performanceto increment DB updating sothat only cells affected by the change are updatedand not the entire MDDB structure

  • 8/7/2019 datawarehousing chap01

    27/27

    users access to data in multiple martsoneapproach is to replicate data between different data marts or,alternatively, build virtual data martit is views ofseveralphysical data marts or the corporate data warehouse tailored tomeet the requirements ofspecific groups ofusers

    data mart internet/intranet accessits products sitbetween a web server and the data analysis product.Internet/intranetoffers users low-cost access to data marts and the data WH using web

    browsers. data mart administrationorganization can not easily

    perform administration of multiple data marts, giving rise to issues suchas data mart versioning, data and meta-data consistency and integrity,enterprise-wide security, and performance tuning . Data martadministrative tools are commerciallly available

    data mart installationdata marts are becomingincreasingly complex to build. Vendors are offering productsreferred to as data mart in a box that provide a low-cost sourceof data mart tools