Post on 27-Jan-2015
description
Chapter 13 – Data Chapter 13 – Data WarehousingWarehousing
DatabasesDatabases
Databases are developed on the IDEA that Databases are developed on the IDEA that DATA is one of the critical materials of the DATA is one of the critical materials of the Information AgeInformation Age
Information, which is created by data, Information, which is created by data, becomes the bases for decision makingbecomes the bases for decision making
Decision Support SystemsDecision Support Systems
Created to facilitate the decision making Created to facilitate the decision making processprocess
So much information that it is difficult to So much information that it is difficult to extract it all from a traditional databaseextract it all from a traditional database
Need for a more comprehensive data Need for a more comprehensive data storage facilitystorage facility– Data WarehouseData Warehouse
Decision Support SystemsDecision Support Systems
Extract Information from data to use as the basis Extract Information from data to use as the basis for decision makingfor decision making
Used at all levels of the OrganizationUsed at all levels of the Organization Tailored to specific business areasTailored to specific business areas InteractiveInteractive Ad Hoc queries to retrieve and display informationAd Hoc queries to retrieve and display information Combines historical operation data with business Combines historical operation data with business
activitiesactivities
4 Components of DSS4 Components of DSS
Data Store – The DSS DatabaseData Store – The DSS Database– Business DataBusiness Data– Business Model DataBusiness Model Data– Internal and External DataInternal and External Data
Data Extraction and FilteringData Extraction and Filtering– Extract and validate data from the operational Extract and validate data from the operational
database and the external data sourcesdatabase and the external data sources
4 Components of DSS4 Components of DSS
End-User Query ToolEnd-User Query Tool– Create Queries that access either the Create Queries that access either the
Operational or the DSS databaseOperational or the DSS database
End User Presentation ToolsEnd User Presentation Tools– Organize and Present the DataOrganize and Present the Data
Differences with DSSDifferences with DSS
OperationalOperational– Stored in Normalized Relational DatabaseStored in Normalized Relational Database– Support transactions that represent daily Support transactions that represent daily
operations (Not Query Friendly)operations (Not Query Friendly)
3 Main Differences3 Main Differences– Time SpanTime Span– GranularityGranularity– DimensionalityDimensionality
Time SpanTime Span
OperationalOperational– Real TimeReal Time– Current TransactionsCurrent Transactions– Short Time FrameShort Time Frame– Specific Data FactsSpecific Data Facts
DSSDSS– HistoricHistoric– Long Time Frame (Months/Quarters/Years)Long Time Frame (Months/Quarters/Years)– PatternsPatterns
GranularityGranularity
OperationalOperational– Specific Transactions that occur at a given timeSpecific Transactions that occur at a given time
DSSDSS– Shown at different levels of aggregationShown at different levels of aggregation– Different Summary LevelsDifferent Summary Levels– Decompose (drill down)Decompose (drill down)– Summarize (roll up)Summarize (roll up)
DimensionalityDimensionality
Most distinguishing characteristic of DSS Most distinguishing characteristic of DSS datadata
OperationalOperational– Represents atomic transactionsRepresents atomic transactions
DSSDSS– Data is related in Many waysData is related in Many ways– Develop the larger pictureDevelop the larger picture– Multi-dimensional view of dataMulti-dimensional view of data
DSS Database RequirementsDSS Database Requirements
DSS Database SchemeDSS Database Scheme– Support Complex and Non-Normalized dataSupport Complex and Non-Normalized data
Summarized and Aggregate dataSummarized and Aggregate data Multiple RelationshipsMultiple Relationships Queries must extract multi-dimensional time slicesQueries must extract multi-dimensional time slices Redundant DataRedundant Data
DSS Database RequirementsDSS Database Requirements
Data Extraction and FilteringData Extraction and Filtering– DSS databases are created mainly by extracting data DSS databases are created mainly by extracting data
from operational databases combined with data from operational databases combined with data imported from external sourceimported from external source Need for advanced data extraction & filtering toolsNeed for advanced data extraction & filtering tools Allow batch / scheduled data extractionAllow batch / scheduled data extraction Support different types of data sourcesSupport different types of data sources Check for inconsistent data / data validation rulesCheck for inconsistent data / data validation rules Support advanced data integration / data formatting conflictsSupport advanced data integration / data formatting conflicts
DSS Database RequirementsDSS Database Requirements
End User Analytical InterfaceEnd User Analytical Interface– Must support advanced data modeling and data Must support advanced data modeling and data
presentation toolspresentation tools– Data analysis toolsData analysis tools– Query generationQuery generation– Must Allow the User to Navigate through the DSSMust Allow the User to Navigate through the DSS
Size RequirementsSize Requirements– VERY Large – TerabytesVERY Large – Terabytes– Advanced Hardware (Multiple processors, multiple disk Advanced Hardware (Multiple processors, multiple disk
arrays, etc.)arrays, etc.)
Data WarehouseData Warehouse
DSS – friendly data repository for the DSS is DSS – friendly data repository for the DSS is the DATA WAREHOUSEthe DATA WAREHOUSE
Definition: Integrated, Subject-Oriented, Definition: Integrated, Subject-Oriented, Time-Variant, Nonvolatile database that Time-Variant, Nonvolatile database that provides support for decision makingprovides support for decision making
IntegratedIntegrated
The data warehouse is a centralized, The data warehouse is a centralized, consolidated database that integrated data consolidated database that integrated data derived from the entire organizationderived from the entire organization– Multiple SourcesMultiple Sources– Diverse SourcesDiverse Sources– Diverse FormatsDiverse Formats
Subject-OrientedSubject-Oriented
Data is arranged and optimized to provide Data is arranged and optimized to provide answer to questions from diverse functional answer to questions from diverse functional areasareas– Data is organized and summarized by topicData is organized and summarized by topic
Sales / Marketing / Finance / Distribution / Etc.Sales / Marketing / Finance / Distribution / Etc.
Time-VariantTime-Variant
The Data Warehouse represents the flow of The Data Warehouse represents the flow of data through timedata through time
Can contain projected data from statistical Can contain projected data from statistical modelsmodels
Data is periodically uploaded then time-Data is periodically uploaded then time-dependent data is recomputeddependent data is recomputed
NonvolatileNonvolatile
Once data is entered it is NEVER removedOnce data is entered it is NEVER removed Represents the company’s entire historyRepresents the company’s entire history
– Near term history is continually added to itNear term history is continually added to it– Always growingAlways growing– Must support terabyte databases and Must support terabyte databases and
multiprocessorsmultiprocessors
Read-Only database for data analysis and Read-Only database for data analysis and query processingquery processing
Data MartsData Marts
Small Data StoresSmall Data Stores More manageable data setsMore manageable data sets Targeted to meet the needs of small groups Targeted to meet the needs of small groups
within the organizationwithin the organization
Small, Single-Subject data warehouse Small, Single-Subject data warehouse subset that provides decision support to a subset that provides decision support to a small group of peoplesmall group of people
OLAPOLAP
Online Analytical Processing ToolsOnline Analytical Processing Tools DSS tools that use multidimensional data DSS tools that use multidimensional data
analysis techniquesanalysis techniques– Support for a DSS data storeSupport for a DSS data store– Data extraction and integration filterData extraction and integration filter– Specialized presentation interfaceSpecialized presentation interface
12 Rules of a Data Warehouse12 Rules of a Data Warehouse
Data Warehouse and Operational Data Warehouse and Operational Environments are SeparatedEnvironments are Separated
Data is integratedData is integrated Contains historical data over a long period Contains historical data over a long period
of timeof time Data is a snapshot data captured at a given Data is a snapshot data captured at a given
point in timepoint in time Data is subject-orientedData is subject-oriented
12 Rules of Data Warehouse12 Rules of Data Warehouse
Mainly read-only with periodic batch updatesMainly read-only with periodic batch updates Development Life Cycle has a data driven Development Life Cycle has a data driven
approach versus the traditional process-approach versus the traditional process-driven approachdriven approach
Data contains several levels of detailData contains several levels of detail– Current, Old, Lightly Summarized, Highly Current, Old, Lightly Summarized, Highly
SummarizedSummarized
12 Rules of Data Warehouse12 Rules of Data Warehouse
Environment is characterized by Read-only Environment is characterized by Read-only transactions to very large data setstransactions to very large data sets
System that traces data sources, transformations, System that traces data sources, transformations, and storageand storage
Metadata is a critical componentMetadata is a critical component– Source, transformation, integration, storage, Source, transformation, integration, storage,
relationships, history, etcrelationships, history, etc Contains a chargeback mechanism for resource Contains a chargeback mechanism for resource
usage that enforces optimal use of data by end usage that enforces optimal use of data by end usersusers
OLAPOLAP
Need for More Intensive Decision SupportNeed for More Intensive Decision Support 4 Main Characteristics4 Main Characteristics
– Multidimensional data analysisMultidimensional data analysis– Advanced Database SupportAdvanced Database Support– Easy-to-use end-user interfacesEasy-to-use end-user interfaces– Support Client/Server architectureSupport Client/Server architecture
Multidimensional Data Analysis Multidimensional Data Analysis TechniquesTechniques
Advanced Data Presentation FunctionsAdvanced Data Presentation Functions– 3-D graphics, Pivot Tables, Crosstabs, etc.3-D graphics, Pivot Tables, Crosstabs, etc.– Compatible with Spreadsheets & Statistical Compatible with Spreadsheets & Statistical
packagespackages– Advanced data aggregations, consolidation and Advanced data aggregations, consolidation and
classification across time dimensionsclassification across time dimensions– Advanced computational functionsAdvanced computational functions– Advanced data modeling functionsAdvanced data modeling functions
Advanced Database SupportAdvanced Database Support
Advanced Data Access FeaturesAdvanced Data Access Features– Access to many kinds of DBMS’s, flat files, and Access to many kinds of DBMS’s, flat files, and
internal and external data sourcesinternal and external data sources– Access to aggregated data warehouse dataAccess to aggregated data warehouse data– Advanced data navigation (drill-downs and roll-Advanced data navigation (drill-downs and roll-
ups)ups)– Ability to map end-user requests to the Ability to map end-user requests to the
appropriate data sourceappropriate data source– Support for Very Large DatabasesSupport for Very Large Databases
Easy-to-Use End-User InterfaceEasy-to-Use End-User Interface
Graphical User InterfacesGraphical User Interfaces Much more useful if access is kept simpleMuch more useful if access is kept simple
Client/Server ArchitectureClient/Server Architecture
Framework for the new systems to be Framework for the new systems to be designed, developed and implementeddesigned, developed and implemented
Divide the OLAP system into several Divide the OLAP system into several components that define its architecturecomponents that define its architecture– Same ComputerSame Computer– Distributed among several computerDistributed among several computer
OLAP ArchitectureOLAP Architecture
3 Main Modules3 Main Modules– GUIGUI– Analytical Processing LogicAnalytical Processing Logic– Data-processing LogicData-processing Logic
OLAP Client/Server OLAP Client/Server ArchitectureArchitecture
Relational OLAPRelational OLAP
Relational Online Analytical ProcessingRelational Online Analytical Processing– OLAP functionality using relational database OLAP functionality using relational database
and familiar query tools to store and analyze and familiar query tools to store and analyze multidimensional datamultidimensional data
Multidimensional data schema supportMultidimensional data schema support Data access language & query performance Data access language & query performance
for multidimensional datafor multidimensional data Support for Very Large DatabasesSupport for Very Large Databases
Multidimensional Data Schema Multidimensional Data Schema SupportSupport
Decision Support Data tends to beDecision Support Data tends to be– NonnormalizedNonnormalized– DuplicatedDuplicated– PreaggregatedPreaggregated
Star SchemaStar Schema– Special Design technique for multidimensional Special Design technique for multidimensional
data representationsdata representations– Optimize data query operations instead of data Optimize data query operations instead of data
update operationsupdate operations
Star SchemasStar Schemas
Data Modeling Technique to map Data Modeling Technique to map multidimensional decision support data into multidimensional decision support data into a relational databasea relational database
Current Relational modeling techniques do Current Relational modeling techniques do not serve the needs of advanced data not serve the needs of advanced data requirementsrequirements
Star SchemaStar Schema
4 Components4 Components– FactsFacts– DimensionsDimensions– AttributesAttributes– Attribute HierarchiesAttribute Hierarchies
FactsFacts
Numeric measurements (values) that represent a Numeric measurements (values) that represent a specific business aspect or activityspecific business aspect or activity
Stored in a fact table at the center of the star Stored in a fact table at the center of the star schemescheme
Contains facts that are linked through their Contains facts that are linked through their dimensionsdimensions
Can be computed or derived at run timeCan be computed or derived at run time Updated periodically with data from operational Updated periodically with data from operational
databasesdatabases
DimensionsDimensions
Qualifying characteristics that provide Qualifying characteristics that provide additional perspectives to a given factadditional perspectives to a given fact– DSS data is almost always viewed in relation to DSS data is almost always viewed in relation to
other dataother data
Dimensions are normally stored in Dimensions are normally stored in dimension tablesdimension tables
AttributesAttributes
Dimension Tables contain AttributesDimension Tables contain Attributes Attributes are used to search, filter, or classify Attributes are used to search, filter, or classify
factsfacts Dimensions provide descriptive characteristics Dimensions provide descriptive characteristics
about the facts through their attributedabout the facts through their attributed Must define common business attributes that will Must define common business attributes that will
be used to narrow a search, group information, or be used to narrow a search, group information, or describe dimensions. (ex.: Time / Location / describe dimensions. (ex.: Time / Location / Product)Product)
No mathematical limit to the number of dimensions No mathematical limit to the number of dimensions (3-D makes it easy to model)(3-D makes it easy to model)
Attribute HierarchiesAttribute Hierarchies
Provides a Top-Down data organizationProvides a Top-Down data organization– AggregationAggregation– Drill-down / Roll-Up data analysisDrill-down / Roll-Up data analysis
Attributes from different dimensions can be Attributes from different dimensions can be grouped to form a hierarchygrouped to form a hierarchy
Star Schema for SalesStar Schema for Sales
Fact Table
Dimension Tables
Star Schema RepresentationStar Schema Representation
Fact and Dimensions are represented by physical Fact and Dimensions are represented by physical tables in the data warehouse databasetables in the data warehouse database
Fact tables are related to each dimension table in Fact tables are related to each dimension table in a Many to One relationship (Primary/Foreign Key a Many to One relationship (Primary/Foreign Key Relationships)Relationships)
Fact Table is related to many dimension tablesFact Table is related to many dimension tables– The primary key of the fact table is a composite primary The primary key of the fact table is a composite primary
key from the dimension tableskey from the dimension tables Each fact table is designed to answer a specific Each fact table is designed to answer a specific
DSS questionDSS question
Star SchemaStar Schema
The fact table is always the larges table in The fact table is always the larges table in the star schemathe star schema
Each dimension record is related to Each dimension record is related to thousand of fact recordsthousand of fact records
Star Schema facilitated data retrieval Star Schema facilitated data retrieval functionsfunctions
DBMS first searches the Dimension Tables DBMS first searches the Dimension Tables before the larger fact tablebefore the larger fact table
Data Warehouse ImplementationData Warehouse Implementation
An Active Decision Support FrameworkAn Active Decision Support Framework– Not a Static DatabaseNot a Static Database– Always a Work in ProcessAlways a Work in Process– Complete Infrastructure for Company-Wide Complete Infrastructure for Company-Wide
decision supportdecision support– Hardware / Software / People / Procedures / Hardware / Software / People / Procedures /
DataData– Data Warehouse is a critical component of the Data Warehouse is a critical component of the
Modern DSS – But not the Only critical Modern DSS – But not the Only critical componentcomponent
Data MiningData Mining
Discover Previously unknown data Discover Previously unknown data characteristics, relationships, dependencies, characteristics, relationships, dependencies, or trendsor trends
Typical Data Analysis Relies on end users Typical Data Analysis Relies on end users – Define the ProblemDefine the Problem– Select the DataSelect the Data– Initial the Data AnalysisInitial the Data Analysis– Reacts to External StimulusReacts to External Stimulus
Data MiningData Mining
ProactiveProactive Automatically searchesAutomatically searches
– AnomaliesAnomalies– Possible RelationshipsPossible Relationships– Identify Problems before the end-userIdentify Problems before the end-user
Data Mining tools analyze the data, uncover Data Mining tools analyze the data, uncover problems or opportunities hidden in data problems or opportunities hidden in data relationships, form computer models based on relationships, form computer models based on their findings, and then user the models to predict their findings, and then user the models to predict business behavior – with minimal end-user business behavior – with minimal end-user interventionintervention
Data MiningData Mining
A methodology designed to perform A methodology designed to perform knowledge-discovery expeditions over the knowledge-discovery expeditions over the database data with minimal end-user database data with minimal end-user interventionintervention
3 Stages of Data3 Stages of Data– DataData– InformationInformation– KnowledgeKnowledge
Extraction of Knowledge from Extraction of Knowledge from DataData
4 Phases of Data Mining4 Phases of Data Mining
Data PreparationData Preparation– Identify the main data sets to be used by the Identify the main data sets to be used by the
data mining operation (usually the data data mining operation (usually the data warehouse)warehouse)
Data Analysis and ClassificationData Analysis and Classification– Study the data to identify common data Study the data to identify common data
characteristics or patternscharacteristics or patterns Data groupings, classifications, clusters, sequencesData groupings, classifications, clusters, sequences Data dependencies, links, or relationshipsData dependencies, links, or relationships Data patterns, trends, deviationData patterns, trends, deviation
4 Phases of Data Mining4 Phases of Data Mining
Knowledge AcquisitionKnowledge Acquisition– Uses the Results of the Data Analysis and Classification phaseUses the Results of the Data Analysis and Classification phase– Data mining tool selects the appropriate modeling or knowledge-Data mining tool selects the appropriate modeling or knowledge-
acquisition algorithmsacquisition algorithms Neural NetworksNeural Networks Decision TreesDecision Trees Rules InductionRules Induction Genetic algorithmsGenetic algorithms Memory-Based ReasoningMemory-Based Reasoning
PrognosisPrognosis– Predict Future BehaviorPredict Future Behavior– Forecast Business OutcomesForecast Business Outcomes
65% of customers who did not use a particular credit card in the last 6 65% of customers who did not use a particular credit card in the last 6 months are 88% likely to cancel the account.months are 88% likely to cancel the account.
Data MiningData Mining
Still a New TechniqueStill a New Technique May find many Unmeaningful RelationshipsMay find many Unmeaningful Relationships Good at finding Practical RelationshipsGood at finding Practical Relationships
– Define Customer Buying PatternsDefine Customer Buying Patterns– Improve Product Development and AcceptanceImprove Product Development and Acceptance– Etc.Etc.
Potential of becoming the next frontier in Potential of becoming the next frontier in database developmentdatabase development