Data Warehousing Lifecycle
description
Transcript of Data Warehousing Lifecycle
![Page 1: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/1.jpg)
Data Warehousing Lifecycle
Conceptual modeling:
System requirements, data sources and warehousing activities.
Logical design:
Data flow from sources to DW, composition and semantics of activities.
DW construction:
Schema implementation, data population and warehouse tuning.
Application development:
DW interfaces, OLAP and data mining tools.
![Page 2: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/2.jpg)
On-Line Analytical Processing (OLAP)
Store
Pro
duct
Time (day)
M T W Th F S S
Juice
Milk
Coke
Cream
Soap
Bread
NYSF
LA
10 15 18 5 24 32 16
Dimensions: Time, Product, StoreHierarchies: Day Week Quarter
Product Brand … Store Region Country
roll-up to week
roll-up to brandroll-up to region
Store
Pro
duct
Time (week)
W1 2 3 4
Juice
Milk
Coke
Cream
Soap
Bread
NYSF
LA
120
Operators: roll-up, drill-down, slice and dice.Uses: Business data analysis, e.g., market-driven trend analysis.
![Page 3: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/3.jpg)
CSE601 3
Cube Aggregates Lattice
city, product, date
city, product city, date product, date
city product date
all
day 2c1 c2 c3
p1 44 4p2 c1 c2 c3
p1 12 50p2 11 8
day 1
c1 c2 c3p1 56 4 50p2 11 8
c1 c2 c3p1 67 12 50
129
use greedyalgorithm todecide whatto materialize
![Page 4: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/4.jpg)
CSE601 4
Dimension Hierarchies
all
state
city
cities city statec1 CAc2 NY
![Page 5: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/5.jpg)
CSE601 5
Dimension Hierarchies
city, product
city, product, date
city, date product, date
city product date
all
state, product, date
state, date
state, product
state
not all arcs shown...
![Page 6: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/6.jpg)
Logical Data Modeling: A Star Schema Example
Sales
time_key
branch_key
location_key
product_key
num_units
amount_usd
Time
time_key
day
month
year
Product
product_key
name
brand
type
Supplier
supplier_key
name
type
Location
location_key
city
state
country
Branch
branch_key
name
type
1
n
1
1
1
n
n
n
???
One-to-many relationships between the fact and dimensions. The fact-dimension relationships are certain. Dimensions in star models are often tightly coupled. Star schema does not appear to be very extensible.
![Page 7: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/7.jpg)
Biomedical Data Resources
• Static data: data on genotypes, biological entities such as nucleic acids, protein and relationships between these entities.
• Dynamic data: data on phenotypes, the dynamics of biological processes.
• Data on analysis tools: data on biological and computer science methods which can be used to identify the entities and relationships.
• References and annotations: to scientific papers and textual explanations.
![Page 8: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/8.jpg)
Biomedical Data Modeling
• Flat file collections: Databases were built up as indexed ASCII text files.
• Relational databases: many biology databases were implemented using Oracle, Sybase, or MySQL.
• Object-oriented databases: data are modeled as objects that are organized in classes.
• Multidimensional databases: data are organized in star like schema.
![Page 9: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/9.jpg)
Using Star Schema in Gene Expression Data Management
• “Applying Data Warehouse Concepts to Gene Expression Data Management”, by V. Markowitz and T. Topaloglou
• Three modeling data spaces:– Sample data space– Gene Annotation data space– Gene expression data space
![Page 10: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/10.jpg)
Gene Expression Data Space
Gene_idExperiment_id
Analysis_idExpression_call
Analysis_idAlgorithm
version
Gene_idGene_name
Gene_symbol
Experiment_idExp_nameExp_dateExp_fileSample
Gene
Analysis
Expression
Experiment
Clinical Sample
![Page 11: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/11.jpg)
Sample Data Space
BiologicalSample
PathwaysStudy
Donor
DonorDemorgraphics
DonorClinical
![Page 12: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/12.jpg)
Gene Annotation Data Space
GeneFragmentsSequence
Pathways
SequenceCluster
Known gene
MicroarrayDesign
Chromosome
![Page 13: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/13.jpg)
OLAP Operations
• Sample selection: extract sets of samples with a certain profile on the sample data space. Eg, a sample set of male colon samples with adenocarcenoma for donors in the age group 40-60.
• Classification on organ: total number of samples classified by liver, brain, …
![Page 14: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/14.jpg)
OLAP Operations
• Gene selection: extract sets of genes with certain properties over the gene annotation data space. Eg, a gene set of the genes on chromosome 22 …
• Aggregates: gene summarization on sample dimension, sample summarization on gene dimension. Etc.
![Page 15: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/15.jpg)
Clinical Data Sapce
Clinical Sample
Medical ImageFollowup
Drug
Demographics Clinical Test
Physiology
Patient
1 n
n
n n
1 n 1
n
1 n
n n
Disease
n n
n
![Page 16: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/16.jpg)
Sample Data Sapce
Protein Expression
mRNA Expression
Anatomy Ontology Biochemical Assay
Genetic Screening
Clinical Sample
n
n
1
n
Patient
n n
1
n n
1 n
n
![Page 17: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/17.jpg)
Microarray Data Sapce
mRNA Expression
Experiment Measurement Unit
Array Probe
Gene Sequence
n n
n n
1 1
1 1
1
n
Clinical Sample
![Page 18: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/18.jpg)
Proteomic Data Sapce
Protein Expression
Experiment Measurement Unit
Gene Sequence
n n
n n
1 1
1 1Clinical Sample
![Page 19: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/19.jpg)
Experiment Data Sapce
Project
Experiment
Publication Normalization
Protocol
Person
n n
n n
n 1 1 n
1 1
1 1
Platform
![Page 20: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/20.jpg)
Gene Data Sapce
n 1
Protein Expression
Gene Sequence
Promoter Gene Ontology
1
n
n
n
Protein Domain
Protein-Protein Interaction
n
n
1
2
1
n n
n
Gene Cluster
mRNA Expression
Array Probe
n 1
![Page 21: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/21.jpg)
mRNA Expression
Experiment Measurement Unit
Array Probe
Gene Sequence
n n
n n
1 1
1 1
1
n
Clinical Sample
Anatomy Ontology
n
1
Patient
1
n
Disease
n
n
Project Platform
Normalization
1
n
1
n
1
n
Gene Ontology Gene Cluster
n
n
n
n
Explicit Definition of Concept Hierarchies
![Page 22: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/22.jpg)
Characteristics of Clinical and Genomic Data
Clinical and Genomic Data Business Data
Complex data structure with many potential dimensions
Easy-to-understand data structure with few dimensions
Often many-to-many relationships between facts and dimensions
Many-to-one relationships between facts and dimensions
Uncertain relationships between fact and dimension objects
Certain relationships between fact and dimension objects
Some measures require advanced temporal support for time validity
Historical data, no advanced temporal support needed
Incomplete and/or imprecise data very common
Few incomplete and/or imprecise data
![Page 23: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/23.jpg)
Large Number of Dimensions and Evolution of Dimensions
• If Star schema is used and the number of dimensions is large, the fact table will be huge (combination of foreign keys).
• Adding new dimension to Star schema will require re-computing of all data entries in the fact table.
![Page 24: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/24.jpg)
Many-to-Many relationships
• The many-to-many relationships cannot be easily modeled using Star schema, which is originally designed to handle many-to-one relationships between business fact and a dimension.
![Page 25: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/25.jpg)
Incompleteness of Data
• Clinical data may be incomplete. This may cause a lot of null values in the fact table for foreign keys, which will result in inconsistency.
![Page 26: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/26.jpg)
Star Schema Fact
DimKey1DimKey2DimKey3DimKey4Measure1Measure2Measure3Measure4
Dim3
DimKey3
. . .
Dim2
DimKey2
. . .
Dim4
DimKey4. . .
Dim1
DimKey1. . .
BioStar Schema
Fact
FactKey
. . .Dim3
DimKey3
. . .
MTable2
DimKey2FactKeyMeasure2
MTable4
DimKey4FactKeyMeasure4
Dim1
DimKey1. . .
MTable3
DimKey3FactKeyMeasure3
MTable1
DimKey1FactKeyMeasure1
Dim2
DimKey2
. . .
Dim4
DimKey4. . .
![Page 27: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/27.jpg)
BioStar Schema for Part of the Clinical Data Space
Patient
PatientIDSSNNameGenderDOB
DrugUse
DrugIDPatientIDDosageValidFromValidTo
TestResult
TestIDPatientIDResultDateTested
ClinicalSample
SampleIDPatientIDSourceAmountDateTaken
Diagnosis
DiseaseIDPatientIDSymptomValidFromValidTo
Drug
DrugIDDrugNameDrugTypeDescription
Disease
DiseaseIDNameTypeDescription
ClinicalTest
TestIDTestNameTestTypeTestSetting
Extensibility and flexibility
![Page 28: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/28.jpg)
BioStar Schema for the Sample Data Space
ClinicalSample
SampleIDPatientIDSourceAmountDateTaken
mRNAExpression
SampleIDArrayProbeIDExperimentIDMeasureUnitIDExpression
AssayResult
AssayIDSampleIDResultCommentDateTested
AnatomyTerm
TermIDTermTypeTermNameDefinition
BiochemAssay
AssayIDAssayNameAssayTypeAssaySettingDescription
SampleAnatomy
TermIDSampleIDDescription
GeneticScreen
MarkerIDSampleIDResultRawDataCommentDateTested
GeneticMarker
MarkerIDMarkerNameMarkerTypeGeneticLocusDescription
![Page 29: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/29.jpg)
BioStar Schema for Part of the Gene Data Space
GeneSequence
UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus
GOAnnotation
GOIDUIDEvidence
Promoter
PromoterIDUIDPromoterTypePromoterSeqLengthDescription
ProteinInteract
UID1UID2EvidenceDescription
GeneCluster
ClusterID
UID
GOTerm
GOIDAccessionTermTypeTermNameDefinition
Cluster
ClusterIDNumOfGenesExprPatternClusteringToolToolSettingDescription
ArrayProbe
ArrayProbeIDUIDArrayIDProbeNameDescriptionIsQC
GeneDomain
DomainIDUIDAlignmentSeqFromSeqToDomainFromDomainToEValueBitScore
DomainModel
DomainIDModelTypeSourceDBAccessionTitleLengthDescription
![Page 30: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/30.jpg)
Star Schema for the Microarray Data Space
mRNAExpression
SampleIDArrayProbeIDExperimentIDMeasureUnitIDExpression
Experiment
ExperimentIDExperimentNameExperimentTypeProjectIDPersonIDPlatformIDProtocolIDNormalizationIDPublicationID
ArrayProbe
ArrayProbeIDUIDArrayIDProbeNameDescriptionIsQC
MeasurementUnit
MeasureUnitIDMeasureUnitNameMeasureUnitTypeDescription
GeneSequence
UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus
ClinicalSample
SampleIDPatientIDSourceAmountDateTaken
![Page 31: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/31.jpg)
Star Schema for the Proteomic Data Space
ProteinExpression
SampleIDUIDExperimentIDMeasureUnitIDExpression
Experiment
ExperimentIDExperimentNameExperimentTypeProjectIDPersonIDPlatformIDProtocolIDNormalizationIDPublicationID
MeasurementUnit
MeasureUnitIDMeasureUnitNameMeasureUnitTypeDescription
GeneSequence
UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus
ClinicalSample
SampleIDPatientIDSourceAmountDateTaken
![Page 32: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/32.jpg)
Star Schema for the Experiment Data Space
Experiment
ExperimentID
ExperimentName
ExperimentType
ProjectID
PersonID
PlatformID
ProtocolID
NormalizationID
PublicationID
Project
ProjectIDProjectNameInvestigatorDescription
Protocol
ProtocolIDProtocolNameProtocolTextCreatedBy
Publication
PublicationIDPubMedIDTitleAuthorsAbstractPubDateCitation
Platform
PlatformIDHardwareSoftwareSettingsDescription
Person
PersonIDPersonNameLabNameContact
Normalization
NormalizationIDNormTypeSoftwareParametersDescription
![Page 33: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/33.jpg)
BioStar is not Fact Constellation• You may view measure tables as small “fact”
tables, but fact tables in a constellation usually share multiple dimension tables.
Dimensiontable
Fact table
Fact table
Fact table
Dimensiontable
Dimension table
Dimensiontable
Dimensiontable
Dimensiontable
DimensiontableDimension
table
![Page 34: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/34.jpg)
Extensibility of BioStar
• Add a protein structure information dimension to gene data space.
GeneSequence
UIDSeqTypeAccessionVersionSeqDatasetSpeciesIDStatus
UIDPDBID
…..
PDBID
…..
ProteinStructureProteinSequence
Dimension tableMeasure table
Populating the two new tables will not affect other tables.
![Page 35: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/35.jpg)
Flexibility of BioStar
• Separate tables for fact measures to solve the many-to-many relationship problem dimension table and its associated measure table can be populated independently avoid null values.
![Page 36: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/36.jpg)
Sample Classification Hierarchy
All_sample
Normal Tumor
Brain Blood Colon Breast
CNS_tumor Leukemia
. . .
Adeno-carcinoma
. . .
Glio-blastoma
. . . ALL AML Colontumor
Breasttumor
. . .
(Patients)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
![Page 37: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/37.jpg)
OLAP for Microarray Data Exploration
Mea
sure
men
t
Unit
Gen
e
Sample (patient)
1 2 3 4 5 6 7
D13626
D13627
D13628
J04605
L37042
S78653
X60003
Z11518
PAVal
10 15 18 5 24 32 16
roll-up todisease types
roll-up to GO terms
roll-up to expression
Dimensions: Sample Gene Measurement Unit
Operators: roll-up drill-down slice dice t-test p-select
Application: Exploration of gene expression data
![Page 38: Data Warehousing Lifecycle](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fbf550346895daa9d21/html5/thumbnails/38.jpg)
Data Sources Data Warehouse Unified Access
Clinical data and sample annotations
Gene functional annotations
MicroarraymRNAexpression
Proteomics proteinexpression
Promotersequencesand motifs
Protein domains & interactome
Data Integration
Data extraction, trans-formation, cleaning & loading
Metadata capturing & integration
Data quality control
Refreshment
Data Mining
• Ad hoc queries
• OLAP
• Cluster analysis
• Mining gene regulatory networks
• Interactome prediction
• Pathway analysis
A standard interface for application tools
Object-oriented
Defining basic operators for data access
Biomediacl Data Warehouse System Architecture