A Data Warehouse Platform for the Analysis of Molecular...
Transcript of A Data Warehouse Platform for the Analysis of Molecular...
A Data Warehouse Platformfor the Analysis of Molecular-biological and Clinical Data
Erhard RahmH.-H. Do, M. Hartung, T. Kirsten, J. Lange
http://dbs.uni-leipzig.dewww.izbi.de
Data Warehouse Technologies in Bioinformatics (DWTB06)December 05, 2006
Interdisciplinary Center for Bioinformatics
IZBI: Bioinformatics Center of the Univ. Leipzig
Grant of the DFG-Initiative Bioinformatics
founded 2001
several working groups, including Databases / Data Integration
Initiation of the international workshop series Data Integration in the Life Sciences (DILS)
DILS2004: Leipzig (IZBI)
DILS2005: San Diego (UCSD Supercomputing Center)
DILS2006: Cambridge, UK (EBI)
DILS2007: Philadelphia (UPenn)
LNBI 2994
Agenda
Data Integration in Bioinformatics Data characteristics
Data integration alternatives: Warehousing / Mediators / P2P
The GeWare data integration and analysis platformSystem architecture
Integration of clinical data
Multidimensional data organization
Annotation management
BioFuice: Mapping-based P2P-like data integration
Summary
Data Integration in Bioinformatics
Many heterogeneous data sources Experimental dataExperimental annotationsClinical dataLots of inter-connected web data sources and ontologies
Sequence dataAnnotation data
Private vs. public data
Different kinds of analysis needs Analysis of sequence data (e.g. multiple alignments)Gene expression analysisPathway analysis and reconstructionFunctional profilingTranscription analysisIdentification of transcription factor binding sites, …
High-Volume Experimental Data
High-throughput, chip-based measurement techniques Genome-wide measurements
Gene expression data, e.g. by expression microarrays
Mutation data, e.g. Matrix-CGH arrays, SNP arrays, …
Different chip types, continuous improvement
Very voluminous raw data
Several pre-processing routines (no standard)Different data aggregation levels (e.g. Affy probe vs. probeset expression values)
Wide spectrum of analysis methodsStatistical approaches, e.g. tests and resampling procedures, …
Data mining techniques, e.g. clustering, …
Visualizations, e.g. Heatmap, M/A plot, …
Affymetrix microarray
Clinical Data
Patient-related data and findingsData about patients and their clinical and pathological state
Typically manually captured in hospitals
Mostly of textual nature
Relatively small volume (compared to chip-based data)
RequirementsUniform data specification (metadata, values)
Autonomous data input (online data input)
Data integration: Patient-related findings + chip-based genetic data
Protection of patients' privacy
Utilization of existing software, e.g. for study management
Molecular-biological annotations
Annotation data vs. mapping data (cross-references)
Enzyme
GeneOntology
OMIMUniGeneKEGG
} References to other data sources
source-specific ID (accession)
annotations: names, symbols, synonyms, etc.
}
Interconnected data sources
heterogeneous schemas,formats, semantics
many, highly connected data sources and ontologies
frequent changes
incomplete data sources
common (global) database schema ???
Data integration: physical vs. virtual
Source 1 Source m Source n
Wrapper 1 Wrapper m Wrapper n
Mediator
Client 1 Client k
Meta data
Virtual Integration(query mediators)
Operational Systems
Import (ETL)
Data Warehouse
Data Marts
Analysis Tools
Meta data
Physical Integration(Data Warehousing)
P2P Integration: Typical Scenario
Gene Ontology
Protein annotations for gene X?
Local dataCheck GO annotation for
genes of interest?
SwissProt Ensembl
NetAffx
Bidirectional mappings between data sources instead of global schema
Queries refer to single source and are propagated to relevant peersAdding new sources becomes simpler
Support for local data sources (e.g. private gene list)
Data integration: physical vs. virtual
Virtual
-
+
+
o
-
o
At query runtime
A priori
Query mediators
o-(HW) ressourcerequirements
+oSource autonomy
+oData freshness
o+Achievable data quality
-+Analysis of large datavolumes
o-Scalability to many sources
At query runtimeA prioriInstance data integration
No schemaintegration
A prioriSchema integration
Peer-to-Peer
Physical(Warehouse)
The GeWare System
GeWare – Genetic Data Warehouse
Central data management and analysis platform
Data of chip-based experiments (i.e. expression microarrays & Matrix-CGH arrays)
Uniform and autonomous specification of experiment annotations
Import of clinical data
Integration of gene annotations from public sources
Various methods for pre-processing, analysis and visualization
Coupling with existing tools for powerful and flexible analysis
Applications
Two collaborative cancer research studiesMolecular Mechanism in Malignant Lymphoma (MMML)http://www.lymphome.de/Projekte/MMML
German Glioma Network: http://www.gliomnetzwerk.de/
Data from several national clinical, pathological and molecular-genetics centers
Experimental and clinical data for hundreds of patients
Local research groups at the Univ. Leipzig, e.g.Expression analysis of different types of human thyroid nodules
Expression analysis of physiological properties of mice
Analysis of factors influencing the specific binding of sequences on microarrays
System Architecture
Data Sources Data Warehouse Web Interface
Staging Area
Data Im-/ExportDatabase APIStored Procedure
Pre-pro-cessingResults
Gene Annotations
Experimental & ClinicalAnnotation Data
Expression/Mutation Data
CEL Files & Expression/CGH Matrices (CSV)
Manual User Input
Public Data SourcesLocalCopies
SRS
MappingDB
Daily Import from Study Management System
• Data pre-processing• Data analysis (canned
queries, statistics, visuali-zation)
• Administration
Data Mart
Expression /CGH Matrix
Core Data Warehouse
Multidimensional Data Model including• Gene Expression Data• Clone Copy Numbers• Experimental & clinical
Annotations• Public Data
• GO• Ensembl• NetAffx
GeWare – System Workflows
Analysis
Import of raw data
Preprocessing(Normalization /
aggregation
Experiment creation / selection
Manualexperiment annotation
Import of pre-processed data
Import Workflow
Statistics Visualization
Browse / search in annotations
Gene/Clonegroups
Treatment groups
External analysis (Functional profiling, clustering)
Expression /CGH matrices
Internal / integrated analysis
Management of analysis objects
Export
Reporting
Analysis Workflow(Closed Loop)
Integrated Analysis
Different types of pre-processing methodsMAS5, RMA, Li/Wong, …
Statistics and reportingDifferent kinds of statistical analysis, e.g. Multivariate oligo-based t-Test, …
Various canned queries, e.g. for outlier detection
Visualization
M/A plot to visualize differentially expressed genes
possible differentially expressed genes
Integrated Analysis cont.
Visualizations of expression values using clinical data
Heatmap of a selected gene expression matrix
Ch
ip 1
Ch
ip 2
Ch
ip 3
Ch
ip 4
Ch
ip 5
Ch
ip 6
Ch
ip 7
Ch
ip 8
Ch
ip 9
Ch
ip 1
0C
hip
11
Ch
ip 1
2C
hip
13
Ch
ip 1
4C
hip
15
Ch
ip 1
6C
hip
17
Ch
ip 1
8C
hip
19
Ch
ip 2
0C
hip
21
Ch
ip 2
2C
hip
23
Ch
ip 2
4C
hip
25
Chip/Patient dendrogram
Gen
e de
ndro
gram
Chips/Patients
Genes
Clinical Data: Integration Architecture*
Chip-based genetic Data
Gene expression data
Matrix-CGH data
Lab annotation data
Chip Id
ClinicalCenters
PathologicalCenters
Clinical findings
Location specific genetic findings
Pathologicalfindings
GeneticsCenters
Patient-related Findings
Public Gene/Clone Annotations
GO Ensembl NetAffx…
Management of Chip-related Data(GeWare)
•Data analysis & reports •Data export
Data Warehouse
Management of Clinical Studies(eResearch Network)
StudyRepository
•Administration•Simple reports•Data export
Validationby data checks
commonPatient ID
Mapping tablePatient IDs Chip IDs
periodictransfer
*Kirsten, T; Lange, J; Rahm, E : An integrated platform for analyzing molecular-biological data within clinical studies.Information Integration in Healthcare Application, LNCS 4254, 2006
Annotation management
Generic approach to specify structure and vocabulary for experimental, clinical and genetic annotations
Consistent metadata instead of freetext or undocumented abbreviations and naming
Manual specification of experimental annotationsdescribing the experimental set-up and procedure: sample modifications, hybridization process, utilized devices, …
Automatic import of clinical annotations and genetic annotations
Annotation templates: collections of hierarchically structured annotation categories
permissible annotation values can be restricted to controlled vocabularies
MIAME compliant templates
Controlled vocabularies: locally developed or external (e.g. NCBI Taxonomy)
MAGE-ML export (data exchange)
Experiment Annotation: Implementation (1)
Template exampleEasy specification and adaptation
Association of available vocabularies
Description
Experiment Annotation: Implementation (2)
Template exampleAutomatically generated web GUI
Hierarchically ordered categories
Index page
Generated page to captureannotation values
Utilization of terms of associated vocabularies
Experiment Annotation: Application
Search in experiment annotation: Create treatment groups (later reuse in analysis)
Search for relevant chipsby specifying queries
Save result as group
Multidimensional Data Management
Fact tables: expression values for different chip types and many chipsScalability and extensibility
Dimensions (chips/patients, genes, analysis methods)
Multidimensional analysisEasy selection, aggregation and comparison of values
Basis to support more advanced analysis methodsFocused selection and creation of matrices
Analysis methods
Experiments (chips)
Genes
GeWare – Data Warehouse Model
Annotation-related Dimensions
Facts: Expression Data, Analysis Results
Processing-related Dimensions
Chip
Treatment Group
*1
Experiment
*1
Gene**
Gene Group
Gene Intensity
Expression Matrix
Analysis Method
Transformation Method
Sample, Array, Treatment, …
GO function,Location, Pathway, ...
MAS5, RMA,Li-Wong, …
Data Warehouse
Data Mart
Clustering, Classification, Westfall/Young, ...
*
11
*
*
*
1
Clone**
Clone Group
Clone Intensity
CGH Matrix
Chromosomal Location, …
*
*
11
*
*
1
11
Integration of Public Sources*
Annotation AnalysisExpression AnalysisIdentification of relevant genes using annotation data
Identification of relevant genesusing experimental data
Expression (signal) valueP-Value…
Molecular functionGene locationProtein (product)Disease…
DWH+
Analysis Tools
gene /clone
groupsSRS
Gene annotation
Mapping-DB
Query-Mediator
*Kirsten, T; Rahm, E: Hybrid integration of molecular-biological annotation data. Proc. 2nd Intl. Workshop DILS, July 2005
BioFuice*
BioFuice: Bioinformatics information fusion utilizing instance correspondences and peer mappings
Based on iFuice approach for P2P data integration
P2P-like infrastructureMappings between autonomous data sources (peers)
Mapping: Set of instance correspondences
Simple integration of new sources
High-level operators to process mappings and objectsMapping Mediator
Controlling of mapping- and operator execution
Utilization of application specific semantic domain model
*Kirsten, T; Rahm, E: BioFuice: Mapping-based data integration in bioinformatics. Proc. 3rd Intl. Workshop DILS, July 2006
Script Example
ScenarioGiven: Set of sequences in local source MySequences
Wanted: Three classes: unaligned s., non-coding s., protein coding sequences
$alignedSeqMR := map( MySequences, { SeqDnaBlast } );$codingSeqMR := compose( $alignedSeqMR, { Ensembl.SRegionExons } );
$unalignedSeqOI := diff ( MySequences, domain ( $alignedSeqMR ));$protCodingSeqOI := domain ( $codingSeqMR );$nonCodingSeqOI := diff ( domain ( $alignedSeqMR ) , $protCodingSeqOI );
Ensembl
MySequences
Ensembl.SRegionExons
SeqDnaBlast
Sequence Region
SequenceExon
LDS PDS
mapping(same: )
Legend
Conclusions
Different data integration architectures for bioinformatics neededData Warehousing
Virtual integration approaches (Mediators, P2P)
Combinations
GeWareManagement of a high volume of expression data and Matrix-CGH mutation data
Comprehensive support for consistent experimental annotations
Import of clinical data from study management system
Access to gene annotations from web sources
Different kinds of pre-processing methods and analysis
BioFuiceP2P-like data integration
Domain model using semantic object and mapping types
Set of high-level operators for query and mapping execution