Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of...
-
date post
18-Dec-2015 -
Category
Documents
-
view
216 -
download
3
Transcript of Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of...
Integrated Data Systems for Genomic Analysis
Genomics and Bioinformatics for the Advancement of Clinical Sciences
Thomas Jefferson University, Oct. 14, 2002
Chris Stoeckert, Ph.D.
Dept. of Genetics & Center for Bioinformatics
University of Pennsylvania
Plasmodium genomics: Genomics and proteomics pave the way for controlling malaria
Nature, October 3, 2002
Thinking Genomically
Genome
Phenotype
•Genome structure•Genes and function•Pathways•Expression patterns•(Complex) diseases
Genomic Unified Schema (GUS) is a relational database that warehouses and integrates biological sequence, sequence annotation, and gene expression data from a number of heterogeneous sources. User-friendly web interfaces present slices of the GUS database and allow researchers to execute structured queries for information concerning gene structure, function, and expression.
Using a Genomics Unified Schema (GUS) to ask genomic questions
GUS Powers Multiple Genomics ProjectsAllGenesAllGenes
PlasmoDBPlasmoDB
EPConDBEPConDB
Allgenes is based on a comprehensive mouse and human gene index. The genes are approximated by transcripts predicted from EST and mRNA clustering
PlasmoDB is the official database of the Plasmodium falciparum genome project which provides an integrated view of genome sequence data including expression data from EST, SAGE, and microarray projects
EPConDB is an index of genes expressed in endocrine pancreas. Expression is defined either through microarray experiments or sequence annotation.
"Is my cDNA similar to any mouse genes that are predicted to encode transcription factors and have
been localized to mouse chromosome 5?"
http://www.allgenes.org/Steve Fischer, Debbie Pinney, Brian Brunk, Joan Mazarelli, Jonathan
Crabtree, Yongchang Gan, Sharon DiskinNikolay Kolchanov, Alexey Katohkin
Data Integration Data Analysis Tools•RHMap
•GOFunction
•Sequence
•GOFunction assigments
•Boolean function
•History function
•BLAST
This query illustrates several aspects of the GUS database including:
allgenes.org query
There are 26 mouse RNAs (assemblies) that meet these criteria:
This query result set now appears on the query "history" page:
Now use the BLAST page to identify RNAs similar to my cDNA
The results of the BLAST search appear in the query history
Intersect ("AND") the BLAST search with the previous query:
And we have our answer (the third row on the query history page):
Predicted GO function(s)(some manually reviewed)
Other transcripts fromthe same gene
External links
Mapping information
Gene trap insertions
Protein/motifs, etc.
predicted protein CAP4 assembly EST expression profile UCSC BLAT
"List all genes whose proteins are predicted to contain a signal peptide and for which there is
evidence that they are expressed in Plasmodium falciparum's merozoite stage."
http://plasmodb.org/David Roos, Jessie Kissinger, Bindu Gajria, Martin Fraunholz, Jules Milgram, Phil
Labo, Amit Bahl, Dave Pearson, Dinesh Gupta, Hagai GinsburgJonathan Crabtree, Jonathan Schug, Brian Brunk, Greg Grant, Trish Whetzel, Matt
Mailman, Li Li
Data Integration Data Analysis Tools
•Genome annotation
•Mass spec
•Sequence analysis •History function
This query illustrates several aspects of the GUS database including:
PlasmoDB query:
Choose chromosome and Gene/prediction type-submit
There are 651 genes with predicted signal peptides
Choose Gene Expression from the queries page, then Proteomics
Then choose chromosome, lifecycle stage, evidence - submit
There are 828 gene predictions that satisfy this query
Go to the history page and choose which simple queries to combine. Select intersect.
We have an answer. There are 86 predicted genes that satisfy our complex query
Click on a gene to get a full report
There is a variety of information available from the report page including:
Predicted protein features
and gene models
"Which DOTS assemblies (RNA) represented on the Endocrine Pancreas Consortium’s chip 2.0 are constituents of the insulin initiated signal transduction pathway ?"
EPConDB query:
Data Integrationes3wwww
Data Analysis Tools
•Sequence
•Microarray experiment
•Transduction pathway
•BLAST •History function
http://www.cbil.upenn.edu/EPConDBKlaus Kaestner, Marie Scearce, John Brestelli, Phillip Le
Elisabetta Manduchi, Angel Pizarro, Debbie Pinney, Greg Grant, Joan Mazzarelli, Jonathan Crabtree, Hongxian He,Shannon Mcweeney, Matt
Mailman
Choose the insulin pathway, a p-value, pancreas, the species, and whether an assembly must include an mRNA - submit
There are 59 dots assemblies that are constituents of the insulin pathway
Return to the gene information query page and select clones sets. Choose chip 2.0 - submit
There are 3242 assemblies represented on chip 2
Go to the history page, select the queries to combine and select intersect – view the results
There are 8 assemblies that satisfy the complex query. Clicking on an RNA retrieves an allgenes report.
Using Databases to Think Genomically
• Draw attention to these resources
• Show how different data sources and approaches can be used to ask powerful questions
• This can be done for different organisms, different systems
How GUS Works
AllGenesAllGenes PlasmoDBPlasmoDB
EPConDBEPConDB
CoreSRESTESSRADDoTS
Oracle RDBMS Object Layer for Data Loading
Java Servlets
Other sites,Other projects,e.g. GeneDB
Other sites,Other projects,e.g. GeneDB
Goals of GUS• Generic platform for model organism or disease specific
databases • Integration of genome, transcript and protein data, including:
– Sequence– Function– Expression– Interaction– Regulation– Orthologs and paralogs
• Support for:– automated annotation and integration– manual curation– data mining/analysis and sophisticated queries– web access
http://www.gusdb.orgJonathan Crabtree, Jonathan Schug, Steve Fischer, Elisabetta Manduchi, Angel
Pizarro, Junmin Liu, Debbie Pinney, Greg Grant, Trish Whetzel, Li Li, Sharon Diskin, Hongxian He
AutomatedAnalysis &Integration
WWW queries,
browsing, & download
Java Servlets &
Perl CGI
Mining
Applications
DoTS Oracle/SQL
GenomicSequence
microarray& SAGE
Experiments
MappingData
GenBank, InterPro,
GO, etc
GSSs &ESTs
Annotation QTL,POP,SNP, Clinical
RAD Core SRes
Object Layer
TESS
Annotator’s Interface
Architecture of GUS
Five domains
OntologiesShared ResourcesSRes
(Shared Resources)
EvidenceData ProvenanceCore
GrammarsGene regulationTESS
(Trans Elem Search Site)
MIAME/MAGEGene expressionRAD
(RNA Abundance DB)
Central dogmaSequence and
annotationDoTS
(DB of Transcribed Seqs)
HighlightsDomainNamespace
* Protein Abundance DB domain underway
GUS is divided into 5 domains* (separate name spaces)
DoTS central dogma schema
GeneGene GeneInstance
GeneFeature
(isa NA Feature)
GenomicSequence
(isa NA Sequence)
RNARNA RNAInstance
RNAFeature
(isa NA Feature)
RNASequence
(isa NA Sequence)
ProteinProtein ProteinInstance
ProteinFeature
(isa NA Feature)
ProteinSequence
(isa AA Sequence)
ElementAnnotation
Analysis
AnalysisImplementationParam
AnalysisInput
AnalysisImplementation1
0..*1
0..*
1 0..*1 0..*
AnalysisInvocationParamAnalysisInvocation1
0..*1
0..*
1
0..*
1
0..*
1 0..*1 0..*
AnalysisOutput
1
0..*
1
0..*
CompositeElementAnnotation
ArrayAnnotation
CompositeElementImp
0..*0..1 0..*0..1
1
0..*
1
0..*
ElementResultImp CompositeElementResultImp
1
0..*
1
0..*
0..10..* 0..10..*
QuantificationParam
RelatedQuantification
Study
StudyDesignDescription
StudyAssay10..* 10..*
StudyDesignAssay
StudyFactorValueAssayLabeledExtract
BioMaterialImp1
0..*
1
0..*
LabelMethod
0..1
0..*
0..1
0..*
ProtocolParam
MAGEDocumentation
MAGE_ML
0..*
1
0..*
1
AcquisitionParam
Assay
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
Channel
1
0..*
1
0..*
0..*0..1
0..*0..1
Quantification1
0..*
1
0..*1
0..*
1
0..*
10..*
10..*
1 0..*1 0..*1 0..*1 0..*
Acquisition1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
RelatedAcquisition1 0..*1 0..*1 0..*1 0..*
ProcessImplementationParam
ProcessIO
ProcessInvocation
1
0..*
1
0..*
ProcessInvocationParam10..* 10..*
Array
1
0..*
1
0..*
10..*
10..* 1 0..*1 0..*
BioMaterialMeasurement1 0..*1 0..*
Protocol
1
0..*
1
0..*
1
0..*
1
0..*
0..1
0..*
0..1
0..*
0..1
0..*
0..1
0..*Treatment
1
0..*
1
0..*
1
0..*
1
0..*
0..1
0..*
0..1
0..*
StudyDesign
1
0..*
1
0..*10..* 10..*
1 0..*1 0..*
BioMaterialCharacteristic1
0..*1
0..*
ProcessImplementation10..* 10..*
1
0..*
1
0..*
ElementImp
0..10..* 0..10..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
1
0..*
Control
1
0..*
1
0..*
ProcessResult1 0..*1 0..*
StudyFactor
1
0..*
1
0..*
10..* 10..*
OntologyEntry10..* 10..*
0..*0..1
0..*0..1
1
0..*
1
0..*
RAD schema uses MAGE/MIAMEMAGE
ExperimentArray
BioMaterialBioAssay
BioAssayData Protocol, Descr.
HigherLevelAnalysis
MAGEExperiment
ArrayBioMaterial
BioAssayBioAssayData
Protocol, Descr.HigherLevelAnalysis
MIAMEExperimental Design
Array designSamples
Hybridization, MeasureNormalization
.
MIAMEExperimental Design
Array designSamples
Hybridization, MeasureNormalization
.
Journals are Adopting the MGED Standards
Use of Minimal Information About Microarray Experiment (MIAME)
TESS Schema
ModelString
ModelConsensusString
ModelPositionalWeightMatrix
ModelGrammar
TESS.Model
ActivityProteinDnaBinding
ActivityTissueSpecificity
TESS.Activity
Moiety
TESS.Moiety
MoietyMultimer
MoietyHeterodimer
MoietyComplex
TESS.FootprintInstance
DoTS.NaFeatureBindingSite
Promoter
. . .
DoTS.NaSequence
TESS.TrainingSet
TESS.ParameterGroup
TESS.Note
RAD
EST clustering and assembly
DoTS
Genomic alignmentand comparativeSequence analysis
Identify sharedTF binding sites
TESS
Using GUS for Genomic Research
Annotating mouse chromosome 5– Maja Bucan
• Identifying novel genes expressed in the endocrine pancreas– Klaus Kaestner, Alan Permutt, Doug Melton
• Identifying genes regulated by CREB– Allan Pack, Mirek Mackiewicz
Annotation of Mouse Chromosome 5
• What are all the genes?
• What is their structure and function?
• Where are they expressed and how is this regulated?
Maja Bucan, Otto Valladeres, Kyle GaultonJonathan Crabtree, Yongchang Gan, Joan Mazzarelli, Jonathan Shug
Areas of Focus on Mouse Chromosome 5
Dpp6
Adra2c
7q36
4p16.3
4q12
4p15.31
12
20
40
43
Reln
Htr5a
Drd5
Gabrb1, a2, g1
Clock
Nos3
23
8
15
Qdpr30
7q21-22Sema3a,c,d,e
Pdgfra, Kit, Flk1
Hdh,Rw as abalancer
Approach to Annotating Mouse Chromosome 5
• Genomic sequencePublic release: chromosome 5 has many gaps– Celera– Combine to eliminate gaps where possible
• Gene models– ENSMBL prediction– Celera predictionsBLAT alignment of DoTS– Comparison to human regions
Only 14 RefSeq Genes plus an additional 7 from Ensembl
Known RefSeq Genes in (72-76Mb) Region as Viewed in UCSC Genome Browser
5033405K12Rik6030432N09Rik1810027I20RikAI836376Sgcb1700067I02RikC782832700023E23Rik1190017B18Rik6720475M21Rik1300019H17RikLnx1Chic2Gsh2PdgfraKitKdrGabarapl2 (homolog)Srd5a2lTparlClockPdcl2Nmu
MGI approved symbols
Gene symbol synonyms
KIAA1458KIAA0826LOC231293KIAA0276FLJ12552
Identified 28 known genes
~76Mb
~72Mb
15 genes have assigned GO Functions5 enzyme4 signal transducer4 ligand binding or carrier3 nucleic acid binding2 transporter
Known Genes on Mouse Chromosome 5
*Alignment reveals exon differences between RNAs belonging to gene (Alternative forms)
*
Example of Known Mouse Chromosome 5 Gene - Chic2
putative gene mouse chr5 Note:multi-exon alignment; single image clone 583253; polyA signal suggests 3’ end of gene
putative gene mouse chr5 Note:Singleton ESTs from IMAGE clone 551428 align
putative gene mouse chr5 Note:multi-exon alignment; ESTs from single image clone 515319;possible polyA signal in 3'sequence
putative gene mouse chr5 Note:multiple span alignment; 9/02- RNAs also aligning to another region of mouse chr5
putative gene mouse chr5 Note: 3 ESTs in assembly from embryo…….…….Total 21 (some putative genes may later be merged)
Putative Genes on Mouse Chromosome 5
Example of a Putative Mouse Gene
Example DT.40155293 image clone sequences (5’ and 3’ in same assembly)
Genes on Mouse Chromosome 5
• 72-76 Mb region– 65 genes from automated DoTS analysis – 49 manual evaluation– 21 Ensembl genes– 14 RefSeq genes
• Whole chromosome 5 (151 Mb)– 2157 genes from automated DoTS analysis– 1275 Ensembl genes
Summary
• To make links between genotype and phenotype, the output of technologies such as genomic sequencing, microarrays, mass spec, etc., must be integrated
• Our solution is GUS, Genomics Unified Schema, used for multiple systems: AllGenes, PlasmoDB, EPConDB– GUS is freely available as a system for use and development
– RAD as part of GUS and uses microarray standards now available
• Using GUS for genomic research such as annotating mouse chromosome 5.– Possibly doubling the number of genes in annotated regions!
http://www.cbil.upenn.edu