Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of...

46
Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of Clinical Sciences Thomas Jefferson University, Oct. 14, 2002 Chris Stoeckert, Ph.D. Dept. of Genetics & Center for Bioinformatics University of Pennsylvania
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    3

Transcript of Integrated Data Systems for Genomic Analysis Genomics and Bioinformatics for the Advancement of...

Integrated Data Systems for Genomic Analysis

Genomics and Bioinformatics for the Advancement of Clinical Sciences

Thomas Jefferson University, Oct. 14, 2002

Chris Stoeckert, Ph.D.

Dept. of Genetics & Center for Bioinformatics

University of Pennsylvania

Plasmodium genomics: Genomics and proteomics pave the way for controlling malaria

Nature, October 3, 2002

Thinking Genomically

Genome

Phenotype

•Genome structure•Genes and function•Pathways•Expression patterns•(Complex) diseases

Genomic Unified Schema (GUS) is a relational database that warehouses and integrates biological sequence, sequence annotation, and gene expression data from a number of heterogeneous sources. User-friendly web interfaces present slices of the GUS database and allow researchers to execute structured queries for information concerning gene structure, function, and expression.

Using a Genomics Unified Schema (GUS) to ask genomic questions

GUS Powers Multiple Genomics ProjectsAllGenesAllGenes

PlasmoDBPlasmoDB

EPConDBEPConDB

Allgenes is based on a comprehensive mouse and human gene index. The genes are approximated by transcripts predicted from EST and mRNA clustering

PlasmoDB is the official database of the Plasmodium falciparum genome project which provides an integrated view of genome sequence data including expression data from EST, SAGE, and microarray projects

EPConDB is an index of genes expressed in endocrine pancreas. Expression is defined either through microarray experiments or sequence annotation.

"Is my cDNA similar to any mouse genes that are predicted to encode transcription factors and have

been localized to mouse chromosome 5?"

http://www.allgenes.org/Steve Fischer, Debbie Pinney, Brian Brunk, Joan Mazarelli, Jonathan

Crabtree, Yongchang Gan, Sharon DiskinNikolay Kolchanov, Alexey Katohkin

Data Integration Data Analysis Tools•RHMap

•GOFunction

•Sequence

•GOFunction assigments

•Boolean function

•History function

•BLAST

This query illustrates several aspects of the GUS database including:

allgenes.org query

Select the allgenes.org boolean query page

Click on the "AND" button

Choose the RH map and GO function queries

Select mouse chromosome 5 and "transcription factor"

There are 26 mouse RNAs (assemblies) that meet these criteria:

This query result set now appears on the query "history" page:

Now use the BLAST page to identify RNAs similar to my cDNA

The results of the BLAST search appear in the query history

Intersect ("AND") the BLAST search with the previous query:

And we have our answer (the third row on the query history page):

Predicted GO function(s)(some manually reviewed)

Other transcripts fromthe same gene

External links

Mapping information

Gene trap insertions

Protein/motifs, etc.

predicted protein CAP4 assembly EST expression profile UCSC BLAT

"List all genes whose proteins are predicted to contain a signal peptide and for which there is

evidence that they are expressed in Plasmodium falciparum's merozoite stage."

http://plasmodb.org/David Roos, Jessie Kissinger, Bindu Gajria, Martin Fraunholz, Jules Milgram, Phil

Labo, Amit Bahl, Dave Pearson, Dinesh Gupta, Hagai GinsburgJonathan Crabtree, Jonathan Schug, Brian Brunk, Greg Grant, Trish Whetzel, Matt

Mailman, Li Li

Data Integration Data Analysis Tools

•Genome annotation

•Mass spec

•Sequence analysis •History function

This query illustrates several aspects of the GUS database including:

PlasmoDB query:

Select Queries from the PlasmoDB homepage

Choose signal peptide

Choose chromosome and Gene/prediction type-submit

There are 651 genes with predicted signal peptides

Choose Gene Expression from the queries page, then Proteomics

Then choose chromosome, lifecycle stage, evidence - submit

There are 828 gene predictions that satisfy this query

Go to the history page and choose which simple queries to combine. Select intersect.

We have an answer. There are 86 predicted genes that satisfy our complex query

Click on a gene to get a full report

There is a variety of information available from the report page including:

Predicted protein features

and gene models

"Which DOTS assemblies (RNA) represented on the Endocrine Pancreas Consortium’s chip 2.0 are constituents of the insulin initiated signal transduction pathway ?"

EPConDB query:

Data Integrationes3wwww

Data Analysis Tools

•Sequence

•Microarray experiment

•Transduction pathway

•BLAST •History function

http://www.cbil.upenn.edu/EPConDBKlaus Kaestner, Marie Scearce, John Brestelli, Phillip Le

Elisabetta Manduchi, Angel Pizarro, Debbie Pinney, Greg Grant, Joan Mazzarelli, Jonathan Crabtree, Hongxian He,Shannon Mcweeney, Matt

Mailman

Go to the gene information query page and click on “DOTS assemblies involved in a pathway”

Choose the insulin pathway, a p-value, pancreas, the species, and whether an assembly must include an mRNA - submit

There are 59 dots assemblies that are constituents of the insulin pathway

Return to the gene information query page and select clones sets. Choose chip 2.0 - submit

There are 3242 assemblies represented on chip 2

Go to the history page, select the queries to combine and select intersect – view the results

There are 8 assemblies that satisfy the complex query. Clicking on an RNA retrieves an allgenes report.

Using Databases to Think Genomically

• Draw attention to these resources

• Show how different data sources and approaches can be used to ask powerful questions

• This can be done for different organisms, different systems

How GUS Works

AllGenesAllGenes PlasmoDBPlasmoDB

EPConDBEPConDB

CoreSRESTESSRADDoTS

Oracle RDBMS Object Layer for Data Loading

Java Servlets

Other sites,Other projects,e.g. GeneDB

Other sites,Other projects,e.g. GeneDB

Goals of GUS• Generic platform for model organism or disease specific

databases • Integration of genome, transcript and protein data, including:

– Sequence– Function– Expression– Interaction– Regulation– Orthologs and paralogs

• Support for:– automated annotation and integration– manual curation– data mining/analysis and sophisticated queries– web access

http://www.gusdb.orgJonathan Crabtree, Jonathan Schug, Steve Fischer, Elisabetta Manduchi, Angel

Pizarro, Junmin Liu, Debbie Pinney, Greg Grant, Trish Whetzel, Li Li, Sharon Diskin, Hongxian He

AutomatedAnalysis &Integration

WWW queries,

browsing, & download

Java Servlets &

Perl CGI

Mining

Applications

DoTS Oracle/SQL

GenomicSequence

microarray& SAGE

Experiments

MappingData

GenBank, InterPro,

GO, etc

GSSs &ESTs

Annotation QTL,POP,SNP, Clinical

RAD Core SRes

Object Layer

TESS

Annotator’s Interface

Architecture of GUS

Five domains

OntologiesShared ResourcesSRes

(Shared Resources)

EvidenceData ProvenanceCore

GrammarsGene regulationTESS

(Trans Elem Search Site)

MIAME/MAGEGene expressionRAD

(RNA Abundance DB)

Central dogmaSequence and

annotationDoTS

(DB of Transcribed Seqs)

HighlightsDomainNamespace

* Protein Abundance DB domain underway

GUS is divided into 5 domains* (separate name spaces)

DoTS central dogma schema

GeneGene GeneInstance

GeneFeature

(isa NA Feature)

GenomicSequence

(isa NA Sequence)

RNARNA RNAInstance

RNAFeature

(isa NA Feature)

RNASequence

(isa NA Sequence)

ProteinProtein ProteinInstance

ProteinFeature

(isa NA Feature)

ProteinSequence

(isa AA Sequence)

ElementAnnotation

Analysis

AnalysisImplementationParam

AnalysisInput

AnalysisImplementation1

0..*1

0..*

1 0..*1 0..*

AnalysisInvocationParamAnalysisInvocation1

0..*1

0..*

1

0..*

1

0..*

1 0..*1 0..*

AnalysisOutput

1

0..*

1

0..*

CompositeElementAnnotation

ArrayAnnotation

CompositeElementImp

0..*0..1 0..*0..1

1

0..*

1

0..*

ElementResultImp CompositeElementResultImp

1

0..*

1

0..*

0..10..* 0..10..*

QuantificationParam

RelatedQuantification

Study

StudyDesignDescription

StudyAssay10..* 10..*

StudyDesignAssay

StudyFactorValueAssayLabeledExtract

BioMaterialImp1

0..*

1

0..*

LabelMethod

0..1

0..*

0..1

0..*

ProtocolParam

MAGEDocumentation

MAGE_ML

0..*

1

0..*

1

AcquisitionParam

Assay

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

Channel

1

0..*

1

0..*

0..*0..1

0..*0..1

Quantification1

0..*

1

0..*1

0..*

1

0..*

10..*

10..*

1 0..*1 0..*1 0..*1 0..*

Acquisition1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

RelatedAcquisition1 0..*1 0..*1 0..*1 0..*

ProcessImplementationParam

ProcessIO

ProcessInvocation

1

0..*

1

0..*

ProcessInvocationParam10..* 10..*

Array

1

0..*

1

0..*

10..*

10..* 1 0..*1 0..*

BioMaterialMeasurement1 0..*1 0..*

Protocol

1

0..*

1

0..*

1

0..*

1

0..*

0..1

0..*

0..1

0..*

0..1

0..*

0..1

0..*Treatment

1

0..*

1

0..*

1

0..*

1

0..*

0..1

0..*

0..1

0..*

StudyDesign

1

0..*

1

0..*10..* 10..*

1 0..*1 0..*

BioMaterialCharacteristic1

0..*1

0..*

ProcessImplementation10..* 10..*

1

0..*

1

0..*

ElementImp

0..10..* 0..10..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

1

0..*

Control

1

0..*

1

0..*

ProcessResult1 0..*1 0..*

StudyFactor

1

0..*

1

0..*

10..* 10..*

OntologyEntry10..* 10..*

0..*0..1

0..*0..1

1

0..*

1

0..*

RAD schema uses MAGE/MIAMEMAGE

ExperimentArray

BioMaterialBioAssay

BioAssayData Protocol, Descr.

HigherLevelAnalysis

MAGEExperiment

ArrayBioMaterial

BioAssayBioAssayData

Protocol, Descr.HigherLevelAnalysis

MIAMEExperimental Design

Array designSamples

Hybridization, MeasureNormalization

.

MIAMEExperimental Design

Array designSamples

Hybridization, MeasureNormalization

.

http://www.mged.org

Journals are Adopting the MGED Standards

Use of Minimal Information About Microarray Experiment (MIAME)

TESS Schema

ModelString

ModelConsensusString

ModelPositionalWeightMatrix

ModelGrammar

TESS.Model

ActivityProteinDnaBinding

ActivityTissueSpecificity

TESS.Activity

Moiety

TESS.Moiety

MoietyMultimer

MoietyHeterodimer

MoietyComplex

TESS.FootprintInstance

DoTS.NaFeatureBindingSite

Promoter

. . .

DoTS.NaSequence

TESS.TrainingSet

TESS.ParameterGroup

TESS.Note

RAD

EST clustering and assembly

DoTS

Genomic alignmentand comparativeSequence analysis

Identify sharedTF binding sites

TESS

Using GUS for Genomic Research

Annotating mouse chromosome 5– Maja Bucan

• Identifying novel genes expressed in the endocrine pancreas– Klaus Kaestner, Alan Permutt, Doug Melton

• Identifying genes regulated by CREB– Allan Pack, Mirek Mackiewicz

Annotation of Mouse Chromosome 5

• What are all the genes?

• What is their structure and function?

• Where are they expressed and how is this regulated?

Maja Bucan, Otto Valladeres, Kyle GaultonJonathan Crabtree, Yongchang Gan, Joan Mazzarelli, Jonathan Shug

Areas of Focus on Mouse Chromosome 5

Dpp6

Adra2c

7q36

4p16.3

4q12

4p15.31

12

20

40

43

Reln

Htr5a

Drd5

Gabrb1, a2, g1

Clock

Nos3

23

8

15

Qdpr30

7q21-22Sema3a,c,d,e

Pdgfra, Kit, Flk1

Hdh,Rw as abalancer

Approach to Annotating Mouse Chromosome 5

• Genomic sequencePublic release: chromosome 5 has many gaps– Celera– Combine to eliminate gaps where possible

• Gene models– ENSMBL prediction– Celera predictionsBLAT alignment of DoTS– Comparison to human regions

Only 14 RefSeq Genes plus an additional 7 from Ensembl

Known RefSeq Genes in (72-76Mb) Region as Viewed in UCSC Genome Browser

5033405K12Rik6030432N09Rik1810027I20RikAI836376Sgcb1700067I02RikC782832700023E23Rik1190017B18Rik6720475M21Rik1300019H17RikLnx1Chic2Gsh2PdgfraKitKdrGabarapl2 (homolog)Srd5a2lTparlClockPdcl2Nmu

MGI approved symbols

Gene symbol synonyms

KIAA1458KIAA0826LOC231293KIAA0276FLJ12552

Identified 28 known genes

~76Mb

~72Mb

15 genes have assigned GO Functions5 enzyme4 signal transducer4 ligand binding or carrier3 nucleic acid binding2 transporter

Known Genes on Mouse Chromosome 5

*Alignment reveals exon differences between RNAs belonging to gene (Alternative forms)

*

Example of Known Mouse Chromosome 5 Gene - Chic2

putative gene mouse chr5 Note:multi-exon alignment; single image clone 583253; polyA signal suggests 3’ end of gene

putative gene mouse chr5 Note:Singleton ESTs from IMAGE clone 551428 align

putative gene mouse chr5 Note:multi-exon alignment; ESTs from single image clone 515319;possible polyA signal in 3'sequence

putative gene mouse chr5 Note:multiple span alignment; 9/02- RNAs also aligning to another region of mouse chr5

putative gene mouse chr5 Note: 3 ESTs in assembly from embryo…….…….Total 21 (some putative genes may later be merged)

Putative Genes on Mouse Chromosome 5

Example of a Putative Mouse Gene

Example DT.40155293 image clone sequences (5’ and 3’ in same assembly)

Genes on Mouse Chromosome 5

• 72-76 Mb region– 65 genes from automated DoTS analysis – 49 manual evaluation– 21 Ensembl genes– 14 RefSeq genes

• Whole chromosome 5 (151 Mb)– 2157 genes from automated DoTS analysis– 1275 Ensembl genes

Summary

• To make links between genotype and phenotype, the output of technologies such as genomic sequencing, microarrays, mass spec, etc., must be integrated

• Our solution is GUS, Genomics Unified Schema, used for multiple systems: AllGenes, PlasmoDB, EPConDB– GUS is freely available as a system for use and development

– RAD as part of GUS and uses microarray standards now available

• Using GUS for genomic research such as annotating mouse chromosome 5.– Possibly doubling the number of genes in annotated regions!

http://www.cbil.upenn.edu