Understanding biological systems by using DNA microarrays ...Understanding biological systems by...
Transcript of Understanding biological systems by using DNA microarrays ...Understanding biological systems by...
Understanding biological systemsby using DNA microarrays and
bioinformaticsJoaquín DopazoBioinformatics Unit,
Centro Nacional de Investigaciones Oncológicas (CNIO), Spain.http://bioinfo.cnio.es
The use of high throughput methodologies allows us to queryour systems in a new way but, at the same time, generates newchallenges for data analysis and requires from us a change in our data management habits
National Institute of Bioinformatics, Functional Genomics node
From genotype to phenotype.
(only the genetic component)
>protein kinase
acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....
…code for the structure of proteins...
…which accounts for the function...
…providing they are expressed in the proper moment and place...
…in cooperation with other proteins…
…conforming complex interaction networks...
Genes in the DNA... …whose final
effect can be different because of the variability.
Now: 23531 (NCBI 34 assembly 02/04) Estimations: 20.000 to 100.000.
50% mRNAs do not code for proteins (mouse)50% display alternative splicing
Each protein has an average of 8 interactions
A typical tissue isexpressing among5000 and 10000
genes
That undergopost-
translationalmodifications
More than 3.5 millonSNPs have been
mapped
25%-60% unknown
Pre-genomics scenario in the lab
>protein kunase
acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....
Sequence
Molecular databases
Search results
Phylogenetictree
alignment
Conserved region
MotifMotif
databases
Information
Secondary and tertiary protein structure
The aim:
Extracting as muchinformation as possible for onesingle data
Bioinformatics tools for pre-genomicsequence data analysis
Genome sequencing
2-hybrid systemsMass spectrometry for protein complexes
Post-genomic vision
ExpressionArrays
Literature, databases
Who?
What do we know?
In what way?
Where, when and how much?
SNPs
And who else?
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Post-genomic vision
genes
interactions
Gene expression
Information
polimorphisms
InformationDatabases
The new tools:Clustering
Feature selectionData integration
Information mining
Gene expression profiling.The rationale, what we would like and related problems
Differences at phenotype level are the visible cause of differences at molecular level which, in many cases, can be detected by measuring the levels of gene expression. The same holds for different experiments, treatments, etc.
• Classification of phenotypes / experiments (Can I distinguish among classes, values of variables, etc. using molecular gene expression data?)
• Selection of differentially expressed genes among the phenotypes / experiments(did I select the relevant genes, all the relevant genes and nothing but the relevantgenes?)
• Biological roles the genes are playing in the cell (what general biological roles are really represented in the set of relevant genes?)
A note of caution:
Genome-wide technologies allows us to produce vastamounts of data. But... data is not knowledgeMisunderstanding of this has lead to “new” (notnecessarily good) ways of asking (scientific) questions
Question Experiment test
Is gene A involved in process B?
Experiment (sometimes) test Question
Is there any gene (or set of genes) involved in any process?
Gene expression analysis using DNA microarrays
Cy5
There are twodominanttechnologies: spotted arraysand oligo arraysalthough newplayers are arriving to thearena
Cy3
cDNA arrays Oligonucleotide arrays
Transforming images into data
Test sample labeled red (Cy5)Reference sample labeled green (Cy3)
Red : gene overexpressed in test sampleGreen : gene underexpressed in test sampleYellow - equally expressed
red/green - ratio of expression
NormalisationA
There are many sources of error that can affect and seriously biass theinterpretation of the results. Differences in the efficience of labeling, thehibridisation, local effects, etc.
Normalisation is a necessary step beforeproceeding with the analysis
B
C
Before (left) and after (right) normalization. A) BoxPlots, B) BoxPlots of subarrays and C) MA plots (ratio versus intensity)
(a) After normalization by average (b) after print-tip lowessnormalization (c) after normalization taking into account spatialeffects
The data
Characteristics of the data:
• Many more variables (genes) thanmeasurements (experiments / arrays)
• Low signal to noise ratio
• High redundancy and intra-gene correlations
• Most of the genes are notinformative with respect to the traitwe are studying (account forunrelatedphysiological conditions, etc.)
• Many genes have no annotation!!
Genes(thousands)
A B C
Expression profileof a gene across theexperimental conditions
Expressionprofile of all thegenes for a experimental condition (array)
Different classesof experimental conditions, e.g. Cancer types, tissues, drug treatments, time survival, etc.
...
Experimental conditions(from tens up to no more than a few houndreds)
Co-expressing genes... What do theyhave in common?
Different phenotypes...
What genes are responsible for?
B CAHow is the
network?Genes interacting in a network (A,B,C..)...
DE
Molecular classification of samples
Multiple array experiments.Can we find groups ofexperiments withsimilar gene expressionprofiles?
UnsupervisedSupervised
Reverse engineering
Unsupervised clustering methods:Useful for class discovery (we do not have
any a priori knowledge on classes)
Non hierarchical hierarchical
K-means, PCA UPGMA
SOM SOTA
Different levels of information
quick and robust
An unsupervised problem: clustering of genes.
•Gene clusters are unknown beforehand
•Distance function
•Cluster gene expressionpatterns based uniquelyon their similarities.
•Results are subjected tofurther interpretation (ifpossible)
Perou et al., PNAS 96 (1999)
Distinctive gene expression patterns in human mammary epithelial cells and breast cancers
Overview of the combined in vitro and breast tissuespecimen cluster diagram. A scaled-down representation of
the 1,247-gene cluster diagram The black bars show thepositions of the clusters discussed in the text: (A) proliferation-associated, (B) IFNregulated, (C) B
lymphocytes, and (D) stromal cells.
Clustering of experiments:The rationale
If enough genes have theirexpression levels altered in thedifferent experiments, we mightbe able of finding these classes by comparing gene expressionprofiles.
Clustering of experiments:The problems
Any gene (regardeless its relevance forthe classification) has the same weightin the comparison. If relevant genes are not in overwhelming majority itproduces:
Noise
and/or
irrelevant trends
Supervised analysis.If we already have information on the classes, our question
to the data should use it.Class prediction based on gene expression profiles:
A B C Problems:
How can classes A, B, C... be distiguished based on the correspondingprofiles of gene expression?
How a continuous phenotypic trait(resistence to drugs, survival, etc.) can be predicted?
And
Which genes among the thousandsanalysed are relevant for theclassification?
Genes(thousands)
Predictor
Gene selection
Experimental conditions(from tens up to no more than a few houndreds)
Gene selection.We are interested in selecting those genes showingdifferential expression among the classes studied.
• Contingency table (Fisher's test)
For discrete data (presence/absence, etc).
• T-test
We could compare gene expressiondata between two types of patients.
• ANOVA
Analysis of variance. We compare between two or more groups thevalue of an interval data. The pomelo tool
Gene selection and classdiscrimination
Genes differentially expressedamong classes (t-test orANOVA), with p-value < 0.05
10 10cases controls
Sorry... the data was a collection ofrandom numbers labelled for two classes
This is a multiple-testingstatistic contrast.
Adjusted p-values must be used!
NE EEC
NEEEC
G Symbol
Gene selection
between normal endometrium(ne) and endometrioid
endometrial carcinomas (eec)
A Number
Hierarchical Clustering of 86 genes withdifferent expression patterns between
Normal Endometrium and EndometrioidEndometrial Carcinoma (p<0.05) selected
among the ~7000 genes in the CNIO oncochip
Moreno et al., BREAST AND GYNAECOLOGICAL CANCER LABORATORY, Molecular Pathology Programme, CNIO
And, genes are not only related todiscrete classes...
Pomelo: a tool forfinding differentiallyexpressed genes • Among classes
• Survival
• Related to a continuousparameter
Of predictors and molecular signaturesA B
Model, orclassificator
A/B?
1 Training
(with internal and/orexternal CV)
A
2. Classification / predictionA/B?
Unknown sample
CV
Predictor of clinical outcome in breast cancer
Genes are arranged totheir correlation eiththe pronostic groups
Pronostic classifierwith optimal accuracy
van’t Veer et al., Nature, 2002
Information miningHow are structured?
What isthis gen?
Clustering Links
?
What are thesegroups?
Information mining
Cell cycle...
DBs Information
My data...
Information mining applications.
1) use of biological informationas a validation criteria
Information mining of DNA array data.Allows quick assignation of function, biological role and subcellular location to groups of genes.
Used to understand why genes differ in theirexpression between two different conditions
Sources of information: • Free text• Curated terms (ontologies, etc.)
Gene OntologyCONSORTIUM
http://www.geneontology.org
• The objective of GO is to provide controlled vocabulariesfor the description of the molecular function, biologicalprocess and cellular component of gene products.
• These terms are to be used as attributes of gene productsby collaborating databases, facilitating uniform queriesacross them.
• The controlled vocabularies of terms are structured toallow both attribution and querying to be at different levelsof granularity.
FatiGO: GO-driven data analysisThe aim: to develop a statistical frameworkable to deal with multiple-testing questions
GO: source of information. A reduced number of curated termsThe Gene Ontology Consortium. 2000. Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25-29
How does FatiGO work?Compares two sets of genes (query and reference) Has Ontology information [Process, Function and Component] ondifferent organismsSelect level [2-5]. Important: annotations are upgraded to the levelchosen. This increases the power of the test: there are less terms to be tested and more genes by term.
Cluster GenesQuery
ClusterGenes
Reference
Remove genes
repeated
in Cluster Query
Remove genes repeated
between Clusters
Remove genes
repeated
in Cluster Reference
CleanCluster Query
CleanCluster
Reference
GO – DBSearch GO term atlevel and ontology
selected
DistributionOf GO Terms
In QueryCluster
DistributionOf GO TermsIn Reference
Cluster
p-valuemultiple test
Important: since we are performing as many tests as GO terms, multiple-testing adjustment must be used
Number Genes with GO Term at leveland ontology selected for each Cluster
Unadjusted p-valueStep-down min p adjusted p-value
FDR (indep.) adjusted p-valueFDR (arbitrary depend.) adjusted p-value
TablesGO Term – Genes
Genes of old versions (Unigene)Genes without result
Repeated Genes
GO Tree with diferent levels ofinformation
FatiGO ResultsThe application extracts biologicalrelevant terms (showing a significant differential distribution) for a set of genes
PTL LBC
Understanding why genes differin their expression between two
different phenotypes
Limphomas from mature lymphocytes (LB) and precursor T-lymphocyte (PTL).
Genes differentially expressed, selected among the ~7000 genes in the CNIO oncochip
Genes differentially expressed among bothgroups were mainly related to immuneresponse (activated in mature lymphocytes)
Martinez et al., Human Genetics Laboratory. Molecular Pathology Programme, CNIO
Biological processes shown by the genes differentiallyexpressed among PTL-LB
Martinez et al., Human Genetics Laboratory. Molecular Pathology Programme, CNIO
Algorithms are used if they are available in programs.GEPAS, a package for DNA array data analysisArray
Scanning,
Image processing
Preprocessor+ hub
Supervisedclustering
SVM
Unsupervisedclustering
HierarchicalSOMSOTA
SomTree
Datamining
FatiGO
FatiWise
ViewersSOTATreeTreeViewSOMplot
External tools
EP, HAPI
Two-conditionscomparisonGene selection
Two-classesMultiple classesContinuous variableCategorical variablesurvival
NormalizationDNMAD
Predictor
tnasas
In silico CGH
A
BC D
E
F
G
Bioinformatics Group, CNIO
From left to right: Lucía Conde, Joaquín Dopazo, Alvaro Mateos, Fátima Al-Shahrour, Víctor Calzado, Hernán Dopazo, Javier Herrero, Javier Santoyo, Ramón Díaz, MichalKarzinstky & Juanma Vaquerizas
http://gepas.bioinfo.cnio.es
http://fatigo.bioinfo.cnio.eshttp://bioinfo.cnio.es