Post on 31-Jan-2016
description
Case Study I: Two-Sample Analysis
Ru-Fang Yeh
October 23, 2004 Genentech Hall Auditorium, Mission Bay, UCSF
Microarrays: Case Studies and Advanced Analysis
Biological verification and interpretation
TestingEstimation Discrimination
Analysis
Clustering
Microarray experiment
Experimental design
Image analysis
Normalization
Biological question
Quality Measurement
Failed
Pass
Preprocessing
Sample/ConditionGene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : …
Annotation
Microarrays: Case Studies and Advanced Analysis
Short-oligonucleotide chip data:• quality assessment,• background correction, • probe-level normalization,• probe set summary
Two-color spotted array data:• quality assessment; diagnostic plots,• background correction, • array normalization.
CEL, CDF files gpr, gal files
probes by sample matrix of log-ratios or log-intensities
Analysis of expression data:• Identify D.E. genes, estimation and testing,• clustering, • discrimination, and etc.
Qu
alit
y as
sess
men
tP
re-p
roce
ssin
g Array CGH data:•quality assessment; diagnostic plots,•, background correction • clones summary; • array normalization.
UCSF spot file
Imag
e an
alys
isA
nal
ysis
Biological Question: Molecular Phenotypic Difference in Rat Alveolar
Type I and Type II Cells
From “Freshly-isolated Rat Alveolar Type I Cells, Type II Cells, and Cultured Type II Cells Have Distinct Molecular
Phenotypes.” (To appear, A J Phys) By Robert Gonzalez, Yee Hwa Yang, Chandi Griffin, Lennell
Allen, Zachary Tigue, and Leland Dobbs.
Microarrays: Case Studies and Advanced Analysis
Pulmonary Alveolar Epithelium
Type I Cells
Type II Cells
Microarrays: Case Studies and Advanced Analysis
Type I Cells Type II Cells% Lung cells ~8% ~15%% Lung internal surface area ~98% ~2%Volume / cell ~2000 µm3 ~400 µm3
Surface / area ~5300 µm2 ~100 µm2
Stone, AJRCMB 1992
• Morphologic characteristics conserved across the entire range of mammals.
Known/Possible - water and ion transport - surfactant metabolismFunctions - host defense (oxidants - ion transport
& microorganisms - produce immune
- tumor suppression effector molecules
- matrix preservation - Progenitor cells for TI cells after oxidant
injury (and in lung
development)
Alveolar Epithelial Type I and Type II Cells
Microarrays: Case Studies and Advanced Analysis
Type II cell
Proliferation
Type I cell
TransdifferentiationThe process by which one “stable” (differentiated) cellular phenotype changes into a different “stable” cellular phenotype.
Evans, 1975Adamson, 1975
Alveolar Epithelial Cell Lineage Following Lung Injury
Microarrays: Case Studies and Advanced Analysis
Study Objectives
Long term goals: Increase understanding of• alveolar epithelial cell lineages. • the mechanisms that regulate alveolar epithelial development and
differentiation.
Use microarrays to establish molecular profiles of TI and TII cells:• Identification of differences in expression of single genes to
- provide additional marker genes- develop new hypotheses about cellular functions of each cell type
• To determine changing patterns of expression of groups of genes- to understand processes of (trans)-differentiation in vivo and in vitro- to identify candidate factors (transcription cascades) important in
regulating differentiation
Gene Expression Experiment
TII Cells Cultured TII Cells
TI Cells
Microarrays: Case Studies and Advanced Analysis
Freshly Isolated TI and TII Cells
TII CELLS TI CELLS
TI cell fragment
TII cell
Microarrays: Case Studies and Advanced Analysis
• Matrix (EHS, contracted collagen gels)• Soluble factors (ex: KGF)• Apical surface exposed to air• Mechanical contraction
• Matrix (TCP, fibronectin)
• Apical surface covered by liquid• Mechanical distention
Type II Cells in vitro
Microarrays: Case Studies and Advanced Analysis
Experimental design
• Probe: Affymetrix Rat U34 chip A, with 8799 probe sets.
• Target: 4 biological replicates of each cell type: - TID0: freshly isolated TI cells- TIID0: freshly isolated TII cells- TIID7: cultured TII cells (for 7 days)
[traditionally used as a model for TI day 0 cells]
Cell purity criterion: < 2% cross-contamination
Microarrays: Case Studies and Advanced Analysis
Preparing mRNA samples:
Dissection of tissue
RNA Isolation
Amplification
Probelabelling
Hybridization
Microarrays: Case Studies and Advanced Analysis
Preparing mRNA samples:
Dissection of tissue
RNA Isolation
Amplification
Probelabelling
Hybridization
Biological Replicate
Microarrays: Case Studies and Advanced Analysis
Preparing mRNA samples:
Dissection of tissue
RNA Isolation
Amplification
Probelabelling
Hybridization
Technical replicate
Microarrays: Case Studies and Advanced Analysis
Analysis Aims
Main Questions: Establish molecular profiles of TI and TII cells:
1. Identification of differences in expression of single genes to:- provide additional marker genes
- develop new hypotheses about cellular functions of each cell type.
2. To understand the process of (trans)-differentiation in vivo and in vitro
3. To identify candidate factors (transcription cascades) important in regulating differentiation.
Approaches:
1. Identify differentially expressed (DE) genes between TID0 and TII D0.
2. Comparing TID0 and TIID7; are they similar?
3. Finding common regulatory element (transcription factor binding site) in groups of candidate co-regulated genes.
Microarrays: Case Studies and Advanced Analysis
Biological verification and interpretation
TestingEstimation Discrimination
Analysis
Clustering
Microarray experiment
Experimental design
Image analysis
Normalization
Biological question
Quality Measurement
Failed
Pass
Preprocessing
Sample/ConditionGene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : …
Annotation
Preprocessing
• Quality Assessment.• Background subtraction.• Normalization.• Summarization of probe sets value.
Microarrays: Case Studies and Advanced Analysis
High Density Oligonucleotide Arrays (Affymetrix)
24µm24µm
Millions of copies of a specificMillions of copies of a specificoligonucleotide probe per probe celloligonucleotide probe per probe cell
Image of Hybridized Probe ArrayImage of Hybridized Probe Array
~500,000 probe cells on each ~500,000 probe cells on each chipchip
Single stranded, Single stranded, labeled RNA labeled RNA targettarget
Oligonucleotide Oligonucleotide probeprobe
**
**
*
GeneChipGeneChip Probe ArrayProbe Array
Hybridized Hybridized Probe CellProbe Cell
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
1.28cm1.28cm
Microarrays: Case Studies and Advanced Analysis
How Affymetrix Arrays Are Made
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Figure from Lipshutz et al. Nat. Gen. 1999.
Microarrays: Case Studies and Advanced Analysis
5’ 3’
mRNA reference sequence
…TCGTCTGTATCACAGACACAAAGTTGACTG…PM: CAGACATAGTGTCTGTGTTTCAACTMM: CAGACATAGTGTGTGTGTTTCAACT
MMFluorescent probe intensity
For one gene (probe set): 16 probes/gene for Rat U34
PM
Microarrays: Case Studies and Advanced Analysis
DAT FileHybridization+ Scanning
Image analysis
CEL File
CHP FileIntensity value
Absent / Present call
CDF File+
Text FileProbe ID +
Log2 (Intensity)
Excel File
RMAGCRMA
MASGCOS
dChip
Report File, quality
Preprocessing0. Quality Assessment.1. Background subtraction
(B).2. Normalization (N).3. Summarization of
probe sets values (S).
Microarrays: Case Studies and Advanced Analysis
Quantile NormalizationBolstad et al (2003)
• Quantile normalization is a method to make the distribution of probe intensities the same for every chip.
• The normalization distribution is chosen by averaging each quantile across chips.
Microarrays: Case Studies and Advanced Analysis
Probe Set Summarization:Robust Multi-array Average -- Irizarry et al (2003)
• The RMA model assumes that each probe cell is made up of
Log2 Normalized (Observed Intensity – Background) =
Chip effect + Probe-specific effect + error
• The expression level is estimated using a robust procedure (such as median polish or IRLS) to fit the above linear model.
PM
RMA values: log2 Expression for chip i
Microarrays: Case Studies and Advanced Analysis
Summarization Method Comparison: AffyComp http://affycomp.biostat.jhsph.edu/
Median SD across replicates
average false positivesif we use fold-change > 2 as a cut-off
Microarrays: Case Studies and Advanced Analysis
Software
• Affymetrix: MAS v5.1 or GCOS v1.0
• RMA (Robust Multi-array Average) / GCRMA / PLM: - Bioconductor http://www.bioconductor.org
- affylmGUI http://bioinf.wehi.edu.au/affylmGUI/
- RMAExpress http://stat.www.berkeley.edu/~bolstad/RMAExpress/RMAExpress.html
- Axon: Acuity (RMA only, commercial)- GeneTraffic (RMA only, commercial)
• Li & Wong’s MBEI (Multiplicative Model-Based Expression Index):- dChip http://www.dchip.org/
Microarrays: Case Studies and Advanced Analysis
Qualitative Quality Assessment Using PLM
Weights Residuals
More QC Examples:http://stat-www.berkeley.edu/users/bolstad/PLMImageGallery/index.html
Microarrays: Case Studies and Advanced Analysis
QC with affyPLM
Microarrays: Case Studies and Advanced Analysis
QC with boxplots
Microarrays: Case Studies and Advanced Analysis
RMAExpress
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Microarrays: Case Studies and Advanced Analysis
affylmGUI
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Microarrays: Case Studies and Advanced Analysis
Biological verification and interpretation
TestingEstimation Discrimination
Analysis
Clustering
Microarray experiment
Experimental design
Image analysis
Normalization
Biological question
Quality Measurement
Failed
Pass
Preprocessing
Sample/ConditionGene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : …
Annotation
Analysis
1. Identify differentially expressed (DE) genes between TID0 and TII D0.
2. Compare TID0 and TIID7.3. Beyond expression.
Microarrays: Case Studies and Advanced Analysis
~ 8800 probe sets
50 + 131 > 4x (M>2)163 + 401 > 2x
Simple fold-change rules give no assessment of statistical significance
Need to construct test statistics incorporating variability estimates (from replicates).
DE by Average Fold-Change (M): Freshly Isolated TI vs TII Cells
TI
TII
4x2x
2x4x
M:
A:
Microarrays: Case Studies and Advanced Analysis
Two-sample t-statistic & p-value
• The two-sample t-statistic
is used to test equality of the group
means 1, 2
• The p-value p* is the probability
that, under the null hypothesis (H0:
1=2), the test statistic is at least
as extreme as the observed value
t*.
€
T =X 1 − X 2
s 1n1
+ 1n2
t*-t*
p*/2p*/2
Microarrays: Case Studies and Advanced Analysis
More Two-Sample Statistics
Perform statistical tests on normalized, log-transformed data:
• Standard t-test: assumes normally distributed data in each class
(!), equal variances within classes
• Welch t-test: as above, but allows unequal variances
• Wilcoxon test: non-parametric, rank-based
• permutation test: estimate
the distribution of the test
statistic under the null
hypothesis by permutations
of the sample labels
Microarrays: Case Studies and Advanced Analysis
When there are few replicates…
• (Fold-change) Averages can be driven by outliers.
• T-statistics can be driven by tiny variances.
Solution: “robust” version of t-statistic
- Replace mean by median
- Replace standard deviation by median absolute deviation
€
M
€
M
se(M )
Microarrays: Case Studies and Advanced Analysis
1. Penalized-t
Trying to find a compromise between solely using t and solely using mean. There are several similar solutions of the following form:
where s = standard deviation.
Question: how to estimate a? - 90th percentile of standard deviations (s values). Efron et al (2000).
- minimizes the coefficient of variation(cv) of the absolute t-values (SAM). Tusher et al (2001)
€
t* =M
(s + a) / n
Alternative Test Statistics
Microarrays: Case Studies and Advanced Analysis
2. Moderated t-statistics (G Smyth 2004, Limma):
where is the shrinkage estimate of standard deviation
Other Statistics (cont.)
€
t* =M
˜ s / n
€
˜ s 2 =s2d + s0
2d0
d + d0
Pooled sd from all genes
sd for gene i
Estimation is done using an extension to the empirical Bayes method in Lonnstadt &Speed (2002)
Microarrays: Case Studies and Advanced Analysis
3. B-statistic: log posterior odds ratios
log Pr(gene i IS DE) / Pr(gene i IS NOT DE)
Equivalent to moderated-t in terms of ranking genes.
4. Single-channel methods modeling absolute gene-expression levels:- Newton et al 2001: log-intensities ~ Gamma - Wolfinger et al 2001: linear mixed model on log-intensities
5. Composite methods: Differential Expressed genes via Distance Synthesis (Yang et al 2004) to choose genes that are extreme on all measures by defining a “distance” statistic based on measures of choice.
Other Statistics (cont.)
Microarrays: Case Studies and Advanced Analysis
DE by Fold Changes, (limma) Moderated-t, B (lods)
Microarrays: Case Studies and Advanced Analysis
Assessing Significance I: Diagnostic Plots
Microarrays: Case Studies and Advanced Analysis
Univariate hypothesis testing: For single gene, test the null hypothesis
H0 : the gene is NOT differentially expressed.
And p-value can be generated via theory or permutation tests.
Is this p-value correct?
• Yes if only looking at ONE gene…
• Will expect 10000*0.01 = 100 genes with p-value < 0.01 in 10,000 non-DE genes!
-- clearly we can’t just use standard p-value thresholds (.05, .01)!
• Need to adjust p-values for meaningful interpretation!
Assessing Significance II: Testing
Microarrays: Case Studies and Advanced Analysis
(Unadjusted) p-values of moderated-t
Microarrays: Case Studies and Advanced Analysis
Multiple Hypothesis Testing
H0
Ha
Microarrays: Case Studies and Advanced Analysis
Type I Error Rates (False Positives)
• Family-Wise Error Rate (FWER)
Pr(V > 0) = Pr( At least one false positive )
• False Discovery Rate (FDR) -- The FDR (Benjamini & Hochberg 1995) is the expected proportion of type I errors among the rejected hypotheses.
FDR = E(Q),
With Q = V/R, if R > 0
0, if R = 0
Microarrays: Case Studies and Advanced Analysis
Multiple Testing: Controlling a Type I Error Rate
AIM:
For a given type I error rate , use a procedure to select a set of “significant” genes that guarantee a type I error rate .
Microarrays: Case Studies and Advanced Analysis
Adjusted p-values: Controlling the FWER
• The Bonferroni correction: m pg ; most conservative adjustment.
assume independence among genes.
• Sidák: 1-(1-pg)m
• minP (Westfall & Young):
estimated through permutation; allow dependency between genes.
• maxT: replace pg by test statistics Tg, min by max. Less computationally intensive than minP.
• Step-down• Step-up
Choosing all genes with adjusted p-value controls the FWER at level
€
˜ p g ≤ α
€
˜ p g = Pr( mink=1,...,m
Pk ≤ pg | H0)
Microarrays: Case Studies and Advanced Analysis
Controlling the FDR (Benjamini/Hochberg)
• Order unadjusted p-values:
• To control FDR = E(V/R) at level , reject the hypothesis
• Adjusted p-values:
• Interpretation: expect 5% false positives among genes with < 0.05 FDR-adjusted p-values.
Microarrays: Case Studies and Advanced Analysis
Adjusted p-values
p=0.01
Microarrays: Case Studies and Advanced Analysis
Adjusted p-values
p=0.01p=0.01
Microarrays: Case Studies and Advanced Analysis
Identify DE Genes: TI vs TII Cells
1. Select a statistic which will rank the genes in order of strength of the evidence for differential expression, from strongest to weakest.
2. Choose a critical value for the ranking of statistics above which any value is considered to be significant.
• Number of estimated DE genes
B-statistics
> 0
Bonferroni-Adj. p-value
< 0.01
Median
Fold- change
> 4x
Median
Fold- change
> 2x
TIID0 vs TID0 1500 538 138 + 68 =206 415 + 193 =608
TIID7 vs TID0 1411 295 80 + 83 =163 528 + 210 =738
Microarrays: Case Studies and Advanced Analysis
FWER or FDR?
• Choose FWER if high confidence in ALL selected genes is desired (for example, selecting candidate genes for RT-PCR validation). Loss of power due to strong control of type-I error.
• Use more flexible FDR procedures if certain proportions of false positives are tolerable (e.g. gene discovery, selecting candidate co-regulated gene sets for GO/pathway analysis).
Analysis
1. Identify differentially expressed (DE) genes between TID0 and TII D0.
2. Comparing TID0 and TIID7.3. Beyond differential expression…
Microarrays: Case Studies and Advanced Analysis
Biological verification and interpretation
TestingEstimation Discrimination
Analysis
Clustering
Microarray experiment
Experimental design
Image analysis
Normalization
Biological question
Quality Measurement
Failed
Pass
Preprocessing
Sample/ConditionGene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : …
Annotation
Microarrays: Case Studies and Advanced Analysis
What is next ?
• Further experiments: RT-PCR, immunohistochemistry, ELISA…
• Annotation
• Functional categories of DE genes between TID0 and TIID0. - Gene Ontology [http://www.geneontology.org]- GenMapp [http://www.genmapp.org/]- GOStats [http://gostat.wehi.edu.au/]- Bioconductor: GOStats and goTools
• Finding upstream regulatory element with our current experiments alone? - Experimental and methodological challenges.
Microarrays: Case Studies and Advanced Analysis
RT-PCR Verification
Microarrays: Case Studies and Advanced Analysis
Annotation
Affy ID
L26913_at
GenBank Accession/Refseq
NM_053828
NP_446280
Locuslink
116553
Biochemical pathways
(KEGG)
Nucleotide SequenceGACAAGCCAGCAGCCTAGGCCAGCCCACAGTTCTACAGCTCCCTGGTTCTCTCACTGGCTCTGGGCTTCATGGCGCTCTGGGTGACTGCAGTCCTGGCTCTTGCTTGCCTTGGTGGTCTCGCCGCCCCAGGGCCGGTGCCAAGATCTGTGTCTCTCCCTCTGACCCTTAAGGAGCTTATTGAGGAGCTGAGCAACATCACACAAGACCAGACTCCCCTGTGCAACGGCAGCATGGTATGG
UniGene
Rn.9921
RGD
Il13 Name
Interleukin 13
Gene Symbol
IL13
Swiss-Prot
P42203
GOGO:0005144 [interleukin-13
receptor binding]
GO:0005576 extracellular
GO:0006955 [immune response]
Map PositionChromosome:10q22
39.1 Mb
PubMed 121624387916615
…
Literature
Microarrays: Case Studies and Advanced Analysis
GO Functional Category Enrichment
Microarrays: Case Studies and Advanced Analysis
Can we find common transcriptional regulatory elements/motifs in the co-expressed genes?
List of Affy IDs (DE genes) (co-expression co-regulation)
ComputationalMethods
Candidate transcription factor binding sites/motifs
Upstream sequences of co-expressed genes+ additional data
Genome ResourceGenbankEnsembl
UCSC Genome Browser
Biological VerificationChromatin immunoprecipitation…
Hypotheses of Gene modules, network…
EZRetrieve, SOURCE…
TRANSFACExpression
Other genomes
Microarrays: Case Studies and Advanced Analysis
Transcriptional Regulation
Microarrays: Case Studies and Advanced Analysis
Challenges for higher eukaryotes
• Getting the right sequences is hard- (now minor): Transcription start sites (TSS) could be very far from
translation start sites (ATG). Typically undetermined and low prediction accuracy unless full-length cDNAs or 5’EST are available
- Regulatory motifs could be anywhere: promoter (TSS proximity), very far 5’ upstream (a few to hundreds kb), introns, even 3’downstream.
• High signal to noise ratio- motifs are weak and short: 6-12 bp with 8-9 bits of information (~4
conserved bases)
- Large target regions yield high false positives
Microarrays: Case Studies and Advanced Analysis
Computational Motif Finding in Co-expressed Genes1. Supervise approach: mapping known motif sites
68 TI/138 TII Affy IDs marker genes (DE > 4x)
Score matches
Candidate regulator (TF) withbinding sites in DE genes
23 /70 Refseqs: retrieve peptides +repeat masked (-2000,-1)bp
from annotated TSS
Genome ResourceEnsembl/EnsMart www.ensembl.org
Biological VerificationChromatin immunoprecipitation…
EZRetrieve, SOURCE…
TRANSFACwww.gene-regulation.com
0/7 DE TFs6 w/ binding site info
TFblast
#matches > #expected?
200 non-DE genes
Yes*
Microarrays: Case Studies and Advanced Analysis
Mapping Known DE TF Motifs to DE Genes
TF Type-I Type-II
ATF 17 (13) / 14 58 (37) / 44
V-Jun 0 (0) / 0 6 (3) / (<1)
IRF-1 4 (4) / 0 7 (6) / 4
EGR-1/EGR-2 0 (0) / 1 2 (2) / 4
#matches (#genes) / #expected matches
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Microarrays: Case Studies and Advanced Analysis
Motif Finding in Co-expressed Genes2. De novo Stochastic Algorithms
68 TI/138 TII Affy IDs marker genes (DE > 4x)
Gibbs SamplerAlignACE
MEME
Candidate sequence motifs and sites in DE genes
23 /70 Refseqs: retrieve repeat masked (-2000,-1)bp from TSS
Genome ResourceEnsembl/EnsMart www.ensembl.org
Biological Verification
EZRetrieve, SOURCE…
TRANSFACwww.gene-regulation.com
Computational Verification:Match known TFBS?
Microarrays: Case Studies and Advanced Analysis
Top 12 motifs by AlignACE (default parameters, background %GC =50)
Type-I Type-II
Microarrays: Case Studies and Advanced Analysis
Motif Finding in Co-expressed Genes
1. Mapping known motif sites.Input : Subsets of sequences + known binding sites.
Limited by known sites & lots of false positives.
2. De novo motif finding algorithm.Input : Subsets of Sequences.
Need a good filter to reach at a good subset of sequences. Mostly stochastic so harder to translate results.
3. Regression methods on expression data (REDUCE: Bussemaker et al 2001)
Input : Expression Data + corresponding upstream sequences.
Usually Y ~ X where Y: expression data and X: words/motifs.
4. Phylogenetic Footprinting/Shadowing (Vista: Loots et al 2002)
Input : Subset of upstream sequences of orthologous genes.
Can’t find organism specific sites (estimated 32-40% human sites are not functional in mouse), but could be compensated by using various species for resolution.
Microarrays: Case Studies and Advanced Analysis
Biological Verification & Interpretation: TI TII (D0 or D7), candidate regulator…
Adjust p-values for multiple testing
Ranking genes for DE: fold-change, moderated-t, lods
Microarray experiment
Experimental design: Affy arrays
Quantile Normalization
RMA Summarization
Biological Question: Alveolar TI vs TII Cells
Quality Measurement
Failed
Pass
Preprocessing
Sample/ConditionGene 1 2 3 4 … 1 0.46 0.30 0.80 1.51 … 2 -0.10 0.49 0.24 0.06 … 3 0.15 0.74 0.04 0.10 … : …
Finding TFBS for co-expressed genes
Conclusion