Supervised analysis of gene expression...
Transcript of Supervised analysis of gene expression...
Supervised analysis of gene expression data
Bing Zhang Department of Biomedical Informatics
Vanderbilt University
Gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product.
For a specific cell at a specific time, only a subset of the genes coded in the genome are expressed.
Transcriptional control is critical in gene expression regulation.
Measure of mRNA expression level can Provide a good indicator of corresponding
protein expression level
Provide insight on the mechanisms of transcriptional regulation
Applied Bioinformatics, Spring 2011
graph courtesy of Wikipedia
Candidate gene approach vs high-throughput approach
Advantages of high-throughput technologies High-throughput
Exploratory analysis
Relationship between genes or between samples
Challenges in high-throughput technologies Cost
Data analysis
Applied Bioinformatics, Spring 2011
0 10m 30m 1h 3h 6h 24h Chalcone synthase
Actin
Protein kinase
Northern 10m 30m 1h 3h 6h 24h
Microarray
High-throughput transcriptome profiling approaches
Transcriptome: the set of all messenger RNA (mRNA) molecules, or "transcripts”, produced in one or a population of cells.
Hybridization based approaches: incubating fluorescently labeled cDNA with microarrays. Hybridization signal is measured. cDNA microarray (printed arrays)
High density olio arrays (synthesized arrays)
Sequencing based approaches: directly determine the cDNA sequence. Count is measured. Sanger sequencing of cDNA or EST libraries
Serial Analysis of Gene Expression (SAGE)
Massively Parallel Signature Sequencing (MPSS)
RNA-Seq
Applied Bioinformatics, Spring 2011
Microarray: two-color vs single-color
Applied Bioinformatics, Spring 2011
technology review
NATURE CELL BIOLOGY VOL 3 AUGUST 2001 http://cellbio.nature.com E191
As array technology has advanced, more sensitive and quantita-tive methods for target preparation have had to be developed. Incases in which the quantity of RNA is not limited, incorporationof nucleotides coupled to fluorescent dyes during synthesis of thefirst strand of cDNA is the method of choice, as it provides themost linear relationship between starting material and labelledproduct. However, most protocols require between 25–100 µgtotal RNA, which is often not readily available in studies using pri-mary cells or tissues. Various procedures have been developed to
increase sensitivity and reduce the amount of RNA required. Onestrategy is target amplification by in vitro transcription, whereby upto 50 µg of labelled cRNA can be produced from 1 µg of mRNA. Inaddition, several rounds of in vitro transcription can be combinedwith cDNA synthesis to enhance the amplification even further4.Using these protocols, it is even possible to profile the transcripts ofa single cell5. Another strategy is post-hybridization amplificationusing labelled antibodies or molecules carrying large numbers offluorophors6. Several studies have used target-amplification tech-
TTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTT
TTTTTTTT
TTTTTTTT
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
Hybridizationmixing
First-strand cDNAsynthesis
cDNA collection
Insert amplification by PCRVector-specific primersGene-specific primers
PrintingCoupling
Denaturing
Ratio Cy5/Cy3
Cy3 or Cy5labelled cDNA
Total RNA
Cells/tissue
Targ
et p
repa
ratio
nA
rray
pre
para
tion
AAAAAAAAA
AAAAAAAATTTTTTTT
AAAAAAAATTTTTTTT
mRNA referncesequence
Probe set
In situ synthesisby photolithography
Ratio array 1/array 2
Biotin-labelledcRNA
Double-strandedcDNA
PolyA+ RNA
Cells/tissue
Staininghybridization
In vitro transcription
a b
Cy3 Cy5
cDNA microarray High-density oligonucleotide microarrays
Perfect matchMismatch
cDNA synthesis
Array 1Array 2
T7
T7
Figure 1 Schematic overview of probe array and target preparation for spottedcDNA microarrays and high-density oligonucleotide microarrays. a, cDNA microar-rays. Array preparation: inserts from cDNA collections or libraries (such as IMAGElibraries) are amplified using either vector-specific or gene-specific primers. PCRproducts are printed at specified sites on glass slides using high-precision arrayingrobots. Through the use of chemical linkers, selective covalent attachment of thecoding strand to the glass surface can be achieved. Target preparation: RNA fromtwo different tissues or cell populations is used to synthesize single-stranded cDNAin the presence of nucleotides labelled with two different fluorescent dyes (for exam-ple, Cy3 and Cy5). Both samples are mixed in a small volume of hybridization bufferand hybridized to the array surface, usually by stationary hybridization under a cover-slip, resulting in competitive binding of differentially labelled cDNAs to the correspon-ding array elements. High-resolution confocal fluorescence scanning of the array withtwo different wavelengths corresponding to the dyes used provides relative signal
intensities and ratios of mRNA abundance for the genes represented on the array. b, High-density oligonucleotide microarrays. Array preparation: sequences of 16–20short oligonucleotides (typically 25mers) are chosen from the mRNA referencesequence of each gene, often representing the most unique part of the transcript inthe 5!-untranslated region. Light-directed, in situ oligonucleotide synthesis is used togenerate high-density probe arrays containing over 300,000 individual elements.Target preparation: polyA+ RNA from different tissues or cell populations is used togenerate double-stranded cDNA carrying a transcriptional start site for T7 DNA poly-merase. During in vitro transcription, biotin-labelled nucleotides are incorporated intothe synthesized cRNA molecules. Each target sample is hybridized to a separateprobe array and target binding is detected by staining with a fluorescent dye coupledto streptavidin. Signal intensities of probe array element sets on different arrays areused to calculate relative mRNA abundance for the genes represented on the array.
© 2001 Macmillan Magazines Ltd
technology review
NATURE CELL BIOLOGY VOL 3 AUGUST 2001 http://cellbio.nature.com E191
As array technology has advanced, more sensitive and quantita-tive methods for target preparation have had to be developed. Incases in which the quantity of RNA is not limited, incorporationof nucleotides coupled to fluorescent dyes during synthesis of thefirst strand of cDNA is the method of choice, as it provides themost linear relationship between starting material and labelledproduct. However, most protocols require between 25–100 µgtotal RNA, which is often not readily available in studies using pri-mary cells or tissues. Various procedures have been developed to
increase sensitivity and reduce the amount of RNA required. Onestrategy is target amplification by in vitro transcription, whereby upto 50 µg of labelled cRNA can be produced from 1 µg of mRNA. Inaddition, several rounds of in vitro transcription can be combinedwith cDNA synthesis to enhance the amplification even further4.Using these protocols, it is even possible to profile the transcripts ofa single cell5. Another strategy is post-hybridization amplificationusing labelled antibodies or molecules carrying large numbers offluorophors6. Several studies have used target-amplification tech-
TTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTT
TTTTTTTT
TTTTTTTT
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAA
Hybridizationmixing
First-strand cDNAsynthesis
cDNA collection
Insert amplification by PCRVector-specific primersGene-specific primers
PrintingCoupling
Denaturing
Ratio Cy5/Cy3
Cy3 or Cy5labelled cDNA
Total RNA
Cells/tissue
Targ
et p
repa
ratio
nA
rray
pre
para
tion
AAAAAAAAA
AAAAAAAATTTTTTTT
AAAAAAAATTTTTTTT
mRNA referncesequence
Probe set
In situ synthesisby photolithography
Ratio array 1/array 2
Biotin-labelledcRNA
Double-strandedcDNA
PolyA+ RNA
Cells/tissue
Staininghybridization
In vitro transcription
a b
Cy3 Cy5
cDNA microarray High-density oligonucleotide microarrays
Perfect matchMismatch
cDNA synthesis
Array 1Array 2
T7
T7
Figure 1 Schematic overview of probe array and target preparation for spottedcDNA microarrays and high-density oligonucleotide microarrays. a, cDNA microar-rays. Array preparation: inserts from cDNA collections or libraries (such as IMAGElibraries) are amplified using either vector-specific or gene-specific primers. PCRproducts are printed at specified sites on glass slides using high-precision arrayingrobots. Through the use of chemical linkers, selective covalent attachment of thecoding strand to the glass surface can be achieved. Target preparation: RNA fromtwo different tissues or cell populations is used to synthesize single-stranded cDNAin the presence of nucleotides labelled with two different fluorescent dyes (for exam-ple, Cy3 and Cy5). Both samples are mixed in a small volume of hybridization bufferand hybridized to the array surface, usually by stationary hybridization under a cover-slip, resulting in competitive binding of differentially labelled cDNAs to the correspon-ding array elements. High-resolution confocal fluorescence scanning of the array withtwo different wavelengths corresponding to the dyes used provides relative signal
intensities and ratios of mRNA abundance for the genes represented on the array. b, High-density oligonucleotide microarrays. Array preparation: sequences of 16–20short oligonucleotides (typically 25mers) are chosen from the mRNA referencesequence of each gene, often representing the most unique part of the transcript inthe 5!-untranslated region. Light-directed, in situ oligonucleotide synthesis is used togenerate high-density probe arrays containing over 300,000 individual elements.Target preparation: polyA+ RNA from different tissues or cell populations is used togenerate double-stranded cDNA carrying a transcriptional start site for T7 DNA poly-merase. During in vitro transcription, biotin-labelled nucleotides are incorporated intothe synthesized cRNA molecules. Each target sample is hybridized to a separateprobe array and target binding is detected by staining with a fluorescent dye coupledto streptavidin. Signal intensities of probe array element sets on different arrays areused to calculate relative mRNA abundance for the genes represented on the array.
© 2001 Macmillan Magazines Ltd
Schulze and Downward, Nature Cell Biol, 3:E190, 2001
two-color arrays single-color arrays
Overall workflow of a microarray study
Microarray experiment
Biological question
Experiment design
Image analysis
Pre-processing
Data Analysis
Hypothesis Experimental verification
Applied Bioinformatics, Spring 2011
!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$
Data matrix
Applied Bioinformatics, Spring 2011
Gen
es
Samples
Three major goals of gene expression studies
Class comparison (supervised analysis) e.g. disease biomarker discovery
Differential expression analysis
Input: gene expression data, class label of the samples
Output: differentially expressed genes
Class detection (unsupervised analysis) e.g. patient subgroup detection
Clustering analysis
Input: gene expression data
Output: groups of similar samples or genes
Class prediction (supervised learning) e.g. disease diagnosis and prognosis
Machine learning techniques
Input: gene expression data, class label of the samples (training data)
Output: prediction model
Applied Bioinformatics, Spring 2011
!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$
Data preprocessing I: missing value imputation
Replace with zeros Replace all missing values with 0
Replace with row averages Replace missing values with mean of available values in each row
(gene)
KNN imputation Estimate missing values via the K-nearest neighbors analysis
Applied Bioinformatics, Spring 2011
Data preprocessing II: normalization
To make arrays comparable
Adjust the arrays using some control or housekeeping genes that you would expect to have the same intensity level across all of the samples
Adjust using spike control
Multiply each array by a constant to make the mean (median) intensity the same for each individual array (Global normalization)
Match the percentiles of each array (Quantile normalization)
Applied Bioinformatics, Spring 2011
No normalization Global normalization Quantile normalization
Data preprocessing III: transformation
To make the data more closely meet the assumptions of a statistical inference procedure
log transformation to improve normality
Applied Bioinformatics, Spring 2011
Histogram of a
a
Freque
ncy
50 150 250
050
100150
Histogram of log(a)
log(a)
Freque
ncy
3.5 4.0 4.5 5.0 5.5
050
100150
200
!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$
Differential expression
Applied Bioinformatics, Spring 2011
Case Control
Gen
es
Samples
Fold change
n-fold change Arbitrarily selected fold change cut-offs
Usually ≥ 2 fold
Pros Intuitive
Simple and rapid
Cons Statistically inefficient
Magnitude does not necessarily indicate importance
Applied Bioinformatics, Spring 2011
Statistical analysis: hypothesis testing
Applied Bioinformatics, Spring 2011
Null hypothesis
Alternative hypothesis
!
H0 : µ1 = µ2
!
H1 : µ1 " µ2
!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$
Case Control
Gen
es Samples
A statistical hypothesis is an assumption about a population parameter, e.g. group mean.
Statistical analysis: comparing means of two groups
Parametric method Student’s t-test
Assumes normal distribution of the data
Non-parametric method Mann-Whitney U test
Does not rely on data belonging to any particular distribution
Based on ranks of observations
Student’s t-test vs Mann-Whitney U test Robustness: U-test is more robust to outliers
Efficiency: When normality holds, the efficiency of the U-test is about 0.95 when compared to the t-test. For distributions sufficiently far from normal and for sufficiently large sample sizes, the U-test can be considerably more efficient than the t-test.
Applied Bioinformatics, Spring 2011
t-test: p=0.06; U test: p=0.1 GeneX 9.61 11.03 10.50 11.44 12.23 13.61
GeneX 9.61 11.03 10.50 11.44 12.23 25.61
t-test: p=0.32; U test: p=0.1
Statistical tests for different types of comparisons
Applied Bioinformatics, Spring 2011
DATA Continuous/normal
Rank Nominal
GOAL
Compare two unpaired groups
Unpaired t-test
Mann-Whitney test
Fisher’s exact test or chi-square test
Compare two paired groups
Paired t-test Wilcoxon test
McNemar’s test
Compare three or more groups
One-way ANOVA
Kruskal-Wallis test
Chi-square test
Association to quantitative phenotypes
Pearson’s correlation
Spearman’s correlation
Contingency coefficients
Correction for multiple testing: why?
In an experiment with a 10,000-gene array in which the significance level p is set at 0.05, 10,000 x 0.05 = 500 genes would be inferred as significant even though none is differentially expressed
The probability of drawing the wrong conclusion in at least one of the n different test is
Where is the significance level at single gene level, and is the global significance level.
Applied Bioinformatics, Spring 2011
!
P(wrong) =1" (1"# s)n = #g
!
"g
!
" s
!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$
Eac
h ro
w is
a te
st
1 0.05 0.05 10 0.05 0.40 100 0.05 0.99 1000 0.05 1.00 10000 0.05 1.00 !
" s
!
"gn
Correction for multiple testing: how?
Control the family-wise error rate (FWER), the probability that there is a single type I error in the entire set (family) of hypotheses tested. e.g. Standard Bonferroni Correction: uncorrected p value x no. of genes tested
Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g. Benjamini and Hochberg correction. Ranking all genes according to their p value
Picking a desired FDR level, q (e.g. 5%)
Starting from the top of the list, accept all genes with , where i is the number of genes accepted so far, and m is the total number of genes tested.
Applied Bioinformatics, Spring 2011
!
p " imq
p Bonferroni Rank (i) q (i/m)*q significant? 0.00003 0.0003 1 0.05 0.0050 1 0.00004 0.0004 2 0.05 0.0100 1 0.0003 0.003 3 0.05 0.0150 1 0.0008 0.008 4 0.05 0.0200 1 0.002 0.02 5 0.05 0.0250 1 0.01 0.1 6 0.05 0.0300 1 0.049 0.49 7 0.05 0.0350 0 0.23 1 8 0.05 0.0400 0 0.55 1 9 0.05 0.0450 0 0.92 1 10 0.05 0.0500 0
Resources
Data source Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/
ArrayExpress: http://www.ebi.ac.uk/arrayexpress/
Microarray data analysis tools Bioconductor: http://www.bioconductor.org/
Expression profiler: http://www.ebi.ac.uk/expressionprofiler/
Applied Bioinformatics, Spring 2011
Summary
Three major goals of gene expression studies Class comparison
Class detection
Class prediction
Gene expression data pre-processing steps Missing data imputation
Normalization
Transformation
Statistical tests for two group comparative studies Student’s t-test
Mann-Whitney U test
Multiple-test adjustment Control the family-wise error rate (FWER)
Control the false discovery rate (FDR)
Applied Bioinformatics, Spring 2011
Exercise
Data set: james_west_2005_hne_6h_60vs0.txt (or james_west_2005_hne_6h_60vs0_head100.txt)
54675 probe sets (or the top 100 probe sets)
Two groups (HNE0 and HNE60, three replicates in each group)
No missing value; Already normalized; Already log transformed
Use t-test in expression profiler (http://www.ebi.ac.uk/expressionprofiler) or excel to identify genes that are differentially expressed between the two groups.
Apply multiple test adjustment on the raw p-values
Applied Bioinformatics, Spring 2011
!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$