Supervised analysis of gene expression...

21
Supervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University [email protected]

Transcript of Supervised analysis of gene expression...

Page 1: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Supervised analysis of gene expression data

Bing Zhang Department of Biomedical Informatics

Vanderbilt University

[email protected]

Page 2: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Gene expression

  Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product.

  For a specific cell at a specific time, only a subset of the genes coded in the genome are expressed.

  Transcriptional control is critical in gene expression regulation.

  Measure of mRNA expression level can   Provide a good indicator of corresponding

protein expression level

  Provide insight on the mechanisms of transcriptional regulation

Applied Bioinformatics, Spring 2011

graph courtesy of Wikipedia

Page 3: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Candidate gene approach vs high-throughput approach

  Advantages of high-throughput technologies   High-throughput

  Exploratory analysis

  Relationship between genes or between samples

  Challenges in high-throughput technologies   Cost

  Data analysis

Applied Bioinformatics, Spring 2011

0 10m 30m 1h 3h 6h 24h Chalcone synthase

Actin

Protein kinase

Northern 10m 30m 1h 3h 6h 24h

Microarray

Page 4: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

High-throughput transcriptome profiling approaches

  Transcriptome: the set of all messenger RNA (mRNA) molecules, or "transcripts”, produced in one or a population of cells.

  Hybridization based approaches: incubating fluorescently labeled cDNA with microarrays. Hybridization signal is measured.   cDNA microarray (printed arrays)

  High density olio arrays (synthesized arrays)

  Sequencing based approaches: directly determine the cDNA sequence. Count is measured.   Sanger sequencing of cDNA or EST libraries

  Serial Analysis of Gene Expression (SAGE)

  Massively Parallel Signature Sequencing (MPSS)

  RNA-Seq

Applied Bioinformatics, Spring 2011

Page 5: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Microarray: two-color vs single-color

Applied Bioinformatics, Spring 2011

technology review

NATURE CELL BIOLOGY VOL 3 AUGUST 2001 http://cellbio.nature.com E191

As array technology has advanced, more sensitive and quantita-tive methods for target preparation have had to be developed. Incases in which the quantity of RNA is not limited, incorporationof nucleotides coupled to fluorescent dyes during synthesis of thefirst strand of cDNA is the method of choice, as it provides themost linear relationship between starting material and labelledproduct. However, most protocols require between 25–100 µgtotal RNA, which is often not readily available in studies using pri-mary cells or tissues. Various procedures have been developed to

increase sensitivity and reduce the amount of RNA required. Onestrategy is target amplification by in vitro transcription, whereby upto 50 µg of labelled cRNA can be produced from 1 µg of mRNA. Inaddition, several rounds of in vitro transcription can be combinedwith cDNA synthesis to enhance the amplification even further4.Using these protocols, it is even possible to profile the transcripts ofa single cell5. Another strategy is post-hybridization amplificationusing labelled antibodies or molecules carrying large numbers offluorophors6. Several studies have used target-amplification tech-

TTTTTTTTTTTTTTTT

TTTTTTTTTTTTTTTT

TTTTTTTTTTTTTTTT

TTTTTTTT

TTTTTTTT

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

Hybridizationmixing

First-strand cDNAsynthesis

cDNA collection

Insert amplification by PCRVector-specific primersGene-specific primers

PrintingCoupling

Denaturing

Ratio Cy5/Cy3

Cy3 or Cy5labelled cDNA

Total RNA

Cells/tissue

Targ

et p

repa

ratio

nA

rray

pre

para

tion

AAAAAAAAA

AAAAAAAATTTTTTTT

AAAAAAAATTTTTTTT

mRNA referncesequence

Probe set

In situ synthesisby photolithography

Ratio array 1/array 2

Biotin-labelledcRNA

Double-strandedcDNA

PolyA+ RNA

Cells/tissue

Staininghybridization

In vitro transcription

a b

Cy3 Cy5

cDNA microarray High-density oligonucleotide microarrays

Perfect matchMismatch

cDNA synthesis

Array 1Array 2

T7

T7

Figure 1 Schematic overview of probe array and target preparation for spottedcDNA microarrays and high-density oligonucleotide microarrays. a, cDNA microar-rays. Array preparation: inserts from cDNA collections or libraries (such as IMAGElibraries) are amplified using either vector-specific or gene-specific primers. PCRproducts are printed at specified sites on glass slides using high-precision arrayingrobots. Through the use of chemical linkers, selective covalent attachment of thecoding strand to the glass surface can be achieved. Target preparation: RNA fromtwo different tissues or cell populations is used to synthesize single-stranded cDNAin the presence of nucleotides labelled with two different fluorescent dyes (for exam-ple, Cy3 and Cy5). Both samples are mixed in a small volume of hybridization bufferand hybridized to the array surface, usually by stationary hybridization under a cover-slip, resulting in competitive binding of differentially labelled cDNAs to the correspon-ding array elements. High-resolution confocal fluorescence scanning of the array withtwo different wavelengths corresponding to the dyes used provides relative signal

intensities and ratios of mRNA abundance for the genes represented on the array. b, High-density oligonucleotide microarrays. Array preparation: sequences of 16–20short oligonucleotides (typically 25mers) are chosen from the mRNA referencesequence of each gene, often representing the most unique part of the transcript inthe 5!-untranslated region. Light-directed, in situ oligonucleotide synthesis is used togenerate high-density probe arrays containing over 300,000 individual elements.Target preparation: polyA+ RNA from different tissues or cell populations is used togenerate double-stranded cDNA carrying a transcriptional start site for T7 DNA poly-merase. During in vitro transcription, biotin-labelled nucleotides are incorporated intothe synthesized cRNA molecules. Each target sample is hybridized to a separateprobe array and target binding is detected by staining with a fluorescent dye coupledto streptavidin. Signal intensities of probe array element sets on different arrays areused to calculate relative mRNA abundance for the genes represented on the array.

© 2001 Macmillan Magazines Ltd

technology review

NATURE CELL BIOLOGY VOL 3 AUGUST 2001 http://cellbio.nature.com E191

As array technology has advanced, more sensitive and quantita-tive methods for target preparation have had to be developed. Incases in which the quantity of RNA is not limited, incorporationof nucleotides coupled to fluorescent dyes during synthesis of thefirst strand of cDNA is the method of choice, as it provides themost linear relationship between starting material and labelledproduct. However, most protocols require between 25–100 µgtotal RNA, which is often not readily available in studies using pri-mary cells or tissues. Various procedures have been developed to

increase sensitivity and reduce the amount of RNA required. Onestrategy is target amplification by in vitro transcription, whereby upto 50 µg of labelled cRNA can be produced from 1 µg of mRNA. Inaddition, several rounds of in vitro transcription can be combinedwith cDNA synthesis to enhance the amplification even further4.Using these protocols, it is even possible to profile the transcripts ofa single cell5. Another strategy is post-hybridization amplificationusing labelled antibodies or molecules carrying large numbers offluorophors6. Several studies have used target-amplification tech-

TTTTTTTTTTTTTTTT

TTTTTTTTTTTTTTTT

TTTTTTTTTTTTTTTT

TTTTTTTT

TTTTTTTT

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAA

Hybridizationmixing

First-strand cDNAsynthesis

cDNA collection

Insert amplification by PCRVector-specific primersGene-specific primers

PrintingCoupling

Denaturing

Ratio Cy5/Cy3

Cy3 or Cy5labelled cDNA

Total RNA

Cells/tissue

Targ

et p

repa

ratio

nA

rray

pre

para

tion

AAAAAAAAA

AAAAAAAATTTTTTTT

AAAAAAAATTTTTTTT

mRNA referncesequence

Probe set

In situ synthesisby photolithography

Ratio array 1/array 2

Biotin-labelledcRNA

Double-strandedcDNA

PolyA+ RNA

Cells/tissue

Staininghybridization

In vitro transcription

a b

Cy3 Cy5

cDNA microarray High-density oligonucleotide microarrays

Perfect matchMismatch

cDNA synthesis

Array 1Array 2

T7

T7

Figure 1 Schematic overview of probe array and target preparation for spottedcDNA microarrays and high-density oligonucleotide microarrays. a, cDNA microar-rays. Array preparation: inserts from cDNA collections or libraries (such as IMAGElibraries) are amplified using either vector-specific or gene-specific primers. PCRproducts are printed at specified sites on glass slides using high-precision arrayingrobots. Through the use of chemical linkers, selective covalent attachment of thecoding strand to the glass surface can be achieved. Target preparation: RNA fromtwo different tissues or cell populations is used to synthesize single-stranded cDNAin the presence of nucleotides labelled with two different fluorescent dyes (for exam-ple, Cy3 and Cy5). Both samples are mixed in a small volume of hybridization bufferand hybridized to the array surface, usually by stationary hybridization under a cover-slip, resulting in competitive binding of differentially labelled cDNAs to the correspon-ding array elements. High-resolution confocal fluorescence scanning of the array withtwo different wavelengths corresponding to the dyes used provides relative signal

intensities and ratios of mRNA abundance for the genes represented on the array. b, High-density oligonucleotide microarrays. Array preparation: sequences of 16–20short oligonucleotides (typically 25mers) are chosen from the mRNA referencesequence of each gene, often representing the most unique part of the transcript inthe 5!-untranslated region. Light-directed, in situ oligonucleotide synthesis is used togenerate high-density probe arrays containing over 300,000 individual elements.Target preparation: polyA+ RNA from different tissues or cell populations is used togenerate double-stranded cDNA carrying a transcriptional start site for T7 DNA poly-merase. During in vitro transcription, biotin-labelled nucleotides are incorporated intothe synthesized cRNA molecules. Each target sample is hybridized to a separateprobe array and target binding is detected by staining with a fluorescent dye coupledto streptavidin. Signal intensities of probe array element sets on different arrays areused to calculate relative mRNA abundance for the genes represented on the array.

© 2001 Macmillan Magazines Ltd

Schulze and Downward, Nature Cell Biol, 3:E190, 2001

two-color arrays single-color arrays

Page 6: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Overall workflow of a microarray study

Microarray experiment

Biological question

Experiment design

Image analysis

Pre-processing

Data Analysis

Hypothesis Experimental verification

Applied Bioinformatics, Spring 2011

Page 7: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$

Data matrix

Applied Bioinformatics, Spring 2011

Gen

es

Samples

Page 8: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Three major goals of gene expression studies

  Class comparison (supervised analysis)   e.g. disease biomarker discovery

  Differential expression analysis

  Input: gene expression data, class label of the samples

  Output: differentially expressed genes

  Class detection (unsupervised analysis)   e.g. patient subgroup detection

  Clustering analysis

  Input: gene expression data

  Output: groups of similar samples or genes

  Class prediction (supervised learning)   e.g. disease diagnosis and prognosis

  Machine learning techniques

  Input: gene expression data, class label of the samples (training data)

  Output: prediction model

Applied Bioinformatics, Spring 2011

!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$

Page 9: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Data preprocessing I: missing value imputation

  Replace with zeros   Replace all missing values with 0

  Replace with row averages   Replace missing values with mean of available values in each row

(gene)

  KNN imputation   Estimate missing values via the K-nearest neighbors analysis

Applied Bioinformatics, Spring 2011

Page 10: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Data preprocessing II: normalization

  To make arrays comparable

  Adjust the arrays using some control or housekeeping genes that you would expect to have the same intensity level across all of the samples

  Adjust using spike control

  Multiply each array by a constant to make the mean (median) intensity the same for each individual array (Global normalization)

  Match the percentiles of each array (Quantile normalization)

Applied Bioinformatics, Spring 2011

No normalization Global normalization Quantile normalization

Page 11: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Data preprocessing III: transformation

  To make the data more closely meet the assumptions of a statistical inference procedure

  log transformation to improve normality

Applied Bioinformatics, Spring 2011

Histogram of a

a

Freque

ncy

50 150 250

050

100150

Histogram of log(a)

log(a)

Freque

ncy

3.5 4.0 4.5 5.0 5.5

050

100150

200

Page 12: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$

Differential expression

Applied Bioinformatics, Spring 2011

Case Control

Gen

es

Samples

Page 13: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Fold change

  n-fold change   Arbitrarily selected fold change cut-offs

  Usually ≥ 2 fold

  Pros   Intuitive

  Simple and rapid

  Cons   Statistically inefficient

  Magnitude does not necessarily indicate importance

Applied Bioinformatics, Spring 2011

Page 14: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Statistical analysis: hypothesis testing

Applied Bioinformatics, Spring 2011

Null hypothesis

Alternative hypothesis

!

H0 : µ1 = µ2

!

H1 : µ1 " µ2

!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$

Case Control

Gen

es Samples

A statistical hypothesis is an assumption about a population parameter, e.g. group mean.

Page 15: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Statistical analysis: comparing means of two groups

  Parametric method   Student’s t-test

  Assumes normal distribution of the data

  Non-parametric method   Mann-Whitney U test

  Does not rely on data belonging to any particular distribution

  Based on ranks of observations

  Student’s t-test vs Mann-Whitney U test   Robustness: U-test is more robust to outliers

  Efficiency: When normality holds, the efficiency of the U-test is about 0.95 when compared to the t-test. For distributions sufficiently far from normal and for sufficiently large sample sizes, the U-test can be considerably more efficient than the t-test.

Applied Bioinformatics, Spring 2011

t-test: p=0.06; U test: p=0.1 GeneX 9.61 11.03 10.50 11.44 12.23 13.61

GeneX 9.61 11.03 10.50 11.44 12.23 25.61

t-test: p=0.32; U test: p=0.1

Page 16: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Statistical tests for different types of comparisons

Applied Bioinformatics, Spring 2011

DATA Continuous/normal

Rank Nominal

GOAL

Compare two unpaired groups

Unpaired t-test

Mann-Whitney test

Fisher’s exact test or chi-square test

Compare two paired groups

Paired t-test Wilcoxon test

McNemar’s test

Compare three or more groups

One-way ANOVA

Kruskal-Wallis test

Chi-square test

Association to quantitative phenotypes

Pearson’s correlation

Spearman’s correlation

Contingency coefficients

Page 17: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Correction for multiple testing: why?

  In an experiment with a 10,000-gene array in which the significance level p is set at 0.05, 10,000 x 0.05 = 500 genes would be inferred as significant even though none is differentially expressed

  The probability of drawing the wrong conclusion in at least one of the n different test is

Where is the significance level at single gene level, and is the global significance level.

Applied Bioinformatics, Spring 2011

!

P(wrong) =1" (1"# s)n = #g

!

"g

!

" s

!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$

Eac

h ro

w is

a te

st

1   0.05   0.05  10   0.05   0.40  100   0.05   0.99  1000   0.05   1.00  10000   0.05   1.00  !

" s

!

"gn

Page 18: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Correction for multiple testing: how?

  Control the family-wise error rate (FWER), the probability that there is a single type I error in the entire set (family) of hypotheses tested. e.g. Standard Bonferroni Correction: uncorrected p value x no. of genes tested

  Control the false discovery rate (FDR), the expected proportion of false positives among the number of rejected hypotheses. e.g. Benjamini and Hochberg correction.   Ranking all genes according to their p value

  Picking a desired FDR level, q (e.g. 5%)

  Starting from the top of the list, accept all genes with , where i is the number of genes accepted so far, and m is the total number of genes tested.

Applied Bioinformatics, Spring 2011

!

p " imq

p   Bonferroni   Rank  (i)   q   (i/m)*q   significant?  0.00003   0.0003   1   0.05   0.0050   1  0.00004   0.0004   2   0.05   0.0100   1  0.0003   0.003   3   0.05   0.0150   1  0.0008   0.008   4   0.05   0.0200   1  0.002   0.02   5   0.05   0.0250   1  0.01   0.1   6   0.05   0.0300   1  0.049   0.49   7   0.05   0.0350   0  0.23   1   8   0.05   0.0400   0  0.55   1   9   0.05   0.0450   0  0.92   1   10   0.05   0.0500   0  

Page 19: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Resources

  Data source   Gene Expression Omnibus (GEO): http://www.ncbi.nlm.nih.gov/geo/

  ArrayExpress: http://www.ebi.ac.uk/arrayexpress/

  Microarray data analysis tools   Bioconductor: http://www.bioconductor.org/

  Expression profiler: http://www.ebi.ac.uk/expressionprofiler/

Applied Bioinformatics, Spring 2011

Page 20: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Summary

  Three major goals of gene expression studies   Class comparison

  Class detection

  Class prediction

  Gene expression data pre-processing steps   Missing data imputation

  Normalization

  Transformation

  Statistical tests for two group comparative studies   Student’s t-test

  Mann-Whitney U test

  Multiple-test adjustment   Control the family-wise error rate (FWER)

  Control the false discovery rate (FDR)

Applied Bioinformatics, Spring 2011

Page 21: Supervised analysis of gene expression databioinfo.vanderbilt.edu/zhanglab/lectures/AB2011Lecture12.pdf · Three major goals of gene expression studies Class comparison (supervised

Exercise

  Data set: james_west_2005_hne_6h_60vs0.txt (or james_west_2005_hne_6h_60vs0_head100.txt)

  54675 probe sets (or the top 100 probe sets)

  Two groups (HNE0 and HNE60, three replicates in each group)

  No missing value; Already normalized; Already log transformed

  Use t-test in expression profiler (http://www.ebi.ac.uk/expressionprofiler) or excel to identify genes that are differentially expressed between the two groups.

  Apply multiple test adjustment on the raw p-values

Applied Bioinformatics, Spring 2011

!"#$%&'%(&)* +,-.&/ +,-.&0 +,-.&1 +,-2.&/ +,-2.&0 +,-2.&1/..3&'&4( !"#!!! !"$%&$ !"$'() !"$')& !"$#&' !"*%(*/.51&4( +")$$! +")!*$ +"'&+' +"&))) +")&%' +"&'+'//3&4( ("%(%% ("%%*' #"+%'( +"%')' !"#*!& +"&##*/0/&4( +"()(' +"(''% +"#)&% +"($!) +"('&& +"(*'$/055&6&4( '"&!%) '"'##+ '"&*#% '"*(%% '"'$(* '"&+(+/078&4( #"*$$# #"&*!) #"&%$* #"'&+% #"$%(' #"&(()/1/2&4( #"$($+ #"$**% #"'(%+ #"##*# #"#'*! #"'#!!/10.&4( #"$'+( #"$*!! #"$')% #"##%$ #"$+!( #"(&*#/8.5&)&4( '"*&#% '"'#'% '")'*! '"*'#& '"*!(# '"#!'+/81/&4( $"&)+) $"&%(% $"&#$( $"&!&* $"&$&& $")!%!/819&4( ("%)$$ #"+*$+ #"+&') ("%&'! ("%)'& ("%+()/893&4( !"#*#) !"'!(+ !"''+! !"''(% !"$*)) !"'&&$/878&:&4( ("*&+# ("*+%) ("%!!# ("&#'! ("#%$! ("&+'+/550052&4&4( )%"#&'$ )%"$&*$ )%"#$&& )%"'&%$ )%"&*'' )%"*)''/550053&4&4( )%"*&&' )%")('+ )%")++& )%"&'#' )%"&)+) )%"&'%$