Part 5 of RNA-seq for DE analysis: Detecting differential expression

44
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof. RNA-seq for DE analysis training Detecting differentially expressed genes Joachim Jacob 22 and 24 April 2014

description

Fifth part of the training session 'RNA-seq for Differential expression analysis'. We explain the most important concepts of detecting DE expression based on a count table, explaining DESeq2 algorithm. Interested in following this session? Please contact http://www.jakonix.be/contact.html

Transcript of Part 5 of RNA-seq for DE analysis: Detecting differential expression

Page 1: Part 5 of RNA-seq for DE analysis: Detecting differential expression

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

RNA-seq for DE analysis training

Detecting differentially expressed genesJoachim Jacob22 and 24 April 2014

Page 2: Part 5 of RNA-seq for DE analysis: Detecting differential expression

2 of 44

Bioinformatics analysis will take most of your time

Quality control (QC) of raw reads

Preprocessing: filtering of reads and read parts, to help our goal of differential detection.

QC of preprocessing Mapping to a reference genome(alternative: to a transcriptome)

QC of the mapping

Count table extraction

QC of the count table

DE test

Biological insight

1

2

3

5

4

6

Page 3: Part 5 of RNA-seq for DE analysis: Detecting differential expression

3 of 44

Goal: get me some DE genes!

Based on a raw count table, we want to detect differentially expressed genes between conditions of interest.

We will assign to each gene a p-value (0-1), which shows us 'how surprised we should be' to see this difference, when we assume there is no difference.

0 1

Very big chance there is a difference

p-value

Very small chance there is a real difference

Page 4: Part 5 of RNA-seq for DE analysis: Detecting differential expression

4 of 44

Goal

Every single decision we have taken in previous analysis steps was done to improve this outcome of detecting DE expressed genes.

Page 5: Part 5 of RNA-seq for DE analysis: Detecting differential expression

5 of 44

Raw counts to DE genes

Page 6: Part 5 of RNA-seq for DE analysis: Detecting differential expression

6 of 44

DE detection tools from count tables

http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Detecting_differential_expression_by_count_analysis

Page 7: Part 5 of RNA-seq for DE analysis: Detecting differential expression

7 of 44

Algorithms under active development

http://genomebiology.com/2010/11/10/r106

Page 8: Part 5 of RNA-seq for DE analysis: Detecting differential expression

8 of 44

Intuition: how to detect DE?

gene_id CAF0006876

sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample823171 22903 29227 24072 23151 26336 25252 24122

Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample1619527 26898 18880 24237 26640 22315 20952 25629

Variability X

Variability YCompare and conclude given amean (or 'base') level: similar or not?

Condition A

Condition B

Page 9: Part 5 of RNA-seq for DE analysis: Detecting differential expression

9 of 44

Intuition

gene_id CAF0006876

sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample823171 22903 29227 24072 23151 26336 25252 24122

Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample1619527 26898 18880 24237 26640 22315 20952 25629

Condition A

Condition B

Page 10: Part 5 of RNA-seq for DE analysis: Detecting differential expression

10 of 44

Intuition – model is fitted

gene_id CAF0006876

sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample823171 22903 29227 24072 23151 26336 25252 24122

Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample1619527 26898 18880 24237 26640 22315 20952 25629

Condition A

Condition B

NB model is estimated

Page 11: Part 5 of RNA-seq for DE analysis: Detecting differential expression

11 of 44

Intuition – difference is quantified

gene_id CAF0006876

sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample823171 22903 29227 24072 23151 26336 25252 24122

Sample9 sample10 sample11 sample12 sample13 sample14 sample15 sample1619527 26898 18880 24237 26640 22315 20952 25629

Condition A

Condition B

NB model is estimated:2 parameters: mean and

dispersion needed.

Difference is put into p-value

Page 12: Part 5 of RNA-seq for DE analysis: Detecting differential expression

12 of 44

But counts are dependent on

The read counts of a gene between different conditions, is dependent on (see first part): 1. Chance (NB model)2. Expression level3. Library size (number of reads in that library)4. Length of transcript5. GC content of the genes

Page 13: Part 5 of RNA-seq for DE analysis: Detecting differential expression

13 of 44

Normalize for library size

Assumption: most genes are not DE between samples. DESeq calculates for every sample the 'effective library size' by a scale factor.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426807/

Rest of the genes

Rest of the genes

100%

100%

Page 14: Part 5 of RNA-seq for DE analysis: Detecting differential expression

14 of 44

Normalize for library size

DESeq computes a scaling factor for a given sample by computing the median of the ratio, for each gene, of its read count over its geometric mean across all samples. It then uses the assumption that most genes are not DE and uses this median of ratios to obtain the scaling factor associated with this sample.

Original library size * scale factor = effective library size

DESeq will multiply original counts by the sample scaling factor.

DESeq: This normalization method [14] is included in the DESeq Bioconductor package (ver-sion 1.6.0) [14] and is based on the hypothesis that most genes are not DE. A DESeq scaling factor for a given lane is computed as the median of the ratio, for each gene, of its read count over its geometric mean across all lanes. The underlying idea is that non-DE genes should have similar read counts across samples, leading to a ratio of 1. Assuming most genes are not DE, the median of this ratio for the lane provides an estimate of the correction factor that should be applied to all read counts of this lane to fulfill the hypothesis

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3426807/

Page 15: Part 5 of RNA-seq for DE analysis: Detecting differential expression

15 of 44

Normalize for library size

Geom Mean

24595……...……

1,1 1,2 0,9 1,0 1,2 0,8 0,85 1,00,9 1,0 1,2 0,8 0,85 1,0 1,1 1,21,1 1,2 0,9 1,0 1,2 0,8 0,85 1,01,0 1,2 0,8 0,85 1,0 1,2 1,1 0,81,1 1,0 1,2 0,8 0,85 1,0 0,85 1,00,8 0,85 1,01,1 1,2 0,9 1,0 1,2 1,2… … … … … … … …1,2 0,8 1,1 0,9 0,85 1,1 1,1 1,0

Divide eachcount bygeomean

Take medianper column

Page 16: Part 5 of RNA-seq for DE analysis: Detecting differential expression

16 of 44

Other normalisations

● EdgeR: TMM, trimmed mean of M-values

In the end: the algorithms conduct internally the normalization, and just continue.

Page 17: Part 5 of RNA-seq for DE analysis: Detecting differential expression

17 of 44

Dispersion estimation

● For every gene, an NB is fitted based on the counts. The most important factor in that model to be estimated is the dispersion.

● DESeq2 applies three steps● Estimates dispersion parameter for each gene● Plots and fits a curve● Adjusts the dispersion parameter towards the

curve ('shrinking')

Page 18: Part 5 of RNA-seq for DE analysis: Detecting differential expression

18 of 44

Dispersion estimation

1. Black dots: estimatedfrom normalized data.

2. Red line: curve fitted

3. blue dots: final assigned dispersion parameter for

that gene

Model is fit!

Page 19: Part 5 of RNA-seq for DE analysis: Detecting differential expression

19 of 44

Test is run between conditions

If 2 conditions are compared, for each gene 2 NB models (one for each condition) are made, and a test (Wald test) decides whether the difference is significant (red in plot).

Significant (p-value < 0,01)Not significant

MA-plot: mean of countsversus the log2 fold change

between 2 conditions.

Page 20: Part 5 of RNA-seq for DE analysis: Detecting differential expression

20 of 44

Test is run between conditions

If 2 conditions are compared, for each gene 2 NB models (one for each condition) are made, and a test (Wald test) decides whether the difference is significant (red in plot).

This means that we are going to perform 1000's

of tests.

If we set a cut-off on the p-value of 0,01 and we have performed

20000 tests (= genes), 200 genes thatdo not differ will turn up significant only by chance.

Page 21: Part 5 of RNA-seq for DE analysis: Detecting differential expression

21 of 44

Check the distribution of p-values

An enrichment (smaller or

Bigger) should be seen at low

P-values.

Other p-values should notshow a trend.

The histogram of the p-values must look like the one below. If not, the test is not reliable. Perhaps the NB fitting step did not succeed, or confounding variables are present.

Page 22: Part 5 of RNA-seq for DE analysis: Detecting differential expression

22 of 44

Confounded distribution of p-values

Page 23: Part 5 of RNA-seq for DE analysis: Detecting differential expression

23 of 44

Improve test results

A fraction isfalse positive

You set a cut-off of 0,05.

A fraction iscorrectly identified

as DE

Page 24: Part 5 of RNA-seq for DE analysis: Detecting differential expression

24 of 44

Improve test results

We can improve testing by 2 measures:

● avoid testing: apply a filtering before testing, an independent filtering.

● apply a multiple testing correction

Page 25: Part 5 of RNA-seq for DE analysis: Detecting differential expression

25 of 44

Avoid testing by independent filtering

Some scientists just remove genes with mean counts in the samples <10. But there is a more formal method to remove genes, in order to reduce the testing.

http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf

Page 26: Part 5 of RNA-seq for DE analysis: Detecting differential expression

26 of 44

Avoid testing by independent filtering

Left: a scatter plot of mean counts versus transformed p-values. The red line depicts a cut-off of 0,1. Note that genes with lower counts do not reach the p-value threshold. Some of them are save to exclude from testing.

Page 27: Part 5 of RNA-seq for DE analysis: Detecting differential expression

27 of 44

Avoid testing by independent filtering

If we filter out increasingly bigger portions of genes based on theirmean counts, the number of significant genes increase.

Page 28: Part 5 of RNA-seq for DE analysis: Detecting differential expression

28 of 44

Avoid testing by independent filtering

See later (slide 30)

Choose the variable of interest.You can run it once on all to check the outcome.

http://www.bioconductor.org/help/course-materials/2012/Bressanone2012/From this collection, read 2012-07-04-Huber-Multiple-testing-independent-filtering.pdf

Page 29: Part 5 of RNA-seq for DE analysis: Detecting differential expression

29 of 44

Avoid testing by independent filtering

Page 30: Part 5 of RNA-seq for DE analysis: Detecting differential expression

30 of 44

Packages for independent filtering

HTSFilter is a package especially developed for independent filtering in a non-arbitrary way.

In our Galaxy, during the exercises, you will be using another approach.

http://www.bioconductor.org/packages/release/bioc/html/HTSFilter.html

Page 31: Part 5 of RNA-seq for DE analysis: Detecting differential expression

31 of 44

Multiple testing correction

Automatically performed and reported in results: Benjamini/Hochberg correction, to control false discovery rate (FDR).

FDR is the fraction of false positives in the genes that are classified as DE.

If we set a threshold α of 0,05, 20% of the genes will be false positives. If we apply FDR correction of 0.05, 5% of the genes in the final list will be false positives.

Page 32: Part 5 of RNA-seq for DE analysis: Detecting differential expression

32 of 44

Including influencing factors

Through a generalized linear model (GLM), the influencing factors are modeled to predict the counts. The factors come from the sample descriptions file.

Yeast (=WT)

GDA (=G)

Yeast mutant (=UPC)

GDA + vit C (=AG)

Additional metadata (batchfactor)

Day 1 Day 1Day 2 Day 2

Page 33: Part 5 of RNA-seq for DE analysis: Detecting differential expression

33 of 44

DESeq2 to detect DE genes

We provide a combination of factors(the model, GLM) which influence the counts. Every factor should match the

column name in the sample descriptions

The levels of the factors correspondingTo the 'base' or 'no perturbation'.

The fraction filtered out, determinedby the independent filter tool.

Adjusted p-value cut-off

Page 34: Part 5 of RNA-seq for DE analysis: Detecting differential expression

34 of 44

The output of DESeq2

The 'detect differential expression' tool gives you four results: the first is the report including graphs.

Only lower than cut-off and with indep filtering.

All genes, with indep filtering applied.

Complete DESeq results, without indep filtering applied.

Page 35: Part 5 of RNA-seq for DE analysis: Detecting differential expression

35 of 44

Effect of variance on DE detection

Log2(FC) Log2(FC)

Stan

dar

d E

rror

(SE)

of

LogF

C

Stan

dar

d E

rro

r (S

E) o

f Lo

gFC

All genes, with their logFCOnly the DE genes

Page 36: Part 5 of RNA-seq for DE analysis: Detecting differential expression

36 of 44

Volcano plot is often asymmetric

Volcano plot: shows the DE genes with our given cut-off.

-0.3 0.3

-log10(pvalue)

log10(FC)

Page 37: Part 5 of RNA-seq for DE analysis: Detecting differential expression

37 of 44

Comparing different conditions

Yeast (=WT)

GDA (=G)

Yeast mutant (=UPC)

GDA + vit C (=AG)

Day 1 Day 1Day 2 Day 2

Which genes are DE between UPC and WT?Which genes are DE between G and AG?Which genes are DE in WT between G and AG?

Page 38: Part 5 of RNA-seq for DE analysis: Detecting differential expression

38 of 44

Comparing different conditions

Adjust the sample descriptions file and the model:

Remove these

Remove these

1. Which genes are DE between UPC and WT? 2. Which genes are DE between G and AG?3. Which genes are DE in WT between G and AG?

1. 2. 3.

Page 39: Part 5 of RNA-seq for DE analysis: Detecting differential expression

39 of 44

Congratulations!

We have reached our goal!

Page 40: Part 5 of RNA-seq for DE analysis: Detecting differential expression

40 of 44

Overview

http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html

Page 41: Part 5 of RNA-seq for DE analysis: Detecting differential expression

41 of 44

Reads

Page 42: Part 5 of RNA-seq for DE analysis: Detecting differential expression

42 of 44

KeywordsEffective library size

dispersion

shrinking

Significantly differentially expressed

MA-plot

Alpha cut-off

Independent filtering

FDR

p-value

Write in your own words what the terms mean

Page 44: Part 5 of RNA-seq for DE analysis: Detecting differential expression

44 of 44

Break