PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by modeling non-uniform...

< Journal meeting >

PennSeq: accurate isoform-specific gene ex-pression quantification in RNA-Seq by model-ing non-uniform read distribution

Yu Hu1, Yichuan Liu1, Xianyun Mao1, Cheng Jia1, Jane F. Ferguson2, Chenyi Xue2, Muredach P. Reilly2, Hongzhe Li1 and Mingyao Li1,*

1Department of Biostatistics and Epidemiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA and 2Cardiovascular Institute, Univer-sity of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA

In Seok Yang

13 May 2015

RNA sequencing

Transcriptomics studies using RNA sequencing (RNA-Seq) has become a promising avenue for characterization and understanding of the molecular basis of human dis-eases.

In the past decade, microarrays have been the method of choice for transcriptomics studies due to their ability to measure thousands of transcripts simultaneously. Re-cently, RNA-Seq has emerged as a new approach for transcriptome profiling.

With high coverage and single nucleotide resolution, RNA-Seq can be used to study expressions of genes or isoforms, alternative splicing, non-coding RNAs, post-tran-scriptional modifications and gene fusions

Introduction

Isoform-specific expression

Recent evidence suggests that almost all multiexon human genes have more than one isoform, and different isoforms are often differentially expressed across differ-ent tissues, developmental stages and disease conditions.

Therefore, correctly estimating isoform-specific gene expression is important for understanding complicated biological mechanisms and for mapping disease suscep-tibility genes using expression quantitative trait locus (eQTL) or splicing QTL ap-proaches.

Introduction

Challenging point in estimating isoform-specific gene expression using RNA seq

The current technologies can only sequence complementary DNA (cDNA) mole-cules that represent partial fragments of the RNA. Additionally, most reads that are mapped to a gene are shared by more than one isoform, making it difficult to discern their isoform origin.

An even more serious issue that complicates gene expression estimation is various biases present in RNA-Seq data. Many methods for estimating gene expression in RNA-Seq assume the sequenced fragments (or reads) are uniformly distributed along transcripts.

However, it is widely acknowledged that the true distribution of fragment start posi-tions deviates substantially from uniformity and varies with the fragmentation pro-tocol and sequencing technology.

Introduction

In this study

The authors presents PennSeq, a statistical method that allows each isoform to have its own non-uniform read distribution

Instead of making parametric assumptions, they give adequate weight to the under-lying data by the use of a non-parametric approach.

They evaluate the performance of the program using both simulated data with known ground truth, and using two real Illumina RNA-Seq data sets.

Introduction

Notation

Materials and Method

Symbols Notations Values

R The set of read pairs {r}

r Each read pair

I The set of known isoforms {i}

i Each known isoform

li Length of isoform i Effective length of isoform i

The set of relative abundance {i, iI}

i Relative abundance of isoform i 0 i 1, iIi = 1

ZR,I(r, i) |R||I| matrix 0 or 1

i

P(iso.=i) = = Probability that a read pair originates from isoform i.

Likelihood function 1 PennSeq


Cufflinks

Non-uniform read distribution

Uniform read distribution

Cufflinks estimates transcript abundances using a generative statistical model of RNA-Seq experiments. The model was parameterized by the relative abundances of the set of all transcripts in a sample.

𝐿 (Θ|R ,Z )=∏𝑟 ∈ R

∏𝑖∈ I

[ P (read pair=𝑟 , start=𝑠 ) ]Z R, I ( 𝑟 , 𝑖 )

¿∏𝑟 ∈R

∏𝑖 ∈I

[~𝜃 𝑖P (read pair=𝑟∨start=𝑠 , frag . len.=𝐿𝑖 (𝑟 , 𝑠) , iso .=𝑖)× P (start=𝑠 , frag . len.=𝐿𝑖 (𝑟 , 𝑠)∨iso .=𝑖) ]ZR ,I (𝑟 ,𝑖 )

𝐿 (𝜌|R )=∏𝑟∈ R

𝑃𝑟 (𝑟𝑑 .𝑎𝑙𝑛.=𝑟 )

¿∏𝑟 ∈R

∑𝑡∈𝑇

𝑃𝑟 (𝑟𝑑 .𝑎𝑙𝑛.=𝑟|𝑡𝑟𝑎𝑛𝑠 .=𝑡 )𝑃𝑟 (𝑡𝑟𝑎𝑛𝑠 .=𝑡 )

Likelihood function 2 PennSeq


x The sequence of read pair r

yi The sequence of isoform i

m The length of read pair

qj(a, b) Probability that we observe base a at position j of read pair given that the true base is b

Non-uniform read distribution is con-sidered in this equation, which is a key to the likelihood calculation.

¿1−10−𝑄𝑗 /10 Qj: per-base Phred quality score

Non-uniform read distribution𝐿 (Θ|R ,Z )=∏𝑟 ∈ R

∏𝑖∈ I

[ P (read pair=𝑟 , start=𝑠 ) ]Z R, I ( 𝑟 , 𝑖 )

¿∏𝑟 ∈R

∏𝑖 ∈I

[~𝜃 𝑖P (read pair=𝑟∨start=𝑠 , frag . len.=𝐿𝑖 (𝑟 , 𝑠) , iso .=𝑖)× P (start=𝑠 , frag . len.=𝐿𝑖 (𝑟 , 𝑠)∨iso .=𝑖) ]ZR ,I (𝑟 ,𝑖 )

h 𝑖 (𝑟 , 𝑠 )=P (start=𝑠 , frag . len.=𝐿𝑖 (𝑟 , 𝑠 )|iso .=𝑖)P (read pair=𝑟∨start=𝑠 , frag .len .=𝐿𝑖 (𝑟 ,𝑠 ) ,iso .=𝑖)=∏

𝑗=1

𝑚

𝑞 𝑗 (𝑥 𝑗 , 𝑦 𝑖 , 𝑗+𝑠−1 )=𝛽𝑖 (𝑟 ,𝑠 )

Modeling of read start distribution 1 Most existing methods assume that the read start position is uniformly distributed.


Illustration of the empirical modeling of non-uniform read distribution

h 𝑖 (𝑟 , 𝑠)= 1~𝑙𝑖−𝐿𝑖 (𝑟 ,𝑠 )+1

h 𝑖 (𝑟 , 𝑠)=∑𝑟 1∈R

∑𝑠1∈ S𝑖 ,𝑟

ZR , I (𝑟1 ,𝑖 )~𝐿𝑖 (𝑟 1 ,𝑠1 )

∑𝑟 2∈R

∑𝑠2∈ S𝑖

ZR , I (𝑟2 ,𝑖 )𝐿𝑖 (𝑟 2 ,𝑠2 )

¿h𝑠 𝑎𝑑𝑒𝑑𝑎𝑟𝑒𝑎𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑎𝑟𝑒𝑎

Modeling of read start distribution 2 Unlike the previous approaches, the modeling of hi(r,s) dose not make any paramet-

ric assumptions.

In practice, the isoform origin of a read pair is unobserved. The authors can treat hi(r,s) as an unknown quantity and estimate it non-parametrically in an Expectation Maximization (EM) algorithm. Although feasible, this approach is computationally prohibitive, as it requires the calculation of per-base coverage during every EM up-date.

To speed up the calculation, they propose an estimate of hi(r,s) by approximating the isoform-specific read distribution. Specifically, for isoform informative reads, they assign them to the corresponding isoforms; for those non-informative reads, they as-sign them to all compatible isoforms.

Once the isoform-specific read distribution is determined, they can easily estimate hi(r,s) based on the procedure illustrated in Figure 1.


Parameter estimation using expectation-maximization algorithm


E (expectation)-step

M (maximization)-step

𝐶𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑄 (~Θ|~Θ (𝑡 ) )=EZR ,I∨R~Θ( 𝑡 ) [ log 𝐿 (~Θ|R ) ]¿ ∑𝑟 ∈R

∑𝑖∈ I

EZR ,I∨R~Θ( 𝑡 ) (ZR , I (𝑟 , 𝑖 ) ) log (~𝜃𝑖 𝛽𝑖 (𝑟 ,𝑠 )h 𝑖 (𝑟 ,𝑠) )

h𝑊 𝑒𝑟𝑒 EZR ,I∨R~Θ( 𝑡 ) (ZR, I (𝑟 , 𝑖 ) )=~𝜃 𝑖

(𝑡 )𝛽𝑖 (𝑟 ,𝑠 ) h𝑖 (𝑟 ,𝑠 )

∑𝑢∈I

~𝜃𝑢

(𝑡 ) 𝛽𝑢 (𝑟 ,𝑠 ) h𝑢 (𝑟 ,𝑠 )

𝑀𝑎𝑥𝑖𝑚𝑖𝑧𝑒𝑄 (~Θ|~Θ(𝑡 ) ) ~̂𝜃𝑖

(𝑡+1 )=~𝜃 𝑖

(𝑡 )𝛽𝑖 (𝑟 ,𝑠 ) h𝑖 (𝑟 ,𝑠 )|R|

The EM algorithm consists of alternating between the E- and M-steps until convergence. The author start the algorithm with assuming all isoforms are equally expressed and stop when the log likelihood is no longer increasing significantly.

~Θ (0 )

Quantification of isoform expression level For paired-end RNA-Seq data, the standard is to report Fragments per Kilobase of transcript

per Million mapped reads (FPKM).


𝐹𝑃𝐾𝑀=𝐶𝐿 ∙𝑁

∙10 6 ∙103

C The total number of fragments (or read pairs) mapped in a region of interest

N The total number of mapped reads in the experiment

L The length of the region

FPKM for a particular isoform is similar to the equation, except that C should be replaced by the estimated number of read pairs that originate from isoform i.

𝐹𝑃𝐾𝑀 (𝑖 )=~̂𝜃 𝑖 ∙¿𝐑∨ ¿

~𝑙 𝑖 ∙𝑁

∙10 6 ∙10 3¿

Program running of PennSeq

#bin name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds id name2 cdsStartStat cdsEndStat exonFrames

774 NM_001251984 chr1 + 24828840 24863510 24859578 24861767 4 24828840,24857707,24859572,24861582,24828953,24857881,24859744,24863510,0 RCAN3 cmpl cmpl -1,-1,0,1,

774 NM_001251979 chr1 + 24828840 24863510 24840862 24861767 5 24828840,24840803,24857707,24859572,24861582,24828953,24841057,24857881,24859744,24863510,0 RCAN3 cmpl cmpl -1,0,0,0,1,

774 NM_001251977 chr1 + 24829386 24863510 24840862 24861767 5 24829386,24840803,24857707,24859572,24861582,24829607,24841057,24857881,24859744,24863510,0 RCAN3 cmpl cmpl -1,0,0,0,1,

774 NM_001251981 chr1 + 24829386 24863510 24840862 24861767 4 24829386,24840803,24859572,24861582,24829607,24841057,24859744,24863510,0 RCAN3 cmpl cmpl -1,0,0,1,

1st step: preprocess

./preprocess.pl -r RefSeqAnnotation -o Outputfile-r RefSeqAnnotation file-o The file name that you want to save the results; it will be the input file for "pennseq.pl"

RCAN3 chr1 + 24828840 24863510 NM_001251983,NM_001251978,NM_013441,NM_001251984,NM_001251979,NM_001251981,NM_001251977,NM_001251985,NM_001251982,NM_001251980,

RCAN3 chr1 + 24828840 24828953 0,0,0,1,1,0,0,0,0,0,

RCAN3 chr1 + 24829386 24829607 1,0,1,0,0,1,1,0,0,0,

RCAN3 chr1 + 24829608 24829640 1,0,1,0,0,0,0,0,0,0,

RCAN3 chr1 + 24834083 24834218 0,1,0,0,0,0,0,0,0,0,

RCAN3 chr1 + 24840803 24841057 0,1,1,0,1,1,1,1,1,1,

RCAN3 chr1 + 24857707 24857881 1,1,1,1,1,0,1,0,1,1,

RCAN3 chr1 + 24859572 24859601 1,1,1,1,1,1,1,0,0,0,

RCAN3 chr1 + 24859602 24859744 1,1,1,1,1,1,1,0,0,1,

RCAN3 chr1 + 24861582 24863510 1,1,1,1,1,1,1,1,1,1,

2nd step: calculate the FPKM and relative abundance for each isoform

./pennseq.pl -I ISOFORM_Compatible_Matrix -s SAM_FILE -o Output_Result_FILE-s SAM_FILE is the mapping result file generated by mapping tools such as TopHat. It is in SAM for-mat. -I ISOFORM_Compatible_Matrix is from the output of preprocess.pl-o The file name that you want to save the results

Gene_name Location Isoform_name Number_of_Fragments FPKM Relative_Abundance

RCAN3 chr1:24828840-24863510 NM_001251983 467 1.28E-07 2.88E-13

RCAN3 chr1:24828840-24863510 NM_001251978 467 0.000110771 2.48E-10

RCAN3 chr1:24828840-24863510 NM_013441 467 13363.39519 0.029968735

RCAN3 chr1:24828840-24863510 NM_001251984 467 1.59E-13 3.56E-19

RCAN3 chr1:24828840-24863510 NM_001251979 467 176833.715 0.396567091

RCAN3 chr1:24828840-24863510 NM_001251981 467 233627.4159 0.523932581

RCAN3 chr1:24828840-24863510 NM_001251977 467 22086.69278 0.049531592

RCAN3 chr1:24828840-24863510 NM_001251985 467 1.24E-22 2.78E-28

RCAN3 chr1:24828840-24863510 NM_001251982 467 1.75E-18 3.91E-24

RCAN3 chr1:24828840-24863510 NM_001251980 467 3.77E-16 8.46E-22

RefSeqAnnotation

ISOFORM_Compatible_Matrix Output_Result_FILE


10 is

ofor

ms

Command lines used to run other programs 1. Cufflinks

cufflinks-2.1.1.Linux_x86_64/cufflinks -v -G hg19.gtf INPUT_SAM_FILE -o OUTPUT_PATH

2. Cufflinks-bias

cufflinks-2.1.1.Linux_x86_64/cufflinks -b hg19.fa -v -G hg19.gtf INPUT_SAM_FILE -o OUTPUT_PATH

3. RD

Step 1: perl ./isoform_RefSeq_process.pl GTF_files/refGene.txt out1

Step 2: perl ./isoform_RefSeq.pl out1 INPUT_SAM_FILE out2 0

Step 3: R CMD BATCH r-RefSeq-isoform.R

4. CEM

cem/bin/runcem.py --forceref --annotation hg19.bed INPUT_SAM_FILE

5. CEM-bias

cem/bin/runcem.py --forceref --usebias --annotation hg19.bed INPUT_SAM_FILE

6. IsoEM-bias

isoem-1.1.1/bin/isoem -G hg19.gtf -m 250 -d 50 -b INPUT_SAM_FILE

7. iReckon

java -Xmx15000M -jar IReckon-1.0.7.jar INPUT_SAM_FILE hg19.fa hg19.refgene

-1 INPUT_FASTQ_1 -2 INPUT_FASTQ_2 -o result


RNA-Seq data simulation 1 Flux Simulator is used to simulate paired-end RNA-Seq data by modeling RNA-Seq experi-

ments in slico.

The human genome sequence (hg19, NCBI build 37) was used to reference sequence.

Using Flux Simulator, we generated 100 million (100M) 76-bp paired-end reads.

The authors selected 10, 20, 60 million read from the simulated data, which denoted these sub-sets by 10M, 20M, and 60M, respectively.

For each gene, the authors estimated the isoform relative abundance using PennSeq, Cufflinks, CEM, RD, IsoEM and iReckon.

Results

RNA-Seq data simulation 2

Results

Median: 200 Median: 402

Median: 1208 Median: 2015

Interquartile range=0.75Median=0.041

The median numbers of read pairs mapped in each gene in the 10M, 20M, 60M and 100M data sets are 200, 402, 1208 and 2015, respectively.

Among the evaluated genes, 49% have two isoforms, 24% have three isoforms and 27% have four or more isoforms.

The simulated isoforms have a wide range of relative abundance (interquartile range=0.75, me-dian=0.041).

RNA-Seq data simulation 3

Results

The coverage plots of the simulated data resemble those seen in real studies, demonstrating var-ious biases. These simulated data thus provide an ideal basis to evaluate the performance of PennSeq as the ground truth is known.

Comparison of estimation accuracy

Results

The authors explored several measures to quantify the estimation accuracy of each method.

First, they measured the similarity between the estimated isoform relative abundance and the ground truth by calculating R2, the coefficient of determination (i.e. squared Pearson correla-tion coefficient).

Second, they measured the estimation accuracy by calculating the root mean squared error (RMSE), defined as

Third, they calculated the fraction of genes that have incorrectly inferred major isoforms. The major isoform of a gene is defined as the most abundant isoform of the gene.

𝑅𝑀𝑆𝐸=√∑𝑔 ∑𝑖

(�̂�𝑔 ,𝑖−𝜃𝑔 , 𝑖)2

𝑛

Comparison of estimation accuracy using R2

Results

Comparison of estimation accuracy using RMSE

Results

Comparison of estimation accuracy using percentage of genes with incorrectly inferred major iso-form

Results

Application to the MicroArray Quality Control (MAQC) data

Results

Human Brain Reference (HBR)

Universal Human Reference (UHR)

RNA-Seq data using Illu-mina GenomeAnalyzer us-ing seven lanes, yielding 35-bp single-end data

Application to the human adipose RNA-Seq data

Results

Discussion 1 Accurate estimation of isoform-specific gene expression is critical for eQTL and

splicing QTL studies using RNASeq.

A major challenge in the analysis of RNA-Seq data is the presence of various biases, which if not appropriately corrected, can affect isoform-specific expression estima-tion.

The central idea of the authors’ method is to model non-uniformity by using the em-pirical read distribution in RNA-Seq data. It is the first time that the non-uniformity is modeled at the isoform level.

Because of the non-parametric nature of the authors’ method, it can model any bi-ases that lead to non-uniformity. This flexibility is important as there are still un-known factors that contribute to non-uniformity and they are unlikely to be fully captured by parametric models.

Discussion 2 As a non–parametric-based approach that relies on empirical read distributions,

PennSeq is inevitably computationally intensive. However, our approximation of hi(r, s) significantly improved the computation speed.

Although PennSeq significantly outperforms the other tools, there is still room for improvement. Even with 100M reads, the R2 value of PennSeq is 0.86. Several steps can be taken to further improve the performance.

A drawback of the EM algorithm is overfitting because all isoforms are assigned a positive abundance estimate even if they are not expressed. To prevent overfitting, a simple solution is to refit the data while eliminating those isoforms with estimated relative abundance below a threshold. A more systematic approach would be to use regularized EM algorithm.

Discussion 3 The authors are currently extending their method to do simultaneous transcriptome

assembly and isoform expression estimation by using the component elimination EM algorithm. Other extensions that they are pursuing include detection of differen-tial expression and differential alternative splicing.

http://seqanswers.com/fo-rums/showpost.php?p=102911&postcount=60

Thank you for your attention.

PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by modeling non-uniform...

Documents

Transcript of PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by modeling non-uniform...