An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads

19
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia State University Joint work with Adrian Caciula, Sahar Al Seesi, Dumitru Brinza, Abdul Rouf Banday, Rahul N. Kanadia, Ion Mandoiu and Alex Zelikovsky

description

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads. Serghei Mangul Department of Computer Science Georgia State University. Joint work with Adrian Caciula , Sahar Al Seesi , Dumitru Brinza , - PowerPoint PPT Presentation

Transcript of An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq Reads

Page 1: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End

RNA-Seq Reads

Serghei MangulDepartment of Computer Science

Georgia State University

Joint work with Adrian Caciula, Sahar Al Seesi, Dumitru Brinza, Abdul Rouf Banday, Rahul N. Kanadia, Ion Mandoiu and Alex Zelikovsky

Page 2: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

2

Advances in Next Generation Sequencing

http://www.economist.com/node/16349358

Roche/454 FLX Titanium400-600 million reads/run

400bp avg. length

Illumina HiSeq 2000Up to 6 billion PE reads/run

35-100bp read length

SOLiD 4/55001.4-2.4 billion PE reads/run

35-50bp read length

Ion Proton Sequencer

Page 3: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

3

RNA-SeqRNA-Seq

A B C D E

Make cDNA & shatter into fragments

Sequence fragment ends

Map reads

Gene Expression

A B C

A C

D E

Transcriptome Reconstruction Isoform Expression

Page 4: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

4

Transcriptome Reconstruction

• Given partial or incomplete information about something, use that information to make an informed guess about the missing or unknown data.

Page 5: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

5

Transcriptome Reconstruction Types

• Genome-independent reconstruction (de novo)– de Brujin k-mer graph

• Genome-guided reconstruction (ab initio)– Spliced read mapping – Exon identification– Splice graph

• Annotation-guided reconstruction– Use existing annotation (known transcripts) – Focus on discovering novel transcripts

Page 6: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

6

Previous approaches

• Genome-independent reconstruction – Trinity(2011), Velvet(2008), TransABySS(2008)

• Genome-guided reconstruction – Scripture(2010)

• Reports “all” transcripts

– Cufflinks(2010), IsoLasso(2011), SLIDE(2012)• Minimizes set of transcripts explaining reads

• Annotation-guided reconstruction– RABT(2011), DRUT(2011)

Page 7: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

7

Challenges and Solutions

• Read length is currently much shorter then transcripts length

• Paired-end reads • Fragment length distribution

Page 8: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

1 742 3 65t1 :

1 743 65t2 :

1 742 3 5t3 :

t4 : 1 743 5

1 742 3 65

Exon 2 and 6 are “distant” exons : how to phase them?

Page 9: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

• Map the RNA-Seq reads to genome

• Construct Splice Graph - G(V,E)– V : exons– E: splicing events

• Candidate transcripts– depth-first-search (DFS)

• Filter candidate transcripts– fragment length distribution (FLD)– integer programming

9

Genome

TRIPTransciptome Reconstruction using Integer Programming

Page 10: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

10

Gene representation

• Pseudo-exons - regions of a gene between consecutive transcriptional or splicing events

• Gene - set of non-overlapping pseudo-exons

e1 e3 e5

e2 e4 e6

Spse1Epse1

Spse2

Epse2Spse3

Epse3

Spse4

Epse4

Spse5

Epse5 Spse6

Epse6

Spse7Epse7

Pseudo-exons:

e1 e5

pse1 pse2 pse3 pse4 pse5 pse6 pse7

Tr1:

Tr2:

Tr3:

Page 11: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

11

Splice GraphGenome

1 42 3 5 6 7 8 9

TSSpseudo-exons

TES

Page 12: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

How to select?• Select the smallest set of candidate transcripts

that yields a good statistical fit between– the fragment length distribution empirically determined

during library preparation– fragment lengths implied by mapped read pairs

12

500

300

1 2 3

200 200 200

1 3

200 200

Series1

Mean : 500; Std. dev. 50

Series1

Mean : 500; Std. dev. 50

Page 13: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

Simplified IP Formulation • Objective

• Constraints

T(p) - set of candidate transcripts on which paired-end read p can be mapped y(t) - 1 if a candidate transcript t is selected, 0 otherwisex(p) - 1 if the pe read p is selected to be mapped 13

Tt

ty )(min

ppxtypTt

),()()1()(

p readsNpx )()2(

for each pe read at least one transcript is selected

Page 14: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

IP Formulation• Objective

• Constraints

14

Tt

ty )(min

for each pe read from every category of std.dev. at least one transcript is selected

restricts the number of pe reads mapped within different std. dev.

each pe read is mapped no morethen with one category of std. dev.

every splice junction to be covered

Page 15: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

15

Comparison on Simulated Data

100x coverage, 2x100bp pe reads, 500 mean fragment length, 10% sd

FPTP

TPPPV

SensPPV

SensPPVFScore

2

FNTP

TPSens

Page 16: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

16

Influence of Sequencing Parameters

100x coverage, 2x100bp pe reads, 500 mean fragment length, 10%-100% sd

FNTP

TPSens

FPTP

TPPPV

SensPPV

SensPPVFScore

2

TRIP-L : individual fragment lengths estimates

Page 17: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

17

Results on Real RNA-Seq Data

• CD1 mouse retina RNA samples• specific gene that has 33 annotated transcripts

in Ensembl • TRIP : 5 out of 10 transcripts, confirmed using

qPCR. • Cufflinks : 3 out of 10 transcripts

6906 alignments for 22346 read pairs with read length of 68

Page 18: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

18

Conclusions

• We introduced a novel method for transcriptome reconstruction from paired-end RNA-Seq reads.

• Our method :– exploits distribution of fragment lengths– additional experimental data

• TSS/TES (TRIP with TSS/TES)• individual fragment lengths estimates (TRIP-L)

Page 19: An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA- Seq  Reads

19