Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Post on 19-Mar-2016

36 views 0 download

description

Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data. Marius Nicolae and Ion M ă ndoiu (University of Connecticut, USA). Outline. DGE/SAGE- Seq protocol EM algorithm Experimental results Conclusions. RNA- Seq Protocol. - PowerPoint PPT Presentation

Transcript of Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Accurate Estimation of Gene Expression Levelsfrom Digital Gene Expression Sequencing Data

Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)

Outline

• DGE/SAGE-Seq protocol• EM algorithm• Experimental results• Conclusions

RNA-Seq Protocol

Make cDNA & shatter into fragments

Sequence fragment ends

A B C D E

Map reads

Gene Expression (GE)Isoform Expression (IE)

A B C

A C

D E

Isoform Discovery (ID)

DGE ProtocolAAAAA

Gene Expression (GE)

Cleave with tagging enzymeCATG

Map tags

A B C D E

Cleave with anchoring enzyme (AE)AAAAACATG

AE

TCCRAC AAAAACATG

AETE

Attach primer for tagging enzyme (TE)

Our Approach

Previous methods• Discard ambiguous tags [Asmann et al. 09, Zaretzki et al. 10]• Heuristics to rescue some ambiguous tags [Wu et al. 10]

New DGE-EM algorithm• Uses all tags, including all ambiguous ones• Uses quality scores• Takes into account partial digest and gene isoforms

Tag Formation Probability

12k …3’5’

AE siteMRNA

Tag formation probability

pp(1 -p)p(1 -p) k-1

Tag-Isoform Compatibility

1,, )1( j

ajit ppQw

assign random values to all f(i)while not converged

DGE-EM Algorithm

E-step

twjiiwfs

),,()(

siwfjin )(),(

init all n(i,j) to 0for each tag t

for (i,j,w) in t

M-step )()(

1 ,

)1(1/)( isites

isites

j ji

pNif

nN

for each isoform i

MAQC Data (UHRR, HBRR)

DGE• 9 Illumina libraries, 238M 20bp tags [Asmann et al. 09]• Anchoring enzyme DpnII (GATC)

RNA-Seq • 6 libraries, 47-92M 35bp reads each [Bullard et al. 10]

qPCR • Quadruplicate measurements for 832 Ensembl genes

[MAQC Consortium 06]

Compared Algorithms

DGE• Uniq [Asmann et al. 09, Zaretzki et al. 10]• DGE-EM

RNA-Seq• IsoEM [Nicolae et al. 10]• Cufflinks [Trapnell et al. 10]

DGE-EM vs. Uniq on HBRR Library 4

0 10000000 20000000 30000000 40000000 50000000 6000000065

70

75

80

85

Uniq 0 mismatches Uniq 1 mismatch Uniq 2 mismatches

DGE-EM 0 mismatches DGE-EM 1 mismatch DGE-EM 2 mismatches

Med

ian

Perc

ent E

rror

DGE vs. RNA-Seq

60

65

70

75

80

85

90

95

100RNA HBRR 1X, IsoEMRNA HBRR 1A, IsoEMRNA UHRR 1X, IsoEMRNA UHRR 1A, IsoEMRNA UHRR 2, IsoEMRNA UHRR 3, IsoEMRNA UHRR 4, IsoEMRNA UHRR 5, IsoEMDGE HBRR 1, DGE-EMDGE HBRR 2, DGE-EMDGE HBRR 3, DGE-EMDGE HBRR 4, DGE-EMDGE HBRR 5, DGE-EMDGE HBRR 6, DGE-EMDGE HBRR 7, DGE-EMDGE HBRR 8, DGE-EMDGE UHRR 1, DGE-EMMillion Mapped Bases

Med

ian

Perc

ent E

rror

DGE vs. RNA-Seq

60

65

70

75

80

85

90

95

100RNA HBRR 1X, IsoEMRNA HBRR 1X, CufflinksRNA HBRR 1A, IsoEMRNA HBRR 1A, CufflinksRNA UHRR 1X, IsoEMRNA UHRR 1X, CufflinksRNA UHRR 1A, IsoEMRNA UHRR 1A, CufflinksRNA UHRR 2, IsoEMRNA UHRR 2, CufflinksRNA UHRR 3, IsoEMRNA UHRR 3, CufflinksRNA UHRR 4, IsoEMRNA UHRR 4, CufflinksRNA UHRR 5, IsoEMRNA UHRR 5, CufflinksDGE HBRR 1, DGE-EMDGE HBRR 1, UniqDGE HBRR 2, DGE-EMDGE HBRR 2, UniqDGE HBRR 3, DGE-EMDGE HBRR 3, UniqDGE HBRR 4, DGE-EMDGE HBRR 4, UniqDGE HBRR 5, DGE-EMDGE HBRR 5, UniqDGE HBRR 6, DGE-EMDGE HBRR 6, UniqDGE HBRR 7, DGE-EMDGE HBRR 7, UniqDGE HBRR 8, DGE-EMDGE HBRR 8, UniqDGE UHRR 1, DGE-EMDGE UHRR 1, UniqMillion Mapped Bases

Med

ian

Perc

ent E

rror

DGE vs. RNA-Seq

0.35

0.45

0.55

0.65

0.75

0.85RNA HBRR 1X, IsoEMRNA HBRR 1X, CufflinksRNA HBRR 1A, IsoEMRNA HBRR 1A, CufflinksRNA UHRR 1X, IsoEMRNA UHRR 1X, CufflinksRNA UHRR 1A, IsoEMRNA UHRR 1A, CufflinksRNA UHRR 2, IsoEMRNA UHRR 2, CufflinksRNA UHRR 3, IsoEMRNA UHRR 3, CufflinksRNA UHRR 4, IsoEMRNA UHRR 4, CufflinksRNA UHRR 5, IsoEMRNA UHRR 5, CufflinksDGE HBRR 1, DGE-EMDGE HBRR 1, UniqDGE HBRR 2, DGE-EMDGE HBRR 2, UniqDGE HBRR 3, DGE-EMDGE HBRR 3, UniqDGE HBRR 4, DGE-EMDGE HBRR 4, UniqDGE HBRR 5, DGE-EMDGE HBRR 5, UniqDGE HBRR 6, DGE-EMDGE HBRR 6, UniqDGE HBRR 7, DGE-EMDGE HBRR 7, UniqDGE HBRR 8, DGE-EMDGE HBRR 8, UniqDGE UHRR 1, DGE-EMDGE UHRR 1, UniqMillion Mapped Bases

R2

Synthetic Data

• 1-30M tags, lengths 14-26bp• UCSC hg19 genome and known isoforms• Simulated expression levels

– Gene expression for 5 tissues from the GNFAtlas2– Geometric expression for the isoforms of each gene

• Anchoring enzymes from REBASE– DpnII (GATC) [Asmann et al. 09]– NlaIII (CATG) [Wu et al. 10]– CviJI (RGCY, R=G or A, Y=C or T)

MPE for 30M 21bp tags

RNA-Seq: 8.3 MPE

GATC GGCC CATG TGCA AGCT YATR ASST RGCY0

5

10

15

20

25

30

Uniq p=1.0 Uniq p=0.5 DGE-EM p=1.0 DGE-EM p=.5

Med

ian

Perc

ent E

rror

ConclusionsIntroduced new DGE-EM algorithm

• Improves accuracy over previous methods by using ambiguous tags and considering isoforms and partial digestion

• Source code freely availabe at http://www.dna.engr.uconn.edu/software/DGE-EM

First direct comparison of RNA-Seq and DGE protocols• Best inference algorithms yield comparable cost-normalized

accuracy on MAQC dataSimulations suggest possible DGE protocol improvements

• Enzymes with degenerate recognition sites (e.g. CviJI)• Optimizing cutting probability

Questions?

ACKNOWLEDGEMENTSWork supported in part by NSF awards IIS-0546457 and IIS-0916948

Anchoring Enzyme Statistics

GATC GGCC CATG TGCA AGCT YATR ASST RGCY75

80

85

90

95

100

% Genes Cut % Unique Tags (p=1.0) % Unique Tags (p=0.5)

RNA-Seq

10000005000000

1000000015000000

30000000

0

5

10

15

20

25

14

18

21

26

36

50

75

100

14 18 21 26 36 50 75 100

#Reads

MPE

Read Length

DGE enzyme GATC p=1.0

1418

2124

26

0

2

4

6

8

10

12

14

16

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags

DGE enzyme CATG p=1.0

1418

2124

26

0

2

4

6

8

10

12

14

16

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags

DGE enzyme RGCY p=1.0

1418

2124

26

0

2

4

6

8

10

12

14

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags

DGE enzyme GATC p=.5

1418

2124

26

0

2

4

6

8

10

12

14

16

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags

DGE enzyme CATG p=.5

1418

2124

26

0

2

4

6

8

10

12

14

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags

DGE enzyme RGCY p=.5

1418

2124

26

0

2

4

6

8

10

12

14

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags