Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

26
Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)

description

Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data. Marius Nicolae and Ion M ă ndoiu (University of Connecticut, USA). Outline. DGE/SAGE- Seq protocol EM algorithm Experimental results Conclusions. RNA- Seq Protocol. - PowerPoint PPT Presentation

Transcript of Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Page 1: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Accurate Estimation of Gene Expression Levelsfrom Digital Gene Expression Sequencing Data

Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)

Page 2: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Outline

• DGE/SAGE-Seq protocol• EM algorithm• Experimental results• Conclusions

Page 3: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

RNA-Seq Protocol

Make cDNA & shatter into fragments

Sequence fragment ends

A B C D E

Map reads

Gene Expression (GE)Isoform Expression (IE)

A B C

A C

D E

Isoform Discovery (ID)

Page 4: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

DGE ProtocolAAAAA

Gene Expression (GE)

Cleave with tagging enzymeCATG

Map tags

A B C D E

Cleave with anchoring enzyme (AE)AAAAACATG

AE

TCCRAC AAAAACATG

AETE

Attach primer for tagging enzyme (TE)

Page 5: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Our Approach

Previous methods• Discard ambiguous tags [Asmann et al. 09, Zaretzki et al. 10]• Heuristics to rescue some ambiguous tags [Wu et al. 10]

New DGE-EM algorithm• Uses all tags, including all ambiguous ones• Uses quality scores• Takes into account partial digest and gene isoforms

Page 6: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Tag Formation Probability

12k …3’5’

AE siteMRNA

Tag formation probability

pp(1 -p)p(1 -p) k-1

Page 7: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Tag-Isoform Compatibility

1,, )1( j

ajit ppQw

Page 8: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

assign random values to all f(i)while not converged

DGE-EM Algorithm

E-step

twjiiwfs

),,()(

siwfjin )(),(

init all n(i,j) to 0for each tag t

for (i,j,w) in t

M-step )()(

1 ,

)1(1/)( isites

isites

j ji

pNif

nN

for each isoform i

Page 9: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

MAQC Data (UHRR, HBRR)

DGE• 9 Illumina libraries, 238M 20bp tags [Asmann et al. 09]• Anchoring enzyme DpnII (GATC)

RNA-Seq • 6 libraries, 47-92M 35bp reads each [Bullard et al. 10]

qPCR • Quadruplicate measurements for 832 Ensembl genes

[MAQC Consortium 06]

Page 10: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Compared Algorithms

DGE• Uniq [Asmann et al. 09, Zaretzki et al. 10]• DGE-EM

RNA-Seq• IsoEM [Nicolae et al. 10]• Cufflinks [Trapnell et al. 10]

Page 11: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

DGE-EM vs. Uniq on HBRR Library 4

0 10000000 20000000 30000000 40000000 50000000 6000000065

70

75

80

85

Uniq 0 mismatches Uniq 1 mismatch Uniq 2 mismatches

DGE-EM 0 mismatches DGE-EM 1 mismatch DGE-EM 2 mismatches

Med

ian

Perc

ent E

rror

Page 12: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

DGE vs. RNA-Seq

60

65

70

75

80

85

90

95

100RNA HBRR 1X, IsoEMRNA HBRR 1A, IsoEMRNA UHRR 1X, IsoEMRNA UHRR 1A, IsoEMRNA UHRR 2, IsoEMRNA UHRR 3, IsoEMRNA UHRR 4, IsoEMRNA UHRR 5, IsoEMDGE HBRR 1, DGE-EMDGE HBRR 2, DGE-EMDGE HBRR 3, DGE-EMDGE HBRR 4, DGE-EMDGE HBRR 5, DGE-EMDGE HBRR 6, DGE-EMDGE HBRR 7, DGE-EMDGE HBRR 8, DGE-EMDGE UHRR 1, DGE-EMMillion Mapped Bases

Med

ian

Perc

ent E

rror

Page 13: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

DGE vs. RNA-Seq

60

65

70

75

80

85

90

95

100RNA HBRR 1X, IsoEMRNA HBRR 1X, CufflinksRNA HBRR 1A, IsoEMRNA HBRR 1A, CufflinksRNA UHRR 1X, IsoEMRNA UHRR 1X, CufflinksRNA UHRR 1A, IsoEMRNA UHRR 1A, CufflinksRNA UHRR 2, IsoEMRNA UHRR 2, CufflinksRNA UHRR 3, IsoEMRNA UHRR 3, CufflinksRNA UHRR 4, IsoEMRNA UHRR 4, CufflinksRNA UHRR 5, IsoEMRNA UHRR 5, CufflinksDGE HBRR 1, DGE-EMDGE HBRR 1, UniqDGE HBRR 2, DGE-EMDGE HBRR 2, UniqDGE HBRR 3, DGE-EMDGE HBRR 3, UniqDGE HBRR 4, DGE-EMDGE HBRR 4, UniqDGE HBRR 5, DGE-EMDGE HBRR 5, UniqDGE HBRR 6, DGE-EMDGE HBRR 6, UniqDGE HBRR 7, DGE-EMDGE HBRR 7, UniqDGE HBRR 8, DGE-EMDGE HBRR 8, UniqDGE UHRR 1, DGE-EMDGE UHRR 1, UniqMillion Mapped Bases

Med

ian

Perc

ent E

rror

Page 14: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

DGE vs. RNA-Seq

0.35

0.45

0.55

0.65

0.75

0.85RNA HBRR 1X, IsoEMRNA HBRR 1X, CufflinksRNA HBRR 1A, IsoEMRNA HBRR 1A, CufflinksRNA UHRR 1X, IsoEMRNA UHRR 1X, CufflinksRNA UHRR 1A, IsoEMRNA UHRR 1A, CufflinksRNA UHRR 2, IsoEMRNA UHRR 2, CufflinksRNA UHRR 3, IsoEMRNA UHRR 3, CufflinksRNA UHRR 4, IsoEMRNA UHRR 4, CufflinksRNA UHRR 5, IsoEMRNA UHRR 5, CufflinksDGE HBRR 1, DGE-EMDGE HBRR 1, UniqDGE HBRR 2, DGE-EMDGE HBRR 2, UniqDGE HBRR 3, DGE-EMDGE HBRR 3, UniqDGE HBRR 4, DGE-EMDGE HBRR 4, UniqDGE HBRR 5, DGE-EMDGE HBRR 5, UniqDGE HBRR 6, DGE-EMDGE HBRR 6, UniqDGE HBRR 7, DGE-EMDGE HBRR 7, UniqDGE HBRR 8, DGE-EMDGE HBRR 8, UniqDGE UHRR 1, DGE-EMDGE UHRR 1, UniqMillion Mapped Bases

R2

Page 15: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Synthetic Data

• 1-30M tags, lengths 14-26bp• UCSC hg19 genome and known isoforms• Simulated expression levels

– Gene expression for 5 tissues from the GNFAtlas2– Geometric expression for the isoforms of each gene

• Anchoring enzymes from REBASE– DpnII (GATC) [Asmann et al. 09]– NlaIII (CATG) [Wu et al. 10]– CviJI (RGCY, R=G or A, Y=C or T)

Page 16: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

MPE for 30M 21bp tags

RNA-Seq: 8.3 MPE

GATC GGCC CATG TGCA AGCT YATR ASST RGCY0

5

10

15

20

25

30

Uniq p=1.0 Uniq p=0.5 DGE-EM p=1.0 DGE-EM p=.5

Med

ian

Perc

ent E

rror

Page 17: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

ConclusionsIntroduced new DGE-EM algorithm

• Improves accuracy over previous methods by using ambiguous tags and considering isoforms and partial digestion

• Source code freely availabe at http://www.dna.engr.uconn.edu/software/DGE-EM

First direct comparison of RNA-Seq and DGE protocols• Best inference algorithms yield comparable cost-normalized

accuracy on MAQC dataSimulations suggest possible DGE protocol improvements

• Enzymes with degenerate recognition sites (e.g. CviJI)• Optimizing cutting probability

Page 18: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Questions?

ACKNOWLEDGEMENTSWork supported in part by NSF awards IIS-0546457 and IIS-0916948

Page 19: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

Anchoring Enzyme Statistics

GATC GGCC CATG TGCA AGCT YATR ASST RGCY75

80

85

90

95

100

% Genes Cut % Unique Tags (p=1.0) % Unique Tags (p=0.5)

Page 20: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

RNA-Seq

10000005000000

1000000015000000

30000000

0

5

10

15

20

25

14

18

21

26

36

50

75

100

14 18 21 26 36 50 75 100

#Reads

MPE

Read Length

Page 21: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

DGE enzyme GATC p=1.0

1418

2124

26

0

2

4

6

8

10

12

14

16

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags

Page 22: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

DGE enzyme CATG p=1.0

1418

2124

26

0

2

4

6

8

10

12

14

16

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags

Page 23: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

DGE enzyme RGCY p=1.0

1418

2124

26

0

2

4

6

8

10

12

14

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags

Page 24: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

DGE enzyme GATC p=.5

1418

2124

26

0

2

4

6

8

10

12

14

16

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags

Page 25: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

DGE enzyme CATG p=.5

1418

2124

26

0

2

4

6

8

10

12

14

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags

Page 26: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data

DGE enzyme RGCY p=.5

1418

2124

26

0

2

4

6

8

10

12

14

1000000

5000000

10000000

15000000

30000000

1000000 5000000 10000000 15000000 30000000

Tag Length

MPE

#Tags