Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted...

23
Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z . Ratz U Washington UC Berkeley & U Penn Microsoft Research 016

Transcript of Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted...

Page 1: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Sequence assemblyfrom corrupted shotgun reads

Shirshenducanguly Elhanan Mossel Miklos Z.

Ratz

U. Washington UC Berkeley & U

. Penn Microsoft Research

1¥016

Page 2: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

DNA sequencing :

• Prevalent technique : shotgun sequencing

• Goal of de novo assembly :

reconstruct X from reads R

Page 3: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

DNA

sequencing

:

robustalgorithms.2.ee

Prevalent technique : shotgun sequencing

• Goal of de novo assembly :

reconstruct X from reads R

•. Many sequencing technologiesw/ different error profiles

: Are there robust assembly algorithms ?

Page 4: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

DNA

sequencing

: robustalgorithmsn

• Prevalent technique : shotgun sequencing

• Goal of de novo assembly :

reconstruct X from reads R

•. Many sequencing technologiesw/ different error profiles

: Are there robust assembly algorithms ?

A : Yes,

and a simple sequential algorithm works well.

Page 5: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Sequencing technologies• Sanger sequencing

- 800-1000 bp reads,

< 1% error

- very expensive

•. Next gen ( Edgar) sequencing- high throughput , cheap Example :

- short reads ( too -200 bp ) Illuming- low error rate ( 1- 3%)

•• Emerging ( 3rd gen) technologiesExamples :

- long reads ( > 10000 bp )• Paasio 's SMRT

- high error rate ( 10 - 22% ) • Oxford Nanopore

Page 6: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Sequencing technologies• Sanger sequencing

- 800-1000 bp reads,

< 1% error

- very expensive

•• Next gen ( Edgar) sequencing- high throughput , cheap Example :

- short reads ( too -200 bp ) Illuming- low error rate ( 1- 3%)

•• Emerging ( 3rd gen) technologiesExamples :

- long reads ( > 10000 bp )• Paasio 's SMRT

- high error rate ( 10 - 22% ) • Oxford Nanopore

: how robust are reconstruction algorithmsWirt .

different sequencing technologies ?

Page 7: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Adversarial corruption / error model

• Instead of getting true reads R ,

get corrupted reads FL

• Assume only that

ed ( Ri ,R~i)< EL

where ed = edit distance ,

and L = length of R ; .

Page 8: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Approximate reconstruction problem1

.

Choose XE ["

uniformly at random,

E = { A ,C , G ,T]

2.

Draw reads

R = { R, ,R , , ... ,Rµ } of length L

from uniformly random positions

3.

Get corrupted reads

k={ RT,RT , . . spin }

Satisfying edl Ri ,ki)< EL

Page 9: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Approximate reconstruction problem1

.

Choose XE ["

uniformly at random,

E={ A ,C , G ,T]

2.

Draw reads

R = { R, ,R , , ... ,Rµ } of length L

from uniformly random positions

3.

Get corrupted reads

k={ RT,RT , . . spin }

Satisfying edl Ri ,ki)< EL

Goal : approximate reconstruction

Output : F- Re )eE* st.

ed ( I ,X ) ecen

Wlprob . 21-8.

Page 10: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Main obstructions to reconstruction

I.

Short reads lead to repeats ( Ukkonen ' 92 )repeat -

limited

regime

Page 11: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Main obstructions to reconstruction

1.

Short reads lead to repeats ( Ukkonen ' 92 )repeat -

limited

regime

2. Need enough reads to cover Xcoverage

-

limited

regime

Lander,

Waterman ( 1988 ) :

Nav = Naval ,

D=Eln ( to )

Page 12: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Exact reconstruction ( to )Thin ( A

.Motahan , G. Bresler

, D Tse , 2013 )+

These are the only obstructions.

Page 13: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Exact reconstruction ( to )Thin ( A

.Motahan , G

.Bresler

, D Tse , 2013 )+

These are the only obstructions.

More precisely : let X be random ,L= I lnlnl

,5<112 .

Then :

repeated • if I < ¥2 , then exact reconstruction is impossible ;vigifted . if I > Ii then king ,NTo÷ = 1.

Page 14: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Exact reconstruction ( to )Thin ( A

.Motahan , G

.Bresler

, D Tse , 2013 )+

These are the only obstructions.

More precisely : let X be random ,L= I ln (nl

,5<42 .

Then :

repeated • if I < ¥2 , then exact reconstruction is impossible ;viwaited . if I > ¥ ,then nltg ,Nvmj÷ = 1

.

low

For arbitrary sequences:

G.

Bresler,

M Bresler, D .

Tse ( 2013 )thresholds based on repeat statistics of genome

Page 15: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Approximate reconstruction

Thy Approximate reconstruction is possible ,

if L and N are large enough .

Page 16: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Approximate reconstruction

Thy Approximate reconstruction is possible ,

if L and N are large enough .

More precisely : Let X be random ,and L=Iln(n )

.

For every C >3 there exist constants E= EK ) ,Eo=E°K , C)

, CECYE, C)

st. for every E e ( 0

, ED if [ 2 Ek,

NZC ' Nav /E

then there exists an approximate reconstruction algorithmfor error rate E with approximation factor C

.

Page 17: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Approximate reconstruction

Thy Approximate reconstruction is possible ,

if L and N are large enough .

More precisely : Let X be random ,and L=Iln(n )

.

For every C >3 there exist constants E= EK ) ,Eo=E°K , C)

, ECYE , C)

st. for every E e ( 0

, ED if [ 2 Ek,

NZC ' Nav /E

then there exists an approximate reconstruction algorithmfor error rate E with approximation factor C

.

Comments• simple sequential algorithm works

•. dependence of [ and N on E not necessary( but get worse C)

••• best achievable C might depend on [ and N

• related work :

Motahari ,Ramchandran

, Tse ,Ma ( 2013 ) ; Shomorony ,

Gurtade,Tse ( 2015 )

Page 18: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Edit distance between random strings

Lemme Xm ,Yme Em independent,

Uniformly random .

Then

lim In ed ( Xm ,Ym) = Cind > 0

.

For KK 4 :

• empirically Cnd = 0.51 .

• volume argument : Cind > 0.33.

Page 19: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Edit distance between random strings

^

utftmntntqrxanmdgtmmewenmindependent,

} )limtmedl Xm ,Ym)=Cind>0 .

Lenny XEIM uniformly random .

Then :

For kI=4 :

• empirically cindaost . ed(X[1,m],X[ t.tk ,m+k])=2k• volume argument : Cind >0 -33

. foray k£cm with prob .71 - £ " ?

Page 20: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Sequential reconstruction algorithm~

Rk XEL

pi

• Fix a appropriately .

• Given Pin , find TZETZ s .

t.

ed ( ksiftix,Fire ' ) -42+214 el

• Concatenates pen and lIn%×.

• at each step , gain = (1- Le)L ,make error s3EL

Page 21: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Summary• introduced adversarial read error model

• approximate reconstruction is possible• simple sequential algorithm works

• edit distance results key to analysis

Page 22: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Summary• introduced adversarial read error model

• approximate reconstruction is possible• simple sequential algorithm works

• edit distance results key to analysis

Challenges• determine fundamental limits

of approximate reconstruction

• results for arbitrary sequences•. bridge gap between models

Page 23: Sequence assembly - Princeton University · 2017. 9. 15. · Sequence assembly from corrupted shotgun reads Shirshenducanguly Elhanan Mossel Miklos Z Ratz U. Washington UC Berkeley

Summary• introduced adversarial read error model

• approximate reconstruction is possible• simple sequential algorithm works

• edit distance results key to analysis

Challenges• determine fundamental limits

of approximate reconstruction

• results for arbitrary sequences•. bridge gap between models

Thank you!