Le#Reads#devo#essere#traate per# trasformare#i#da in ... · Greedy#Reconstruc1on# It was the best...

Post on 15-Aug-2020

2 views 0 download

Transcript of Le#Reads#devo#essere#traate per# trasformare#i#da in ... · Greedy#Reconstruc1on# It was the best...

Le  Reads  devo  essere  tra,ate  per  trasformare  i  da1  in  informazioni  

Assembly  

In  bioinforma1cs,  sequence  assembly  refers  to  aligning  and  merging  fragments  of  a  much  longer  DNA  sequence  in  order  to  reconstruct  the  original  sequence  

A  con1g  (from  con1guous)  is  a  set  of  overlapping  DNA  segments  that  together  represent  a  consensus  region  of  DNA.  In  bo,om-­‐up  sequencing  projects,  a  con1g  refers  to  overlapping  sequence  data  (reads);  

Dat

a si

ze

 Raw reads

 Pre- processing

 Assembly: Alignment /    de novo

   Application      specific:  Variant calling, count matrix,...

Compare  samples /    methods

Question

Generalized  NGS  analysis  

Answer?  

Merge small DNA fragments together so  they form a previously unknown sequence

What  is  de  novo  assembly?  

Merge millions reads together so they form previously unknown sequences

de novo assembly • Assemble reads into longer fragments

Find overlap between reads

• Many approaches

reads  

con1gs  

scaffolds  

de novo assembly • Assemble reads into longer fragments

Find overlap between reads

• Many approaches

reads  

con1gs  

scaffolds  

• • • • •

Lets try to assemble some reads!

• Rules:      a minimum of 7-bp overlap      overlap must not include any N bases      same orientation so that the sequence can be read left to right      there may be 1-bp differences      simplified - no double stranded DNA

     Valid assemblies        ..NNNNGGACTATGATTCG          |||||||          TGATTCGAGGCTAANN..  ..NNNNNNNNCGATTCTGATCCGA        |||||||    GTCCTCGATTCTNNNNNNNN..

 Invalid assemblies      ..NNNNCGGACTATGATT

       ||||||        ATGATTCGAGGCTAANN..  

..NNNNNNNNCGCTACTGATCCGA      || | |||    GTCCTCGATTCTGNNNNNNN..

     NGS de novo assembly                          

• Success is a factor of:

 • Genome size,genomic    repeats(!),ploidy

• High coverage,long read lengths,PE/MP libraries

Repeats in E.coli Domani  vedremo  una  storia  di  successo  di  un  genemo  assembly  

Two bacterial genomes de Bruijn graphs

 Few repeats “more” repeats

Alla  fine  di  questa  giornata  non  vedrete  due  scarabocchi,  ma  molto  altro  

 Which approaches?                            Greedy (“Simple” approach)

• Overlap-Layout-Consensus (Long  fewer reads)

• de Bruijn graphs (Many short reads)

Simple approach - Greedy • Pseudo code:

1. Pairwise alignment of all reads

2. Identify fragments that have largest overlap      3. Merge these

4. Repeat until all overlaps are used

• Can only resolve repeats smaller than read length

High computational cost with increasing no.reads

Shredded  Book  Reconstruc1on  •  Dickens  accidentally  shreds  the  first  prin1ng  of  A  Tale  of  Two  Ci1es  

–  Text  printed  on  5  long  spools  

•  How can he reconstruct the text? –  5 copies x 138, 656 words / 5 words per fragment = 138k fragments –  The short fragments from every copy are mixed together –  Some fragments are identical

It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …

Greedy  Reconstruc1on  

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

of times, it was the

times, it was the age

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

it was the worst of

was the worst of times,

worst of times, it was

of times, it was the

times, it was the age

it was the age of

was the age of wisdom,

the age of wisdom, it

age of wisdom, it was

of wisdom, it was the

wisdom, it was the age

it was the age of

was the age of foolishness,

the worst of times, it

 The  repeated  sequence  make  the  correct  reconstruc1on  ambiguous  •  It  was  the  best  of  1mes,  it  was  the  [worst/age]  

 Model  sequence  reconstruc1on  as  a  graph  problem.  

La  teoria  dei  Grafi  

la  teoria  dei  grafi  si  occupa  di  studiare  i  grafi,  oggeX  discre1  che  perme,ono  di  schema1zzare  una  grande  varietà  di  situazioni  e  di  processi  e  spesso  di  consen1rne  l'analisi  in  termini  quan1ta1vi  e  algoritmici.  

La  teoria  dei  grafi  è  un  modo  di  vedere  le  cose  

 •  oggeX  semplici,  deX  ver1ci  (ver1ces)  o  nodi  (nodes),  •  collegamen1  tra  i  ver1ci.  I  collegamen1  possono  essere:  

•  orienta1,  e  in  questo  caso  sono  deX  archi  (arcs)  o  cammini  (paths),  e  il  grafo  è  de,o  orientato  

•  non  orienta1,  e  in  questo  caso  sono  deX  spigoli  (edges),  e  il  grafo  è  de,o  non  orientato  

•  eventualmente  da1  associa1  a  nodi  e/o  collegamen1.  

Per  grafo  si  intende  una  stru,ura  cos1tuita  da:  

La  stru,ura  informa1ca  di  WikiPedia  

Problema  dei  pon1  di  Königsberg  

Königsberg,  è  percorsa  dal  fiume  Pregel  e  da  suoi  affluen1  e  presenta  due  estese  isole  che  sono  connesse  tra  di  loro  e  con  le  due  aree  principali  della  ci,à  da  se,e  pon1  

Nel  corso  dei  secoli  è  stata  più  volte  proposta  la  ques1one  se  sia  possibile  con  una  passeggiata  seguire  un  percorso  che  a,raversi  ogni  ponte  una  e  una  volta  soltanto  e  tornare  al  punto  di  partenza  

Nel  1736  Leonhard  Euler  affrontò  tale  problema,  dimostrando  che  la  passeggiata  ipo1zzata  non  era  possibile  

Problema  dei  pon1  di  Königsberg  

Eulero  ha  il  merito  di  aver  formulato  il  problema  in  termini  di  teoria  dei  grafi,  astraendo  dalla  situazione  specifica  di  Königsberg;  innanzitu,o  eliminò  tuX  gli  aspeX  con1ngen1  ad  esclusione  delle  aree  urbane  delimitate  dai  bracci  fluviali  e  dai  pon1  che  le  collegano;  secondariamente  rimpiazzò  ogni  area  urbana  con  un  punto,  ora  chiamato  ver1ce  o  nodo  e  ogni  ponte  con  un  segmento  di  linea,  chiamato  spigolo,  arco  o  collegamento.  

Eulero  rappresentò  la  disposizione  dei  se,e  pon1  congiungendo  con  altre,ante  linee  le  qua,ro  grandi  zone  della  ci,à,  come  nella  prima  immagine.  Si  no1  che  dai  nodi  A,  B  e  D  partono  (e  arrivano)  tre  pon1;  dal  nodo  C,  invece,  cinque  pon1.  Ques1  sono  i  gradi  dei  nodi:  rispeXvamente,  3,  3,  5,  3.  Prima  di  raggiungere  una  conclusione,  Eulero  ha  ipo1zzato  delle  situazioni  diverse  di  zone  e  pon1  (nodi  e  collegamen1):  con  qua,ro  nodi  e  qua,ro  pon1  è  possibile  par1re,  ad  esempio,  da  A,  e  tornarci  passando  per  tuX  i  pon1  una  e  una  sola  volta.  Il  grado  di  ciascun  nodo  è  un  numero  pari.  Se  invece  si  parte  da  A  per  arrivare  a  D,  ogni  nodo  è  di  grado  pari  a  eccezione  di  due  nodi,  di  grado  dispari  (uno).  Sulla  base  di  queste  osservazioni,  Eulero  ha  enunciato  il  seguente  teorema:    Un  qualsiasi  grafo  è  percorribile  se  e  solo  se  ha  tu5  i  nodi  di  grado  pari,  o  due  di  essi  sono  di  grado  dispari;  per  percorrere  un  grafo  "possibile"  con  due  nodi  di  grado  dispari,  è  necessario  par:re  da  uno  di  essi,  e  si  terminerà  sull’altro  nodo  dispari.  

                                 Overlap Layout

 Consensus

de Bruijn

Graph  Theory!!!  

 Create overlap graph by all-vs-all alignment

 Contigs created based on overlap          

In  the  graph  each  node  is  a  read,  edges  are  overlaps  between  reads  

Overlap-Layout-Consensus

• Consensus:Hamiltonian path (visit each node exactly once)

 • Computationally hard    problem                                        

Overlap-Layout-Consensus

Assemblers:  ARACHNE,  PHRAP,  CAP,  TIGR,  CELERA  

Overlap:    find  poten1ally  overlapping  reads  

Layout:    merge  reads  into  con1gs  and                                                                      con1gs  into  supercon1gs  

Consensus:    derive  the  DNA  sequence  and  correct  read  errors   ..ACGATTACAATAGGTT..

Overlap-­‐Layout-­‐Consensus    

•  Find  the  best  match  between  the  suffix  of  one  read  and  the  prefix  of  another  

 

•  Due  to  sequencing  errors,  need  to  use  dynamic  programming  to  find  the  op1mal  overlap  alignment  

 

•  Apply  a  filtra1on  method  to  filter  out  pairs  of  fragments  that  do  not  share  a  significantly  long  common  substring  

Overlap  

TAGATTACACAGATTAC

TAGATTACACAGATTAC |||||||||||||||||

•  Sort  all  k-­‐mers  in  reads            (k  ~  24)  

•  Find pairs of reads sharing a k-mer

•  Extend  to  full  alignment  –  throw  away  if  not  >95%  similar  

T GA

TAGA | ||

TACA

TAGT ||

Overlapping  Reads  

Che  cos’è  un  k-­‐mer  e  il  k-­‐mer?  

•  A  k-­‐mer  that  appears  N  1mes,  ini1ates  N2  comparisons  

 •  For  an  Alu  that  appears  106  1mes  à  1012  comparisons  –  too  much  

 •  Solu:on:    Discard  all  k-­‐mers  that  appear  more  than    

                             t  ×  Coverage,  (t  ~  10)  

Overlapping  Reads  and  Repeats  

Alu  elements  are  the  most  abundant  transposable  elements  in  the  human  genome  

Create  local  mul1ple  alignments  from  the  overlapping  reads  

TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

Finding  Overlapping  Reads  

•  Correct  errors  using  mul1ple  alignment  

TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA

C: 20 C: 35 T: 30 C: 35 C: 40

C: 20 C: 35 C: 0 C: 35 C: 40

•  Score  alignments  •  Accept  alignments  with  good  scores    

A: 15 A: 25 A: 40 A: 25 -

A: 15 A: 25 A: 40 A: 25 A: 0

Finding  Overlapping  Reads  (cont’d)  

•  Repeats  are  a  major  challenge  •  Do  two  aligned  fragments  really  overlap,  or  are  they  from  two  copies  of  a  repeat?    

•  Solu1on:    repeat  masking  –  hide  the  repeats!!!  •  Masking  results  in  high  rate  of  misassembly  (up  to  20%)  

•  Misassembly  means  a  lot  more  work  at  the  finishing  step  

Layout  

•  Repeats  shorter  than  read  length  are  OK    •  Repeats  with  more  base  pair  differencess  than  sequencing  error  rate  are  OK  

 •  To  make  a  smaller  por1on  of  the  genome  appear  repe11ve,  try  to:  – Increase  read  length  – Decrease  sequencing  error  rate  

Repeats, Errors, and Contig Lengths  

It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …

De  Bruijn  graph  assembly  

•  Dickens  accidentally  shreds  the  first  prin1ng  of  A  Tale  of  Two  Ci1es  –  Text  printed  on  5  long  spools  

•  How can he reconstruct the text? –  5 copies x 138, 656 words / 5 words per fragment = 138k fragments –  The short fragments from every copy are mixed together –  Some fragments are identical

It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …

Shredded  Book  Reconstruc1on  

Greedy  Reconstruc1on  

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

of times, it was the

times, it was the age

It was the best of

of times, it was the

best of times, it was

times, it was the worst

was the best of times,

the best of times, it

it was the worst of

was the worst of times,

worst of times, it was

of times, it was the

times, it was the age

it was the age of

was the age of wisdom,

the age of wisdom, it

age of wisdom, it was

of wisdom, it was the

wisdom, it was the age

it was the age of

was the age of foolishness,

the worst of times, it

 The  repeated  sequence  make  the  correct  reconstruc1on  ambiguous  •  It  was  the  best  of  1mes,  it  was  the  [worst/age]  

 Model  sequence  reconstruc1on  as  a  graph  problem.  

•  Dk  =  (V,E)  •  V  =  All  length-­‐k  subfragments    •  E  =  Directed  edges  between  consecu1ve  subfragments  

•  Nodes  overlap  by  k-­‐1  words  

•  Locally  constructed  graph  reveals  the  global  sequence  structure  •  Overlaps  between  sequences  implicitly  computed  

It was the best was the best of It was the best of

Original  Fragment   Directed  Edge  

de  Bruijn,  1946  Idury  and  Waterman,  1995  Pevzner,  Tang,  Waterman,  2001  

de  Bruijn  Graph  Construc1on  

•  Can  this  really  work?  •  How  do  we  choose  a  value  for  k?  

– Needs  to  be  big  enough  to  be  unique  – But  repeats  make  it  impossible  to  use  such  a  large  k,  because  en1re  reads  are  not  unique  

– So  pick  k  to  be  “big  enough”  

No  need  to  compute  overlaps!  

•  Dickens  accidentally  shreds  the  first  prin1ng  of  A  Tale  of  Two  Ci1es  –  Text  printed  on  5  long  spools  

•  How can he reconstruct the text? –  5 copies x 138, 656 words / 5 words per fragment = 138k fragments –  The short fragments from every copy are mixed together –  Some fragments are identical

It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …

It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …

It was the best worst of times, it was of times, it was the the age of wisdom, it was the age of foolishness,

It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …

It was was the worst of times, the best of times, it it was the age of wisdom, it was the age of foolishness, …

It it was the worst of was the best of times, times, it was the age of wisdom, it was the age of foolishness, …

Shredded  Book  Reconstruc1on  

de  Bruijn  Graph  Assembly  

the age of foolishness

It was the best

best of times, it

was the best of

the best of times,

of times, it was

times, it was the

it was the worst

was the worst of

worst of times, it

the worst of times,

it was the age

was the age of the age of wisdom,

age of wisdom, it

of wisdom, it was

wisdom, it was the

A  unique  Eulerian  tour  of  the  graph  reconstructs  the  

original  text    

If  a  unique  tour  does  not  exist,  try  to  simplify  the  

graph  as  much  as  possible  

de  Bruijn  Graph  Assembly  

the age of foolishness

It was the best of times, it

of times, it was the

it was the worst of times, it

it was the age of the age of wisdom, it was the A  unique  Eulerian  tour  of  

the  graph  reconstructs  the  original  text  

 If  a  unique  tour  does  not  exist,  try  to  simplify  the  

graph  as  much  as  possible  

1  

2  

It was the best of of times, it was the times, it was the worst age of wisdom, it was the age of foolishness, …

38

 Example                                                                                    

 TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG  

   AGTCGAG CTTTAGA CGATGAG CTTTAGA      GTCGAGG TTAGATC ATGAGGC GAGACAG            GAGGCTC                  ATCCGAT AGGCTTT GAGACAG    AGTCGAG TAGATCC ATGAGGC TAGAGAA  TAGTCGA CTTTAGA CCGATGA TTAGAGA          CGAGGCT AGATCCG TGAGGCT AGAGACA  TAGTCGA GCTTTAG TCCGATG GCTCTAG        TCGACGC GATCCGA GAGGCTT AGAGACA  TAGTCGA TTAGATC GATGAGG TTTAGAG      GTCGAGG TCTAGAT ATGAGGC TAGAGAC              AGGCTTT ATCCGAT AGGCTTT GAGACAG    AGTCGAG TTAGATT                    ATGAGGC AGAGACA                GGCTTTA TCCGATG TTTAGAG          CGAGGCT TAGATCC TGAGGCT GAGACAG    AGTCGAG TTTAGATC ATGAGGC TTAGAGA            GAGGCTT GATCCGA GAGGCTT GAGACAG

       Velvet / Curtain

Velvet / Curtain 09.03.12 39

GTCG (1x)

Example          

 Read: GTCGAGG

Velvet / Curtain 09.03.12 40

GTCG (1x)

TCGA (1x)

Example          

 Read: GTCGAGG

Velvet / Curtain 09.03.12 41

GTCG (1x)

TCGA (1x)

CGAG (1x)

Example          

 Read: GTCGAGG

Velvet / Curtain 09.03.12 42

GTCG (1x)

TCGA (1x)

CGAG (1x)

GAGG (1x)

Example          

 Read: GTCGAGG

Velvet / Curtain 09.03.12 43

Example            New read: CGAGGCT

GTCG (1x)

TCGA (1x)

CGAG (2x)

GAGG (1x)

Velvet / Curtain 09.03.12 44

GTCG (1x)

TCGA (1x)

CGAG (2x)

GAGG (2x)

Example          

 Read: CGAGGCT

Velvet / Curtain 09.03.12 45

GTCG (1x)

TCGA (1x)

CGAG (2x)

GAGG (2x)

AGGC (1x)

Example          

 Read: CGAGGCT

Velvet / Curtain 09.03.12 46

GTCG (1x)

TCGA (1x)

GGCT (1x)

CGAG (2x)

GAGG (2x)

AGGC (1x)

Example          

 Read: CGAGGCT

Velvet / Curtain 09.03.12 47

Example          

 New read: TCGACGC

GTCG (1x)

TCGA (2x)

CGAG (2x)

GAGG (2x)

AGGC (1x)

Velvet / Curtain 09.03.12 48

GTCG (1x)

TCGA (2x)

CGAG (2x)    CGAC (1x)

GAGG (2x)      GACG  (1x)

AGGC (1x)      ACGC  (1x)

Example          

 Read: TCGACGC

Velvet / Curtain 09.03.12 49

AGAT (8x)

ATCC (7x)

TCCG (7x)

CCGA (7x)

CGAT (6x)

GATG (5x)

ATGA (8x)

TGAG (9x)

GATC (8x)

GATT (1x)

TAGT (3x)

AGTC (7x)

GTCG (9x)

TCGA (10x)

GGCT (11x)

TAGA (16x)

AGAG (9x)

GAGA (12x)

GACA (8x)

ACAG (5x)

GCTT (8x)

GCTC (2x)

CTTT (8x)

CTCT (1x)

TTTA (8x)

TCTA (2x)

TTAG (12x)

CTAG (2x)

AGAC (9x)

AGAA (1x)

CGAG (8x)

CGAC (1x)

GAGG (16x)

GACG (1x)

AGGC (16x)

ACGC (1x)

Example          

 etc…

Velvet / Curtain 09.03.12 50

TAGTCGA

AGAGA TAGA

AGAT

GCTTTAG

GCTCTAG

AGACAG

AGAA

CGAG

CGACGC

GAGGCT

GATCCGATGAG

GATT

Example          

 After simplification…

GGCT

Velvet / Curtain 09.03.12 51

Example          

 Tips removed…

TAGTCGA

AGAGA TAGA

AGAT

GCTTTAG

GCTCTAG

AGACAG

CGAG

GAGGCT

GATCCGATGAG

GGCT

Velvet / Curtain 09.03.12 56

TAGTCGA

AGAGA TAGA

AGAT

GCTTTAG AGACAG

CGAG

GAGGCT

GATCCGATGAG

GGCT

Example          

 Bubbles removed… by TourBus

Velvet / Curtain 09.03.12 57

TAGTCGAG AGAGACAG

AGATCCGATGAG

GAGGCTTTAGA

Example          

 Final simplification…

Velvet / Curtain 09.03.12 58

One possible walk through the graph ...  

 TAGTCGAG    GAGGCTTTAGA      AGATCCGATGAG        GAGGCTTTAGA          AGAGACAG

TAGTCGAG AGAGACAG

Example    TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG        Final simplification…

             AGATCCGATGAG

GAGGCTTTAGA

Now  we  create  a  dra}  assembly  in  con1g  

But  is  not  sufficient  to  understand  the  characteris1c  of  a  genome  

Contigs

Scaffolds  

Reads

‘De Bruijn’ assembly

To  go  ahead  we  have  to  talk  about  the  paired-­‐end  sequencing  technology  

Paired-­‐end  Sequencing  

Scaffolding

Contigs

Scaffolds

(An assembly)

Reads  ‘De  Bruijn’    assembly  

“Captured”  gaps  caused  by  repeats.  Represented  by  “NNN”  in  assembly  

Join contigs using evidence from paired end data

Align reads to DeBruijn contigs

Scaffolding  

SUPERSCAFOLDING!!!  

A  “real”  protocol  

1.  Retrieve  reads  2.  Quality  check  of  reads  3.  Trimming  and  filtering  4.  Assembly  5.  Using  paired-­‐end  for  scaffolding  6.  Check  the  genome  quality  

Reads

Overlap  

Local  Mul1ple  Alignment    

Con1gs    

Scaffolding      

Alignment  Scoring  

Finishing    

Assembly Problems: -Repeats

-Chimerism

-Gaps

•  Number of large contigs

•  Total size •  Coverage

•  Average length •  N50

•  Longest contig •  % genome assembled

Important Assembler Metrics How  can  we  asses  the  quality  of  a  genome?  

How  can  we  understand  if  we  performed  a  good  assembly?  

Species Genome size

(Mb) N50 Scaffold

index N50 scaffold size

(Mb) # scaffolds N50 contig size

(Kb) sequencing technology reference

Melon 450 26 4,678 1,594 18.2 454, Sanger this report

Potato 844 121 1,782 2,043 31,4 Illumina, 454,

Sanger The Potato Genome Sequencing Consortium

2011 Apple 743 102 1,542 1,629 13.4 Sanger, 454 Velasco et al 2010

Fragaria 240 n.a. 1,361 3,263 n.a. 454, Illumina,

SOLiD Shulaev et al 2011

Cucumber 367 59 1,144 47,837 19.8 Illumina, Sanger Huang et al 2009

Brassica rapa 529 n.a. 1,97 n.a. 27.3 Illumina

The Brassica rapa Genome Sequencing Project Consortium 2011

Cacao 430 178 0,47 4,792 19,8 454 Argout et al 2011

Date palm 658 n.a. 0,03 57,277 6.4 Illumina Al-Dous et al 2011