Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

24
hard assembly Jan Pačes Institute of Molecular Genetics AS CR

Transcript of Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

Page 1: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

hard assembly

Jan Pačes

Institute of Molecular Genetics AS CR

Page 2: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

problemsgenomes high GC content repetitions (short - low informational content,

long) polymorphic "unreadable" sequences, "weird" structures

technologies nonrandom libraries wrong sizes erroneous or chimeric reads

Page 3: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

sequencing technologies ABI (sanger)

454 (pyrosequencing)

solexa (reversible terminator)

SOLiD (2base ligation)

PacBio (SMRT)

Page 4: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

example of errors in one technology

http://chevreux.org/mira_ex_454sanger.html

Page 5: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

Aird et al. Genome Biology 2011

high GC regions are underrepresented

Page 6: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

Aird et al. Genome Biology 2011

protocol optimization for high GC content

Page 7: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

repetitions

scaffold

repetition

Page 8: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

repetitions

Page 9: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

repetitions recognition

MIRA http://sourceforge.net/projects/mira-assembler/

MaSuRCAhttp://www.genome.umd.edu/masurca.html

SPAdeshttp://bioinf.spbau.ru/spades

Repeatmaskerhttp://www.repeatmasker.org/

RepeatModeller (RECON and RepeatScout)http://www.repeatmasker.org/RepeatModeler.html

position aware assemblers

Page 10: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

k-mer distribution

Page 11: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

k-mer analysis

JELLYFISH - Fast, Parallel k-mer Counting for DNAhttp://www.cbcb.umd.edu/software/jellyfish/

Quake is a package to correct substitution sequencing errors in experiments with deep coveragehttp://www.cbcb.umd.edu/software/quake/

KHMER Trim off likely erroneous k-mershttps://khmer-protocols.readthedocs.org/en/v0.8.2/

Page 12: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

repetitions

scaffold

repetition

Page 13: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

filling gaps

GapCloser (part of SOAPdenovo)http://soap.genomics.org.cn/soapdenovo.html

GapFiller (part of SSPACE)http://www.baseclear.com/lab-products/bioinformatics-tools/gapfiller/

GapFillerhttp://sourceforge.net/projects/gapfiller/

Page 14: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

454 multiplicates

Page 15: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

contig coverage by large libraries

Page 16: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

illumina pe and mate-pairs libraries

Page 17: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

highly polymorphic genomes

scaffold

two copies of polymorphic contigs

Page 18: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

polymorphic assembly workflow

normal assembly

condensing alternative contigs

mapping to identify SNPs

"repair" reads

second "polymorpic" assembly

http://www.fishbrowser.org/software/L_RNA_scaffolder

Page 19: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.
Page 20: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

G-quadruplex

Page 21: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

AGCGACCCCCCCCCACCACCGCCACCACCACCTCTGCCATTGGCCGCCGCCGCCCCCCCCCCATTAAACCCCCCCACCCCCCCCCGCGCTGCCCCCTCCCCGGTGG

Chicken p53 – coverage from RNAseq data

Coverage > 13,000X

Page 22: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

CCCGCCCACCCCCACCCCCACCCGCACCCCCCACTCTCCCACCCCCACCCCCTTTTCTCCCACCCCCTCTTCTCCCACCCCCTTTTCCCCCCCTTCCTCCCCCCACTCCGCCCCCCCCCCGCCCCCTCCCCCCCCCCAGGTGAGGACCCT

Chicken erythropoietin (EPO)– coverage from RNAseq data

Coverage > 500X from RNAseq

(*EPO locus not completed even from 1000X coverage genomic Illumina data!)

Page 23: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

chicken missing genes

Page 24: Hard assembly Jan Pačes Institute of Molecular Genetics AS CR.

that’s it, thank you

many thanks also to:

Daniel EllederTomáš HronMichal KolářHynek Strnad