MCB3895-004 Lecture #9 Sept 23/14 Illumina library preparation, de novo genome assembly.
De novo whole genome assembly - Cornell...
Transcript of De novo whole genome assembly - Cornell...
![Page 1: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/1.jpg)
De novo whole genome assembly
Lecture 1
Qi Sun
Bioinformatics FacilityCornell University
![Page 2: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/2.jpg)
DNA Sequencing Platform 1
Illumina sequencing (100 to 300 bp reads)
Paired-end reads
Mate pair reads>5kb fragment
<500bp fragment
![Page 3: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/3.jpg)
PacBio long-read (10 – 15 kb reads)
10-15 kb
1. Long contigs;
2. Resolve paralogs and heterozygous regions
DNA Sequencing Platform 2
![Page 4: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/4.jpg)
NNNScaffolds
Contigs
Illumina Reads Assembly Strategy
Paired-end reads
Mate-pair reads
NNNN
Contiging
Scaffolding
![Page 5: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/5.jpg)
SOAP de novo: easy to run; relatively fast
Discovar: require longer reads (>=250bp);
MaSuRCA: high quality assembly; slow and memory
intensive
Software for short-read assembly
Easy
difficult
![Page 6: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/6.jpg)
Two categories of contiging strategies
overlap–layout–consensus de-bruijn-graph
source: http://www.homolog.us/Tutorials/index.php
Velvet, Soap-denovo, Abyss, Trinity et alCanu, Falcon, Celera et al.
Long reads Short reads
![Page 7: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/7.jpg)
de-bruijn-graph for contiging short reads
source: http://www.homolog.us/Tutorials/index.php
Kmers
![Page 8: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/8.jpg)
GGATGGAAGTCG…………… CGATGGAAGGATParalogous regions
GATGGAA ATGGAAGTGGAAGT
TGGAAGG
GGATGGAAGT
CGATGGAAGG
GATGGAAGT
GATGGAAGG
7merGGATTGGA
CCGATTGGA
9mer
Kmer Paths in paralogous regions
![Page 9: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/9.jpg)
Sequencing errors and
Tips, bubbles and crosslinks
source: http://www.homolog.us/Tutorials/index.php
Tips
Bubbles
Crosslinks
![Page 10: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/10.jpg)
Deal with sequencing errors and repetitive regions
1. Sequencing errors• Remove low depth kmers in a bubble;• Too long kmers would cause coverage problem;
2. Repetitive region• Longer kmers
Tiny repeat:Separate the path
Break boundary between low and high copy regions
![Page 11: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/11.jpg)
Read depth vs Kmer depth
Read depth: number of reads at a genome position
Kmer depth: number of occurrence of an identical kmer
Read depth: 4
Kmeroccurance: 3
![Page 12: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/12.jpg)
Impact of kmer-length (2)Read depth vs Kmer depth
30mer80mer
Kmer depth =3 Kmer depth =1
Longer kmers could results in too little kmer depth
Read depth: 3
![Page 13: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/13.jpg)
AATTTGTGAATGGCT 1ATGCAAGACCAATTG 128ACAAATAGGGGTAGT 1AACACAAAATAATAA 1GCTGTAAGCTTTAAC 1AACCTATGAGTGAAA 1ATTCGTAGATCTTCC 101CAGGTATCAGGCATG 146TATCAGCTGGAGTCA 60ACCAGAGATATAAAA 1GGTCCATTTTGTTTA 1GTCAAAAAATTACGA 1ACATTCCCTCGGTAA 1AGCATTTCTTAACAC 1TAAAAACCATCTTTA 137ATTCGATTTGTCAAG 62ACATTGGAAAAAACT 1AATGGAAATACAATA 2…CATTCGTTAGATCAA 1…CATTCGTGAGATCAA 128…CCTTGTCATTAACTC 667…
Kmer counting from sequencing reads
(15mers)
15mer
31mer
Kmer depth distribution plots
![Page 14: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/14.jpg)
Heterozygous genomes
Experimentally:
• Create inbred lines or haploid cell culture.• Assembly of clonal fragments, merging allelic regions.
Assembly
• SNP: merge bubbles• Highly polymorphic regions
Highly polymorphic regions
Kmer coverage distribution
![Page 15: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/15.jpg)
Examine an assembly software 1. Soap denovo
![Page 16: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/16.jpg)
Scaffolding strategies 1. Mate-pair reads
Contigs
Illumina mate pair data: 5kb, 8kb or 15kb inserts
Software: Soap denovo, et al
![Page 17: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/17.jpg)
Schematic overview of the Platanus algorithm.
Rei Kajitani et al. Genome Res. 2014;24:1384-1395
© 2014 Kajitani et al.; Published by Cold Spring Harbor Laboratory Press
Examine an assembly software 2. Platanus
![Page 18: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/18.jpg)
Scaffolding strategies 2. Illumina Moleculo and 10X
• Scaffolding contigs• Phasing
Illumina Moleculo
![Page 19: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/19.jpg)
Scaffolding strategies 3. Physical maps
Hi-C: Sequence cross-linked and ligated DNA fragments
BioNano:Generating high-quality genome maps by labeling specific 7-mer nickaserecognition sites in a genome with a single-color fluorophore
![Page 20: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/20.jpg)
Eucaryotic genome assembly of in 2016
Small and Inbred Highly heterozyougs/Polyploidy
Illumina paired-end + mate-pair PacBio long readsOR
Scaffolding by physical mapping• BioNano• Hi-C• 10X
Pseudo-chromsoomes through genetics mapping
![Page 21: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/21.jpg)
/programs/jellyfish-2.1.4/bin/jellyfish count -s 1G -m 21 -t 8 -o kmer -C SRR1554178.fastq
/programs/jellyfish-2.1.4/bin/jellyfish dump -c -t kmer > kmer21.txt
CATTATTGGAGTTGATGAAAA 134CCGCCACCACCACCACCGCCA 52TAATTAATTTCAAATTTATAA 1CCCTGTAGAAGCCTGTGTACC 1AAAAAAAAATTTGTAGGGGGG 79GGCATCTAGGTCATGTATTTA 3AAATGAAAGTACATCAGCTTA 1AAAATTTGTTCTACCACCGCC 2CGAGGAGAATTGGGAAAAAAA 103CTATTTGTGGAATCATAGATC 1CATCACAAGAACCTATAGAAG 1CGCCAACCATCCATTATCAAA 1AAAGTATCGCCAACTTTTAAT 1CAAAATGAATTTCTTATTTTC 76GGAAATATGGAGTAGTTACCA 1CTGATTTCTCTCTCTATCTCC 1ATTTATCTCGCAAATTTTAGA 1CAAAGAGAGAATCACAACAAA 86……
awk '{print $2}' kmer21.txt |sort -n |uniq -c \| awk '{print $2 "\t" $1}'> histogram
1 147475392 7989193 1311604 391505 192026 117557 87348 67529 579510 467911 384812 323213 297814 281915 265516 2373
Exercise: Estimate genome size based on kmer distribution
![Page 22: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/22.jpg)
100 60272101 64317102 68183103 72164104 75022105 78876106 81907107 85077108 88059109 89946110 91659111 91950112 92535113 91935114 91286115 90194116 87616117 85215118 82041119 78751120 75306
Peak kmerdepth: 112
Exercise: Estimate genome size based on kmer distribution
![Page 23: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/23.jpg)
Estimate genome size based on kmer distribution
N = M* L/(L-K+1)
Step 1: convert read depth to kmer depth
M: kmer depth = 112L: read length = 101 bpK: Kmer size =21 bpN: read depth =140
Step 2: genome size is total sequenced basepairsdevided by read depth
Genome size = T/NT: total base pairs = 0.505 gbN: read depth = 140Genome size: 3.6 mb
75 merRead depth = 3Kmer depth = 2
![Page 24: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/24.jpg)
Estimate genome size based on kmer distribution
Total kmer bases: 387 mbKmer depth: 113Genome size: 3.4 mb
data <- read.table(file="histogram",header=F)poisson_expeact_K=function(file,start=3,end=200,step=0.1){diff<-1e20min<-1e20pos<-0total<-sum(as.numeric(file[start:dim(file)[1],1]*file[start:dim(file)[1],2]))singleCopy_total <- sum(as.numeric(file[10:500,1]*file[10:500,2]))for (i in seq(start,end,step)){
singleC <- singleCopy_total/ia<-sum((dpois(start:end, i)*singleC-file[start:end,2])^2)if (a < diff){
pos<-idiff<-a
}}print(paste("Total kmer bases: ", total))print(paste("Kmer depth: ", pos))print(paste("Genome size: ", total/pos))pdf("kmerplot.pdf") plot(1:200,dpois(1:200, pos)*singleC, type = "l", col=3, lwd=2, lty=2,
ylim=c(0,150000), xlab="K-mer coverage",ylab="K-mer counts frequency")lines(file[1:200,],type="l",lwd=2)dev.off()
}poisson_expeact_K(data)
R code to fit Poisson distribution
![Page 25: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/25.jpg)
Using short kmer could underestimate genome size (<20)Low kmer depth could overestimate genome size (<20) - Minghui Wang
![Page 26: De novo whole genome assembly - Cornell Universitycbsuss05.tc.cornell.edu/doc/assembly2016_lecture1.pdfDe novo whole genome assembly Lecture 1 Qi Sun Bioinformatics Facility Cornell](https://reader034.fdocuments.us/reader034/viewer/2022043009/5f9a0931d5c4042c1742f4ec/html5/thumbnails/26.jpg)
Exercise: Run kmer analysis tools to estimate genome size
/programs/jellyfish-2.2.3/bin/jellyfish count -s 1G -m 21 -t 4 -o kmer -C SRR1554178.fastq
/programs/jellyfish-2.2.3/bin/jellyfish dump -c -t kmer > kmer21.txt
You will be provided with a Fastq file. Using Jellyfish to get the kmer count, then estimate the kmer depth