Jen Taylor Bioinformatics Team CSIRO Plant Industry
description
Transcript of Jen Taylor Bioinformatics Team CSIRO Plant Industry
![Page 1: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/1.jpg)
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation?
Jen Taylor
Bioinformatics Team
CSIRO Plant Industry
![Page 2: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/2.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Assumptions
• Every k-mer has equal chance of being sequenced
![Page 3: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/3.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Read density
![Page 4: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/4.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Deviations from Assumptions?
![Page 5: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/5.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Outline
• Sample preparation• MNase Digestion
• Alignment• Parameter choices
• Mismatches• Multiple read mappings
• Hamming edit distances and k-mer space
![Page 6: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/6.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Assumptions : Digestion
Illumina SOLiD
http://seq.molbiol.ru/sch_lib_fr.html
![Page 7: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/7.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
ChIPSeq
MNaseLinker Digest
Sequence &Align
RemoveNucleosomes
![Page 8: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/8.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
ChIPSeq - Nucleosome
Sample:
MNase digested
Size fractionated
Control:
MNase digested
Random sizes
![Page 9: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/9.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Aligned Reads 36-MerMonomer Composition
![Page 10: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/10.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Aligned Reads 5’ +/- 16bpMonomer Composition
![Page 11: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/11.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
MNase Site PreferencingFlick et al., J. Mol. Biology 1986
![Page 12: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/12.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Control MNase Site Preferencing
Sequence Occurrences Sequence Starts Preference (%)
ctataggg 499 245 49.10taataggg 864 424 49.10gtattagg 1044 253 24.23tctttgct 4902 425 8.67cacattac 1807 52 2.88tcccagac 695 20 2.88
aaacaaca 10083 159 1.58acacgagc 810 2 0.25tttgtttt 32186 35 0.19tttgcata 4602 5 0.11ttggttta 7671 1 0.01gaggtttt 3926 0 0
![Page 13: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/13.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
ChIPSeq
MNaseDigest
Sequence &Align
RemoveNucleosomes
![Page 14: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/14.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Control MNase Site Preferencing
Sequence Occurrences Sequence Starts Preference (%)
ctataggg 499 245 49.10taataggg 864 424 49.10gtattagg 1044 253 24.23tctttgct 4902 425 8.67cacattac 1807 52 2.88tcccagac 695 20 2.88
aaacaaca 10083 159 1.58acacgagc 810 2 0.25tttgtttt 32186 35 0.19tttgcata 4602 5 0.11ttggttta 7671 1 0.01gaggtttt 3926 0 0
![Page 15: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/15.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials – Read Density
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Nor
mal
ised
Rea
d D
ensi
ty
Base Coordinate
1 Kb
![Page 16: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/16.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00MNase Potential
Nor
mal
ised
Rea
d D
ensi
ty
![Page 17: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/17.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00MNase Potential
Nor
mal
ised
Rea
d D
ensi
ty
![Page 18: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/18.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potential
![Page 19: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/19.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
MNase biases aiding interpretation?
• Can aid identification in a local sequence ?• Dependent upon local sequence context
• Cautionary tale about analysing sequence contexts of ChipSeq data
• Nucleotide composition analyses must take into account digestion preferencing
![Page 20: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/20.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Outline
• Sample preparation• MNase Digestion
• Alignment• Parameter choices
• Mismatches• Multiple read mappings
• Hamming edit distances and k-mer space
![Page 21: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/21.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming Edit Distances
• Defined as the number of substitution edit operations, required to transform one sequence of length k into another of length k
• For all possible kmers (36, 65 ) in Arabidopsis genome• All vs.All, both strands
• Minimum HE distance
Target Sequence C G T A C A T G C
Probe Sequence C G T T C A G G C
Substitution Required N N N Y N N Y N N
Hamming 2
![Page 22: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/22.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Arabidopsis Minimum Hamming Edit Distances 36mer
![Page 23: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/23.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment issues
0 2 4 6 8 10 12 14
hg18
dm3
araTha9
ce6
sacCer6
![Page 24: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/24.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment artefacts : aligner properties
Mismatch Read length
Genome pre-
processing
Reads pre-processing
Uses quality score
Reports unmapped
readsMultithread
SOAP 0-5 60
SOAP2 0-5 1 ?
Maq 1-3 2 ?
Bowtie 0-3 3 1024
Ubsalign 0-20 1024 4 5
![Page 25: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/25.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Breakdown of sequencing run
Reads PercentageTotal Sequences 76,034,736 100%Total Unique Sequences 33,188,251 44%Mapped to unique location 22,807,050 30%Failed mapping 10,381,201 14%
![Page 26: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/26.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming edits and Ubsalign HE difference
AGATTAGCCTGGTACTGCTA
…..AGCTTAGCCTGGTACTGGTA….
AGATTAGCCTGGTACTGCTA
2H
2
H
…..AGCTTAGCCGGGTACTGGTA….
AGATTAGCCTGGTACTGCTA3
No Alignment
![Page 27: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/27.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming edits and Ubsalign HE difference
AGATTAGCCTGGTACTGCTA
…..AGATTAGCCTGGTACTGCTA….
AGATTAGCCTGGTACTGCTA
2H
0
H
…..AGCTTAGCCGGGTACTGCTA….
AGATTAGCCTGGTACTGCTA2
No Alignment
![Page 28: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/28.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming edits and Ubsalign HE difference
AGATTAGCCTGGTACTGCTA
…..AGCTTAGCCTGGTACTGCTA….
AGATTAGCCTGGTACTGCTA
2H
1
H
…..AGCTTAGCCGGGTTCTGGTA….
AGATTAGCCTGGTACTGCTA4
Alignment !
![Page 29: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/29.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Testing Aligner Accuracy
• Simulated reads• Known correct location
• 25 million, 50 million
• Perfect match, up to 5 mismatches, up to 10 mismatches
• Error 3’ bias
• Numbers of :• correctly aligned reads
• incorrectly aligned reads
• Unalignable reads
• Speed
![Page 30: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/30.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment artefacts :Managing mismatch thresholds
50 Million Reads Accuracy - Correct
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Perfect Match Up to 5M Up to 10M
Per
cen
tag
e o
f to
tal
read
s
UBSAligner Bowtie - d Bowtie - best
![Page 31: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/31.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment artefacts :Managing mismatch thresholds
50 Million Reads Accuracy - Unaligned
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Perfect Match Up to 5M Up to 10M
Per
cen
tag
e o
f to
tal
read
s
UBSAligner Bowtie - d Bowtie - best
![Page 32: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/32.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
How does this affect interpretation ?
• Incorporation of edit differentials• Leads to gains in the number of alignable reads
• Increased information• Determination of the alignment• Gains of 5 - 10% in mappable sites
• Hamming edit distributions provide useful information
Impact of MNase digestion on short read sequence coverage
![Page 33: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/33.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming distance variability
![Page 34: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/34.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Read Deserts
![Page 35: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/35.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Read Deserts
![Page 36: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/36.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Sequence deserts
![Page 37: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/37.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Conclusions
• Sample preparation• MNase Digestion• Local biases present
• Alignment• Parameter choices
• Mismatches – generally too low relative to uniqueness of kmers in the genome
• Multiple read mappings – can drive ‘absence’ of mapped reads
• Hamming edit distances and k-mer space• Kmers have unique and genome specific properties
• Can be used to inform results of alignment
![Page 38: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/38.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Acknowledgements
CSIRO PI Bioinformatics Team
Andrew Spriggs
Stuart Stephen
Emily Ying
Jose Robles
Michael James
CSIRO Prog X
Chris Helliwell
Frank Gubler
Liz Dennis
CSIRO Transformational Biology Capability Platform
David Lovell
Mark Morrison
CMIS / TBCP
Paul Greenfield
![Page 39: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/39.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Paired end data – sample preparation
CG
AT
insert
insert
![Page 40: Jen Taylor Bioinformatics Team CSIRO Plant Industry](https://reader035.fdocuments.us/reader035/viewer/2022062722/568139c9550346895da176bb/html5/thumbnails/40.jpg)
CSIRO. Newton Meeting July 2010 - Sequence coverage
Control and sample read density
Control
Sample