Next generation sequencing course - part 2: sequence mapping
Transcript of Next generation sequencing course - part 2: sequence mapping
![Page 1: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/1.jpg)
[I0D51A] Bioinformatics: High-Throughput AnalysisNext-generation sequencing. Part 2: Mapping
Prof Jan AertsFaculty of Engineering - ESAT/[email protected]
TA: Alejandro Sifrim ([email protected])
1
![Page 2: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/2.jpg)
Context
2
![Page 3: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/3.jpg)
Assembly vs mapping
3
![Page 4: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/4.jpg)
Trapnell & Salzberg, 2009
challenges:• how quickly can we align the reads to the genome?• what do we do with repetitive sequences?
4
![Page 5: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/5.jpg)
Trapnell & Salzberg, 2009
Approaches
hash-based
Burrows-Wheeler
transform
5
![Page 6: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/6.jpg)
Hash-based mapping
E.g. MAQ
Steps:
• Index reference genome (or sequence reads) => creates hash index (= big file: >50GB)
• Divide each read into segments (seeds) and look up in table
seed positions
... ...
AAGC 3,473,2738,...
AAGG 34,236,1827,...
AAGT 8,172,782,1921,...
... ...6
![Page 7: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/7.jpg)
Burrows-Wheeler transform
E.g. BWA
Used in data compression (e.g. bzip) => index: much smaller than hash-based index (<2GB)
Alignment speed: 30x faster than MAQ
Steps:
• Create BWT index of genome
• Align read 1 character at a time to BWT-transformed genome
7
![Page 8: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/8.jpg)
Burrows-Wheeler transform
2. R
ead
map
pin
g
Creating Burrows-Wheeler
8
![Page 9: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/9.jpg)
Inverse BWT: recreating original text
if BWT = O^OOGO$L => what was original text?
O^OOGO$L = last column L => first column F = sorted
9
Last column L
O
^
O
O
G
G
$
L
First column F
G
G
L
O
O
O
^
$
sort
![Page 10: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/10.jpg)
Inverse BWT: recreating original text
ith occurrence of a character in L is same text occurrence as the ith occurrence in F
10
F L
1st G G O 1st O
2nd G G ^ 1st ^
1st L L O 2nd O
1st O O O 3rd O
2nd O O G 1st G
3rd O O G 2nd G
1st ^ ^ $ 1st $
1st $ $ L 1st L
![Page 11: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/11.jpg)
11
F L
1st G G O 1st O
2nd G G ^ 1st ^
1st L L O 2nd O
1st O O O 3rd O
2nd O O G 1st G
3rd O O G 2nd G
1st ^ ^ $ 1st $
1st $ $ L 1st L
$
![Page 12: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/12.jpg)
12
F L
1st G G O 1st O
2nd G G ^ 1st ^
1st L L O 2nd O
1st O O O 3rd O
2nd O O G 1st G
3rd O O G 2nd G
1st ^ ^ $ 1st $
1st $ $ L 1st L
L$
![Page 13: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/13.jpg)
13
F L
1st G G O 1st O
2nd G G ^ 1st ^
1st L L O 2nd O
1st O O O 3rd O
2nd O O G 1st G
3rd O O G 2nd G
1st ^ ^ $ 1st $
1st $ $ L 1st L
OL$
![Page 14: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/14.jpg)
14
F L
1st G G O 1st O
2nd G G ^ 1st ^
1st L L O 2nd O
1st O O O 3rd O
2nd O O G 1st G
3rd O O G 2nd G
1st ^ ^ $ 1st $
1st $ $ L 1st L
GOL$
![Page 15: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/15.jpg)
15
F L
1st G G O 1st O
2nd G G ^ 1st ^
1st L L O 2nd O
1st O O O 3rd O
2nd O O G 1st G
3rd O O G 2nd G
1st ^ ^ $ 1st $
1st $ $ L 1st L
OGOL$
![Page 16: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/16.jpg)
16
F L
1st G G O 1st O
2nd G G ^ 1st ^
1st L L O 2nd O
1st O O O 3rd O
2nd O O G 1st G
3rd O O G 2nd G
1st ^ ^ $ 1st $
1st $ $ L 1st L
OOGOL$
![Page 17: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/17.jpg)
17
F L
1st G G O 1st O
2nd G G ^ 1st ^
1st L L O 2nd O
1st O O O 3rd O
2nd O O G 1st G
3rd O O G 2nd G
1st ^ ^ $ 1st $
1st $ $ L 1st L
GOOGOL$
![Page 18: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/18.jpg)
18
F L
1st G G O 1st O
2nd G G ^ 1st ^
1st L L O 2nd O
1st O O O 3rd O
2nd O O G 1st G
3rd O O G 2nd G
1st ^ ^ $ 1st $
1st $ $ L 1st L
^GOOGOL$
![Page 19: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/19.jpg)
19
Searching using BWT
uses row index and fact that rows are alphabetically sorted => binary searche.g. at what positions does “GO” occur in “^GOOGOL$”?
take middle position: is “GO” alphabetically before or after this position?-> if before: take middle position of first half (if after: last half) and discard other half-> repeat until string found-> row indices indicate positions of substring: “GO” is at positions 2 and 5
![Page 20: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/20.jpg)
Issues
• placing reads in regions that do not exist in the reference genome
• sequencing errors and variations: alignment between read and true source in genome may have more differences than alignment with some other copy of repeatWhat if many nucleotide differences with closest fully sequenced genome?
• placing reads in repetitive regions: MAQ & bwa return only 1 mapping; If multiple: mapQ = 0
• MAQ & bwa: use paired-end information => might prefer correct distance over correct alignment
20
![Page 21: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/21.jpg)
File formats
SAM (Sequence Alignment/Map) format = unified format for storing read alignments to a reference genome
BAM = binary version of SAM for fast querying
21
![Page 22: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/22.jpg)
22
7172283 83 chr9 139389482 60 90M = 139389330 -242 ACGGGAG... #######...7172283 163 chr9 139389330 60 90M = 139389482 242 TAGGAGG... EHHHHHH...7705896 83 chr9 139389513 60 90M = 139389512 -91 GCTGGGG... EBCHHFC...7705896 163 chr9 139389512 60 90M = 139389513 91 AGCTGGG... HHHHHHH...
1 QNAME query template name2 FLAG bitwise flag3 RNAME reference sequence name4 POS 1-based leftmost mapping position5 MAPQ mapping quality6 CIGAR CIGAR string7 RNEXT reference name of mate8 PNEXT position of mate9 TLEN observed template length10 SEQ sequence11 QUAL ASCII of Phred-scaled base quality
http://samtools.sourceforge.net/SAM1.pdf
![Page 23: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/23.jpg)
23
• 7172283 83 chr9 139389482 60 90M = 139389330 -242 ACGGGAG... #######...7172283 163 chr9 139389330 60 90M = 139389482 242 TAGGAGG... EHHHHHH...7705896 83 chr9 139389513 60 90M = 139389512 -91 GCTGGGG... EBCHHFC...7705896 163 chr9 139389512 60 90M = 139389513 91 AGCTGGG... HHHHHHH...
paired data
![Page 24: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/24.jpg)
24
SAM format: FLAG field
numeric binary description
1 00000001 template has multiple fragments in sequencing
2 00000010 each fragment properly mapped according to aligner
4 00000100 fragment is unmapped
8 00001000 mate is unmapped
16 00010000 sequence is reverse complemented
32 00100000 sequence of mate is reversed
64 01000000 is first fragment in template
128 10000000 is second fragment in template
![Page 25: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/25.jpg)
SAM FLAG: examples
• 83 = 64 + 16 + 2 + 1 = 01010011
template has multiple fragments, each fragment is properly aligned, fragment is not unmapped, mate is not unmapped, sequence is reverse complemented, sequence of mate is not reversed, this is the first fragment in the template, this is not the second fragment in the template
• 163 = 128 + 32 + 2 + 1 = 10100011
template has multiple fragments, each fragment is properly aligned, fragment is not unmapped, mate is not unmapped, sequence is not reverse complemented, sequence of mate is reversed, this is not the first fragment in the template, this is the second fragment in the template
25
![Page 26: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/26.jpg)
SAM format: CIGAR string
26
M alignment match (can be sequence match or mismatch)
I insertion to the reference
D deletion to the reference
N skipped region from the reference
S soft clipping (clipped sequence is present in SEQ)
H hard clipping (clipped sequence is not present in SEQ)
P padding (silent deletion from padded reference)
= sequence match
X sequence mismatch
![Page 27: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/27.jpg)
CIGAR string: example
27
read ACGCA-TGCAGTtagacgt
reference ACGCAGTG--GT
CIGAR 5M1D2M2I2M7S
![Page 28: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/28.jpg)
Running bwa (FASTQ -> BAM)
http://bio-bwa.sourceforge.net
Steps:
1.Create index for genome (only has to be done once)
2.Run “bwa aln” to find suffix array coordinates of good hits of each individual read
3.Run “bwa samse/sampe” which converts suffix array coordinates to chromosomal coordinates and paired reads (for sampe)
28
![Page 29: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/29.jpg)
29
Running “bwa” without arguments returns help.
![Page 30: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/30.jpg)
bwa: indexing the genome
Only has to be done once!
To index chromosome 17 only:
1.Download chr17.fa.gz from UCSC Genome Browser (downloads section)
2.Run bwa index -a is chr17.fa
30
![Page 31: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/31.jpg)
31
![Page 32: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/32.jpg)
bwa: finding suffix array coordinates for reads
32
![Page 33: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/33.jpg)
bwa: converting suffix array coordinates to chromosome coordinates
33
![Page 34: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/34.jpg)
Using Galaxy for read mapping
34
![Page 35: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/35.jpg)
Viewing BAM files
Many options:
• Integrative Genome Viewer (IGV) by Broad Institute
• samtools tview
• UCSC genome browser
• bamview
• bambino
• ...
35
![Page 36: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/36.jpg)
Viewing BAM files: IGV
http://www.broadinstitute.org/software/igv/
Java WebStart
36
![Page 37: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/37.jpg)
37
coverage
reads
polymorphisms
gene model
![Page 38: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/38.jpg)
38
Is this a known SNP?
![Page 39: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/39.jpg)
39
File -> Load from Server...
![Page 40: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/40.jpg)
40
Yes, it is...
![Page 41: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/41.jpg)
41
Viewing BAM files: samtools tview
http://samtools.sourceforge.net
![Page 42: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/42.jpg)
42
![Page 43: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/43.jpg)
43
Viewing BAM files: UCSC Genome Browser
http://genome.ucsc.edu-> “Genome Browser”-> “Manage Custom Tracks”-> “Add Custom Tracks”-> In “Edit configuration”:
track type=bam name="My BAM" bigDataUrl=http://med.kuleuven.be/lcb/teaching/aln.sorted.bam
-> “Submit”
aln.sorted.bam contains reads that map to the first 10Mb of chr17
![Page 44: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/44.jpg)
44
whole chromosome
![Page 45: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/45.jpg)
45
zoomed in
![Page 46: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/46.jpg)
46
zoomed in even further query template names
![Page 47: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/47.jpg)
47
Read details
![Page 48: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/48.jpg)
48
Manipulating SAM/BAM files
• convert SAM <-> BAM
• remove PCR duplicates
• sort BAM file - necessary for loading into tools such as IGV
• index BAM file - necessary for loading into tools such as IGV
• local realignment around indels
• base quality recalibration
• pileup - i.e. convert from read-based to position-based; SNP calling
• ...
![Page 49: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/49.jpg)
Manipulating SAM/BAM files - tools: samtools
49
Li et al, 2009
http://samtools.sourceforge.net
![Page 50: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/50.jpg)
50
convert SAM to BAM
sort
index
![Page 51: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/51.jpg)
Manipulating SAM/BAM files - tools: PICARD
http://picard.sourceforge.net
= Java-based command-line utility with similar functionality as samtools
useful commands:
• MarkDuplicates - flags duplicate records (i.e. due to PCR amplification bias)
• CalculateHsMetrics - calculates set of Hybrid Selection specific metrics
• SamToFastq - extracts read sequences and qualities from SAM file
51
![Page 52: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/52.jpg)
52
![Page 53: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/53.jpg)
Duplicate removal
53
PCR amplification bias
some reads: better amplified than others => bias!!
=> keep only one (with highest mapping Q) PCR went well
PCR didn’t go so well
PCR didn’t work
![Page 54: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/54.jpg)
54
java -Xmx2048m \-jar /path_to_picard/MarkDuplicates.jar \INPUT=input.bam \OUTPUT=output.bam \METRICS_FILE=output.metrics \VALIDATION_STRINGENCY=LENIENT
samtools rmdup input.bam output.bam
Picard
samtools
![Page 55: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/55.jpg)
55
Manipulating SAM/BAM files - tools: GATK
GATK = Genome Analysis Toolkit, developed by Broad Institute
http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit
• Full variant discovery workflow
• Variant evaluation
• ...
![Page 56: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/56.jpg)
Base quality recalibration
56
• Why?
correct for variation in quality with machine cycle, sequence context, lane, baseQ, ...
• Steps:
• Identify what to correct for
• Calculate covariates
• Apply covariates
• Check (create plots)
![Page 57: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/57.jpg)
57
Mapping quality dependent on sequence context
![Page 58: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/58.jpg)
58
java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R resources/Homo_sapiens_assembly18.fasta \ --DBSNP resources/dbsnp_129_hg18.rod \ -I my_reads.bam \ -T CountCovariates \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov DinucCovariate \ -recalFile my_reads.recal_data.csv
java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R resources/Homo_sapiens_assembly18.fasta \ -I my_reads.bam \ -T TableRecalibration \ -outputBam my_reads.recal.bam \ -recalFile my_reads.recal_data.csv
![Page 59: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/59.jpg)
59
Local realignment near indels
![Page 60: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/60.jpg)
60
Local realignment near indels
![Page 61: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/61.jpg)
61
java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R /path/to/reference.fasta \ -o /path/to/output.intervals
java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \ -jar /path/to/GenomeAnalysisTK.jar \ -I input.bam \ -R ref.fasta \ -T IndelRealigner \ -targetIntervals /path/to/output.intervals \ -o realignedBam.bam
![Page 62: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/62.jpg)
62
Exercises
![Page 63: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/63.jpg)
63
Aligning reads to reference on the command line
Login on the server mentioned on Toledo, and:
From directory ~jaerts/i0d51a/: copy the files s_1_sequence_small.txt, s_2_sequence_small.txt and chr9.fa to your own home directory.
If you know that s_1_sequence_small.txt and s_2_sequence_small.txt contain paired reads: align these against chr9. You’ll first have to create an index for chr9 (see slides). Also convert the resulting sam-file to a bam-file.
How many of the reads were mapped? How many could not be mapped? There’s an easy way to do this with grep, but extra point if you can use the bitwise flag.
How many reads mapped without mismatches (i.e. CIGAR string equal to “90M”)?
![Page 64: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/64.jpg)
Aligning reads to reference using Galaxy
Log into your account on Galaxy.
Align the reads in s_1_sequence_small.txt and s_2_sequence_small.txt (that you uploaded in the last lesson) against hg19. Perform the mapping using BWA for Illumina. Use the built-in index “Human (Homo sapiens): hg19 Full” (type “hg19” in the “Select a reference genome” box). Do not suppress the header in the output SAM file.
Using Galaxy: create a histogram of the insert sizes of this DNA sequencing library (tip: you’ll need some commands from the “Text Manipulation” and “Filter and Sort” groups)
64
![Page 65: Next generation sequencing course - part 2: sequence mapping](https://reader036.fdocuments.us/reader036/viewer/2022081507/554e8cfbb4c90526358b4b93/html5/thumbnails/65.jpg)
Investigating BAM file with IGV
Start the IGV application from http://www.broadinstitute.org/software/igv/download (750MB version) and open the first10Mbchr17.sorted.bam file which you can download from Toledo.
• Is this data from a whole-genome sequencing experiment, or rather from some type of pulldown? If the latter: what type of pulldown (i.e. what were the targets).
• Is the complete CDS of the KIF1C gene covered?
• What is the left-most gene that is also in OMIM (you can find those at “Load from Server -> hg19 -> Phenotype and Disease Associations”)? Are all its exons covered?
• At position 11,928 of chromosome 17: is this a SNP? If it is: is it already known in dbSNP? What about position 13,905?
65