MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

28
MCB3895-004 Lecture #20 Nov 18/14 Reference alignments

Transcript of MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Page 1: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

MCB3895-004 Lecture #20Nov 18/14

Reference alignments

Page 2: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Today:

1. Align reads to a reference genome

2. Correct for misalignments

3. Analyze variants between reads and the reference genome • (i.e., differences between the sequenced genome

and the reference)

• We will roughly follow the default samtools protocol: http://www.htslib.org/workflow/

Page 3: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Record keeping

• As you will see, a huge amount of this work is converting between formats so that different software will work

• It is therefore CRUCIAL that you keep records of all the commands that you use

• BEWARE: different versions of all of these software have different syntaxes

Page 4: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Read mapping

• Align reads to a reference genome and determine SNPs

• Note aligning reads, not contigs as with nucmer

• Computationally more efficient than doing de novo assembly first

Page 5: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Read mapping tools

• Many different flavors, but overall dominated by two programs:

• bwa (we will use today)• bowtie

• Note: early versions of bowtie did not align reads containing indels, whereas bwa did

• Some debate about which is best, trade-offs between sensitivity (ability to map everything) and specificity (are mappings correct)

• Also speed and memory considerations

Page 6: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Mapping using bwa

• Create an index of the reference genome nucleotide fasta for the alignment software to use for read mapping

• syntax: $ bwa index [ref.fasta file]• e.g.: $ bwa index E_coli.fasta• note: use ".fasta" file ending for a later step

• Creates 5 output files: [ref.fasta].amb, .ann, .bwt, .pac, .sa

• "Index": special computer data structure that allows fast searching; software-specific

Page 7: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Mapping using bwa

• bwa mem does the actual mapping step• syntax: $ bwa mem -R '@RG\tID:[name1]\tSM:[name2]\tLB:library1' [ref.fasta file] [read file 1] [read file 2] > [outfile]

• -R: indexes "read groups", required for GATK in later steps

• e.g.: $ bwa mem -R '@RG\tID:all\tSM:all\tLB:library' E_coli.fasta SRR826450_1.fastq SRR826450_2.fastq > align.sam

Page 8: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

samtools: convert .sam to .bam, clean up names• .sam is the plain text output format of most

sequence alignment programs

• Because these can be large, most subsequent programs use the compressed ".bam" format instead

• bwa sometimes does odd things to read pairing information, can clean up during conversion

Page 9: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

samtools: convert .sam to .bam, clean up names• syntax: $ samtools fixmate -O bam [input .sam file] [output .bam file]

• -O: output file type

• e.g.: $ samtools fixmate -O bam align.sam align_fixmate.bam

Page 10: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

samtools: sort .bam file

• samtools and related software use .bam files that are sorted by ascending genomic position

• i.e., starts from position #1 on the reference genome and goes to the end

• syntax: $ samtools sort -O bam -o [output sorted .bam] -T [temp file location] [input unsorted .bam]

• -O: output file type• -o: output file name• -T: location for temporary files (required)

• e.g.: $ samtools sort -O bam -o align_sorted.bam -T temp align_fixmate.bam

Page 11: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

GATK: realign indels

• bwa sometimes misaligns indels in reads

• One way to get rid of these is to use the realignment functions in the GATK package

• More generally: GATK does much of the same thing as samtools, strong focus on diploid genomes

• Unfortunately, GATK uses java (silly command line syntax)

• Unfortunately, GATK needs its own file formats

Page 12: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Picard: index reference

• syntax: $ java -jar /export/apps/picard-tools-1.124/picard.jar CreateSequenceDictionary REFERENCE=[ref.fasta file] OUTPUT=[output .dict file]

• REFERENCE: reference file name• OUTPUT: output index file name, must be ".dict"

• e.g.: $ java -jar /export/apps/picard-tools-1.124/picard.jar CreateSequenceDictionary REFERENCE=E_coli.fasta OUTPUT=E_coli.dict

Page 13: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

samtools: index reference

• syntax: $ samtools faidx [ref.fasta file]

• e.g.: $ samtools faidx E_coli.fasta• outputs [ref.fasta].fai index file

Page 14: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

samtools: index .bam file

• syntax: $ samtools index [sorted .bam file]

• e.g.: $ samtools index align_sorted.bam

• outputs [sorted bam].bai output file

Page 15: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

GATK: prepare reads for indel realignment

• syntax: $ java -Xmx2g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R [ref.fasta file] -I [sorted & indexed .bam file] -o [output file name]

• -R: reference .fasta file name• -I: sorted and indexed .bam file• -o: output intervals file name

• e.g.: $ java -Xmx2g -jar /opt/bioinformatics GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R E_coli.fasta -I align_sorted.bam -o align_sorted.intervals

Page 16: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

GATK: perform indel realignment• syntax: $ java -Xmx4g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.jar -T IndelRealigner -R [ref.fasta file] -I [sorted & indexed .bam file] -targetIntervals [intervals file] -o [output .bam file]

• -T: Program function to use• -R: Reference .fasta file• -I: Intervals file from last step• -o: output .bam file name

• e.g.: $ java -Xmx2g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.jar -T IndelRealigner -R E_coli.fasta -I align_sorted.bam -targetIntervals align_sorted.intervals -o align_realigned.bam

Page 17: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Picard: remove duplicates

• Duplicate reads can arise because of PCR artifacts during sequencing

• Because duplicate reads to not provide additional information, it is best to remove them for computational efficiency

• Identified by having identical start and end mapping positions

Page 18: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Picard: remove duplicates

• syntax: $ java -Xmx2g -jar /export/apps/picard-tools-1.124/picard.jar MarkDuplicates INPUT=[input bam file] OUTPUT=[output bam file] REMOVE_DUPLICATES=true METRICS_FILE=[metrics output file]

• INPUT: input .bam file from GATK• OUTPUT: output .bam file lacking duplicates• METRICS_FILE: summary file of duplicate reads removed

• e.g.: $ java -Xmx2g -jar /export/apps/picard-tools-1.124/picard.jar MarkDuplicates INPUT=align_realigned.bam OUTPUT=align_nodups.bam REMOVE_DUPLICATES=true METRICS_FILE=nodups.metrics

Page 19: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

samtools: index .bam file

• samtools requires that the new .bam file be indexed before variant calling

• syntax: $ samtools index [.bam file name]

• e.g.: $ samtools index align_nodups.bam

Page 20: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

samtools: create .bcf file for variant calling• bcftools is a package very similar to samtools that handles variant calling

• Of course, it requires its own file format

• syntax: $ samtools mpileup -go [output .bcf] -f [ref.fasta] [1 or more indexed .bam]

• -go: specify output file name and .bcf format• -f: reference .fasta file name

• e.g.: $ samtools mpileup -go E_coli.bcf -f E_coli.fasta align_nodups.bam

Page 21: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

bcftools: call variants

• The actual variant calling step uses the call function in bcftools

• syntax: $ bcftools call -vmO z -o [output .vcf.gz file] [input .bcf file]

• -v: only output variant sites• -m: specify variant calling algorithm (multiallelic)• -O: specify output format, z = .vcf.gz• -o: output file name

• e.g.: $ bcftools call -vmO z -o E_coli.vcf.gz E_coli.bcf

Page 22: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

bcftools: index .vcf.gz file• tabix is a program included in bcftools that

indexes a .vcf.gz file

• syntax: $ tabix -p vcf [input .vcf.gz file]

• -p: specifies file type

• e.g.: $ tabix -p vcf E_coli.vcf.gz

Page 23: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

bcftools: analyze .vcf.gz file• bcftools has handy software to analyze the

variants that it has identified

• syntax: $ bcftools stats -F [ref.fasta] -s - [input .vcf.gz file] > [output file]

• -F: faidx indexed reference .fasta sequence • -s: list of samples to analyze, "-" = all samples

• e.g.: $ bcftools stats -F E_coli.fasta -s - E_coli.vcf.gz > E_coli.vcf.gz.stats

Page 24: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Summary stats

Indel stats

Page 25: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Quality stats

Indel types

Substitution types

Page 26: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

bcftools: filter variants based on quality score• Generally, one wants to mark low quality variants.

• How to draw a cutoff line is somewhat subjective

• syntax: $ bcftools filter -O z -o [output .vcf.gz file] -s LOWQUAL -i '%QUAL>10' [input .vcf.gz file]

• -O: output type, "z" = .vcf.gz• -o: output file name• -s: label to mark failed variants• -i: condition under which sequences pass

• e.g.: $ bcftools filter -O z -o E_coli_filtered.vcf.gz -s LOWQUAL -i '%QUAL>10' E_coli.vcf.gz

Page 27: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

bcftools: calculate stats based on filtered variants• You can tell bcftools stats to only analyze

variants that pass the filter

• syntax: $ bcftools stats -F [ref.fasta] -f PASS -s - [input filtered .vcf.gz file]

• -F: faidx indexed reference .fasta sequence • -f: how sequences to include are marked• -s: list of samples to analyze, "-" = all samples

• e.g.: $ bcftools stats -F E_coli.fasta -f PASS -s - E_coli_filtered.vcf.gz

Page 28: MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Assignment

• How do our MiSeq and HiSeq E.coli datasets differ from the reference K-12 genome?

• Submit: 1. Number of SNP and indel differences compared

to the reference genome2. Justification for the filtering parameters used3. Lab notebook file listing all of the exact

parameters used4. Output of the bcftools stats analysis