MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Post on 21-Dec-2015

214 views 1 download

Tags:

Transcript of MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

MCB3895-004 Lecture #20Nov 18/14

Reference alignments

Today:

1. Align reads to a reference genome

2. Correct for misalignments

3. Analyze variants between reads and the reference genome • (i.e., differences between the sequenced genome

and the reference)

• We will roughly follow the default samtools protocol: http://www.htslib.org/workflow/

Record keeping

• As you will see, a huge amount of this work is converting between formats so that different software will work

• It is therefore CRUCIAL that you keep records of all the commands that you use

• BEWARE: different versions of all of these software have different syntaxes

Read mapping

• Align reads to a reference genome and determine SNPs

• Note aligning reads, not contigs as with nucmer

• Computationally more efficient than doing de novo assembly first

Read mapping tools

• Many different flavors, but overall dominated by two programs:

• bwa (we will use today)• bowtie

• Note: early versions of bowtie did not align reads containing indels, whereas bwa did

• Some debate about which is best, trade-offs between sensitivity (ability to map everything) and specificity (are mappings correct)

• Also speed and memory considerations

Mapping using bwa

• Create an index of the reference genome nucleotide fasta for the alignment software to use for read mapping

• syntax: $ bwa index [ref.fasta file]• e.g.: $ bwa index E_coli.fasta• note: use ".fasta" file ending for a later step

• Creates 5 output files: [ref.fasta].amb, .ann, .bwt, .pac, .sa

• "Index": special computer data structure that allows fast searching; software-specific

Mapping using bwa

• bwa mem does the actual mapping step• syntax: $ bwa mem -R '@RG\tID:[name1]\tSM:[name2]\tLB:library1' [ref.fasta file] [read file 1] [read file 2] > [outfile]

• -R: indexes "read groups", required for GATK in later steps

• e.g.: $ bwa mem -R '@RG\tID:all\tSM:all\tLB:library' E_coli.fasta SRR826450_1.fastq SRR826450_2.fastq > align.sam

samtools: convert .sam to .bam, clean up names• .sam is the plain text output format of most

sequence alignment programs

• Because these can be large, most subsequent programs use the compressed ".bam" format instead

• bwa sometimes does odd things to read pairing information, can clean up during conversion

samtools: convert .sam to .bam, clean up names• syntax: $ samtools fixmate -O bam [input .sam file] [output .bam file]

• -O: output file type

• e.g.: $ samtools fixmate -O bam align.sam align_fixmate.bam

samtools: sort .bam file

• samtools and related software use .bam files that are sorted by ascending genomic position

• i.e., starts from position #1 on the reference genome and goes to the end

• syntax: $ samtools sort -O bam -o [output sorted .bam] -T [temp file location] [input unsorted .bam]

• -O: output file type• -o: output file name• -T: location for temporary files (required)

• e.g.: $ samtools sort -O bam -o align_sorted.bam -T temp align_fixmate.bam

GATK: realign indels

• bwa sometimes misaligns indels in reads

• One way to get rid of these is to use the realignment functions in the GATK package

• More generally: GATK does much of the same thing as samtools, strong focus on diploid genomes

• Unfortunately, GATK uses java (silly command line syntax)

• Unfortunately, GATK needs its own file formats

Picard: index reference

• syntax: $ java -jar /export/apps/picard-tools-1.124/picard.jar CreateSequenceDictionary REFERENCE=[ref.fasta file] OUTPUT=[output .dict file]

• REFERENCE: reference file name• OUTPUT: output index file name, must be ".dict"

• e.g.: $ java -jar /export/apps/picard-tools-1.124/picard.jar CreateSequenceDictionary REFERENCE=E_coli.fasta OUTPUT=E_coli.dict

samtools: index reference

• syntax: $ samtools faidx [ref.fasta file]

• e.g.: $ samtools faidx E_coli.fasta• outputs [ref.fasta].fai index file

samtools: index .bam file

• syntax: $ samtools index [sorted .bam file]

• e.g.: $ samtools index align_sorted.bam

• outputs [sorted bam].bai output file

GATK: prepare reads for indel realignment

• syntax: $ java -Xmx2g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R [ref.fasta file] -I [sorted & indexed .bam file] -o [output file name]

• -R: reference .fasta file name• -I: sorted and indexed .bam file• -o: output intervals file name

• e.g.: $ java -Xmx2g -jar /opt/bioinformatics GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R E_coli.fasta -I align_sorted.bam -o align_sorted.intervals

GATK: perform indel realignment• syntax: $ java -Xmx4g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.jar -T IndelRealigner -R [ref.fasta file] -I [sorted & indexed .bam file] -targetIntervals [intervals file] -o [output .bam file]

• -T: Program function to use• -R: Reference .fasta file• -I: Intervals file from last step• -o: output .bam file name

• e.g.: $ java -Xmx2g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.jar -T IndelRealigner -R E_coli.fasta -I align_sorted.bam -targetIntervals align_sorted.intervals -o align_realigned.bam

Picard: remove duplicates

• Duplicate reads can arise because of PCR artifacts during sequencing

• Because duplicate reads to not provide additional information, it is best to remove them for computational efficiency

• Identified by having identical start and end mapping positions

Picard: remove duplicates

• syntax: $ java -Xmx2g -jar /export/apps/picard-tools-1.124/picard.jar MarkDuplicates INPUT=[input bam file] OUTPUT=[output bam file] REMOVE_DUPLICATES=true METRICS_FILE=[metrics output file]

• INPUT: input .bam file from GATK• OUTPUT: output .bam file lacking duplicates• METRICS_FILE: summary file of duplicate reads removed

• e.g.: $ java -Xmx2g -jar /export/apps/picard-tools-1.124/picard.jar MarkDuplicates INPUT=align_realigned.bam OUTPUT=align_nodups.bam REMOVE_DUPLICATES=true METRICS_FILE=nodups.metrics

samtools: index .bam file

• samtools requires that the new .bam file be indexed before variant calling

• syntax: $ samtools index [.bam file name]

• e.g.: $ samtools index align_nodups.bam

samtools: create .bcf file for variant calling• bcftools is a package very similar to samtools that handles variant calling

• Of course, it requires its own file format

• syntax: $ samtools mpileup -go [output .bcf] -f [ref.fasta] [1 or more indexed .bam]

• -go: specify output file name and .bcf format• -f: reference .fasta file name

• e.g.: $ samtools mpileup -go E_coli.bcf -f E_coli.fasta align_nodups.bam

bcftools: call variants

• The actual variant calling step uses the call function in bcftools

• syntax: $ bcftools call -vmO z -o [output .vcf.gz file] [input .bcf file]

• -v: only output variant sites• -m: specify variant calling algorithm (multiallelic)• -O: specify output format, z = .vcf.gz• -o: output file name

• e.g.: $ bcftools call -vmO z -o E_coli.vcf.gz E_coli.bcf

bcftools: index .vcf.gz file• tabix is a program included in bcftools that

indexes a .vcf.gz file

• syntax: $ tabix -p vcf [input .vcf.gz file]

• -p: specifies file type

• e.g.: $ tabix -p vcf E_coli.vcf.gz

bcftools: analyze .vcf.gz file• bcftools has handy software to analyze the

variants that it has identified

• syntax: $ bcftools stats -F [ref.fasta] -s - [input .vcf.gz file] > [output file]

• -F: faidx indexed reference .fasta sequence • -s: list of samples to analyze, "-" = all samples

• e.g.: $ bcftools stats -F E_coli.fasta -s - E_coli.vcf.gz > E_coli.vcf.gz.stats

Summary stats

Indel stats

Quality stats

Indel types

Substitution types

bcftools: filter variants based on quality score• Generally, one wants to mark low quality variants.

• How to draw a cutoff line is somewhat subjective

• syntax: $ bcftools filter -O z -o [output .vcf.gz file] -s LOWQUAL -i '%QUAL>10' [input .vcf.gz file]

• -O: output type, "z" = .vcf.gz• -o: output file name• -s: label to mark failed variants• -i: condition under which sequences pass

• e.g.: $ bcftools filter -O z -o E_coli_filtered.vcf.gz -s LOWQUAL -i '%QUAL>10' E_coli.vcf.gz

bcftools: calculate stats based on filtered variants• You can tell bcftools stats to only analyze

variants that pass the filter

• syntax: $ bcftools stats -F [ref.fasta] -f PASS -s - [input filtered .vcf.gz file]

• -F: faidx indexed reference .fasta sequence • -f: how sequences to include are marked• -s: list of samples to analyze, "-" = all samples

• e.g.: $ bcftools stats -F E_coli.fasta -f PASS -s - E_coli_filtered.vcf.gz

Assignment

• How do our MiSeq and HiSeq E.coli datasets differ from the reference K-12 genome?

• Submit: 1. Number of SNP and indel differences compared

to the reference genome2. Justification for the filtering parameters used3. Lab notebook file listing all of the exact

parameters used4. Output of the bcftools stats analysis