2014 Wellcome Trust Advances Course: NGS Course - Lecture2

41
WTAC NGS Course, Hinxton 12 th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants Thomas Keane Sequence Variation Infrastructure Group WTSI Today's slides: ftp://ftp-mouse.sanger.ac.uk/other/tk2/WTAC-2014/Lecture2.pdf

description

 

Transcript of 2014 Wellcome Trust Advances Course: NGS Course - Lecture2

  • WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants Thomas Keane Sequence Variation Infrastructure Group WTSI Today's slides: ftp://ftp-mouse.sanger.ac.uk/other/tk2/WTAC-2014/Lecture2.pdf
  • WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants VCF Format SNP/indel Identification Structural Variation
  • WTAC NGS Course, Hinxton 10th April 2014 VCF: Variant Call Format VCF is a standardised format for storing DNA polymorphism data SNPs, insertions, deletions and structural variants With rich annotations (e.g. context, predicted function, sequence data support) Indexed for fast data retrieval of variants from a range of positions Store variant information across many samples Record meta-data about the site dbSNP accession, filter status, validation status Very flexible format Arbitrary tags can be introduced to describe new types of variants No two VCF files are necessarily the same User extensible annotation fields supported Same event can be expressed in multiple ways by including different numbers Recommendation on VCF format website to ensure consistency
  • WTAC NGS Course, Hinxton 10th April 2014 VCF Format Header section and a data section Header Arbitrary number of meta-data information lines Starting with characters ## Column definition line starts with single # Mandatory columns Chromosome (CHROM) Position of the start of the variant (POS) Unique identifiers of the variant (ID) Reference allele (REF) Comma separated list of alternate non-reference alleles (ALT) Phred-scaled quality score (QUAL) Site filtering information (FILTER) User extensible annotation (INFO)
  • WTAC NGS Course, Hinxton 10th April 2014 Example VCF (SNPs/indels)
  • WTAC NGS Course, Hinxton 10th April 2014 VCF Trivia 1 What version of the human reference genome was used? What does the DB INFO tag stand for? What does the ALT column contain? At position 17330, what is the total depth? What is the depth for sample NA00002? At position 17330, what is the genotype of NA00002? Which position is a tri-allelic SNP site? What sort of variant is at position 1234567? What is the genotype of NA00002?
  • WTAC NGS Course, Hinxton 10th April 2014 Functional Annotation VCF can store arbitrary INFO tags per site Genotype FORMAT tags Use tags to describe Genomic context of the variant (e.g. coding, intronic, non-coding, UTR, intergenic) Predicted functional consequence of the variant (e.g. synonymous/non- synonymous, protein structure change) Presence of the variant in other large resequencing studies Several tools for annotating a VCF SnpEff: http://snpeff.sourceforge.net/ Ensembl VEP: http://www.ensembl.org/info/docs/tools/vep/script/index.html FunSeq: http://funseq.gersteinlab.org/
  • WTAC NGS Course, Hinxton 10th April 2014 Ensembl - VEP "VEP determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions." Species must be included in either Ensembl OR Ensembl genomes Sequence ontology (SO) terms to describe genomic context Pubmed IDs for variants cited Output only the most severe consequence per variation. Online or off-line mode Off-line recommended for large numbers of variants (download relevant cache) Human specific annotations Sift - predicts whether an amino acid substitution affects protein function Polyphen - predicts impact of an amino acid substitution on the structure of human proteins 1000 genomes frequencies - global or per population
  • WTAC NGS Course, Hinxton 10th April 2014 VEP VCF VEP INFO tag: ##INFO= Example CSQ=T|ENSG00000238962|ENST00000458792|Transcript|upstream_gene_variant| |||||rs72779452|||3789|-1||RNU7-176P|HGNC|||0.02|0.10|0.07|0.17, T|ENSG00000143870|ENST00000404824|Transcript|synonymous_variant|474|102| 34|A|gcC/gcA|rs72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17, T|ENSG00000143870|ENST00000381611|Transcript|5_prime_UTR_variant|264|||||r s72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17
  • WTAC NGS Course, Hinxton 10th April 2014 More Information VCF http://bioinformatics.oxfordjournals.org/content/27/15/2156.full http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf- variant-call-format-version-41 VCFTools http://vcftools.sourceforge.net GATK http://www.broadinstitute.org/gatk/ http://www.broadinstitute.org/gatk/guide/article?id=1268 VCF Annotation Ensembl VEP: http://www.ensembl.org/info/docs/tools/vep/index.html SNPeff: http://snpeff.sourceforge.net/ Anntools: http://anntools.sourceforge.net/
  • WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants VCF Format SNP/indel Identification Structural Variation
  • WTAC NGS Course, Hinxton 12th April 2014 SNP Identification SNP - single nucleotide polymorphisms Examine the bases aligned to position and look for differences SNP discovery vs genotyping Finding new variant sites Determining the genotype at a set of already known sites Factors to consider when calling SNPs Base call qualities of each supporting base Proximity to Small indel Homopolymer run (>4-5bp for 454 and >10bp for illumina) Mapping qualities of the reads supporting the SNP Low mapping qualities indicates repetitive sequence Read length Possible to align reads with high confidence to larger portion of the genome with longer reads Paired reads Sequencing depth
  • WTAC NGS Course, Hinxton 12th April 2014 Mouse SNP
  • WTAC NGS Course, Hinxton 12th April 2014 Read Length vs. Uniqueness
  • WTAC NGS Course, Hinxton 12th April 2014 Inaccessible Genome
  • WTAC NGS Course, Hinxton 12th April 2014 Is this a real SNP?
  • WTAC NGS Course, Hinxton 12th April 2014 Evaluating SNPs Specificity vs sensitivity False positives vs. false negatives Desirable to have high sensitivity and specificity Sensitivity External sources of validation Specificity Test a random selection of snps by another technology e.g. Sequenom, Sanger sequencing Receiver operator curves to investigate effects of varying parameters
  • WTAC NGS Course, Hinxton 12th April 2014 Known Systematic Biases Many biases can be introduced in either sample preparation, sequencing process, computational alignment steps etc. Can generate false positive SNPs/indels Potential biases Strand bias End distance bias Consistency across replicates/libraries Variant distance bias VCF Tools Soft filter variants file for these biases Variants kept in the file - just annotated with potential bias affecting the variant
  • WTAC NGS Course, Hinxton 12th April 2014 Strand Bias
  • WTAC NGS Course, Hinxton 12th April 2014 End Distance Bias
  • WTAC NGS Course, Hinxton 12th April 2014 Variant Distance Bias
  • WTAC NGS Course, Hinxton 12th April 2014 Reproducibility
  • WTAC NGS Course, Hinxton 12th April 2014 Future of Variant Calling? Current approaches Rely heavily on the supplied alignment Largely site based, don't examine local haplotype Local denovo assembly based variant callers Calls SNP, INDEL, MNP and small SV simultaneously Can removes mapping artifacts e.g. GATK haplotype caller
  • WTAC NGS Course, Hinxton 12th April 2014 Haplotype Based Calling - GATK
  • WTAC NGS Course, Hinxton 12th April 2014 Lecture 2: Identification of SNPs, Indels, and structural variants VCF Format SNP/indel Identification Structural Variation
  • WTAC NGS Course, Hinxton 12th April 2014 Genomic Structural Variation Large DNA rearrangements (>100bp) Frequent causes of disease Referred to as genomic disorders Mendelian diseases or complex traits such as behaviors E.g. increase in gene dosage due to increase in copy number Prevalent in cancer genomes Many types of genomic structural variation (SV) Insertions, deletions, copy number changes, inversions, translocations & complex events Comparative genomic hybridization (CGH) traditionally used to for copy number discovery CNVs of 1-50 kb in size have been under-ascertained Next-gen sequencing revolutionised field of SV discovery Parallel sequencing of ends of large numbers of DNA fragments Examine alignment distance of reads to discover presence of genomic rearrangements Resolution down to ~100bp
  • WTAC NGS Course, Hinxton 12th April 2014 Human Disease Stankiewicz and Lupski (2010) Ann. Rev. Med.
  • WTAC NGS Course, Hinxton 12th April 2014 Structural Variation Several types of structural variations (SVs) Large Insertions/deletions Inversions Translocations Read pair information used to detect these events Paired end sequencing of either end of DNA fragment Observe deviations from the expected fragment size Presence/absence of mate pairs
  • WTAC NGS Course, Hinxton 12th April 2014 Structural Variation Types
  • WTAC NGS Course, Hinxton 10th April 2014 Fragment Size QC
  • WTAC NGS Course, Hinxton 10th April 2014 What is this?
  • WTAC NGS Course, Hinxton 12th April 2014 What is this?
  • WTAC NGS Course, Hinxton 12th April 2014 What is this?
  • WTAC NGS Course, Hinxton 12th April 2014 Mobile Element Insertions Transposons are segments of DNA that can move within the genome A minimal genome - ability to replicate and change location Relics of ancient viral infections Dominate landscape of mammalian genomes 38-45% of rodent and primate genomes Genome size proportional to number of TEs Class 1 (RNA intermediate) and 2 (DNA intermediate) Potent genetic mutagens Disrupt expression of genes Genome reorganisation and evolution Transduction of flanking sequence Species specific families Human: Alu, L1, SVA Mouse: SINE, LINE, ERV Many other families in other species
  • WTAC NGS Course, Hinxton 12th April 2014 Human Mobile Elements
  • WTAC NGS Course, Hinxton 12th April 2014 Mobile Element Insertions
  • WTAC NGS Course, Hinxton 12th April 2014 Mouse Example - LookSeq
  • WTAC NGS Course, Hinxton 12th April 2014 Human Alu - IGV
  • WTAC NGS Course, Hinxton 12th April 2014 Detecting Mobile Element Insertions Most algorithms for locating non-reference mobile elements operate in a similar manner Goal: Detect all read pairs where one-end is flanking the insertion point and mate is in the inserted sequence Pseudo algorithm Read through BAM file and make list of all discordant read pairs Filter the reads where one end is similar to your library of mobile elements Remove anchor reads with low mapping quality Cluster the anchor reads and examine breakpoint Filter out any clusters close to annotated elements of the same type
  • WTAC NGS Course, Hinxton 12th April 2014 1000 Genomes CEU Trio Typical human sample ~900-1000 non-reference mobile elements ~800 Alu elements, ~100 L1 Why are there 44 calls private to the child?
  • WTAC NGS Course, Hinxton 12th April 2014 Mobile Element Software RetroSeq: https://github.com/tk2/RetroSeq VariationHunter: http://compbio.cs.sfu.ca/strvar. htm T-LEX: http://petrov.stanford.edu/cgi-bin/Tlex. html Tea: http://compbio.med.harvard.edu/Tea/