Post on 03-Jun-2020
Annovar
Variants Analysishttp://www.openbioinformatics.org/annovar/http://www.openbioinformatics.org/annovar/
Marin Vargas, Sergio Paul
Dicembre 2013
Variants Analysisto diagnosis of Genetic Disease
DNA
Extraction
DNA Sequencing
(Genome or Exome)
FASTQ files
Variants Calling
Genome reference
Illumina Hiseq
Variants Calling
(BWA + GATK)
VCF files
Variants Analysis
(Several softwares)
Annovar description� Annovar is a program for functional annotation of genetic
variants from high-throughput sequencing data.
� Efficient tool to functional annotation of genetic variants from
diverse genomes (human, mouse, worm, fly, yeast, etc).
Genetic ANNOVAR
The most likely
causal variants Genetic
variants(VCF format)
ANNOVARcausal variants
and their
corresponding
candidate genes
Annotated genomes(GFF3 format)
UCSC, ENSEMBL(human, mouse, cow, etc)
BiologicalKnowledge(Predictors)
Annovar goal
� Variants reduction, through a stepwise procedure is possible
excluded variants that are unlikely to be disease causal and so
identify the putative genes involved in the disease.
� Filtering synonymous SNP
for further analysis.
� Different prediction
algorithms use differentalgorithms use different
information, then we use
predictions from multiple
algorithms.
� Querying predictions from
different databases for
different algorithms is both
tedious and time
consuming.
Annovar functionality� Principal functionality is given three types of functional
annotation:
� Gene based: identify whether Single Nucleotide Variant(SNV), small Ins/Del or Copy Number Variation (CNV)
cause protein coding changes.
� Region based: identify variants in specific genomic regions.
� Filter based: identify variants in base to filters on diverse
databases.databases.
� Secondary functionality:
� Retrieve the nucleotide sequence in any user-specific
genomic positions in batch.
� Identify a candidate gene list for Mendelian diseases from
exome data.
� Other utilities.
Gene based� From a whole-genome sequencing experiment on a human
subject, given a list of SNVs and indels, it is of interest to
identify the genes that are disrupted.
� For intergenic variants, we are interested in knowing what are
the two flanking genes, and what are the distances between the
variants and the flanking genes.
� For exonic variants, we are interested in knowing the amino
acid changes.acid changes.
Region based� Identify variants at conserved genomic regions.
� Identify the subset of variants that either fall
within the conserved regions (for SNPs and short
in-dels), or overlap with these conserved regions
(for large-scale CNVs).
� Use phastCons program prediction to annotate
variants that fall within conserved genomic
regions.
� Use TFBS (Transcription Factor Binding Site)
database to annotate the respective region.database to annotate the respective region.
� Identify cytogenetic band for genetic variants.
� Identify variants located in segmental
duplications (SegDup).
� Identify previously reported structural variants in
DGV (Database of Genomic Variants).
� Identify variants reported in previously published
GWAS (Genome-wide association studies).
� Identify variants in ENCODE annotated regions.
� Identify non-coding variants that disrupt
enhancers, repressors, promoters.
Filter based predictors 1� Identify subsets of variants based on
comparison to other variant
databases, for example, dbSNP or
1000 Genome Project.
� 1000 Genomes Project: started
in January 2008, is an
international research effort to
establish by far the most detailedestablish by far the most detailed
catalogue of human genetic
variation. annovar use the last
version (2012 April).
� dbSNP: The Single Nucleotide
Polymorphism Database is a free
public archive for genetic
variation within and across
different species developed and
hosted by the NCBI.
Filter based predictors 2� dbNSFP is a database developed by LJB2 (Liu, Jian and
Boerwinkle version 2) for Functional Prediction and annotation
of all potential Non-Synonymous SNVs in the human genome.
� It compiles prediction scores along with a conservation score,
from several popular algorithms and other related information.
� Thus dbSNFP use two types algorithms prediction:� Thus dbSNFP use two types algorithms prediction:
� Protein variant functional prediction.
� Variant conservation prediction.
Filter based predictors 3� dbNSFP protein variant functional prediction:
� SIFT: Sorting Intolerant From Tolerant,predicts whether an amino acid substitution islikely to affect protein function based onsequence homology and the physico-chemicalsimilarity between the alternate amino acids.
� PolyPhen2: prediction of functional effects ofhuman nsSNPs.human nsSNPs.
� LRT: Likelihood Ratio Test identify a subset ofdeleterious mutations that disrupt highlyconserved amino acids within protein-codingsequences.
� MutationTaster: rapid evaluation of thedisease-causing potential of DNA sequencealterations.
� MutationAssesor: predicts the functionalimpact of amino-acid substitutions in proteins.
� FATHMM: Functional Analysis ThroughHidden Markov Models.
Filter based predictors 4� dbNSFP variant conservation prediction:
� PhyloP: assigns conservation p-values, scores reflect either
conservation (positive scores) or selection (negative scores).
� GERP++: Genomic Evolutionary Rate Profiling, measures base
conservation.
� SiPhy: models the pattern of substitutions, rather than just the
rate. Biased substitutions (e.g. conserved lysine: AAA <-> AAG).
Filter based predictors 5� ESP (Exome Sequencing Project) annotations
� The ESP is a NHLBI funded exome sequencing project aiming to identify genetic
variants in exonic regions from over 6000 individuals, including healthy ones as
well as subjects with different diseases.
� GERP++(Genomic Evolutionary Rate Profiling) annotations� GERP identifies constrained elements in multiple alignments by quantifying
substitution deficits.
� CG (Complete Genomics) frequency annotations� Each technical platform, such as Complete Genomics and Illumina HiSeq, may� Each technical platform, such as Complete Genomics and Illumina HiSeq, may
generate some platform specific sequencing artifacts. Complete genomics
provides whole-genome data for a relatively small group of healthy subjects, but
this data set can be quite useful to filter out technical artifacts for CG users.
� Population frequency ensembl annotations� The database popfreq_all integrates PopFreqMax, 1000G2012APR_ALL,
1000G2012APR_AFR, 1000G2012APR_AMR, 1000G2012APR_ASN,
1000G2012APR_EUR, ESP6500si_AA, ESP6500si_EA, CG46, NCI60 SNP137,
COSMIC65, DISEASE.
� Generic mutation annotations� Annovar users have the flexibility to supply a custom-made annotation file, and
let ANNOVAR perform filter-based annotation on this annotation file.
Annovar result
�Two output files will be generated:
�The first file contains annotation for all variants.
�The second output file, contains the amino acid
changes as a result of the exonic variant.
�Annovar use standardized nomenclature to
annotate non-synonymous SNV and indels on
cDNA or on proteins.
Example: NOD2:NM_022162:exon4:p.R702W