Mutation Analysiscompbio.charite.de/tl_files/mutation-analysis-2012.pdf · 2015. 11. 27. ·...
Transcript of Mutation Analysiscompbio.charite.de/tl_files/mutation-analysis-2012.pdf · 2015. 11. 27. ·...
Mutation Analysis
Sebastian Bauer
Institut für Medizinische GenetikCharité Universitätsmedizin Berlin
2012/03/22
Workflow for Mutation Analysis
Raw Data Generation Sample preparation and sequencing
Raw Data Analysis Base calling
Whole Genome Mapping Alignment to a reference genome
Variant Calling Detection of genetic variation
Annotation Linking variants to biological information
Raw Data Generation
Prepare samples
Then sequence
Output is vendor-specific raw data
Workflow for Mutation Analysis
Raw Data Generation Sample preparation and sequencing
Raw Data Analysis Base calling
Whole Genome Mapping Alignment to a reference genome
Variant Calling Detection of genetic variation
Annotation Linking variants to biological information
Raw Data Analysis: Base Calling
Transform raw data in to sequences of bases
Exact procedure depends on the used sequencing platform
Most report additonally quality score for each base that canbe transformed into a Phread score
QPhred =−10 log10 P(error)
Example: QPhred = 20 ⇔ error = 1%
One Sequence Entry (Read) in the Output Fastq File@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!’’*((((***+))%%%++)(%%%%).1***-+*’’))**55CCF>>>>>>CCCCCCC65
Workflow for Mutation Analysis
Raw Data Generation Sample preparation and sequencing
Raw Data Analysis Base calling
Whole Genome Mapping Alignment to a reference genome
Variant Calling Detection of genetic variation
Annotation Linking variants to biological information
Whole Genome Mapping / Aligning
Current methods assume mapping to a reference genome
Allows to find variants with known associations to diseasesbut also new suspects
Most short read mapper use hashs or on data structuresbased on the Burrows-Wheeler transform
Output is some statistics and a SAM or BAM file
Whole Genome Mapping / Aligning
Current methods assume mapping to a reference genome
Allows to find variants with known associations to diseasesbut also new suspects
Most short read mapper use hashs or on data structuresbased on the Burrows-Wheeler transform
Output is some statistics and a SAM or BAM file
Workflow for Mutation Analysis
Raw Data Generation Sample preparation and sequencing
Raw Data Analysis Base calling
Whole Genome Mapping Alignment to a reference genome
Variant Calling Detection of genetic variation
Annotation Linking variants to biological information
Variant Calling: Genetic Variation
Variant Calling: Identify regions that differ from the reference
Single nucleotide variants (SNVs)TGCATTGCGTAGGCTGCATTCCGTAGGC
Short indels (=insertion/deletion)TGCATT– – –TAGGCTGCATTCCGTAGGC
MicrosatellitesTGCTCATCATCATCAGCTGCTCATCA– – – – – –GC
Minisatellites≤ 100bp
Copy number variations (largedeletions, duplications, inversions;CNVs)
≥ 1000bp
Variant Calling: Genotype
Easy Approach
Count alleles at each column X in the pileup and use cutoff rules1 Filter for Phread score (QPhread) of 20
2 Call a genotype heterozygous, if non-ref allele is between 20%and 80%, otherwise homozygous
Works reasonable well when coverage is > 20 (Nielsen et al. 2011)
More elaborate ones are based on probabilistic frameworks
P(G|X) ∝ P(X |G)P(G) = ∏i
P(Xi |G)P(G), G ∈ {A,C,T,G}
Likelihood P(X |G) from quality score P(Xi |G) for each entry i
P(G) allows to specifiy data-independent prior knowledge
Posterior 0 < P(G|X)< 1 assesses genotype and confidence
Variant Calling: Integrating Prior Knowlegde
Single Sample Prior for a Given Position X
Suppose that a G/T polymorphism is reported in dbSNP. Then,G G T GT Other combinations
P(G) 0.454 0.454 0.0909 < 10−4
GATK multi-sample uses estimated allelefrequencies from larger sample setscombined Hardy-Weinberg equilibrium
GATK-Beagle with linkage disequilibriumdata
Variant Calling: Other Examples of Extensions
(taken from the Illumina Website)
But......all is not much of use for rare mutations
Workflow for Mutation Analysis
Raw Data Generation Sample preparation and sequencing
Raw Data Analysis Base calling
Whole Genome Mapping Alignment to a reference genome
Variant Calling Detection of genetic variation
Annotation Linking variants to biological information
Annotation
Between 3 and 5 million SNVs per indivdual
Only few have a functional impact
Separating them is a challenge of bioinformatics
Many tools use supervised learning approaches(remember my last talk)
SNV features
cSNVs (protein-coding)– Amino acid residue substitions prop.– Evolutionary history of AA position– Sequence-function relationship– Structure-function relationship
rSNVs (regulatory)– Transcription– Pre-MRna splicing– MicroRNA binding– Post-translational modification sites
Annotation: Protein-Sequence-Based
(Cooper et. al)
Annotation: DNA-Sequence-Based
(Cooper et. al)
Flow chart for informed use ofSNV function prediction tools
Cline et al.
Final
Thanks for your attention!
References
Nielsen et. al. Genotype and SNP calling fromnext-generation sequencing data. Nature ReviewsGenetics. (2011)
Cline et. al. Using bioinformatics to predict the functionalimpact of SNVs. Bioinformatics. (2011)
Cooper et. al. Needles in stacks of needles: findingdisease-causal variants in a wealth of genomic data.Nature Reviews Genetics. (2011)