Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc....
-
Upload
ashlie-greene -
Category
Documents
-
view
227 -
download
0
Transcript of Software tools for the analysis of medically important sequence variations Gabor T. Marth, D.Sc....
Software tools for the analysis of medically
important sequence variations
Gabor T. Marth, D.Sc.Boston CollegeDepartment of [email protected]://bioinformatics.bc.edu/marthlab
Pfizer visit, March 7. 2006
Our lab focuses on three main projects…
2. software for SNP discovery in clonal and re-sequencing data,
1. software tools for clinical case-control association studies
3. connecting HapMap and pharmaco-genetic data
1. We developing computer software to aid tagSNP selection and association testing
gene annotations
tags
association statistics
input data views
LD views
GUI
user control interface
reference samples
representative computational samples
tag evaluationmarker selectionassociation testing
study specificationuser input
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
LA LD (r2)
5-s
ite
Co
mp
uta
ion
all
y G
en
era
ted
LD
(r
2)
1-4 Mrk Sep.
5-9 Mrk Sep.
10-17 Mrk Sep.
18-26 Mrk Sep.
computationalsample database
(discussed in more detail)
• inherited (germ line) polymorphisms are important as they can predispose to disease
1.
2. We build computer tools for SNP discovery
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
• we have a 5-year NIH R01 grant to re-develop our computer package, PolyBayes© , our SNP discovery tool originally developed while the PI was at the Washington University Medical School
Marth et al. Nature Genetics 1999
• looking for SNPs and short INDELs
Apply our tools for genome-scale SNP mining
Sachidanandam et al. Nature 2001
~ 10 million
EST
WGS
BAC
genome reference
Extend our methods for SNP detection in medical re-sequencing data from traditional Sanger sequencers…
Homozygous T
Homozygous C
Heterozygous C/T
… and in 454 pyrosequence data
454 sequence from the NCBI Trace Archive
• accurate base calling for de novo sequencing
• detection of heterozygotes in medical re-sequencing data
Figure from Nordfors, et. al. Human Mutation 19:395-401 (2002)
(discussed in more detail)
Developing methods to detect somatic mutations (as distinguished from inherited polymorphisms)
© Brian Stavely, Memorial University of Newfoundland
• the detection of somatic mutations, and their distinction from inherited polymorphism, will be important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer
(discussed in more detail)
Process DNA methylation data obtained with sequencing
DNA methylation is important e.g. because hypo- and hypermethylation is consistently present in various cancers
Issa. Nature Reviews Cancer, 4, 2004: 988-993
we are developing methods to interpret DNA methylation data obtained with sequencing, in the presence of methodological artifacts such as incomplete bi-sulfite conversion of un-methylated cytosines
Lewin et. al. Bioinformatics, 20:3005-30012, 2004
… and tools to integrate genetic and epigenetic data from varied sources to find “common themes” during cancer development
chromatin structure
gene expression profiles
copy number changes
methylation profiles
chromosome rearrangement
s
repeat expansions
somatic mutations
3. We are planning a project to connect multi-marker haplotypes to drug metabolic phenotypes
• predicting metabolic phenotypes (ADR) based on haplotype markers
• evolutionary origin of drug metabolizing enzyme polymorphisms
Computer software to aid case-control association studies: tagSNP selection and association testing (details)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
LA LD (r2)
5-s
ite
Co
mp
uta
ion
all
y G
en
era
ted
LD
(r
2 )
1-4 Mrk Sep.
5-9 Mrk Sep.
10-17 Mrk Sep.
18-26 Mrk Sep.
Dr. Eric Tsung
Clinical case-control association studies – concepts
• association studies are designed to find disease-causing genetic variants
• searching “significant” marker allele frequency differences between cases and controls
AF(cases)
AF(
contr
ol
s)
clinical cases
clinical controls
• genotyping cases and controls at various polymorphisms
Association study designs
• region(s) interrogated: single gene, list of candidate genes (“candidate gene study”), or entire genome (“genome scan”)
• direct or indirect:
causative variant causative variantmarker that is co-inherited with causative variant
• single-SNP marker or multi-SNP haplotype marker
• single-stage or multi-stage
Marker (tag) selection for association studies
2. LD-driven – based entirely on the reduction of redundancy presented by the linkage disequilibrium (LD) between SNPs; tags represent other SNPs they are correlated with
1. hypothesis driven (i.e. based on gene function)
causative variant
for economy, one cannot genotype every SNP in thousands of clinical samples: marker selection is the process where a subset of all available SNPs is chosen
The International HapMap project
http://www.hapmap.org
The international HapMap project was designed to provide a set of physical and informational reagents for association studies by mapping out human LD structure
LD varies across samples
African reference (YRI)
there are large differences in LD between different human populations…
European reference (CEU)
… and even between samples from the same population.
Other European samples
Sample-to-sample LD differences make tagSNP selection problematic
groups of SNPs that are in LD in the HapMap reference samples may not be in a future set of clinical samples…
… and tags that were selected based on LD in the HapMap may no longer work (i.e. represent the SNPs they were supposed to) in the clinical samples…
… possibly resulting in missed disease associations.
Natural marker allele frequency differences confound association testing
reference samples: ~ 120 chromosomes
cases: 500-2,000 chromosomes
controls: 500-2,000 chromosomes
• the HapMap reference samples are much smaller than clinical sample sizes
• difficult to accurately assess both marker allele frequency (single-SNP or haplotype frequency) in the clinical samples and naturally occurring variation of marker allele frequency differences between cases and controls
AF(cases)
AF(
contr
ol
s)
• therefore difficult to assess statistical significance of candidate associations
We are developing technology for assessing sample-to-sample variance in silico
reference
cases
controlstag evaluationtag selection
association testing
we estimate LD differences betweenHapMap and future clinical samples…
“cases”
“controls”
…by generating “computational” samples representing future clinical samples…
… and use computational “proxy” samples for tabulating LD and allele frequency differences.
Two methods of computational sample generation
“HapMap” “cases”
“controls”HapMap
Method 1. “Data-relevant Coalescent”. This algorithm uses a population genetic model to connect mutations in the HapMap reference to mutations in future clinical samples. Full model but computationally slow.
Method 2. The PAC method (product of approximate conditionals, Li & Stephens). This method constructs “new” samples as mosaics of existing haplotypes, mimicking the effects of recombination. An approximation but fast.
Computational samples
HapMap (CEU)
Computational (PAC)
Computational (Coalescent)
Extra genotypes (Estonia)
MARKER EVALUATION with computational samples
test if markers selected from the HapMap continue to “tag” other SNPs in their original LD group
MARKER SELECTION with computational samples
selecting tags in multiple consecutive sets of computational samples and choosing for the association study the best-performing tags
ASSOCIATION TESTING with computational samples
“cases”
“controls”
“cases”
“controls”
“cases”
“controls”
tabulating ΔAF in “cases” vs. “controls” in multiple consecutive computational pairs of samples provides the natural range of allele frequency differences to decide if a candidate association is statistically significant
AF(cases)
AF(
contr
ol
s)
Do computational samples represent future clinical genotypes realistically?
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
we quantify the quality of representation by comparing the correlation of LD between corresponding pairs of markers (i.e. ask if two markers were in strong LD in one set of samples, are they ALSO in strong LD in the other set?
LD difference -- comparison to extra experimental genotypes
0.949 +/- 0.013
0.978 +/- 0.0100.963 +/- 0.014
• we have analyzed two extra genotype sets collected at the HapMap SNPs in three genome regions, from our clinical collaborators (Prof. Thomas Hudson, McGill; Prof. Stanley Nelson, UCLA)
AF difference -- comparisons to extra experimental genotypes
0
0.01
0.02
0.03
0.04
0.05
0.06
0 0.01 0.02 0.03 0.04 0.05 0.06
AF Diff, Estonian Data
AF
Dif
f, C
om
p S
am
ple
s
• according to our limited initial test, computational samples can represent future clinical samples well for estimating sample-to-sample variability
A new marker selection and association testing software tool
• data visualization
reference samples
representative computational samples
• representative computational sample generation
• advanced tag selection functionality
gene annotations
tags
LD views
• gene annotations overlaid on physical map of SNPs (i.e. the human genome sequence)
association statistics0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
LA LD (r2)
5-s
ite
Co
mp
uta
ion
all
y G
en
era
ted
LD
(r
2)
1-4 Mrk Sep.
5-9 Mrk Sep.
10-17 Mrk Sep.
18-26 Mrk Sep.
• advanced association testing functionality
• multi-level user customization including user conveniences e.g. tag prioritization based on SNP assay score
User community
• companies designing new generations of whole-genome or specialized SNP arrays
• researchers comparing alternative platforms (e.g. Affymetrix 500K and the Illumina 300K ) most suitable for their study
• clinical researchers designing candidate gene studies
• researchers designing second-stage follow-up studies in specific genome regions after an initial genome scan (our methods can take advantage of first-stage data already available in the clinical samples)
• the association testing features should be useful for analysts regardless of study design
Base calling and SNP detection in sequence traces including 454 data
Aaron Quinlan
Base calling and SNP detection in sequence traces including 454 “pyrogram” data
• PolyBayes was originally written to find SNPs in clonal sequences in large SNP discovery projects
• medical re-sequencing projects require the detection of SNPs in heterozygous diploid sequence traces
C
CG
G AT
CG
5’
3’
5’
3’
Heterozygote detection in sequence traces
Ind. 1
Ind. 2
Ind. 3
Ind. 4
Individual traces
• we use a machine learning method (Support Vector Machine, SVM) to recognize characteristic features of homozygous vs. heterozygous positions
Aggregating information from multiple traces
forward/reverse sequences from same individual
P(GT ) = .993
resultant genotype call
P(GT | Read) = .98
P(GT | Read) = .87
Discovery vs. genotyping
Prior(CT) = .001
discovery: “uninformed prior”don’t know if site is polymorphichave to test each site
Prior(CT) = 0.34
genotyping: “informed prior”1. site is known to be polymorphic2. allele frequency estimate
Our heterozygote detection works better than other methods
Performance Measured on ~1000 Alignments covering 500Kb Region of Chromosome 4
Fraction of Data
Analyzed
False Discovery
Rate
Fraction of Heterozygotes
Found
Fraction of Homozygotes
Found
PolyBayes+ 85.1 0.0375 86.60% 97.8%
Polyphred 5 86.17 0.0389 83.16% 82.63%
Base calling for “pyrograms”
From NCBI Trace Archive
• we have access to standardized data formats
• readout in pyrosequencing is based on instantaneous detection of base incorporation… multiple bases of the same type are incorporated in the same cycle
26 55 24 15 10 7 5 4 2 1 0 0
TCAGGGGGGGGGGGACGACAAGGCGTGGGGA• the identity of consecutive bases is very reliable but the length of mono-nucleotide runs (base number) is difficult to quantify (great for re-sequencing; but problematic for de novo sequencing)
SNP genotyping with pyrosequencers
Nordfors, et. al. Human Mutation 19:395-401 (2002)
we are in the process of identifying discriminating pyrogram features to use in our machine-learning methods to recognize polymorphic positions within traces
Somatic mutation detection
Michael Stromberg
Somatic mutations
© Brian Stavely, Memorial University of Newfoundland
the detection of somatic mutations, and their distinction from inherited polymorphism, is important to separate pre-disposing variants from mutations that occur during disease progression e.g. in cancer
1. detect the mutations
2. classify whether somatic or inherited
Detecting somatic mutations with comparative data
• based on comparison of cancer and normal tissue from the same individual
• often cancer tissue is highly heterogeneous and the somatic mutant allele may represent at low allele frequency
Detecting somatic mutations with subtraction
• if normal tissue samples are not available, we detect SNPs in cancer tissue against e.g. the human genome reference sequence
• subtract apparent mutations that are present in sequence variation databases
• search for evidence that these mutations are genetic
Detecting somatic mutations with subtraction
• we have applied our methods for somatic mutation detection in murine mitochondrial sequences
heteroplasmy homoplasmy
• we will be applying our methods for human nuclear DNA from our collaborators
Using new haplotype resources to connect genotype and clinical outcome in pharmaco-genetic systems
• the HapMap was designed as a tool to detect high-frequency (common) phenotypic (e.g. disease-causing) alleles
• important drug metabolizing enzymes are relatively few in number, well studied, are at known genome locations, many associated phenotypes are well described
• many functional alleles are known, and of high frequency (common)
• multi-SNP alleles are highly predictive of metabolic phenotype
• clinical phenotype (adverse drug reaction) less predictable
• ideal candidate for applying haplotype resources
Multi-marker haplotypes as accurate markers for ADRs?
functional allele (known metabolic
polymorphism)
genetic marker (haplotype) in genome
regions of drug metabolizing enzyme
(DME) genes
molecular phenotype (drug concentration measured in blood
plasma)
clinical endpoint (adverse drug
reaction)computational prediction
based on haplotype structure
Resources
• specifics of enzyme-drug interactions
• LD and haplotype structure in the HapMap reference samples, based on high-density SNP map
• functional alleles
• existing DME P genotyping chips
Evolutionary questions
• mutation age?
• mutations single-origin or recurrent?• geographic origin of mutations?
• analysis based on complete local variation structure and haplotype background of functional mutations
• specifics of the selection process that led to specific functional alleles?
Proposed steps of analysis
• haplotypes vs. metabolic phenotype?
• complete polymorphic structure?
• ethnicity?
• additional functional SNPs?
• haplotypes vs. functional alleles?
haplotype block?
functional allele(genotype)
metabolic phenotype
clinical phenotype(ADR)haplotype
• haplotypes vs. ADR phenotype?