Post on 30-Dec-2015
description
Genome-wide association studies
Usman Roshan
SNP
• Single nucleotide polymorphism• Specific position and specific chromosome
SNP genotype
Suppose this is the DNA on chromosome 1 starting from position 1.
There is a SNP C/G on position 5, C/T on position 14, and G/T on position 21. This person is heterozygous in the first SNP and homozygous in the other two.
F: AACACAATTAGTACAATTATGACM: AACAGAATTAGTACAATTATGAC
SNP genotype representation
The example
F: AACACAATTAGTACAATTATGAC
M: AACAGAATTAGTACAATTATGAC
is represented as
CG CC GG …
SNP genotype
• For several individuals
A/T C/T G/T …
H0: AA TT GG …
H1: AT CC GT …
H2: AA CT GT …
.
.
.
SNP genotype encoding
• If SNP is A/B (alphabetically ordered) then count number of times we see B.
• Previous example becomesA/T C/T G/T … A/T C/T G/T …
H0: AA TT GG … 0 2 0 …H1: AT CC GT … =>1 0 1 …H2: AA CT GT … 0 1 1 …
Now we have data in numerical format
Genome wide association studies (GWAS)
• Aim to identify which regions (or SNPs) in the genome are associated with disease or certain phenotype.
• Design:– Identify population structure– Select case subjects (those with disease)– Select control subjects (healthy)– Genotype a million SNPs for each subject– Determine which SNP is associated.
Example GWAS
A/T C/G A/G …
Case 1 AA CC AA
Case 2 AT CG AA
Case 3 AA CG AA
Control 1 TT GG GG
Control 2 TT CC GG
Control 3 TA CG GG
Encoded data
A/T C/G A/G A/T C/GA/G
Case1 AA CC AA 0 0 0
Case2 AT CG AA 1 1 0
Case3 AA CG AA => 0 1 0
Con1 TT GG GG 2 2 2
Con2 TT CC GG 2 0 2
Con3 TA CG GG 1 1 2
Ranking SNPs
SNP1 SNP2 SNP3 SNP1 SNP2 SNP3
A/T C/G A/G A/T C/G A/G
Case1 AA CC AA 0 0 0
Case2 AT CG AA 1 1 0
Case3 AA CG AA => 0 1 0
Con1 TT GG GG 2 2 2
Con2 TT CC GG 2 0 2
Con3 TA CG GG 1 1 2
A good ranking strategy would produce SNP3, SNP1, SNP2
Chi-square test
• Gold standard is the univariate non-parametric chi-square test with two degrees of freedom.
• Search for SNPs that deviate from the independence assumption.
• Rank SNPs by p-values
Statistical test of association (P-values)
• P-value = probability of the observed data (or worse) under the null hypothesis
• Example:– Suppose we are given a series of coin-tosses– We feel that a biased coin produced the tosses– We can ask the following question: what is the probability
that a fair coin produced the tosses?– If this probability is very small then we can say there is a
small chance that a fair coin produced the observed tosses.– In this example the null hypothesis is the fair coin and the
alternative hypothesis is the biased coin
Binomial distribution
• Bernoulli random variable: – Two outcomes: success of failure– Example: coin toss
• Binomial random variable:– Number of successes in a series of independent Bernoulli trials
• Example: – Probability of heads=0.5– Given four coin tosses what is the probability of three heads? – Possible outcomes: HHHT, HHTH HTHH, HHHT– Each outcome has probability = 0.5^4– Total probability = 4 * 0.5^4
Binomial distribution
• Bernoulli trial probability of success=p, probability of failure = 1-p
• Given n independent Bernoulli trials what is the probability of k successes?
• Binomial applet: http://www.stat.tamu.edu/~west/applets/binomialdemo.html
Hypothesis testing under Binomial hypothesis
• Null hypothesis: fair coin (probability of heads = probability of tails = 0.5)
• Data: HHHHTHTHHHHHHHTHTHTH• P-value under null hypothesis = probability
that #heads >= 15• This probability is 0.021• Since it is below 0.05 we can reject the null
hypothesis
Chi-square statistic
• Define four random variables Xi each of which is binomially distributed Xi ~ B(n, pi) where n=c1+c2+c3+c4 is the total number of subjects and pi is the probability of success of Xi.
• Each variable Xi represents the number of case and control subjects with number of risk and wildtype alleles.
• The expected value E(Xi) = npi since each Xi is binomial.
c4 (X4)c3 (X3)Control
c2 (X2)c1 (X1)Case
#Allele2 (wildtype)
#Allele1 (risk)
Chi-square statistic
Define the statistic:
where
ci = observed frequency for ith outcomeei = expected frequency for ith outcomen = total outcomes
The probability distribution of this statistic is given by thechi-square distribution with n-1 degrees of freedom.Proof can be found at http://ocw.mit.edu/NR/rdonlyres/Mathematics/18-443Fall2003/4226DF27-A1D0-4BB8-939A-B2A4167B5480/0/lec23.pdf
Great. But how do we use this to get a SNP p-value?
Null hypothesis for case control contingency table
• We have two random variables:– D: disease status– G: allele type.
• Null hypothesis: the two variables are independent of each other (unrelated)
• Under independence – P(D,G)= P(D)P(G)– P(D=case) = (c1+c2)/n– P(G=risk) = (c1+c3)/n
• Expected values– E(X1) = P(D=case)P(G=risk)n
• We can calculate the chi-square statistic for a given SNP and the probability that it is independent of disease status (using the p-value).
• SNPs with very small probabilities deviate significantly from the independence assumption and therefore considered important.
c4c3Control
c2c1Case
#Allele2
(wildtype)
#Allele1
(risk)
Chi-square statistic exercise
482Control
3515Case
#Allele2#Allele1• Compute expected valuesand chi-square statistic• Compute chi-square p-value by referring tochi-square distribution
GWAS problems and applications
• Detect causal SNPs– Chi-square– Multivariate approaches
• Predict case and control from genotypes– Machine learning algorithms– A simple algorithm based on Euclidean
distances