Genome-wide association studies

Usman Roshan

• Single nucleotide polymorphism• Specific position and specific chromosome

SNP genotype

Suppose this is the DNA on chromosome 1 starting from position 1.

There is a SNP C/G on position 5, C/T on position 14, and G/T on position 21. This person is heterozygous in the first SNP and homozygous in the other two.

F: AACACAATTAGTACAATTATGACM: AACAGAATTAGTACAATTATGAC

SNP genotype representation

The example

F: AACACAATTAGTACAATTATGAC

M: AACAGAATTAGTACAATTATGAC

is represented as

CG CC GG …

SNP genotype

• For several individuals

A/T C/T G/T …

H0: AA TT GG …

H1: AT CC GT …

H2: AA CT GT …

SNP genotype encoding

• If SNP is A/B (alphabetically ordered) then count number of times we see B.

• Previous example becomesA/T C/T G/T … A/T C/T G/T …

H0: AA TT GG … 0 2 0 …H1: AT CC GT … =>1 0 1 …H2: AA CT GT … 0 1 1 …

Now we have data in numerical format

Genome wide association studies (GWAS)

• Aim to identify which regions (or SNPs) in the genome are associated with disease or certain phenotype.

• Design:– Identify population structure– Select case subjects (those with disease)– Select control subjects (healthy)– Genotype a million SNPs for each subject– Determine which SNP is associated.

Example GWAS

A/T C/G A/G …

Case 1 AA CC AA

Case 2 AT CG AA

Case 3 AA CG AA

Control 1 TT GG GG

Control 2 TT CC GG

Control 3 TA CG GG

Encoded data

A/T C/G A/G A/T C/GA/G

Case1 AA CC AA 0 0 0

Case2 AT CG AA 1 1 0

Case3 AA CG AA => 0 1 0

Con1 TT GG GG 2 2 2

Con2 TT CC GG 2 0 2

Con3 TA CG GG 1 1 2

Ranking SNPs

SNP1 SNP2 SNP3 SNP1 SNP2 SNP3

A/T C/G A/G A/T C/G A/G

Case1 AA CC AA 0 0 0

Case2 AT CG AA 1 1 0

Case3 AA CG AA => 0 1 0

Con1 TT GG GG 2 2 2

Con2 TT CC GG 2 0 2

Con3 TA CG GG 1 1 2

A good ranking strategy would produce SNP3, SNP1, SNP2

Chi-square test

• Gold standard is the univariate non-parametric chi-square test with two degrees of freedom.

• Search for SNPs that deviate from the independence assumption.

• Rank SNPs by p-values

Statistical test of association (P-values)

• P-value = probability of the observed data (or worse) under the null hypothesis

• Example:– Suppose we are given a series of coin-tosses– We feel that a biased coin produced the tosses– We can ask the following question: what is the probability

that a fair coin produced the tosses?– If this probability is very small then we can say there is a

small chance that a fair coin produced the observed tosses.– In this example the null hypothesis is the fair coin and the

alternative hypothesis is the biased coin

Binomial distribution

• Bernoulli random variable: – Two outcomes: success of failure– Example: coin toss

• Binomial random variable:– Number of successes in a series of independent Bernoulli trials

• Example: – Probability of heads=0.5– Given four coin tosses what is the probability of three heads? – Possible outcomes: HHHT, HHTH HTHH, HHHT– Each outcome has probability = 0.5^4– Total probability = 4 * 0.5^4

Binomial distribution

• Bernoulli trial probability of success=p, probability of failure = 1-p

• Given n independent Bernoulli trials what is the probability of k successes?

• Binomial applet: http://www.stat.tamu.edu/~west/applets/binomialdemo.html

Hypothesis testing under Binomial hypothesis

• Null hypothesis: fair coin (probability of heads = probability of tails = 0.5)

• Data: HHHHTHTHHHHHHHTHTHTH• P-value under null hypothesis = probability

that #heads >= 15• This probability is 0.021• Since it is below 0.05 we can reject the null

hypothesis

Chi-square statistic

• Define four random variables Xi each of which is binomially distributed Xi ~ B(n, pi) where n=c1+c2+c3+c4 is the total number of subjects and pi is the probability of success of Xi.

• Each variable Xi represents the number of case and control subjects with number of risk and wildtype alleles.

• The expected value E(Xi) = npi since each Xi is binomial.

c4 (X4)c3 (X3)Control

c2 (X2)c1 (X1)Case

#Allele2 (wildtype)

#Allele1 (risk)

Chi-square statistic

Define the statistic:

ci = observed frequency for ith outcomeei = expected frequency for ith outcomen = total outcomes

The probability distribution of this statistic is given by thechi-square distribution with n-1 degrees of freedom.Proof can be found at http://ocw.mit.edu/NR/rdonlyres/Mathematics/18-443Fall2003/4226DF27-A1D0-4BB8-939A-B2A4167B5480/0/lec23.pdf

Great. But how do we use this to get a SNP p-value?

Null hypothesis for case control contingency table

• We have two random variables:– D: disease status– G: allele type.

• Null hypothesis: the two variables are independent of each other (unrelated)

• Under independence – P(D,G)= P(D)P(G)– P(D=case) = (c1+c2)/n– P(G=risk) = (c1+c3)/n

• Expected values– E(X1) = P(D=case)P(G=risk)n

• We can calculate the chi-square statistic for a given SNP and the probability that it is independent of disease status (using the p-value).

• SNPs with very small probabilities deviate significantly from the independence assumption and therefore considered important.

c4c3Control

c2c1Case

#Allele2

(wildtype)

#Allele1

(risk)

Chi-square statistic exercise

482Control

3515Case

#Allele2#Allele1• Compute expected valuesand chi-square statistic• Compute chi-square p-value by referring tochi-square distribution

GWAS problems and applications

• Detect causal SNPs– Chi-square– Multivariate approaches

• Predict case and control from genotypes– Machine learning algorithms– A simple algorithm based on Euclidean

distances

Genome-wide association studies

Documents

Transcript of Genome-wide association studies

RESEARCH ARTICLE Open Access Genome-wide association …

---- Genotype Imputation for Genome-wide Association Studies

GENETIC ANALYSIS AND GENOME-WIDE ASSOCIATION …

Genome-Wide Association Study (GWAS)

Genome wide association studies seminar

Genome-wide association studies Usman Roshan. Recap Single nucleotide polymorphism Genome wide association studies –Relative risk, odds risk (or odds.

“Population genomic and genome-wide association …oar.icrisat.org/6357/1/Sorghum_BREAD_manuscript_2nd...“Population genomic and genome-wide association studies of agroclimatic

Multiple-ancestry genome-wide association study identifies ...

POPULATION LEVEL GENOME-WIDE ASSOCIATION … · POPULATION LEVEL GENOME-WIDE ASSOCIATION ... Population level genome -wide association studies in dairy cattle ... Nuzul Widyas, Smitha

Genome-Wide Association Study Identifies Novel Loci ...

Genome‑wide association study for grain yield and related ... › attachment › 1007 › Genome-wi… · Genome-wide association studies (GWAS) and quantita-tive trait loci (QTL)

Genome-Wide Association (GWA) Studies

Genome-Wide Association and Functional Follow-Up

Genome-wide association studies (GWAS)

Genome-wide association study identifies 74 loci ...

Genome-Wide Association Studies - GitHub Pagesjnmaloof.github.io/BIS180L_web/docs/Al-Chalabi_2009_Cold Spring H… · Genome-Wide Association Studies Service Email Alerting Receive

Report- Genome wide association studies.

Simulating Genes in Genome-wide Association Studies

Genome-wide association studies€¦ · NHGRI Current Topics in Genome Analysis 2012 Week 8: Genome-wide association studies March 14, 2012 Karen Mohlke, Ph.D. 1 Genome-wide association

Concepts and relevance of genome-wide association studies · Keywords: genome-wide association study, genetics, statistics, gene discovery 1. Introduction The science of genome-wide