Post on 18-Dec-2015
1
Associating Genomic Variations with
Phenotypes
Model comparison, rare variants, and analysis pipeline
Qunyuan Zhang
Division of Statistical Genomics & Genome Institute
Washington University School of Medicine
2
Data & Question
Relationshipbetween X and Y ?
nmnnn
m
m
xxxyn
xxxy
xxxy
XYi
..
.....................
...2
...1
21
222212
112111
Genotypes:SNP
InsertionDeletion
DuplicationInversion
Translocation…
Phenotypes(quantitative,categorical)
3
Linkage & Association
Association: (Y,X)
Linkage: (Y,Q)Q is unobservable
...
.....................
...2
...1
221
2222212
1212111
nnnn xqxyn
xqxy
xqxy
XYi Genotypes
Phenotype
Putative QTL
r1 Q r2
4
A Fixed-effect Mixture Model For LinkageCommonly used in plant genetics
r1 Q r2
P1 X P2
F1
F2
3
1
),|()(j
iji rXQPyf
2)(
2
1exp
2
1
j
jiy
j
n
iiyfYL
1
)()(
SNP A SNP B
5
A Variance-component Model For Linkage
Commonly used in human genetics
r1 Q r2
)()(
2
1exp
||)2(
1)( 1
2/12/
YYYL T
nV
V
222)( eggQQYCov IΔΔV
Background IBD matrix
QTL IBD matrix
Diagonal unit matrix
QΔ
SNP A SNP B
6
Variance-component Model = Random-effect Linear Model
222eggQQ IΔΔV
eγZγZμ ggQQY
),0( 2QQMVN Δ ),0( 2
ggMVN Δ ),0( 2eN
)()(
2
1exp
||)2(
1)( 1
2/12/
YYYL T
nV
V
Random effects
7
From Linkage to Association
22egg IΔV
eγZγZμ ggQQY
)()(
2
1exp
||)2(
1)( 1
2/12/
XYXYYL T
nV
V
eγZXβμ ggY
marker effect(s)
Family-based association model
Linkage model
QTL effect(s)
fixed effect(s)
8
A Simple Association ModelFor Unrelated Subjects
2eIV
)()(
2
1exp
||)2(
1)( 1
2/12/
XYXYYL T
nV
V
eXβμ Y
n
i e
i Xy
e1
2)(2
1exp
2
1
9
Covariate(s): Adjusting For Confounder(s)
eβXXβμ CCY
Observed confounders: age, sex etc.Hidden confounders: population structure
Population structure can be estimated by:-PCA-Clustering-Admixture/ancestry
10
Modeling Hidden Genetic CorrelationBetween Subjects
22egg IΔV
eγZβXXβμ ggCCY
marker fixed effect(s)
Family data, pedigree => IBD matrixPopulation data, hidden, marker data => IBS matrix
covariate fixed effect(s)
Genetic background random effects
11
Modeling Rare Variants
eγZβXXβμ ggCCY
...11 XY μ
......2211 kkXXXY μ
Common variants, tested individually, H0: β1=0. One p-value per variant
Rare variants, tested as an entire group (burden test), usually by geneH0: β1= β2=…=βk=0 . One p-value per group of variants
Incorporated with variable selection, with loose criteria
β can be treated as random effects, variance components test, can be weighted by prior information
12
Collapsing Model
......2211 kkXXXY μ
... XY μ
1
1
0
0013
1102
0001321 XXXXsubject
Collapsing multiple variables into one
13
Weighted Sum Model
......2211 kkXXXY μ
...)(1
k
jjjXwY μ
2.0
8.0
0.0
0013
1102
00013.05.02.0 1
3
1
2
1
1 S
w
X
w
X
w
Xsubject
Weighted sum score
... SY μ
14
Weighting Variants
Base on allele frequency, continuous or binary(0,1) weight, variable threshold;
Based on function annotation/prediction;Based on sequencing quality (coverage, mapping quality,
genotyping quality, validated or not etc.);Data-driven, using both genotype and phenotype data,
learning weights (including effect directions) from data, requiring permutation test;
Any combination …
Grouping VariantsBy gene By transcript By exonBy gene set / pathway By protein domain……
15
Modeling More Data TypesGeneralized Linear (Mixed) Model
eXβμ ...)(Yg
Link function
For binary Y, logistic model
)0(1
)1(log)(log)(
YP
YPYitYg
1)...exp(
)...exp()1(
eXβμ
eXβμYP
16
Longitudinal Data (quantitative)
Fixed effect, time as covariate
Repeated measures, random effect, correlation within subjects
Time
17
Longitudinal Data (binary)
Linear model, time as covariate
Survival analysis, CoxPH model etc.
Time
18
Tools
SAS ProceduresREG, LOGISTIC, GENMOD, MIXED, HPMIXED, GLIMMIX, PHREG/LIFETEST
R Functions/Packageslm (), glm()gee, nlme, kinship2/coxme, lme4, survival
Other ProgramsSOLAR, MMAP, EMMA, EMMAX, SKAT
19
Pipeline
job1 job2 …..Job N
Input (data + options)
Options.jobi => self-programmed modules (SAS, R,…)
Options.jobi => external program modules (MMAP, SKAT,..)
Result 1
Result 2
….. Result N
Job generating/submitting module
Job number controlling module
Job status monitoring module (all done ?)
Yes
Result summarizing module
no
Wait …
LSF bsub
20
gwas.sh options.gwa
#!/bin/shOPFILE=$1...…
[DATA]database=SASgenotype_dir=/dsg1/gwas/fhsgenogenotype_file=
phenotype_file=fhs100markerinfo_file=mapallmarker_selection=MAF>0.01pedigree_file=pediallsubjectID=subjectpedgreeID=famidmarkername=snp…[ANALYSIS]phenolist_file=pheno_list=bmi/qtcovariates=program=SASGLManalysis=mixed[OUTPUT]output_dir=/dsguser/qunyuan/fhs/bmioutput_file=output_replace=no[RUN]clusterjobname=bmimixedmemsize=1000Mmaxjobn=300…
Pheno type covar program analysis runBmi qt age,sex SASGLM mixed YESObes ql NA SASGLM gee YESHD ql age SASGLM gee NOAge …Sex ……
Program language location Maintainer SASGLM SAS /dsg1/code/sas/glm.sas Q.ZhangGSTAT R /dsg1/code/R/gstat.R Q.ZhangMMAP C /dsg1/code/sas/mmap.sh J. Czajkowski…
21
Thanks !