Forest Approach to Genetic Studies
-
Upload
ivor-cross -
Category
Documents
-
view
17 -
download
0
description
Transcript of Forest Approach to Genetic Studies
![Page 1: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/1.jpg)
Forest Approach to Genetic StudiesForest Approach to Genetic Studies
Heping ZhangHeping Zhang
Presented at IMS Genomic Workshop, NUS Singapore, June 8, 2009
And Xiang Chen, Ching-Ti Liu, Minghui Wang, Meizhuo Zhang
![Page 2: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/2.jpg)
2
OutlineOutline
Background for genetic studies of complex traits
Recursive partitioning, trees, and forests
Challenges & solutions in genetic studies
A case study
![Page 3: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/3.jpg)
Complex TraitsComplex Traits
Diseases that do not follow Mendelian Inheritance Pattern
Genetic factors, Environment factors, G-G and G-E interactions
Interactions: effects that deviate from the additive effects of single effects
3 of 48
![Page 4: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/4.jpg)
4
Successes in Genetic Studies of Complex TraitsSuccesses in Genetic Studies of Complex Traits
Genetic variants have been identified for Age-related Macular Degeneration, Diabetes, Inflammatory Bowel Disorders, etc.
![Page 5: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/5.jpg)
5
SNP and Complex TraitsSNP and Complex Traits
http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism
![Page 6: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/6.jpg)
6
SNPs and HaplotypesSNPs and Haplotypes
![Page 7: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/7.jpg)
7
Gold MiningGold Mining
![Page 8: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/8.jpg)
8
Regression approachRegression approach
1
2
…
25
26
…
72
*
*
…
*
*
…
*
~
~
…
~
~
…
~
![Page 9: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/9.jpg)
9
Classic Modeling vs Genomic Association AnalysisClassic Modeling vs Genomic Association Analysis
In classic statistical modeling, we tend to have an adequate sample size for estimating parameters of interest. Often, we have hundreds or thousands of observations for the inference on a few parameters. We can try to settle an “optimal” model.In genomic studies, we have more and more variables (gene based) but the access to the number of study subjects remains the same. One model can no longer provide an adequate summary of the information.
![Page 10: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/10.jpg)
Recursive PartitioningRecursive Partitioning
A technique to identify heterogeneity in the data and fit a simple model (such as constant or linear) locally, and this avoids pre-specifying a systematic component.
10 of 48
![Page 11: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/11.jpg)
11
Leukemia DataLeukemia Data
Source: http://www-genome.wi.mit.edu/cancer
Contents:
• 25 mRNA - acute myeloid leukemia (AML)
• 38 - B-cell acute lymphoblastic leukemia (B-ALL)
• 9 - T-cell acute lymphoblastic leukemia (T-ALL)
• 7,129 genes
Question: are the microarray data useful in classifying different types of leukemia?
![Page 12: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/12.jpg)
12
3-D View3-D View
AML
T-ALLB-ALL
![Page 13: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/13.jpg)
13
Click to see the diagram
Node SplittingNode Splitting
![Page 14: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/14.jpg)
14
Tree StructureTree Structure
Node 7Node 6Node 5Node 4
Node 1
Node 3Node 2
![Page 15: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/15.jpg)
15
ForestsForests
To identify a constellation of models that collectively help us understand the data. For example, in gene expression profiling, we can select and rank the genes whose expressions show a great promise of classifying tumor cells.
![Page 16: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/16.jpg)
16
Bagging (Bootstrap Aggregating)Bagging (Bootstrap Aggregating)
Cancer NormalHigh
Low
A random tree
A Random Forest
Repetition
A tree
![Page 17: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/17.jpg)
17
Choose 20 best splits
Choose 3 best splits for each daughter node
1.1
2.13.13.23.3
1.1
2.1
1.1
2.2
1.1
2.3
1.1
2.2 3.43.53.6
1.1
2.3 3.73.83.9
. . .1.1 1.2 1.3 1.4 1.5
For the highlighted daughter nodes, we choose three best splits
Deterministic ForestDeterministic Forest
![Page 18: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/18.jpg)
Challenge I: Memory ConstraintChallenge I: Memory Constraint
• The number of SNPs makes it impossible to conduct a full genomewide association study in standard desktop computers.
• Data security requirements often do not allow the analysis done in computers with huge memory.
• We need a simple but efficient memory management design.
18 of 48
![Page 19: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/19.jpg)
How to Use Memory Efficiently?How to Use Memory Efficiently?
0 (AA), 1 (AB), 2 (BB) & 3 (missing)
2 0 3 1 0 …
1 0 0 0 1 1 0 1 0 0 …
byte bit
Com
pression D
ecompression
19 of 48
![Page 20: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/20.jpg)
WillowsWillows
20 of 48
![Page 21: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/21.jpg)
Williows GUIWilliows GUI
21
![Page 22: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/22.jpg)
Williows OutputWilliows Output
22 of 48
![Page 23: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/23.jpg)
Challenge II: Haplotype CertaintyChallenge II: Haplotype Certainty
SNPs Directly observed No uncertainty
Less informativeTree approaches
Haplotypes Inferred from SNPs
Uncertain More informative Forest approaches
23 of 48
![Page 24: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/24.jpg)
24
Forest Forming SchemeForest Forming Scheme
Unphased data
Estimated haplotype frequencies
Reconstructed phased data 1
Reconstructed phased data 2
Reconstructed phased data 3
Reconstructed phased data 4
Reconstructed phased data n
……
…
Importance index for haplotype 1
Importance index for haplotype 2
Importance index for haplotype 3
Importance index for haplotype k
……
…
Tree 1
Tree 2
Tree 3
Tree 4
Tree n
……
…
![Page 25: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/25.jpg)
25
Haplotype Frequency EstimationHaplotype Frequency Estimation
Existing haplotype frequency estimation software that output a set of haplotype pairs with corresponding frequencies for each subject in each region.
We used SNPHAP (Clayton 2006)
![Page 26: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/26.jpg)
26
Unphased to Phased DataUnphased to Phased Data
One unphased data expands to a large number of phased datasets.
In each region, an individual’s haplotype pair is randomly selected based on the estimated frequencies to account for the uncertainty of the haplotypes.
Haplotypes with low frequencies (~5-10%) should have some representations.
![Page 27: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/27.jpg)
27
Trees Based on Phased DataTrees Based on Phased Data
A tree is grown for each phased data set.
A random forest is formed for all phased data sets.
![Page 28: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/28.jpg)
28
Inference from the ForestInference from the Forest
ce.independen of statistictest - theof value
theis and node ofdepth theis where
,2
in tree haplotype of Importance
2
2t
by split is ,
2
tL
V
Th
t
htTtt
Lh
t
![Page 29: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/29.jpg)
29
Significance LevelSignificance Level
Distribution of the maximum haplotype importance
under null hypothesis is determined by permutation.
First, disease status is permuted among study
subjects while keeping the genome intact for all
individuals.
Then, each of the permuted data set is treated in the
same way as the original data.
![Page 30: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/30.jpg)
Simulation Studies (2 loci)Simulation Studies (2 loci)• 300 cases and 300 controls • Each region has 3 SNPs• 12 interaction models from Knapp et. al. (1994) and Becker et. al. (2005)• 2 additive models with background penetrance• 3 scenarios
• Neither region is in LD with the disease allele• One of the regions is in LD (D’ = 0.5) with the disease allele• Both regions are in LD (D’ = 0.5) with the disease allele
30 of 48
![Page 31: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/31.jpg)
31
Simulation Studies (2 loci)Simulation Studies (2 loci)
00f01f 02f
10f 11f12f
20f 21f 22f
Region 2
0 1 2
Region 1
0
1
2
Penetrance
![Page 32: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/32.jpg)
32
Simulation Studies (2 loci)Simulation Studies (2 loci)00f01f02f10f11f12f20f21f22f
ffModel
Ep-1 0 0 0 0 0 0.707 0.210 0.210
Ep-2 0 0 0 0 0 0 0 0.778 0.600 0.199
Ep-3 0 0 0 0 0 0 0 0 0.900 0.577 0.577
Ep-4 0 0 0 0 0 0.911 0.372 0.243
Ep-5 0 0 0 0 0 0 0.799 0.349 0.349
Ep-6 0 0 0 0 0 1.000 0.190 0.190
Het-1 0 0.495 0.053 0.053
Het-2 0 0 0.660 0.279 0.040
Het-3 0 0 0 0 1.000 0.194 0.194
S-1 0 0.522 0.052 0.052
S-2 1 1 1 0 0 0.574 0.228 0.045
S-3 1 1 1 0 0 0 0.512 0.194 0.194
Ad-1 0.04 0.304 0.02 0.01 0.01 0.01 0.799 0.349 0.349
Ad-2 0.15 0.324 0.10 0.05 0.05 0.05 0.799 0.349 0.349
ff
f
f
ff
f f
ff
f
f
f
ff
f ff f f f
f
ff
f
f
ff
f
fff
fff
ffffff
ff
ffff
ff
f
f
ggg
gggg
1p 2p
22 ffg
![Page 33: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/33.jpg)
33
Simulation Studies (2 loci)Simulation Studies (2 loci)
Benckmark:
FAMHAP software from Becker et. al. (2005)
![Page 34: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/34.jpg)
34
Result for Scenario IResult for Scenario I
False positive rate:
Our method: < 1%
FAMHAP: > 5%
![Page 35: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/35.jpg)
35
Result for Scenario IIResult for Scenario II
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Ep-1 Ep-2 Ep-3 Ep-4 Ep-5 Ep-6 Het-1
Identify the correcthaplotype (Forest)
Identify an incorrecthaplotype (Forest)
Identify SNPs in thecorrect region(FAMHAP)
Identify SNPs in theneutral region(FAMHAP)
![Page 36: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/36.jpg)
36
Result for Scenario IIResult for Scenario II
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Het-2 Het-3 S-1 S-2 S-3 Ad-1 Ad-2
Identify the correcthaplotype (Forest)
Identify an incorrecthaplotype (Forest)
Identify SNPs in thecorrect region(FAMHAP)
Identify SNPs in theneutral region(FAMHAP)
![Page 37: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/37.jpg)
37
Result for Scenario IIIResult for Scenario III
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Po
wer
Ep-1 Ep-2 Ep-3 Ep-4 Ep-5 Ep-6 Het-1
Identify at leastone haplotype(Forest)Identify bothhaplotypes(Forest)Identify SNPs in atleast one region(FAMPHAP)Identify SNPs inboth regions(FAMHAP)
![Page 38: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/38.jpg)
38
Result for Scenario IIIResult for Scenario III
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Po
wer
Het-2 Het-3 S-1 S-2 S-3 Ad-1 Ad-2
Identify at leastone haplotype(Forest)Identify bothhaplotypes(Forest)Identify SNPs in atleast one region(FAMHAP)Identify SNPs inboth regions(FAMHAP)
![Page 39: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/39.jpg)
Real Case StudyReal Case Study
Age-related macular degeneration (AMD)Leading cause of vision loss in elderlyAffects more than 1.75 million individuals in the United StatesProjected to about 3 million by 2020
Klein et al. (2005)Case-control (96 AMD cases, 50 controls)~100,000 SNPs for each individualCFH gene identified
39 of 48
![Page 40: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/40.jpg)
40
Analysis ProcedureAnalysis Procedure
Willows programEach SNP is used as one covariateTwo SNPs identified as potentially associated with AMD (rs1329428 on chromosome 1 and rs10272438 on chromosome 7)
Hapview program: LD block construction6-SNP block for rs132942811-SNP block for rs10272438
![Page 41: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/41.jpg)
41
ResultResult
Two haplotypes are identifiedMost significant: ACTCCG in region 1 (p-value = 2e-6)
Identical to Klein et. al. (2005)Located in CFH gene
Another significant haplotype: TCTGGACGACA, in region 2 (p-value = 0.0024)
Not reported beforeProtectiveLocated in BBS9 gene
![Page 42: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/42.jpg)
42
Expected FrequenciesExpected Frequencies
![Page 43: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/43.jpg)
43
RemarksRemarks
A
A
B
![Page 44: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/44.jpg)
Main ReferencesMain References
• Chen, Liu, Zhang, Zhang, PNAS, 2007• Zhang, Wang, Chen, BMC Bioinformatics,
2009• Chen, et al., BMC Genetics, 2009
44 of 48
![Page 45: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/45.jpg)
45
BooksBooks
Zhang HP and Singer B. Recursive Partitioning in the Health Sciences. Springer, 1999.
![Page 46: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/46.jpg)
46
Trees in Genetic StudiesTrees in Genetic Studies
Zhang and Bonney (2000)Nelson et al. (2001)Bastone et al. (2004)Cook, Zee and Ridker (2004)Foulkes, De Gruttola and Hertogs (2004)
![Page 47: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/47.jpg)
47
References on ForestsReferences on Forests
Breiman L. Bagging predictors. Machine Learning, 24(2):123-140, 1996.
Zhang HP. Classification trees for multiple binary responses. Journal of the American Statistical Association, 93: 180-193, 1998.
Zhang HP et al. Cell and Tumor Classification using Gene Expression Data: Construction of Forests. Proceedings of the National Academy of Sciences USA, 100: 4168-4172, 2003.
![Page 48: Forest Approach to Genetic Studies](https://reader035.fdocuments.us/reader035/viewer/2022062721/568137ef550346895d9fab0d/html5/thumbnails/48.jpg)
Thank you!Thank you!
48 of 48