RandomForests for Biomedical Applications
-
Upload
salford-systems -
Category
Technology
-
view
504 -
download
5
Transcript of RandomForests for Biomedical Applications
Random Forests and Archetypal Analysis of Dietary Patterns in the
Cache County Memory Study
Adele CutlerDepartment of Mathematics and Statistics
Utah State University
This research is partially supported by NIH 1R15AG037392-01
04/13/2023 ADMC 2012 2
Leo Breiman, 1928 - 2005
1984 CART
1994 Archetypal Analysis
1996 Bagging
2001 Random Forests
04/13/2023 ADMC 2012 3
Example 1: Cookbooks and nutrition
• 300 recipes from 12 cookbooks• Nutritional information (33 predictors)
Joint work with Sheryl AguilarMichael Lefevre
Center for Advanced Nutrition, Utah State University
04/13/2023 ADMC 2012 4
Example 2: The Cache County Memory Study
04/13/2023 ADMC 2012 5
Utah
04/13/2023 ADMC 2012 6
Cache Valley, Utah
04/13/2023 ADMC 2012 7
Utah State University
04/13/2023 ADMC 2012 8
Example 2: The Cache County Memory Study
• Prospective, population-based study, 1995-2006
• 5,092 people aged 65 and over • Food frequency questionnaire
Joint work with Heidi Wengreen2
Chris Corcoran1
Anna Quach1
1Mathematics and Statistics, Utah State University2Nutrition and Food Sciences, Utah State University
04/13/2023 ADMC 2012 9
Outline
• RF for cookbooks• RF for memory study
• Archetypes for cookbooks• Archetypes for memory
• Current development
04/13/2023 ADMC 2012 10
Random Forests
04/13/2023 ADMC 2012 11
Random Forests for Classification
Example 1 (cookbooks): • Predict the author of a recipe based on the
nutritional content of the recipe• Which variables are important?
Example 2 (memory): • Predict a person’s dementia status (yes/no) based
on their diet• Which variables are important?
04/13/2023 ADMC 2012 12
Example 1: Cookbooks
?
04/13/2023 ADMC 2012 13
Cookbooks: Predict the author?Cookbook Error Rate (%)AHA 4
Cookbook 2 40
Cookbook 3 59
Cookbook 4 95
Cookbook 5 79
Cookbook 6 65
Cookbook 7 91
Cookbook 8 64
Cookbook 9 15
Cookbook 10 92
Cookbook 11 72
Cookbook 12 85
Error rate 63%
04/13/2023 ADMC 2012 14
Cookbooks: important variables
Error rate 63% • Fat (g)• Saturated fat (g)• Cholesterol (mg)• Monounsaturated fat (g)• Sodium (mg)• Protein (g)• Vitamin B6 (mg)
04/13/2023 ADMC 2012 15
Two Classes: AHA versus the rest
Error rate 2.33% • Fat (g)• Monounsaturated fat (g)• Saturated fat (g)• Sodium (mg)• Polyunsaturated fat (g)• Protein (g)• Cholesterol (mg)
04/13/2023 ADMC 2012 16
Two Classes: AHA versus the rest
Error rate 2.33%
Predicted Other AHA Error Rate %
Other 274 1 0.36AHA 6 19 24.00
Class weights!
04/13/2023 ADMC 2012 17
Class Weights
80% weight AHA, 20% weight “Other”Error rate 5%
Predicted Other AHA Error Rate %
Other 261 14 5.1AHA 1 24 4.0
04/13/2023 ADMC 2012 18
Salford and R
Different weighting schemes!
• R weights only take a weighted bootstrap sample
• Salford does weighted splits as well
04/13/2023 ADMC 2012 19
R Weights
0 5 10 15 20 25 30
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
0.0
25
Variable number
Imp
ort
an
ce
04/13/2023 ADMC 2012 20
Important Variables (R)
• Fat (g)• Monounsaturated fat (g)• Saturated fat (g)• Sodium (mg)• Polyunsaturated fat (g)
For all weights!
04/13/2023 ADMC 2012 21
Salford Weights
0 5 10 15 20 25 30
02
46
81
01
21
4
Variable number
Imp
ort
an
ce
04/13/2023 ADMC 2012 22
Important Variables (Salford)
• Carb (g)• Polyunsaturated fat (g)• Caffeine (mg)• Cholesterol (mg)• Fiber (g)• Protein (g)• Trans fat (g)• Fat (g)
04/13/2023 ADMC 2012 23
Example 2: Memory
04/13/2023 ADMC 2012 24
Memory: Predict survivalError rate 28.2%
Predicted Survived Died Error Rate %
Survived 839 591 41Died 359 1584 18
04/13/2023 ADMC 2012 25
Memory: Predict dementia?Error rate 28.1%
Predicted Normal Demented Error Rate %
Normal 2410 24 0.99Demented 926 13 98.62
04/13/2023 ADMC 2012 26
Class Weights
30% weight “Other”70% weight AHAError rate 38%
Predicted Normal Demented Error Rate %
Normal 1646 788 32Demented 508 431 54
04/13/2023 ADMC 2012 27
Salford Weights
0 20 40 60 80
0.0
00
0.0
02
0.0
04
0.0
06
0.0
08
0.0
10
Variable number
Imp
ort
an
ce
04/13/2023 ADMC 2012 28
R Weights
0 20 40 60 80
01
23
4
Variable number
Imp
ort
an
ce
04/13/2023 ADMC 2012 29
Salford Weights
0 20 40 60 80
02
46
81
0
imp
ort
an
ce
04/13/2023 ADMC 2012 30
R Weights
0 20 40 60 80
0.0
00
0.0
02
0.0
04
0.0
06
0.0
08
0.0
10
imp
ort
an
ce
04/13/2023 ADMC 2012 31
Summary
• R weights only take a weighted bootstrap sample
• Salford does weighted splits as well• Salford weights can give different variable
importance
04/13/2023 ADMC 2012 32
Archetypes
Cutler and Breiman, Technometrics, 1994
• Unsupervised learning, alternative to cluster analysis or PCA
• Summarize data using a fixed number of “archetypes”
• The archetypes are extremes• Data points are approximated by mixtures of
archetypes
04/13/2023 ADMC 2012 33
Archetypes
Example 1 (cookbooks): • Archetypes represent extreme recipes• A particular recipe is approximated as a mixture of
the extreme recipes
Example 2 (memory):• Archetypes represent extreme dietary patterns• A person’s diet is approximated as a mixture of the
extreme diets
04/13/2023 ADMC 2012 34
Example 1: Cookbooks
?
04/13/2023 ADMC 2012 35
Cookbooks: How many archetypes?
2 4 6 8 10
30
03
50
40
04
50
Number of archetypes
RM
SE
04/13/2023 ADMC 2012 36
1
2 3
04/13/2023 ADMC 2012 37
1
2 3
4
04/13/2023 ADMC 2012 38
1
2
3 4
5
04/13/2023 ADMC 2012 39
1
2
3 4
5
6
04/13/2023 ADMC 2012 40
1
2
3
4 5
6
7
04/13/2023 ADMC 2012 41
1
2
3 4
5
6
Cookbook 1
04/13/2023 ADMC 2012 42
1
2
3 4
5
6
Cookbook 2
04/13/2023 ADMC 2012 43
1
2
3 4
5
6
Cookbook 3
04/13/2023 ADMC 2012 44
1
2
3 4
5
6
Cookbook 4
04/13/2023 ADMC 2012 45
1
2
3 4
5
6
Cookbook 5
04/13/2023 ADMC 2012 46
1
2
3 4
5
6
Cookbook 6
04/13/2023 ADMC 2012 47
1
2
3 4
5
6
Cookbook 7
04/13/2023 ADMC 2012 48
1
2
3 4
5
6
Cookbook 8
04/13/2023 ADMC 2012 49
1
2
3 4
5
6
Cookbook 9
04/13/2023 ADMC 2012 50
1
2
3 4
5
6
Cookbook 10
04/13/2023 ADMC 2012 51
1
2
3 4
5
6
Cookbook 11
04/13/2023 ADMC 2012 52
1
2
3 4
5
6
Cookbook 12
04/13/2023 ADMC 2012 53
Example 2: Memory
04/13/2023 ADMC 2012 54
Memory: How many archetypes?
2 4 6 8 10
2.5
3.0
3.5
4.0
Number of archetypes
RM
SE
04/13/2023 ADMC 2012 55
1
2
3
4 5
6
7
Color = Dementia Status
04/13/2023 ADMC 2012 56
1
2
3
4 5
6
7
Color = Smoking Status
04/13/2023 ADMC 2012 57
1
2
3
4 5
6
7
Color = Drinking Status
04/13/2023 ADMC 2012 58
1
2
3
4 5
6
7
Color = Age
04/13/2023 ADMC 2012 59
Development
Random forests:• Regression version• Case weights • Probability estimates• Proximities• Multivariate outcomes
Archetypes:• Archetypal functions• Archetypal sets