Biostatistics Case Studies 2009 Peter D. Christenson Biostatistician Session 1: Classification...

Biostatistics Case Studies 2009

Peter D. Christenson

Biostatistician

http://gcrc.labiomed.org/biostat

Session 1:

Classification Trees

Case Study

Goal of paper: Classify subjects as IR or non-IR using subject characteristics other than definitive IR such as clamp:

Gender, weight, lean body mass, BMI, waist and hip circumferences, LDL, HDL, Total Chol, triglycerides, FFA, DBP, SBP, fasting insulin and glucose, HOMA, family history of diabetes, and some derived ratios from these.

Major Conclusion Using All Predictors

27.5 28.9

3.60

4.65

BMI

HOMAIR

Non-IR

p 336, 1st column:

Broad Approaches to Classification

1. Cluster analyses – geometric.

2. Regression, discriminant analysis – modeling.

3. Trees – subgroup partitioning.

Classification Tree Steps

There are several flavors of tree methods, each with many options, but most involve:

• Specifying criteria for predictive accuracy.

• Tree building.

• Tree building stopping rules.

• Pruning.

• Cross-validation.

Overview of Classification Trees

General concept, based on subgroupings:

1. Form subgroups according to combinations of characteristics, each split as High or Low.

2. The splits and selection of variables are determined from IR rates in each subgroup.

3. Classify each subgroup as IR or not.

Notes:

1. No model or statistical assumptions.

2. And so no p-values.

3. Many options are involved in grouping details.

4. Actually implemented hierarchically – next slide.

Figure 2

Classify as IRClassify as non-IR

=81%=13%

Subgroups Combined as IR or Not

27.5 28.9

3.60

4.65

BMI

HOMASubgroups 6 7 8

IR

Non IR

Subgroup 1

Subgroup 2

Sub group 3 Subgroup 4

Subgroup 5

3.11

Graphical Representation of Figure 2:

Example of Finding Optimal Subgroups

9.08.07.06.05.03.0 4.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

HOMA

Pro

po

rtio

n I

nsu

lin R

esi

ste

nt

Overall IR rate = 700/2138 = 33%.Simulated Data:

Split at 3.5: IR rates are ~ 0.4 vs. 0.1 → 4-fold



Choose HOMA that maximizes fold diff.

Then repeat for all other variables.

Then choose max fold variable.

Criteria for Optimal Splitting

• The relative rates – fold or difference – as in the previous slide could be used.

• For classification, sensitivities and specificities may be used since they describe misclassifications.

Classification Tree with One Predictor

• Choose every possible HOMA value and find sensitivity and specificity (as in creating a ROC curve).

• Assign weights to the relative importance of sensitivity and specificity, often equal or AUC.

• Using , find cutpoint h so that we decide:• If HOMA > h then classify as IR• If HOMA ≤ h then classify as non-IR

Classification Accuracies by CutpointsFrom SAS (“Event” = IR):

--Correct-- --Incorrect-- ---------- Percentages ------------- HOMA Non- Non- Sensi- Speci- False FalseLevel Event Event Event Event Correct tivity ficity POS NEG

3.5 588 1117 321 112 79.7 84.0 77.7 35.3 9.14.0 542 1237 201 158 83.2 77.4 86.0 27.1 11.34.5 506 1302 136 194 84.6 72.3 90.5 21.2 13.05.0 458 1333 105 242 83.8 65.4 92.7 18.7 15.45.5 400 1350 88 300 81.9 57.1 93.9 18.0 18.26.0 338 1363 75 362 79.6 48.3 94.8 18.2 21.06.5 286 1369 69 414 77.4 40.9 95.2 19.4 23.27.0 233 1390 48 467 75.9 33.3 96.7 17.1 25.17.5 174 1403 35 526 73.8 24.9 97.6 16.7 27.38.0 123 1416 22 577 72.0 17.6 98.5 15.2 29.08.5 60 1428 10 640 69.6 8.6 99.3 14.3 30.99.0 0 1438 0 700 67.3 0.0 100.0 . 32.7

If the overall percentage correct is used to choose the “optimal” cutpoint, then cutpoint=4.61, with % correct=85.2%, sensitivity=71.7% and specificity=91.7%.

4.61 85.2 91.771.7502 198

Final Tree if Only One Variable

Numbers are slightly different than the paper due to my simulating the real data.

700/2138

32.7% IR

198/1517

13.1% IR

502/621

80.8% IR

HOMA≤4.61 HOMA>4.61

Using 4.61 minimizes the percentage misclassified.

Is it OK to find a p-value for 13% vs. 81% with a chi-square or Fisher’s exact test?

Now Repeat for Each of These 2 Subgroups:

Classify as IRClassify as non-IR

=81%=13%

Trees: Major Decisions to Make

• “Difference” measure: fold diff or misclassify rate?

• “Depth” of tree: when to stop?

• Which subgroups to combine.

Alternative

More Standard Analysis:

Logistic Regression

Alternative: Logistic Regression

1. Find equation:

Prob(IR) = function(w1*BMI + w2*HOMA + w3*LDL +...)

where the w’s are weights (coefficients).

2. Classify as IR if Prob(IR) is large enough.

Note:• Assumes functional specification (“model”).• Gives p-values (which depend on model being

correct).• Find a Prob(IR) cutoff to satisfy desired tradeoffs

between sensitivity and specificity.

Logistic Regression with Only One Predictor

9.08.07.06.05.03.0 4.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

HOMA

Pro

po

rtio

n I

nsu

lin R

esi

ste

nt

Fitted logistic model predicts probability of IR as:

Prob(IR) = eu/(1 + eu), where u= -4.83 + 0.933(HOMA)

A logistic curve has

this sigmoidal

shape

Simulated Data:

Using Logistic Model for Classification

• The logistic model proves that the risk of IR ↑ as HOMA ↑ (significance of the coefficient 0.933 is p<0.0001).

• How can we classify as IR or not based on HOMA?

• Use Prob(IR).– Need cutpoint c so that we classify as:

• If Prob(IR) > c then classify as IR• If Prob(IR) ≤ c then classify as non-IR

• Regression does not supply c. It is chosen to balance sensitivity and specificity.

IR and HOMA: Logistic with Arbitrary Cutpoint

9.08.07.06.05.03.0 4.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

HOMA

Pro

po

rtio

n I

nsu

lin R

esi

ste

nt

If cutpoint c=0.50 is chosen, then we have:

Sensitivity = 440/(440+260) = 62.9%

Specificity = 1339/(1339+99) = 93.1%

Assign IR

Assign non-IR

Actual

IR: N=440

Non-IR: N=99

Actual

IR: N=260

Non-IR: N=1339

Logistic with Other CutpointsFrom SAS (“Event” = IR):

Classification Table

--Correct--- -Incorrect-- ------------ Percentages ------------ Prob Non- Non- Sensi- Speci- False FalseLevel Event Event Event Event Correct tivity ficity POS NEG

0.100 700 0 1438 0 32.7 100.0 0.0 67.3 .0.200 567 1181 257 133 81.8 81.0 82.1 31.2 10.10.300 521 1277 161 179 84.1 74.4 88.8 23.6 12.30.400 485 1325 113 215 84.7 69.3 92.1 18.9 14.00.500 440 1339 99 260 83.2 62.9 93.1 18.4 16.30.600 386 1354 84 314 81.4 55.1 94.2 17.9 18.80.700 331 1363 75 369 79.2 47.3 94.8 18.5 21.30.800 272 1376 62 428 77.1 38.9 95.7 18.6 23.70.900 171 1404 34 529 73.7 24.4 97.6 16.6 27.4

Often, the overall percentage correct is used to choose the “optimal” cutpoint. Here, that gives cutpoint=0.37, with % correct=85.2%, sensitivity=71.7% and specificity=91.7%.

0.37 85.2 91.771.7

Data that is used to create a ROC

One Predictor: Logistic is Equivalent to Tree

9.08.07.06.05.03.0 4.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

HOMA

Pro

po

rtio

n I

nsu

lin R

esi

ste

nt

Assign IR

Assign non-IR

Actual

IR: N=502

Non-IR: N=119

Actual

IR: N=198

Non-IR: N=1319Assign IRAssign non-IR

Actual

IR: N=502

Non-IR: N=119

Actual

IR: N=198

Non-IR: N=1319

But no proof or p-value from tree.

Prob(IR) = eu/(1 + eu), where u= -4.83 + 0.933(HOMA)

So Prob(IR) ↑ as HOMA ↑ (monotonic)

Summary: One Predictor

Logistic regression and tree give same conclusion.

Regression can prove predictor-outcome association by providing p-value.

This proof depends on “smooth” equation.

Can verify data follows this smooth relation.

Tree: No p-value, but can validate on other data.

More Than One Predictor

IR Rate: BMI and HOMA

BMI HOMA

3.0

9.0

0.0

100.0

20.0

40.0

% IR

1. %IR may increase non-smoothly with HOMA and BMI.

2. Logistic regression fits a smooth surface to these %s.

Classifying IR from HOMA and BMI: Logistic

27.5 28.9

3.60

4.65

BMI

HOMA

Non-IR

IR Equation:

0.87(HOMA) + 0.07(BMI)

= cutpoint

Logistic regression forces a smooth partition such as the following, although adding HOMA-BMI interaction could

give curvature to the demarcation line.

Compare this to the tree partitioning on the next slide.

Classifying IR from HOMA and BMI: Trees

27.5 28.9

3.60

4.65

BMI

HOMAIR

Non-IR

Trees partition HOMA-BMI combinations into subgroups, some of which are then combined as IR and non-IR.

Disadvantage of Trees

27.5 28.9

3.60

4.65

BMI

HOMAIR

Non-IR

Can Obtain Implausible Biological Conclusions

Potential Modeling Inadequacy

Prob(IR) = eu/(1 + eu), where u= -6.51 + 0.87(HOMA) + 0.07(BMI),

unless Prob(IR) ↑s smoothly as 0.87(HOMA) + 0.07(BMI) does

Data may not follow logistic curve:

IR Rate: BMI and HOMA

BMI HOMA

3.0

9.0

0.0

100.0

20.0

40.0

9.08.07.06.05.03.0 4.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

HOMA

Pro

po

rtio

n I

nsu

lin R

esi

ste

nt

Data does follow logistic curve.

Disadvantage of Regression

Use Trees or Logistic Regression?

Logistic:• Requires specification of smooth function and

any interrelations among predictors.• Gives a probabilistic statement (p-value) about

the generalizability of whether predictors are associated with IR – “proof”.

• P-value somewhat dependent on correct specification.

• Both technical and heuristic ways to check specifications.


Trees:• Interrelations not pre-specified, but detected in

the analysis.• Does not prove associations beyond reasonable

doubt – i.e., no traditional “proof”.• Conclusions may be very specific to data in the

sample.• With enough subjects, can generate trees from

some subjects and validate in others. Provides credibility as in replicated experiments.

Validation in Paper Appendix ?

CV IR Prediction RuleArea ROC

(%)Sensitivity

(%)Specificity

(%)

1(HOMA-IR > 4.65) or (BMI > 34.8) or (BMI > 27.5 & LDL <

4.77)87.02 82.72 71.43

2 (HOMA-IR > 4.49) or (BMI > 28.1 & LDL < 4.94) 87.05 81.82 74.32

3(HOMA-IR > 7.36) or (BMI > 28.1) or (HOMA-IR > 4.49 &

(BMI > 23.5 or LDL > 3.26)) 89.79 88.41 70.34

4(HOMA-IR > 7.40) or (BMI > 27.5 & LDL < 4.91) or

(HOMA-IR > 4.49 & (BMI > 23.8 or LDL > 3.22))88.57 82.54 73.51

5 (HOMA-IR > 3.09) 90.54 85.07 73.47

6 (HOMA-IR > 3.00) 87.90 80.28 74.13

7 (HOMA-IR > 4.49) or (BMI > 28.4 & LDL < 4.91) 84.89 75.00 78.08


LDL < 2.85)89.61 78.08 83.69


LDL < 2.49)89.05 85.37 78.63

10 (HOMA-IR > 3.00) or (BMI > 34.6) 92.68 96.67 67.32


Generally:

Small N: Cannot use trees.

Large N: Trees useful if there are unknown but suspected complex interrelations among predictors.

Pre-specifying criteria and validation are critical.

Classification Tree Software

• CART from www.salford-systems.com.

• Statistica from www.statsoft.com.

• SAS: in the Enterprise Miner module.

• SPSS: Has Salford CART in their Clementine Data Mining module.

Open QuestionIsn’t this paper really:

Glucose Disposal < 28 µmol/min/kg

How could glucose disposal itself be used as the outcome?


Generally:

Small N: cannot use trees.

Large N: Trees useful if unknown but suspected complex interrelations among predictors. Pre-specifying criteria and validation are critical.

Remaining Slides:

Overview of Decision Details Needed for Classification Trees

Appendix

Classification Tree Steps

There are several flavors of tree methods, each with many options, but most involve:

• Specifying criteria for predictive accuracy.

• Tree building.

• Tree building stopping rules.

• Pruning.

• Cross-validation.

Specifying criteria for predictive accuracy

• Misclassification cost generalizes the concept of misclassification rates so that some types of misclassifying are given greater weight.

• Relative weights, or costs are assigned to each type of misclassification.

• A prior probability of each outcome is specified usually as the observed prevalence of outcome in the data, but could be from previous research or for other populations.

• The costs and priors together give the criteria for balancing specificity and sensitivity. Observed prevalence and equal weights → minimizing overall misclassification.

Tree Building

• Recursively apply what we did for HOMA for each of the two resulting partitions, then for the next set, etc.

• Every factor is screened at every step. The same factor may be reused.

• Some algorithms allow certain linear combinations of factors (e.g., as logistic regression provides, called discriminant functions) to be screened.

• An “impurity measure” or “splitting function” specifies the criteria for measuring how different two potential new subgroups are. Some choices are “Gini”, chi-square and G-square.

Tree Building Stopping Rules

• It is possible to continue splitting and building a tree until all subgroups are “pure” with only one type of outcome. This may be too fine to be useful.

• One alternative is “minimum N”, to allow only pure or only subgroups of a minimum size.

• Another choice is “Fraction of objects” in which a minimum fraction of an outcome class, or a pure class is obtained.

Tree Pruning

• Pruning tries to solve the problem of lack of generalizability due to over-fitting the results to the data at hand.

• Start at the latest splits and measure the magnitude of the reduced misclassification due to that split. Remove the split if it is not large.

• How large is “not large”. This can be made at least objective, if not foolproof, by a complexity parameter related to the depth of the tree, i.e. number of levels of splits. Combining that with the misclassification cost function gives a “cost-complexity pruning”, used in this paper.

Cross Validation• At least two data sets are used. The decision rule

is built with training set(s), and applied to test set(s).

• If the misclassification costs for the test sets are similar to that for the training sets, then that decision rule is considered “validated”.

• With large datasets, as in business data mining, only one training and one test set is used.

• For smaller datasets, “v-fold cross-validation” is used. The data is randomly split into v sets. Each set serves as the test set once, with the combined remaining v-1 sets as the training set, and v-1 times as part of the training set, for v analyses. Average cost is compared to that for the entire set.

Biostatistics Case Studies 2009 Peter D. Christenson Biostatistician Session 1: Classification...

Documents

Transcript of Biostatistics Case Studies 2009 Peter D. Christenson Biostatistician Session 1: Classification...