1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe...

37
1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery 25 June 03

Transcript of 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe...

Page 1: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

1

High Throughput Target Identification

Stan Young, NISS

Doug Hawkins, U Minnesota

Christophe Lambert, Golden Helix

Machine Learning, Statistics, and Discovery

25 June 03

Page 2: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

2

PublicationYear

All Journals PNAS

1992 0 01993 0 01994 0 01995 4 01996 3 11997 8 21998 37 11999 134 82000 409 342001 773 46

Micro Array Literature

Page 3: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

3

Guilt by Association :

You are known

by the company you keep.

Page 4: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

4

Data Matrix

Goal: Associations over the genes.

Guilty Gene

Genes

Tissues

Page 5: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

5

Goals

1. Associations.

2. Deep associations – beyond 1st level correlations.

3. Uncover multiple mechanisms.

Page 6: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

6

Problems

1. n < < p

2. Strong correlations.

3. Missing values.

4. Non-normal distributions.

5. Outliers.

6. Multiple testing.

Page 7: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

7

Technical Approach

1. Recursive partitioning.

2. Resampling-based, adjusted p-values.

3. Multiple trees.

Page 8: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

8

Recursive Partitioning

Tasks

1. Create classes.

2. How to split.

3. How to stop.

Page 9: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

9

Differences:

Recursive Partitioning• Top-down analysis• Can use any type of descriptor.• Uses biological activities to

determine which features matter.

• Produces a classification tree for interpretation and prediction.

• Big N is not a problem!• Missing values are ok.• Multiple trees, big p is ok.

Clustering• Often bottom-up

• Uses “gestalt” matching.

• Requires an external method for determining the right feature set.

• Difficult to interpret or use for prediction.

• Big N is a severe problem!!

Page 10: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

10

Forming Classes, Categories, Groups

Profession Av. Income

Baseball Players 1.5MFootball Players 1.2M

Doctors .8MDentists .5M

Lawyers .23MProfessors .09M

. . . . .

Page 11: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

11

Forming Classes from “Continuous” Descriptor

0 31 2 4 5 6-1-2-3

How many “cuts” and where to make them?

Page 12: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

12

Splitting : t-test

n = 1650ave = 0.34sd = 0.81

n = 1614ave = 0.29sd = 0.73

n = 36ave = 2.60sd = 0.9

Signal 2.60 - 0.29t = = = 18.68Noise 0.734 1 1

36 1614+

TT: NN-CCNN-CC

rP = 2.03E-70

aP = 1.30E-66

Page 13: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

13

Splitting : F-test

n = 1650ave = 0.34sd = 0.81

n = 1553ave = 0.21sd = 0.73

n = 36ave = 2.60sd = 0.9

n = 61ave = 1.29sd = 0.83

n = 61ave = 1.29sd = 0.83

Signal Among Var (Xi. - X..)2/df1F = = =

Noise Within Var (Xij - Xi.)2/df2

Page 14: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

14

How to Stop

Examine each current terminal node.

Stop if no variable/class has a

significant split, multiplicity adjusted.

Page 15: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

15

Levels of Multiple Testing

1. Raw p-value.

2. Adjust for class formation, segmentation.

3. Adjust for multiple predictors.

4. Adjust for multiple splits in the tree.

5. Adjust for multiple trees.

Page 16: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

16

Understanding observations

NB: Splitting variables govern the process,NB: Splitting variables govern the process, linked to response variable.linked to response variable.

MultipleMechanisms

Conditionally important descriptors.

Page 17: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

17

Multiple Mechanisms

Page 18: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

18

Reality: Example Data

60 Tissues

1453 Genes

Gene 510 is the “guilty” gene, the Y.

Page 19: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

19

1st Split of Gene 510 (Guilty Gene)

Page 20: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

20

Split Selection

14 spliters

with adjusted

p-value

< 0.05

Page 21: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

21

Histogram

Non-normal, hence

resampling p-values

make sense.

Page 22: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

22

Resampling-based Adjusted p-value

Page 23: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

23

Single Tree RP Drawbacks

• Data greedy.

• Only one view of the data. May miss other mechanisms.

• Highly correlated variables may be obscured.

• Higher order interactions may be masked.

• No formal mechanisms for follow-up experimental design.

• Disposition of outliers is difficult.

Page 24: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

24

Etc.

Multiple Trees, how and why?Multiple Trees, how and why?

Page 25: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

25

How do you get multiple trees?

1. Bootstrap the sample, one tree per sample.

2. Randomize over valid splitters.

Etc.

Page 26: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

26

RandomTreeBrowsing,

1000 Trees.

Page 27: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

27

Example Tree

Page 28: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

28

1st Split

Page 29: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

29

Example Tree, 2nd Split

Page 30: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

30

Conclusion for Gene G510

If G518 < -0.56

and

G790 < -1.46

then

G510 = 1.10 +/- 0.30

Page 31: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

31

Using Multiple Trees to Understand variables

• Which variables matter?

• How to rank variables in importance.

• Correlations.

• Synergistic variables.

Page 32: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

32

CorrelationInteractionMatrix

Red=Syn.

Page 33: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

33

Summary

• Review recursive partitioning.

• Demonstrated multiple tree RP’s capabilities– Find associated genes

– Group correlated predictors (genes)

– Synergistic predictors (genes that predict together)

• Used to understand a complex data set.

Page 34: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

34

Needed research

• Real data sets with known answers.

• Benchmarking.

• Linking to gene annotations.

• Scale (1,000*10,000).

• Multiple testing in complex data sets.

• Good visualization methods.

• Outlier detection for large data sets.

• Missing values. (see NISS paper 123)

Page 35: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

35

Teams

NC State University :Jacqueline Hughes-OliverKatja Rimlinger

U Waterloo :Will WelchHugh ChipmanMarcia WangYan Yuan

U. Minnesota :Douglas Hawkins NISS :

Alan Karr(Consider post docs)GSK :

Lei ZhuRay Lam

Page 36: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

36

References/Contact

1. www.goldenhelix.com.

2. www.recursive-partitioning.com.

3. www.niss.org, papers 122 and 123.

4. [email protected]

5. GSK patent.

Page 37: 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

37

Questions