1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe...

High Throughput Target Identification

Stan Young, NISS

Doug Hawkins, U Minnesota

Christophe Lambert, Golden Helix

Machine Learning, Statistics, and Discovery

25 June 03

PublicationYear

All Journals PNAS

1992 0 01993 0 01994 0 01995 4 01996 3 11997 8 21998 37 11999 134 82000 409 342001 773 46

Micro Array Literature

Guilt by Association :

You are known

by the company you keep.

Data Matrix

Goal: Associations over the genes.

Guilty Gene

Tissues

1. Associations.

2. Deep associations – beyond 1st level correlations.

3. Uncover multiple mechanisms.

Problems

1. n < < p

2. Strong correlations.

3. Missing values.

4. Non-normal distributions.

5. Outliers.

6. Multiple testing.

Technical Approach

1. Recursive partitioning.

2. Resampling-based, adjusted p-values.

3. Multiple trees.

Recursive Partitioning

1. Create classes.

2. How to split.

3. How to stop.

Differences:

Recursive Partitioning• Top-down analysis• Can use any type of descriptor.• Uses biological activities to

determine which features matter.

• Produces a classification tree for interpretation and prediction.

• Big N is not a problem!• Missing values are ok.• Multiple trees, big p is ok.

Clustering• Often bottom-up

• Uses “gestalt” matching.

• Requires an external method for determining the right feature set.

• Difficult to interpret or use for prediction.

• Big N is a severe problem!!

Forming Classes, Categories, Groups

Profession Av. Income

Baseball Players 1.5MFootball Players 1.2M

Doctors .8MDentists .5M

Lawyers .23MProfessors .09M

. . . . .

Forming Classes from “Continuous” Descriptor

0 31 2 4 5 6-1-2-3

How many “cuts” and where to make them?

Splitting : t-test

n = 1650ave = 0.34sd = 0.81

n = 1614ave = 0.29sd = 0.73

n = 36ave = 2.60sd = 0.9

Signal 2.60 - 0.29t = = = 18.68Noise 0.734 1 1

36 1614+

TT: NN-CCNN-CC

rP = 2.03E-70

aP = 1.30E-66

Splitting : F-test

n = 1650ave = 0.34sd = 0.81

n = 1553ave = 0.21sd = 0.73

n = 36ave = 2.60sd = 0.9

n = 61ave = 1.29sd = 0.83

Signal Among Var (Xi. - X..)2/df1F = = =

Noise Within Var (Xij - Xi.)2/df2

How to Stop

Examine each current terminal node.

Stop if no variable/class has a

significant split, multiplicity adjusted.

Levels of Multiple Testing

1. Raw p-value.

2. Adjust for class formation, segmentation.

3. Adjust for multiple predictors.

4. Adjust for multiple splits in the tree.

5. Adjust for multiple trees.

Understanding observations

NB: Splitting variables govern the process,NB: Splitting variables govern the process, linked to response variable.linked to response variable.

MultipleMechanisms

Conditionally important descriptors.

Multiple Mechanisms

Reality: Example Data

60 Tissues

1453 Genes

Gene 510 is the “guilty” gene, the Y.

1st Split of Gene 510 (Guilty Gene)

Split Selection

14 spliters

with adjusted

p-value

< 0.05

Histogram

Non-normal, hence

resampling p-values

make sense.

Resampling-based Adjusted p-value

Single Tree RP Drawbacks

• Data greedy.

• Only one view of the data. May miss other mechanisms.

• Highly correlated variables may be obscured.

• Higher order interactions may be masked.

• No formal mechanisms for follow-up experimental design.

• Disposition of outliers is difficult.

Multiple Trees, how and why?Multiple Trees, how and why?

How do you get multiple trees?

1. Bootstrap the sample, one tree per sample.

2. Randomize over valid splitters.

RandomTreeBrowsing,

1000 Trees.

Example Tree

1st Split

Example Tree, 2nd Split

Conclusion for Gene G510

If G518 < -0.56

G790 < -1.46

G510 = 1.10 +/- 0.30

Using Multiple Trees to Understand variables

• Which variables matter?

• How to rank variables in importance.

• Correlations.

• Synergistic variables.

CorrelationInteractionMatrix

Red=Syn.

Summary

• Review recursive partitioning.

• Demonstrated multiple tree RP’s capabilities– Find associated genes

– Group correlated predictors (genes)

– Synergistic predictors (genes that predict together)

• Used to understand a complex data set.

Needed research

• Real data sets with known answers.

• Benchmarking.

• Linking to gene annotations.

• Scale (1,000*10,000).

• Multiple testing in complex data sets.

• Good visualization methods.

• Outlier detection for large data sets.

• Missing values. (see NISS paper 123)

NC State University :Jacqueline Hughes-OliverKatja Rimlinger

U Waterloo :Will WelchHugh ChipmanMarcia WangYan Yuan

U. Minnesota :Douglas Hawkins NISS :

Alan Karr(Consider post docs)GSK :

Lei ZhuRay Lam

References/Contact

1. www.goldenhelix.com.

2. www.recursive-partitioning.com.

3. www.niss.org, papers 122 and 123.

4. young@niss.org

5. GSK patent.

Questions

1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe...

Documents

Transcript of 1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe...

Christophe Huchon

Bulletin No. 07-2015 To: NISS Member Companies · 2015-01-30 · From Indianapolis International Airport to: • Downtown Indianapolis • NISS Headquarters 13 miles 22 miles $25

Christophe Harbour

Christophe Bertrand

SUSTRUS: TML tasks Christophe Heyndrickx christophe@tmleuven.be

1 A Rorschach Test. S. Stanley Young, NISS Jessie Q. Xia, NISS Banff, Canada Dec 15, 2011 Variable Importance in Environmental Studies.

Technical Writing NISS – ASA Workshop Washington DC 2 – 5 August 2009.

Christophe Hamm

NISS Sprint 2021 29 April 2021 - NISSOC 2021

christophe côme

christophe test

CHRISTOPHE ALÉVÊQUEdocument.theatredurondpoint.info/156/156/supports/23757/... · 2017. 11. 13. · CHRISTOPHE ALÉVÊQUE TEXTE ET INTERPRÉTATION Christophe Alévêque débute

Niss presentation post 2015 launch_rus (1)

7 christophe

Some Two-Block Problems Douglas M. Hawkins NCSU ECCR NISS Feb 2007 (work with Despina Stefan.)

Location: NISS building Research Triangle Park North Carolina Information: .

Christophe Gilbert

DR.HUGO HAWKINS MEDICO O.R.L. DR.HUGO HAWKINS MEDICO O.R.L.

Christophe CinqMars

Location: NISS building Research Triangle Park