Part II – with interactions of genes in mind Min-Te Chao 2002/10/ 28

112/04/19 1

Part II

– with interactions of genes in mind

Min-Te Chao

2002/10/ 28

112/04/19 2

• So far, all methods are one-gene-at-a-time

• First these methods are simple and intuitive, then they begin to become complicated.

• Eg., Efron has to use a tricky logistic regression to estimate the prior density which is not too easy.

112/04/19 3

• The general problem with microarray of data is, although similar in regression setup, the “design matrix” is never of full rank.

112/04/19 4

In the setup

Y=X * \beta + error

X is n by p, with n<100, p>1000.

I have seen a case with n=7, but p>6000.

112/04/19 5

• Let us say there is a way to “Do the statistical problem” (say, with traditional methods), with a smaller p, say p=p_1=3 or 30, depending on the value of n we have.

• Let us assume a model with the first p_1 parameteres only (the other betas are all 0, say)

112/04/19 6

• With our traditional method, we may find the likelihood function – with n observation and p_1 parmateres

• And we go through the text book method to do inference about the selected p_1 parameters.

• And obtain an estimator of the p_1-dim parameter (together with a sd or p-value)

112/04/19 7

• Repeat the procedure B times, each time with a

“simple random sample without replacement of size p_1”

from the p genes in the problem.

112/04/19 8

• In this way we change an unsolvable problem (in our classical statistical sense) to B problems, all of them can be done with traditional methods

• It is very time-consuming, but sometimes it works

112/04/19 9

• Lo, S haw-Hwa and Tien Zheng (2002) Backward haplotype transmission association algorithm – a fast multi-marker screening method

To appear: Human Heredity

112/04/19 10

• Instead of genes, they use markers.

• P-markers, n-patient

• For each patient, we have data from father and mother

• So we have n pieces of

parents – child

data.

112/04/19 11

• The problem is to identify which are the disease-causing markers

112/04/19 12

• They pick out r markers at a time, r<<p

• A statistics T(r) is constructed, which tells the “amount of information” for a n-patient, r-marker sub-problem

• Markers in this subproblem are deleted one by one, the least important one first,

until all markers left are important

112/04/19 13

• This gets us the group 1 of important markers.

• We do the same thing for another subset of r markers, and get the group 2 of important markers, ….

• Do it B times, B pretty large, say 5000

112/04/19 14

• Combine all markers together, those with highest frequencies are selected.

• More specifically, markers whose returning frequencies are more than the 3-rd quartile plus 1.8 times IQR will be selected (about 3.1 sd from mean)

• About 10^{-3} type I error.

112/04/19 15

• The difficult part of the problem is to formulate a likelihood function for the r selected markers.

• The next problem is to derive a test statistic, together with its properties.

But these are problem-specific…

112/04/19 16

• It is the generality of the setup that is important.

• Because it considers r markers at a time, so the likelihood function is with respect to the r selected markers. If there is any interaction between 2 or 3 markers, this process has a potential to pick them up

112/04/19 17

• This is not possible with all the one-gene-at-a-time processes.

112/04/19 18

• All known methods, data mining or not, for analysis of micro array type of data are ad hoc and rather primitive.

• Amount of theory is limited.

• It has the tendency that these methods will eventually become statistical in nature, because an assessment of risk is still a very important factor in scientific work

112/04/19 19

• Subject-matter relevancy is the key

• Other keys:

good data

other scientists

effective computation

don’t wait

Part II – with interactions of genes in mind Min-Te Chao 2002/10/ 28

Documents

Transcript of Part II – with interactions of genes in mind Min-Te Chao 2002/10/ 28