Part II – with interactions of genes in mind Min-Te Chao 2002/10/ 28
-
Upload
dominic-cherry -
Category
Documents
-
view
15 -
download
1
description
Transcript of Part II – with interactions of genes in mind Min-Te Chao 2002/10/ 28
112/04/19 1
Part II
– with interactions of genes in mind
Min-Te Chao
2002/10/ 28
112/04/19 2
• So far, all methods are one-gene-at-a-time
• First these methods are simple and intuitive, then they begin to become complicated.
• Eg., Efron has to use a tricky logistic regression to estimate the prior density which is not too easy.
112/04/19 3
• The general problem with microarray of data is, although similar in regression setup, the “design matrix” is never of full rank.
112/04/19 4
In the setup
Y=X * \beta + error
X is n by p, with n<100, p>1000.
I have seen a case with n=7, but p>6000.
112/04/19 5
• Let us say there is a way to “Do the statistical problem” (say, with traditional methods), with a smaller p, say p=p_1=3 or 30, depending on the value of n we have.
• Let us assume a model with the first p_1 parameteres only (the other betas are all 0, say)
112/04/19 6
• With our traditional method, we may find the likelihood function – with n observation and p_1 parmateres
• And we go through the text book method to do inference about the selected p_1 parameters.
• And obtain an estimator of the p_1-dim parameter (together with a sd or p-value)
112/04/19 7
• Repeat the procedure B times, each time with a
“simple random sample without replacement of size p_1”
from the p genes in the problem.
112/04/19 8
• In this way we change an unsolvable problem (in our classical statistical sense) to B problems, all of them can be done with traditional methods
• It is very time-consuming, but sometimes it works
112/04/19 9
• Lo, S haw-Hwa and Tien Zheng (2002) Backward haplotype transmission association algorithm – a fast multi-marker screening method
To appear: Human Heredity
112/04/19 10
• Instead of genes, they use markers.
• P-markers, n-patient
• For each patient, we have data from father and mother
• So we have n pieces of
parents – child
data.
112/04/19 11
• The problem is to identify which are the disease-causing markers
112/04/19 12
• They pick out r markers at a time, r<<p
• A statistics T(r) is constructed, which tells the “amount of information” for a n-patient, r-marker sub-problem
• Markers in this subproblem are deleted one by one, the least important one first,
until all markers left are important
112/04/19 13
• This gets us the group 1 of important markers.
• We do the same thing for another subset of r markers, and get the group 2 of important markers, ….
• Do it B times, B pretty large, say 5000
112/04/19 14
• Combine all markers together, those with highest frequencies are selected.
• More specifically, markers whose returning frequencies are more than the 3-rd quartile plus 1.8 times IQR will be selected (about 3.1 sd from mean)
• About 10^{-3} type I error.
112/04/19 15
• The difficult part of the problem is to formulate a likelihood function for the r selected markers.
• The next problem is to derive a test statistic, together with its properties.
But these are problem-specific…
112/04/19 16
• It is the generality of the setup that is important.
• Because it considers r markers at a time, so the likelihood function is with respect to the r selected markers. If there is any interaction between 2 or 3 markers, this process has a potential to pick them up
112/04/19 17
• This is not possible with all the one-gene-at-a-time processes.
112/04/19 18
• All known methods, data mining or not, for analysis of micro array type of data are ad hoc and rather primitive.
• Amount of theory is limited.
• It has the tendency that these methods will eventually become statistical in nature, because an assessment of risk is still a very important factor in scientific work
112/04/19 19
• Subject-matter relevancy is the key
• Other keys:
good data
other scientists
effective computation
don’t wait