Diagnosis of multiple cancer types by shrunken centroids of gene expression
-
Upload
shantell-allen -
Category
Documents
-
view
19 -
download
0
description
Transcript of Diagnosis of multiple cancer types by shrunken centroids of gene expression
Diagnosis of multiple cancer types by shrunken centroids of gene expression
Course: 550.635 Topics in Bioinformatics Presenter: Ting YangTeacher: Professor Geman
By Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu
Nearest Centroid Classification
Example: small round blue cell tumors of childhood
• 63 training samples, 25 testing samples
• 4 classes: BL, EWS, NB, RMS
• Figure 1
• Nearest centroid classification
• Disadvantage
Nearest shrunken Centroids
• A modification of the nearest centroid method
• Idea: First normalize class centroids by the within-class standard deviation for each gene, shrink each class centroid towards the overall centroid.
Details:
0( )ik i
ikk i
x xd
m s s
Mean expression value in class k for gene i
ith component of the overall centroid
Pooled within class standard deviation for gene i
:t statistics
1 1k
k
mn n
:t statistics0( )
ik iik
k i
x xd
m s s
• It measures the difference between the gene i in class k and gene i in all classes combined.
• Idea: a gene that discriminates one class from the rest will have a statistic of large absolute value.
• Shrink it toward zero to eliminate the genes that do not provide sufficient information.
• ‘De-noising’ step
( )( )ik ik ikd sign d d
Choosing the amount of shrinkage• Shrinkage amount is allowed to vary over a wide range.
• 10-fold cross validation ( choose the one that has the smallest error rate)
• Divide the set of samples (at random)into 10 equal size parts.
(classes were distributed proportionally among each of the 10 parts)
• Fit the model on 90% of the samples and then predict the class label of the remaining 10% (test samples).
• Repeat 10 times, add together the error (overall error).
• Figure 2
• Figure 1
More Figures
• Figure 3
• Figure 4
Classification
• A new sample is classified by comparing its expression profile with each shrunken centroid, over those 43 active genes.
• Distance function: prior information included.
Statistical details:
• t-statistic
• Estimates of the class probabilities (Figure 5)
0( )ik i
ikk i
x xd
m s s