Design of a Clinical Microarray Chipcompdiag.molgen.mpg.de/docs/[email protected] · A virtual...

1
Design of a Clinical Microarray Chip J. Jäger and R. Spang Department of Computational Molecular Biology Max-Planck-Institute for Molecular Genetics, Ihnestr. 73, D-14195 Berlin (Germany) E-mail: [email protected], [email protected] Problem setting First results Our goal is to reduce the costs for a clinical diagnostic system based on microarray chips. Currently whole genome chips are used to examine the expression of as many genes as possible. We study the problem of how many samples should be analyzed before moving from a full genome chip approach to a smaller and therefore more cost efficient custom diagnostic chip. Series of whole genome chips How many whole genome chips do we have to look at before we can design a new diagnostic chip? Gene subset selection New compact diagnostic chip Experimental design Diagnostic signature and gene selection Intensity Frequency Questions Normal scenario: Rank all genes based on a test statistic that evaluates differences between diagnostic groups. Then select top genes from this list. Perfect diagnostic marker gene Perfect diagnostic signature using more than one gene Boxplot sampling k relapse and k control patients 20 times. Number of genes on the new chip fixed to 300. Accuracy using all data: 0.77 Boxplot sampling 20 relapse and 20 control patients 10 times. Varying the number of genes from 10 to 300 step 10. Intensity Frequency Group 1 Group 2 Gene 1 Gene 2 For the gene subset selection we would like to cover as many of the potential candidates as necessary for a consistent and reliable class prediction. These genes do not necessarily have to be the most discriminative ones. They should rather represent a reliable subset on which further feature selection can be applied to. So chip design does not yet select genes for diagnostic signatures, but limits further feature selection to a subset of genes to choose from and is therefore different from the well known feature/gene and diagnostic signature selection. In order to study the effects of a new chip design for clinical trials we used the St. Jude * acute lymphoblastic leukemia dataset with 335 patient samples (68 of which had a relapse) and simulated chip design from this dataset. To learn a classifier we used support vector machines with feature selection prefiltering. The performance was evaluated using leave one out classification. * E.-J. Yeoh, M.E. Ross et al.: Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1:133--145, March 2002. Difference of Chip Design and Feature Selection A virtual chip holds only a subset of genes of the original chip. To compare the performance of virtual chips we measure the performance of a classifier based on this virtual chip. In order to simulate different study sizes we randomly sample from the study and determine the subset of genes for the virtual chip based on this data. All chips of the whole study Sampling Select subset of genes for chip design Apply subset to whole study Random Sample How many genes should we put on the new chip? (tradeoff between accuracy and budget) Virtual chips for whole study Feature selection Learn SVM Classifier Calculate LOOCV performance of this classifier Top genes for all virtual chips

Transcript of Design of a Clinical Microarray Chipcompdiag.molgen.mpg.de/docs/[email protected] · A virtual...

Page 1: Design of a Clinical Microarray Chipcompdiag.molgen.mpg.de/docs/jaeger@recomb2003.pdf · A virtual chip holds only a subset of genes of the original chip. To compare the performance

Design of a Clinical Microarray ChipJ. Jäger and R. Spang

Department of Computational Molecular BiologyMax-Planck-Institute for Molecular Genetics, Ihnestr. 73, D-14195 Berlin (Germany)

E-mail: [email protected], [email protected]

Problem setting

First results

Our goal is to reduce the costs for a clinical diagnostic system based onmicroarray chips. Currently whole genome chips are used to examine theexpression of as many genes as possible. We study the problem of howmany samples should be analyzed before moving from a full genome chipapproach to a smaller and therefore more cost efficient custom diagnosticchip.

Series of wholegenome chips

How many whole genome chips do we haveto look at before we can design a newdiagnostic chip?

Gene subset selection New compact

diagnostic chip

Experimental design

Diagnostic signature and gene selection

Intensity

Fre

quen

cy

Questions

Normal scenario:Rank all genes based on a teststatistic that evaluates differencesbetween diagnostic groups. Thenselect top genes from this list.

Perfect diagnosticmarker gene

Perfect diagnostic signatureusing more than one gene

Boxplot sampling k relapse and k control patients 20 times. Numberof genes on the new chip fixed to 300. Accuracy using all data: 0.77

Boxplot sampling 20 relapse and 20 control patients 10 times.Varying the number of genes from 10 to 300 step 10.

Intensity

Fre

quen

cy

Group 1 Group 2

Gene 1

Gen

e 2

For the gene subset selection we would like to cover as many of thepotential candidates as necessary for a consistent and reliable classprediction. These genes do not necessarily have to be the mostdiscriminative ones. They should rather represent a reliable subset onwhich further feature selection can be applied to. So chip design does notyet select genes for diagnostic signatures, but limits further featureselection to a subset of genes to choose from and is therefore differentfrom the well known feature/gene and diagnostic signature selection.

In order to study the effects of a new chip design for clinical trials weused the St. Jude* acute lymphoblastic leukemia dataset with 335patient samples (68 of which had a relapse) and simulated chip designfrom this dataset. To learn a classifier we used support vector machineswith feature selection prefiltering. The performance was evaluatedusing leave one out classification.

* E.-J. Yeoh, M.E. Ross et al.: Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1:133--145, March 2002.

Difference of Chip Design and Feature Selection

A virtual chip holds only a subset of genes of the original chip. Tocompare the performance of virtual chips we measure the performanceof a classifier based on this virtual chip. In order to simulate differentstudy sizes we randomly sample from the study and determine thesubset of genes for the virtual chip based on this data.

All chips of thewhole study

SamplingSelect subsetof genes forchip design

Apply subset towhole study

Random Sample

How many genes should we put on the newchip?(tradeoff between accuracy and budget)

Virtual chips forwhole study

Featureselection

Learn SVMClassifier

Calculate LOOCVperformance of thisclassifier

Top genes for allvirtual chips