HICSS-35: 35th Hawaii International Conference on System Sciences
Interactive Visualization and Analysis for Gene Expression Data
ChunTang,Li ZhangandAidongZhang
Departmentof ComputerScienceandEngineering
TheStateUniversityof New York at Buffalo
Buffalo,NY 14260�chuntang,lizhang,azhang� @cse.buffalo.edu
Abstract
New technologysuchasDNA microarraycanbeusedto producetheexpressionlevelsof thousands
of genessimultaneously. The raw microarraydataare imageswhich can be transformedinto gene
expressionmatriceswhereusuallytherowsrepresentgenes,thecolumnsrepresentvarioussamples,and
thenumberin eachcell characterizestheexpressionlevel of theparticulargenein a particularsample.
Now thecDNA andgenomicsequenceprojectsareprocessingat sucha rapidrate,moreandmoredata
becomeavailableto researcherswhoareworkingin thefield of bioinformatics.New methodsareneeded
to efficiently andeffectively analyzeandvisualizethegenedata.
A key stepin the analysisof geneexpressiondatais the detectionof groupsthat manifestsimilar
expressionpatternsandfilter out thegenesthatareinferredto representnoisefrom thematrixaccording
to samplesdistribution. In this paper, we presenta visualizationmethodwhich mapsthe samples’� -
dimensionalgenevectorsinto 2-dimensionalpoints. This mappingis effective in keepingcorrelation
coefficient similarity which is the mostsuitablesimilarity measurefor analyzingthe geneexpression
data.Our analysismethodfirst removesnoisegenesfrom thegeneexpressionmatrix by sortinggenes
accordingto correlationcoefficient measureandthenadjuststhe weight for eachremaininggene.We
have integratedour geneanalysisalgorithminto a visualizationtool basedon this mappingmethod.We
canusethis tool to monitortheanalysisprocedure,to adjustparametersdynamically, andto evaluatethe
resultof eachstep.Theexperimentsbasedon two groupsof multiple sclerosis (MS) andtreatmentdata
demonstratetheeffectivenessof thisapproach.
1
1 Introduction
DNA microarraytechnologycanbe usedto measureexpressionlevels for thousandsof genesin a single
experiment,acrossdifferent conditionsand over time [3, 16, 14, 24, 23, 27, 36, 20, 21, 8, 30]. To use
the arrays,labeledcDNA is preparedfrom total messengerRNA (mRNA) of target cells or tissues,and
is hybridizedto the array. The amountof label boundis an approximatemeasureof the level of gene
expression. Thus genemicroarrayscan give a simultaneous,semi-quantitative readouton the levels of
expressionsof thousandsof genes.Just4-6 suchhigh-density“genechips” couldallow rapidscanningof
theentirehumanlibrary for geneswhichareinducedor repressedunderparticularconditions.By preparing
cDNA from cellsor tissuesatintervalsfollowing somestimulus,andexposingeachto replicatemicroarrays,
it is possibleto determinethe identity of genesrespondingto that stimulus,the time courseof induction,
andthedegreeof change.
Somemethodshave beendevelopedusingbothstandardclusteranalysisandnew innovative techniques
to extract,analyzeandvisualizegeneexpressiondatageneratedfrom DNA microarrays.Geneexpression
matrix canbe studiedin two dimensions[2]: comparingexpressionprofilesof genesby comparingrows
in the expressionmatrix [22, 6, 26, 23, 25, 9, 33, 3] and comparingexpressionprofiles of samplesby
comparingcolumnsin thematrix [?, 10, 32]. In addition,bothmethodscanbecombined(providedthatthe
datanormalization[18, 38] allows it).
A key stepin theanalysisof geneexpressiondatais thedetectionof groupsthat manifestsimilar ex-
pressionpatternsandfilter out the genesthat areinferredto representnoisefrom the matrix accordingto
samplesdistribution. Thecorrespondingalgorithmicproblemis to clustermulti-conditiongeneexpression
patterns.Dataclustering[6] wasusedto identify patternsof geneexpressionin humanmammaryepithelial
2
cellsgrowing in cultureandin primaryhumanbreasttumors.DeRisiet al. [17] useda DNA arraycontain-
ing acompletesetof yeastgenesto studythediauxicshift timecourse.They selectedsmallgroupsof genes
with similar expressionprofilesandshowed that thesegenesarefunctionally relatedandcontainrelevant
transcriptionfactorbinding sitesupstreamof their openreadingframes(ORFs). In [3], a clusteringalgo-
rithm wasintroducedfor analysisof geneexpressiondatain which an appropriatestochasticerror model
on theinput hasbeendefined.Self-organizingmaps[26, 19], a typeof mathematicalclusteranalysisthatis
suitedfor recognizingandclassifyingfeaturesin complex, multidimensionaldata,wasappliedto organize
the genesinto biologically relevant clustersthat suggestnovel hypothesesabouthematopoieticdifferenti-
ation. In [12], theauthorspresenteda strategy for theanalysisof large-scalequantitative gene-expression
measurementdatafrom time-courseexperiments. The correlatedpatternsof geneexpressionfrom time
seriesdatasuggestsan orderthat conformsto a notionof sharedpathwaysandcontrol processesthat can
beexperimentallyverified. Brown et al. [23] applieda methodbasedon the theoryof supportvectorma-
chines(SVMs). The methodis consideredasa supervisedcomputerlearningmethodbecauseit exploits
prior knowledgeof genefunctionto identify unknown genesof similar functionfrom expressiondata.They
appliedthisalgorithmonsix functionalclassesof yeastgeneexpressionmatricesfrom 79samples[22]. Al-
ter et al. [25] usedsingularvaluedecompositionin transforminggenome-wideexpressiondatafrom genes
� arraysspaceto reduceddiagonalized“eigengenes”� “eigenarrays”spaceto extractsignificantgenesby
normalizingandsortingthedata.Hastieet al. [33] proposeda tree harvesting methodfor supervisedlearn-
ing from geneexpressiondata. This techniquestartswith a hierarchicalclusteringof genes,thenmodels
theoutcomevariableasa sumof theaverageexpressionprofilesof chosenclustersandtheir products.The
methodcandiscover genesthathave strongeffectson theirown, andgenesthatinteractwith othergenes.
On sampledimension,our task is to build a classifierwhich can predict the samplelabelsfrom the
3
expressionprofile. Golubetal. [32] appliedneighborhoodanalysisto constructclasspredictorsfor samples,
especiallyfor leukemias.They werelooking for geneswhoseexpressiondataarebestcorrelatedwith two
known classesof leukemias,acutemyeloid leukemiaandacutelymphoblasticleukemia. They constructed
aweightedvoteclassifierbasedon50genes(outof 6817)using38samplesandappliedit to acollectionof
34new samples.Theclassifiercorrectlypredicted29of the34samples.In [10], theauthorspresentaneural
network modelknown asSimplified FuzzyARTMAP which canidentify normalanddiffuse large B-cell
lymphoma(DLBCL) patientsusingDNA microarraysdatageneratedby apreviousstudy. Many traditional
clusteringalgorithmssuchasthehierarchical[22, 15,5] andK-meansclusteringalgorithms[13, 28] haveall
beenusedfor clusteringexpressionprofiles.MathematicalandstatisticalmethodslikeFourierandBayesian
analysisalsohave beenusedto discover profilesof cell cycle-dependentgenes[31, 37, 1]. Our grouphas
developedamaximumentropy approachto classifyinggenearraydatasets[29]. Weusedpartof pre-known
classesof samplesastraining setandappliedthemaximumentropy modelto generateanoptimal pattern
modelwhichcanbeusedto new samples.
Sampleclusteringhasbeencombinedwith geneclusteringto identify which genesarethemostimpor-
tant for sampleclustering[34, 5]. Alon et al. [34] have applieda partitioning-basedclusteringalgorithm
to study6500genesof 40 tumorand22 normalcolontissuesfor clusteringbothgenesandsamples.Getz
et al. [11] presenta methodappliedon coloncancerandleukemiadata.By identifying relevantsubsetsof
thedata,they wereableto discover partitionsandcorrelationsthatweremasked andhiddenwhenthe full
datasetwasusedin theanalysis.Thismethodis calledtwo-way clustering.
Multiple sclerosis (MS) is achronic,relapsing,inflammatorydisease.Interferon- � ( ������� ) hasbeen
themostimportanttreatmentfor theMS diseasefor thelastdecade[35]. TheDNA microarraytechnology
makesit possibleto studytheexpressionlevelsof thousandsof genessimultaneously. Thegeneexpression
4
levels aremeasuredby the intensity levels of the correspondingarrayspots. In this paper, we presenta
visualizationmethodwhich mapsthesamples’ -dimensiongenevectorsinto 2-dimensionalpoints. This
mappingis effective in keepingcorrelationcoefficient similarity which is themostsuitablesimilarity mea-
surefor analyzinggeneexpressiondata.Ourgeneexpressiondataanalysismethodfirst removesnoisegenes
from theexpressionmatrix by sortinggenesaccordingto correlationcoefficient measureandthenadjusts
theweightfor eachremaininggene.Wehave integratedourgeneanalysisalgorithminto avisualizationtool
basedonthismappingmethod.Wecanusethis tool to monitortheanalysisprocedure,to adjustparameters,
andto evaluatetheresultof eachstep.Theexperimentsonthehealthycontrol,MS andIFN-treatedsamples
basedon the datacollectedfrom the DNA microarrayexperimentsdemonstratethe effectivenessof this
approach.
This paperis organizedasfollows. Section2 introducesthevisualizationmethod.Section3 describes
thedetailsof our genedataanalyzingapproachwith thehelpof our visualizationtool. Section4 presents
theexperimentalresults.And finally, theconclusionis providedin Section5.
2 Visualization Tool
2.1 Mapping Method
A typical geneexpressionmatrix hasthousandsof rows (eachrow representsa gene)andseveral (usually
lessthan100)columnswhichrepresentsamplesrelatedto acertainkind of diseaseor othercondition.Each
sampleis marked with a label thatpointsout which classit belongsto, suchascontrol, patientanddrug-
treated.While we analyzethesesamples,we usuallywantto view thesampledistribution duringeachstep.
How to visualizearbitrarily largedimensionaldataeffectively andefficiently is still anopenproblem.The
parallelcoordinatesystemallows thevisualizationof multidimensionaldataby a simpletwo dimensional
5
Figure1: Distributionof ourgeneexpressiondataof 44samples,whereeachsampleshas4132genes.(A) Visualiza-
tion usingtheparallelcoordinatesystem,wherethehorizontalaxisrepresentsgenedimensionandeachmultidimen-
sionalline representasample.Thesampledistribution is not clear. (B) Visualizationusingour tool, whereeachpoint
representsa sample.Thefigureshows thataftermappingoriginal datainto 2-dimensionalspace,sampledistribution
is clearlyrepresented.
representation.But asthedimensionsgohigher, thedisplayingis lesseffective. Figure1 shows anexample
of differentvisualizationmethods.
We usethe ideaof a linear mappingmethodthatmapsthe multidimensionaldatasetto 2-dimensional
space[7]. Let vector �������������������� �!��"#"#"$���&%('representadatapoint in -dimensionalspace,andtotalnumber
of pointsin thespaceis ) , denotedas� � �&� � ��"#"#"#�&��*
.
WeusetheFormula(1) to map �� �into a2-dimensionalpoint ��,+�
:
��,+� � %-. / � ��0.21 ��354 ' 1 ��� � . '�' �6
. �(1)
where0 .
is anadjustableweightfor eachcoordinate, is vectorlengthof theoriginal space,354 is a ratio
to centralizethepoints,and �6. ��78�:9!�<;=��"#"#"#� '
areunit vectorswhichdivide thecentercircle of thedisplay
screenequally(Figure2 (B)).
6
Figure2: Mappingfrom � -dimensionalspaceto 2-dimensionalspace.(A) A datapoint > in theoriginal space.(B)>@? is thecorrespondingpoint usingmappingfunction(Formula(1)) in the2-dimensionaldisplayingspace.Thered
point markedas(0,0) is thecenterof thedisplayingscreen, AB(CEDGFIHKJMLONPL<QRQSQRL ��T areunit vectorswhich dividedtheunit
circleof thedisplayscreenequally.
Our initial settingis0 . �VUW"YX[Z]\M^`_5aba�7c�:9!�<;=��"#"#" , which meanseachcoordinateof theoriginal space
contributesequallyin the initial mapping.Underthis setting,we caneasilyfigureout that point (0,0,...0)
in theoriginal -dimensionalspacewill bemappedto (0,0)which is thecenterof 2-dimensionaldisplaying
spacebasedon mappingfunction (Formula (1)). In addition,a point which hasthe format of (a,a,.....a)
will alsobe mappedto thecenter(Figure3 (A)). Anotherpropertyunderthe initial settingis keepingthe
correlationcoefficient simiarlty of the original datavectors. Becausethe correlationcoefficient hasthe
advantageof dependingonly on shapebut not on theabsolutemagnitudeof thespatialvector, it is a better
similarity measurethan Euclideandistance[32, 4]. The formula of correlationcoefficient betweentwo
vectors �d and �e is:
f \g� �dh� �ei'j� 1 �lk %. / � � . 1cm . ' �lk %. / � � . ' 1 �lk %. / � m . 'n o
1 � k %. / � � �. ' � k %. / � � . ' �qp o 1 � k %. / � m �. ' � k %. / � mr. ' �qp � (2)
where
�ds����� � ��� � ��"#"#"#��� % ' �7
Figure3: Somepropertiesof our mappingfunction. (A) Showsevery point whosecoordinatesin theoriginal space
canberepresentedas(a,a,.....a)will bemappedto thecenterof the2-dimensionaldisplayingspace.(B) Showspoints
thathave thesamepattern,which meansratiosof eachpairsof coordinatesin theoriginal spaceareall equal,will be
mappedontoa straightline acrossthecenterin the2-dimensionaldisplayingspace.
8
�e���� m � � m � ��"#"#"#� m % ' �%- . / � �
.�1tmr. �vuxw ) \rZzy�{�|~}�^r\!�Pw f y�u�\rZ[}2_�7l^P|x��u f _=a�_5^��%- . / � �
. �vuxw ) \rZz��u f _=ab_�^��%- . / �
mr. �vuxw ) \rZ m u f _=ab_�^��%-. / � � �
. �vuxw ) \!Z�ux��w2_�^P|�����u f _5ab_5^M�%- . / �
m �. �vuxw ) \!Z�ux��w2_�^P|�� m u f _5ab_5^M"If �d and �e have thesamepattern,which meansratiosof eachpairsof coordinatesof �d and �e areall
equal(Equation(3)), their correlationcoefficient valueswill be9. Usingour mappingfunction, thesetwo
vectorswill bemappedontoa straightline acrossthecenterin the2-dimensionaldisplayingspace,andall
othervectorswhichhave thesamepatternas �d and �e will all bemappedontothatline (Figure3 (B)), even
if their Euclideandistancesin theoriginal spacearevery large.
�I�m � � ���m � � ���m � ��"#"#"=� ��%m % "(3)
2.2 Parameter Adjustment
Our visualizationtool allows theuserto adjusttheweightof eachcoordinatefrom 9to
9to changedata
distribution in thedisplayingspace,bothmanuallyandautomatically. Mappingfrom a higherdimensional
spaceto a lower dimensionalspacemay not preserve all the propertiesof the dataset.By adjustingthe
coordinateweightsof the dataset,data’s original staticstateis changedinto dynamicstatewhich may be
usedto compensatethe information lossfrom mapping. For example,two points��9�UrUW��9�UrUW��"#"#"#��9�UrU�'
and
�bUW�&UW��"#"#"#"�U�'arefar away in theoriginal space,but by theinitial setting,they arebothmappedinto thecenter
9
of the 2-dimensionaldisplayingscreen. Whenever any weight0 .
in Formula (1) is changed,thesetwo
pointswill beseparated.That is,�bUW�&UW��"#"#"#"�U�'
will bestill at thecenterbut��9�UrUW��9�UrUW��"#"#"#��9�UrU�'
will no longer
bemappedto thecenter. Figure4 showsanotherexamplewheretheoriginaldatasethas268points(vectors)
whichcanbedividedinto two clusters,markedashollow redcirclesandfilled bluecircles.While mapping
basedon the initial setting,the clusterboundaryis not clearenough:somepartsof the two clustersare
overlapped.After changingtheweightsof somecoordinates,thedatadistribution is alsochanged,andthe
two clustersareseparatedfrom eachother.
Sincethe original dimensionsmay be very high, suchas several thousands,manuallychangingthe
weight for eachdimensionto find the bestcombinationis impractical. That is the reasonwhy our tool
supportsautomaticchangingof weights.Theuseronly needsto settheadjustmentdirection(ascendingor
descending)andchangingstepfor eachdimension.The tool will performan animationto show the data
distribution while theweightsarechangedautomatically. Theusercanobtainall theweightswhentheideal
distribution is reached.
3 Data Analyzing Approach
Basedon thegeneexpressionmatrixwhich hasthousandsof genesandseveraldozensof samples,our task
is to build aclassifierto predictthesamplelabels(classes)from theexpressionprofileusingtheinformation
alreadyknown from the experiment,suchasdiseased/normalattributesof the samples.A very common
methodis [32] first to reducethe numberof genesin thegenedimension,which meansto find important
genesthat aremorerelatedto suchkind of idealizedpatterns;thento assignsomeweightsfrom the first
stepto the remaininggenesto constructa “classpredictor”. How many genesaredeletedis usuallyfrom
experience.Researchersusuallygiveafixednumberto everymatrix,but thebestnumbermightbedifferent
10
Figure4: Effect of weightsadjustment.Left circlesshow thedatadistribution in displayingspace,right sideslides
show theweightadjustmentenvironment.(A) Showsmappingresultof a datasetusingtheinitial setting.Thedataset
includetwo clusters.We markedashollow redcirclesandfilled bluecircles.(B) After changingtheweightsof some
coordinates,thedatadistribution is alsochanged,andtwo clustersareseparatedfrom eachother.
11
Figure5: Differentkindsof genepatterns.Assumethatsampleshave two classes.If we wantto constructa “class
predictor” from thegenematrix, geneshave patternlike (a) give thecorrectinformationwhich arecalled“important
gene”.Pattern(b) meansuselessgenes.Pattern(c) is noisein thedataset.Soour taskis to select(a), remove(b) and
(c) from thesetof genes.
for eachdataset.How to choosethebestweightis anotheropenproblem.
3.1 Normalization
In thegeneexpressionmatrix,differentgeneshavedifferentrangesof intensityvalues.Theintensityvalues
alonemaynot have significantmeaning,but the relative changinglevels aremoreintrinsic. Sowe should
first normalizetheoriginal geneintensityvaluesinto relative changinglevels.Ourgeneralformulais
�[� � . �:��� � . �� �1 y&'�4W� � �
1 y&'�� 0�{�|�^P| � � � -.��r� � � . 4g� 6 �=�(4)
where��� � .
denoteschanginglevel for gene� of sample7,� � .
representsthe original intensityvaluefor
gene� of sample7,y
is aparameter, and � � is themeanof theintensityvaluesfor gene� for all samples6
.
12
3.2 Selecting Important Genes
Notice that amongthousandsof genes,not all of them have the samecontribution in distinguishingthe
classes.Actually, somegeneshave little contribution or just representnoisefrom thematrix (Figure5). We
needto remove thosegenes.
Assumingthereare genesand ) samples,whereeachgenevector(afternormalization)is denotedas
dz����������������� �!��"#"#"#��� �<� ��� ���R���~�E� ��"#"#"#����� * '�� 0`{�|�^P| � ��9!�<;=��"#"#" Z]\�^�|�_ f { � | |"(5)
Without losinggenerality, we assumethefirst � samplesbelongto oneclasswhile the remainingsamples
belongto anotherclass.Theidealgenewhichis highlycorrelatedwith thesamplesdistributionshouldmatch
thepattern �������9!��9!��"#"#"�9!�&UW��"#"#"#�&U='(first � numberis “
9” followedby
� )��� ' “U”) or �� ���bUW�&UW��"#"#"�UW��9!��"#"#"#��9M'
(first � numberis “U” followedby
� ):�� ' “9”). Wecalculatecorrelationcoefficient (Formula(2)) between
eachgenevectorandthepre-definedstablepattern �� , thensortgenesusingthesecorrelationcoefficientsby
a descendingsequencewhich exactly matchestheascendingsequenceif sortingby correlationcoefficients
with anotherpattern �� . We know a certainnumberof genesfrom thetop andthebottomof this sequence
shouldbe chosenasthe “important genes”,but the numberis usuallydecidedfrom the experience.Also
thebestnumberfor differentdatasetsis different.Usingour tool, we canconductflexible judgment.First,
we mapthewholedatasetinto 2-dimensionaldisplayingspace.We thenremove “unimportant”genesfrom
themiddleof thesequenceoneby one. Whentheboundaryof theclustersof thesamplescouldbeclearly
separated,we stopthis procedure.The remaininggenesarechosenas“important” genes(Figure6). The
whole procedureis integratedinto our visualizationtool. Becauseour mappingfunctionhastheproperty
of preservingcorrelationcoefficient similarity, sampleswith similar patternswill bemappedcloseto each
other.
13
Figure6: Procedureof removing “unimportant”genesbasedon a geneexpressiondataof 28 sampleswhich belong
to two clusters. (A) Shows original distribution of 28 samplesmappingfrom 4132 genesvectors. (B) Samples
distribution reducedto 120genes.(C) Samplesdistribution reducedto 70 genes.Thereis a clearboundarybetween
two clusters.
3.3 Weight Adjustment
After selectinga subsetof importantgenes,thenext stepis to build a classpredictorbasedon thesegenes.
Usuallythispredictorhastheformatof:
��0 � � � ��0 � � � ��"#"#"#��0 � � � ' � (6)
where�.is thevalueof theselectedgeneof thesampleto classifyand
0 .is weightof theselectedgene.
Now theproblemishow to decidetheweightof eachgene.Wecandirectlyusethecorrelationcoefficient
with thepattern �������9!��9!��"#"#"�9!�&UW��"#"#"#�&U�'or �� ���bUW�&UW��"#"#"�UW��9!��"#"#"#��9x'
astheweight,or getfrom theparameter
adjustmentfunction of our tool. By usingour tool, we canmanuallyadjustthe weightsof eachgeneas
illustratedin Figure4,or let thetool to performautomaticadjustment.For automaticadjustment,wepresent
ameasurethatevaluatesthequalityof thedistribution of two clusters,andusethismeasureto decidewhen
to stoptheadjustmentprocedure.
14
First wedefinethecenter of eachsampleclusteras ��6usingFormula(7):
��6 �:� f u � � f u � ��"#"#"#� f u % ' � 0`{�|�^P| f u . � k¡ �!� � . "
(7)
Thenfor eachsamplecluster, we definea value ¢ � 6 'to measurethegatherdegreeof all thepointsin
thisclusterusingFormula(8):
¢ � 6 'j� £¤¤¥ 9)¦ 9 - �!� � �u ��,6 � � � £¤¤¥ 9
)¦ 9 - �r�%- . / � �§u
. f u . ' � � (8)
where) ��� 6 �is thenumberof samplesin this cluster.
This measureis similar to thestandarddeviation in onedimensionalspace.If thedistribution of points
in samplecluster6
is sparse,¢ � 6 'will be large,otherwiseit is small. If we have two clustersof samples
denotedas6 �
and6 �
, wehopeeachclustergathertogetheraswell asthedistancebetweenthetwo clusters
is aslarge aspossible.Sowe presentanothermeasureto describethequality of theclusterdistribution as
Formula(9):
f � ¢ � 6 9x'~¨ ¢ � 6 ;P'� �,6 � �6 � � �(9)
where�6 �
is thecenterof6 �
,�6 �
is thecenterof6 �
, and� �6 � �,6 � �
is thedistancebetween�,6 �
and
�6 �.
We hopethedistancebetweenthetwo clusterscanbeaslargeaspossibleandthesparsedegreecanbe
aslow aspossible.Sothebestdistribution is reachedwhen f valueis thesmallest.We applythis measure
on the2-dimensionaldisplayingspaceto evaluatethesamplesdistribution while adjustingweights.Notice
thatto try thecombinationsof all theweightsis impracticalbecauseof exponentialtime complexity. What
wecando is whenthepointsdistribution reachesa local lowestvalueof f , westoptheadjustprocedureand
setcombinationof theweights.
15
Figure7: Experimentresultfor MS IFN groupandCONTROL MS group.Redcirclesdenotesamplesbelongto MS
group,greencirclesrepresentsamplesin IFN groupwhile bluecirclesmeanCONTROL samples.Samplespointed
by arrowsarewronglyclassified.Straightlinesacrossthebig circledenote“classpredictor”whicharebuilt usingthe
cross-validationmethod.
4 Experimental Results
Theexperimentsarebasedontwodatasets:theMS IFN groupandtheCONTROL MS group.TheMS IFN
groupcontains28 samples(14MS samplesand14 IFN samples)while theCONTROL MS groupcontains
30 samples(15controlsamplesand15 MS samples).Eachsamplehas4132genes.
Duringthedataprocessingprocedure,by sortinggenesusingthecorrelationcoefficientsof Formula(2),
we select70 genes(Figure6 (C)) for MS IFN groupand58 genesfor CONTROL MS group. We then
adjustweightsfor thegenesin thesetwo groupsusingautomaticadjustmentfunctionin our tool.
We useK-meansclusteringmethodto evaluatethe “weightedclasspredictor” which we get from the
analyzingprocedure.Herewe choosethe cross-validation method[32] to evaluateeachgroup. In each
16
group,choosea sample,usethe remainingsamplesof this groupto selectimportantgenes,andget class
predictor. Thenpredicttheclassof thewithheldsample.Theprocessis repeatedfor eachsample,andthe
cumulative error rateis calculated.For MS I group,samplesin theIFN groupwereall predictedcorrectly
but onesamplein theMS groupwasincorrectlyclassified.For theCONTROL MS group,samplesin the
MS groupwereall predictedcorrectly, but five samplesin theCONTROL groupwerewrongly classified.
Figure7 shows theevaluationresultof our “weightedclasspredictor”for thesetwo groupsusingthecross-
validationmethod.
5 Conclusion
In this paper, we presenteda visualizationmethodwhich mapsthe samples’ -dimensionalgenevectors
into 2-dimensionalpoints. This mappingis effective in keepingcorrelationcoefficient similarity which is
themostsuitablesimilarity measurefor analyzinggeneexpressiondata.Our analysismethodfirst removes
noisegenesfrom thegeneexpressionmatrix by sortinggenesaccordingto correlationcoefficient measure,
andthenadjuststheweightfor eachremaininggene.Wealsopresentedameasureto judgethequalityof the
clusterdistribution. We have integratedour geneanalysisalgorithminto a visualizationtool basedon the
mappingmethod.Wecanusethis tool to monitortheanalysisprocedure,to adjustparametersdynamically,
andto evaluatethe resultof eachstepof adjustment.Our approachtakes the advantageof dataanalysis
anddynamicvisualizationmethodsto revealcorrelatedpatternsof geneexpressiondata. In particular, we
usedthe above approachto distinguishthe healthycontrol, MS, IFN-treatedsamplesbasedon the data
collectedfrom DNA microarrayexperiments.Fromourexperiments,wedemonstratedthatthisapproachis
apromisingapproachto beusedfor analysisandvisualizationof genearraydatasets.
17
References
[1] A. Ben-Dor, N. Friedman,andZ. Yakhini. Classdiscovery in geneexpressiondata. In Proc. Fifth
Annual Inter. Conf. on Computational Molecular Biology (RECOMB 2001), 2001.
[2] Alvis BrazmaandJaakVilo. Minireview: Geneexpressiondataanalysis. Federation of European
Biochemical societies, 480:17–24,June2000.
[3] Amir Ben-Dor, Ron Shamirand Zohar Yakhini. Clusteringgeneexpressionpatterns. Journal of
Computational Biology, 6(3/4):281–297,1999.
[4] Anna Jorgensen. Clusteringexcipient near infrared spectrausing different chemometricmethods.
Technicalreport,Dept.of Pharmacy, Universityof Helsinki,2000.
[5] Ash A. Alizadeh,Michael B. Eisen,R. Eric Davis, Chi Ma, Izidore S. Lossos,AdreasRosenWald,
JenniferC. Boldrick, HajeerSabet,Truc Tran,Xin Yu, JohnI. Powell, Liming Yang,GeraldE. Marti
et al. Distinct typesof diffuselargeb-cell lymphomaidentifiedby geneexpressionprofiling. Nature,
Vol.403:503–511,February2000.
[6] CharlesM. Perou,StefanieS.Jeffrey, Matt VanDe Rijn, ChristiaA. Rees,MichaelB. Eisen,Douglas
T. Ross,AlexanderPergamenschikov, Cheryl F. Williams, Shirley X. Zhu, Jeffrey C. F. Lee, Deval
Lashkari,Dari Shalon,Pat rick O. Brown, andDavid Bostein.Distinctive geneexpressionpatternsin
humanmammaryepithelialcells andbreastcancers.Proc. Natl. Acad. Sci. USA, Vol. 96(16):9212–
9217,August1999.
18
[7] D. BhadraandA. Garg. An interactive visual framework for detectingclustersof a multidimensional
dataset.TechnicalReport2001-03,Dept.of ComputerScienceandEngineering,UniversityatBuffalo,
NY., 2001.
[8] D. Shalon,S.J.Smith,P.O. Brown. A DNA microarraysystemfor analyzingcomplex DNA samples
usingtwo-colorfluorescentprobehybridization.Genome Research, 6:639–645,1996.
[9] ElisabettaManduchi,Gregory R. Grant,StevenE. McKenzie,G. ChristianOverton,SaulSurrey and
ChristianJ.Stoeckert Jr. Generationof patternsform geneexpressiondataby assigningconfidenceto
differentiallyexpressedgenes.Bioinformatics, Vol. 16(8):685–698,2000.
[10] FranciscoAzuajeDepartment.Makinggenomeexpressiondatameaningful:Predictionanddiscovery
of classesof cancerthrougha connectionistlearningapproach,2000.
[11] GadGetz,Erel Levine andEytanDomany. Coupledtwo-way clusteringanalysisof genemicroarray
data.Proc. Natl. Acad. Sci. USA, Vol. 97(22):12079–12084, October2000.
[12] G.S.Michaels,D.B. Carr, M. Askenazi,S. Fuhrman,X. WenandR. Somogyi. ClusterAnalysisand
datavisualizationof large-scaleexpressiondata.In Pac Symposium of Biocomputing, volume3, pages
42–53,1998.
[13] HartiganJ.A. Clustering Algorithm. JohnWiley andSons,New York., 1975.
[14] J. DeRisi,L. Penland,P.O. Brown, M.L. Bittner, P.S.Meltzer, M. Ray, Y. Chen,Y.A. Su,J.M. Trent.
Useof a cDNA microarrayto analysegeneexpressionpatternsin humancancer. Nature Genetics,
14:457–460,1996.
19
[15] Javier Herrero,Alfonso Valencia,andJoaquinDopazo. A hierarchicalunsupervisedgrowing neural
network for clusteringgeneexpressionpatterns.Bioinformatics, 17:126–136,2001.
[16] J.J.Chen,R.Wu, P.C.Yang,J.Y. Huang,Y.P. Sher, M.H. Han,W.C.Kao,P.J.Lee,T.F. Chiu,F. Chang,
Y.W. Chu,C.W. Wu,K. Peck.Profilingexpressionpatternsandisolatingdifferentiallyexpressedgenes
by cDNA microarraysystemwith colorimetrydetection.Genomics, 51:313–324,1998.
[17] J.L.DeRisi,V.R. Iyer andP.O.Brown. Exploringthemetabolicandgeneticcontrolof geneexpression
on agenomicscale.Science, pages680–686,1997.
[18] JohannesSchuchhardt,Dieter Beule,Arif Malik, Eryc Wolski, Holger Eickhoff, HansLehrachand
HanspeterHerzel. Normalizationstrategies for cDNA microarrays. Nucleic Acids Research, Vol.
28(10),2000.
[19] T. Kohonen.Self-Organization and Associative Memory. Spring-Verlag,Berlin, 1984.
[20] M. Schena,D. Shalon,R.W. Davis, P.O. Brown. Quantitative monitoringof geneexpressionpatterns
with acomplementaryDNA microarray.Science, 270:467–470,1995.
[21] Mark Schena,Dari Shalon,RenuHeller, Andrew Chai, Patrick O. Brown, and RonaldW. Davis.
Parallelhumangenomeanalysis:Microarray-basedexpressionmonitoringof 1000genes.Proc. Natl.
Acad. Sci. USA, Vol. 93(20):10614–10619, October1996.
[22] MichaelB. Eisen,PaulT. Spellman,PatrickO.Brown andDavid Botstein.Clusteranalysisanddisplay
of genome-wideexpressionpatterns.Proc. Natl. Acad. Sci. USA, Vol. 95:14863–14868,1998.
20
[23] MichaelP. S.Brown,William NobleGrundy, David Lin, Nello Cristianini,CharlesSugnet,TerrenceS.
Furey, ManuelAresandJr.David Haussler.Knowledge-basedanalysisof microarraygeneexpression
datausingsupportvectormachines.Proc. Natl. Acad. Sci., 97(1):262–267,January2000.
[24] O. Ermolaeva, M. Rastogi,K.D. Pruitt, G.D. Schuler, M.L. Bittner, Y. Chen,R. Simon,P. Meltzer,
J.M.Trent,M.S.Boguski.Datamanagementandanalysisfor geneexpressionarrays.Nature Genetics,
20:19–23,1998.
[25] Orly Alter, Patrick O. Brown and David Bostein. Singularvalue decompositionfor genome-wide
expressiondataprocessingand modeling. Proc. Natl. Acad. Sci. USA, Vol. 97(18):10101–10106,
Auguest2000.
[26] Pablo Tamayo,DonnaSolni,mJill Mesirov, Qing Zhu, SutisakKitareewan, EthanDmitrovsky, Eric
S. LanderandTodd R. Golub. Interpretingpatternsof geneexpressionwith self-organizingmaps:
Methodsandapplicationto hematopoieticdifferentiation.Proc. Natl. Acad. Sci. USA, Vol. 96(6):2907–
2912,March1999.
[27] R.A. Heller, M. Schena,A. Chai, D. Shalon,T. Bedilion, J. Gilmore, D.E. Woolley, R.W. Davis.
Discovery andanalysisof inflammatorydisease-relatedgenesusingcDNA microarrays.Proc. Natl.
Acad. Sci. USA, 94:2150–2155,1997.
[28] S. Tavazoie,D. Hughes,M.J. Campbell,R.J. Cho andG.M. Church. Systematicdeterminationof
geneticnetwork architecture.Nature Genet, pages281–285,1999.
[29] ShumeiJiang,ChunTang,Li ZhangandAidong Zhang, Murali Ramanathan.A maximumentropy
approachto classifyinggenearraydatasets.In Proc. of Workshop on Data mining for genomics, First
SIAM International Conference on Data Mining, 2001.
21
[30] S.M. Welford, J. Gregg, E. Chen,D. Garrison,P.H. Sorensen,C.T. Denny, S.F. Nelson. Detection
of differentially expressedgenesin primary tumor tissuesusingrepresentationaldifferencesanalysis
coupledto microarrayhybridization.Nucleic Acids Research, 26:3059–3065,1998.
[31] SpellmanP.T., SherlockG., ZhangM.Q., Iyer V.R.,AndersK., EisenM.B., Brown P.O.,BotsteinD.,
FutcherB. . Exploringthemetabolicandgeneticcontrolof geneexpressiononagenomicscale.Mol.
Biol. Cell, page3273,1998.
[32] T.R. Golub, D.K. Slonim, P. Tamayo,C. Huard,M. Gassenbeek,J.P. Mesirov, H. Coller, M.L. Loh,
J.R.Downing, M.A. Caligiuri, D.D. BloomfieldandE.S.Lander. Molecularclassificationof cancer:
Classdiscovery andclasspredictionby geneexpressionmonitoring. Science, Vol. 286(15):531–537,
October1999.
[33] Trevor Hastie,RobertTibshirani,David BoststeinandPatrick Brown. Supervisedharvestingof ex-
pressiontrees.Genome Biology, Vol. 2(1):0003.1–0003.12,January2001.
[34] U. Alon, N. Barkai,D.A. Notterman,K.Gish,S.Ybarra,D. Mack andA.J. Levine. Broadpatternsof
geneexpressionrevealedby clusteringanalysisof tumorandnormalcolontissuesprobedby oligonu-
cleotidearray. Proc. Natl. Acad. Sci. USA, Vol. 96(12):6745–6750, June1999.
[35] V. Yong,S. Chabot,Q. Stuve andG. Williams. Interferonbetain thetreatmentof multiple sclerosis:
mechanismsof action.Neurology, 51:682–689,1998.
[36] V.R. Iyer, M.B. Eisen,D.T. Ross,G. Schuler, T. Moore, J.C.F. Lee, J.M. Trent, L.M. Staudt,Jr. J.
Hudson,M.S.Boguski,D. Lashkari,D. Shalon,D. Botstein,P.O.Brown. Thetranscriptionalprogram
in theresponseof humanfibroblaststo serum.Science, 283:83–87,1999.
22
[37] Y BarashandN Friedman.Context-specificbayesianclusteringfor geneexpressiondata. Bioinfor-
matics, RECOM01, 2001.
[38] YangY.H.,DudoitS.,LuuP. andSpeedT. P. Normalizationfor cDNA MicroarrayData.In Proceedings
of SPIE BiOS 2001, SanJose,California,January2001.
23
Top Related