Analysing Online Social Network Data with Biclustering and Triclustering
TimesVector: A vectorized clustering approach to the...
Transcript of TimesVector: A vectorized clustering approach to the...
TimesVector:AvectorizedclusteringapproachtotheanalysisoftimeseriestranscriptomedatafrommultiplephenotypesInukJung,HongryulAhn,KyuriJo,HyejinKang,YoungjaeYuandSunKim
InukJung([email protected])BioandHealthInformaticslabSeoulNationalUniversity
GoalofthisstudyIdentifybiologicallymeaningfulgeneclusters(triclusters)thathavesignificantlysimilarordifferentialexpressionpatternsfrom3dimensionaltimeseriesdata (Gene-Time-Condition)
C
ExampleOrganism:Mouse(18117genes)Timepoints:day0,day3,day7,day14Conditions:Malariainfectedintactfemale,gonadectomized*
(gdx) female,intactmale,gdx male
289872 expressionvalues(GxTxC)
DifferentiallyExpressedPatterns(DEP) Similarly ExpressedPattern(SEP)
100genes 80genes 200genes
Goalofthisstudy
*Removementof ovaries or testis
Twotechnicalproblemstatements
1. Highclusteringcomplexitybydimensions
2. Technicaldifficultytocapturedifferentialexpressionpatterns betweentwoormoreconditions(WhatareDEGsintimeseriesdata?)
P1.Highclusteringcomplexitybydimensions
DEGanalysisusedfortimeseriesanalysis[1](2000)
Biclustering algorithmdeveloped fortimeseriesdata[2](2000)
Doesnottakeintoaccountthesequential natureoftimeseries expressiondata
Biclustering isNP-hard andisbound to2dimensionalclustering(eithergene-timeorgene-condition)
Firsttriclusteringalgorithmdeveloped ,TriCluster [3](2005)Onlyabletoidentifytriclusters withsimilarexpressionpatterns(SEP)
Triclustering toolthatisabletoidentifyDEPs[4](2012)IdentificationprocessofDEPisbasedonsimilaritymeasures– poorperformance
Onedimension(C) Twodimensions(GT,orGC)
Threedimensions(GCT)Threedimensions(GCT)
[1]Alizadeh etal,Distinct typesofdiffuselargeB-cell lymphomaidentified bygeneexpression profiling, Nature2000[2]Chengetal,Biclustering ofexpression data,ISMB2000[3]Zhaoetal,TheTricluster algorithm,ACMSIGMOD2005[4]Tchagang etal,TheOPTricluster algorithm, BMCBioinformatics 2012
• Divergentpatternrecognitionisnotavailable• Expressionpatterndiffersbetweenallpatterns
• OPTricluster performsapairwisecomparisonfordetectingdivergentexpressionpatternclusters• Incaseoffourconditions– A,B,C,D• AvsBCD,BvsACD,CvsABD,DvsABC• HenceA!=B!=C!=Disnotsupported
P2.Capturingdifferentialexpressionpatternsbetweentwoormoreconditions
TimesVector Framework
Clustering
Detecting patterns
Clustering– Dimensionreduction• Dimensionreductionbystrippingawaythesampledimensionand
concatenatingittothetimedimension• Takesburdenoffofforclusteringandpost-processingprocedures• Noinformationislost
t1 t2 t3
25 23 22
48
17
…
16
t1 t2 t3
5
12
1
…
13
t1 t2 t3
g1 15 20 10
g2 39 52 31
g3 8 16 6
…
… … …
gi 25 23 25
Gen
es
(i)
Time (j)
G×C×T matrix t1 t2 t3 t1 t2 t3
g1 15 20 10 15 10 5
g2 39 52 31 35 22 12
g3 8 16 6 7 3 1…
… … … … … …
gi 25 23 25 14 15 13
s1 s2 skG×CT matrix
t1 t2 t3
25 23 22
55 52 48
20 18 17
… … …
17 16 16
…
…Concatenate
samplesG
enes
(i)
Time (j)⋅Conditions(k)
3 dimensional matrix 2 dimensional matrix
SphericalK-meansclustering• SphericalK-means(skmeans)forclustering thevectors
• AK-means clusteringalgorithmwithcosinesimilarity asitsdistancemetric• Vectorsarenormalizedtounitvectors– thiscauses projectionofvectorstoasphere
• Minimize thecosinedissimilarity inallclusters
: indicator of a gene having membership to cluster
: the centroid of cluster
: expression level vector of gene
: total number of genes: total number of clusters
SelectingK bysilhouettescore
• UsingfourmicroarrayandRNA-seq time-seriesdata,theKwiththehighestsilhouettescorewaschosen
0
100
200
300
400
500
600
700
0 5 10 15 20 25 30
Opt
imal
K
Condition × Time points
Data C T C×T KGSE74465 (Rice) 2 3 6 100GSE11651 (Yeast) 5 3 15 200GSE4324 (Mouse) 4 4 16 500GSE39429 (Rice) 4 6 24 600
: C×T
• Re-introduceconditiondimensionbysplittingvectorsbyconditions• ThebZIP genevectorisdissectedintothenumberofconditions
v(bZIP)=<1, 1,1,3,3,3,3.5,2.5,3,3.7,2.2,3>
Detectingclusterswithdistinctexpressionpatterns
0h
1h
6hABCD
1 2 3 4 5
1
2
3
A(0h, 1h 6h)
B(0h, 1h 6h)
C(0h, 1h 6h)
Conditions
centroid
D(0h, 1h 6h)
Threetypesofpatternsaredefined• DEP(DifferentiallyExpressedPattern)
• Allsamplesinaclusterhavedifferentexpressionpatterns
• ODEP(OneDifferentiallyExpressedPattern)• Onesampleinaclusterhavedifferentexpressionfromtheothers
• SEP(SimilarlyExpressedPattern)• Allsampleshavesimilarexpressionpatterninacluster
Method– DEPpatternrecognition• Objective:TestifexpressionofconditionsA,B,CareA!=B!=C
• Buildcentroidforeachconditionwithineachcluster• Selectthemostoutercentroidasbasecentroid
0h
1h
6h
clusterC1
clusterC2
ABC
AcentroidBcentroidCcentroid
clusterC3
1 2 3 4 5
1
2
3
1. Compute cosinedistancefromeachdissected vectortothebasecentroidforeachcluster2. Rankdissected vectorsbycosinedistance3. MeasureMutualInformationwithXasdistancetobasecentroidandYascondition4. MeasuresignificanceofMIby1000randompermutatedtests,
Method– DEPpatternrecognition
0h
1h
6h
clusterC2
ABC
AcentroidBcentroidCcentroidBasecentroid
1 2 3 4 5
1
2
3
Phenotype A A A A B B B B C C C C
clid G1_A G2_A G3_A G4_A G1_B G2_B G3_B G4_B G1_C G2_C G3_C G4_C
C2 0.9 0.87 0.96 0.99 0.1 0.05 0.2 0.18 0.5 0.6 0.57 0.61
Rank 10 9 11 12 2 1 4 3 5 7 6 8
Discretized Rank 3 3 3 3 1 1 1 1 2 2 2 2
MI Log2(4)=2
Method– ODEPpatternrecognitionObjective:TestifexpressionpatternofaconditionamongA,B,CisA!=BC(B=C)orB!=AC(A=C)orC!=AB(A=B)
1. Computeabasecentroidofcomparingconditions(BC,AC,AB)2. Computecosinedistanceofdissectedvectorstothecentroidfor
eachcombination3. PerformANOVAonthecomputedcosinedistancecombinations
0h
1h
6h
clusterC1
clusterC2
ABC
AcentroidBcentroidCcentroid
clusterC3
1 2 3 4 5
1
2
3
Method– SEPpatternrecognitionObjective:TestifexpressionofconditionsA,B,CisA=B=C
1. Computeabasecentroidofallconditionswithinacluster2. Computecosinedistanceofdissectedvectorstothebasecentroid3. Tightness- lowerboundof99%confidenceintervalofallclusters4. Clusterswithtightnesslessthan99%CIareSEPclusters
0h
1h
6h
clusterC1
clusterC2
ABC
AcentroidBcentroidCcentroid
clusterC3
1 2 3 4 5
1
2
3
Results
• Data
• Biologicallysignificantclustersdetected
• PerformancecomparedwithTricluster andOPTricluster
*
Malaria infected / Gonadectomized male and female mice
Rice plants treated with 4 phytohormones
Dehydration stress treated rice plants
Fermentation of five yeast strains
Results– Clusterpatterns
C=4, T=4 C=4, T=6
Results– MalariainfectedMousedata
(a) DEP cluster 51
(b) ODEP cluster 20
(c) SEP cluster 357
Results– Phytohormone treatedriceplants
• 5clusterswerefound thatrespondedtotheABA (Absicic acid)phytohormone• Genesweregraduallyinducedovertime.• EnrichedGOtermsintheseclusterswererelatedto‘Responsetoabscisic acid’
Results– Comparisonwithothertools
Average number of genes per cluster
Tightness (average within cosine distance of clusters)
Weighted silhouette score
Conclusion
• TimesVector isabletodetectgeneclustersin3Dtime-seriesdatathatexhibitdistinctexpressionpatterns
• Especially,itisabletodetectclusterswithdistinctivelydifferentexpressionpatternsacrossconditions
• Itshowedsignificantlyimprovedclusteringqualitycomparedtorecenttriclustering tools
Funding• TheCooperativeResearchProgramforAgricultureScience&Technology
Development (ProjectNo.PJ01121102)RuralDevelopmentAdministration(RDA),RepublicofKorea
• TheBio&MedicalTechnologyDevelopmentProgramoftheNationalResearchFoundation (NRF) fundedbytheMinistryofScience,ICT&FuturePlanning (2012M3A9D1054622)
• TheKoreaHealthTechnologyR&DProjectthrough theKoreaHealthIndustryDevelopment Institute(KHIDI), funded bytheMinistryofHealth&Welfare,RepublicofKorea(HI15C3224)
Thankyouforyourattention