Integration of highly dimensional biological data sets with multivariate analysis - Kim-Anh Lê Cao

mixOmics and intro Data integration Time course data Cross-platform Conclusions

Integration of highly dimensionalbiological data sets with multivariate

analysis

Kim-Anh Lê Cao

Queensland Facility for Advanced BioinformaticsThe University of Queensland

Kim-Anh Lê Cao

AMATA 2013


In God we trust, all others must bringdata.(W. Edwards Deming)

Kim-Anh Lê Cao

AMATA 2013


Outline of this talk and research questions

1 It is all about mixOmics2 What is the common information contained in different

experiments?3 What are the correlated features across different time course

experiments?4 Can we combine similar experiments performed in different

labs or/and on different platforms?

Kim-Anh Lê Cao

AMATA 2013


Philosophy

mixOmics: philosophy

is originally an R package developed for thestatistical exploration and integration of large biological data sets.

Development of statistical multivariate approachesVariable selection included in the methodologiesGraphical outputsUser-friendly use: website, R package, web interface

First R CRAN release in May 2009.

Lê Cao et al. mixOmics: an R package to unravel relationships between two omics

data sets, Bioinformatics, 25 (21).

Kim-Anh Lê Cao

AMATA 2013


Framework

Framework

One$data$set$PCA$IPCA$

PLS/DA$

MULTIVARIATE$$APPROACH$

GRAPHICAL$$OUTPUTS$

Sample$plots$

Variable$plots$

PRE$PROCESSING$

Two$matching$data$sets$

NIPALS$

mulF/group$

Smoothing$splines$

rGCCA$

ExploraF

on$

IntegraF

on$

$INPUT$DATA$

Missing$values$

Time$Course$

Comparison$across$$

plaIorms$

rCCA$

PLS$

PLS$MulFlevel$

Repeated$measurements$

More$than$two$data$sets$

Kim-Anh Lê Cao

AMATA 2013


Web interface

Web interface

http://www.qfab.org/mixomics/

Kim-Anh Lê Cao

AMATA 2013


Introduction with PCA

Principal Component Analysis: an introductionSeek the best directions in the data that account for most of thevariability

→ principal components: artificial variablesthat are linear combinations of the originalvariables:

c1 = w1x1 + w2x2 + · · ·+ wpxp −1.0 −0.5 0.0 0.5 1.0

−0.4

−0.2

0.0

0.2

0.4

x1

x2

wherec1 is the first principal component with max. variance{w1, . . . ,wp} are the weights in the linear combination{x1, . . . , xp} are the gene expression profiles.

All PCs are mutually orthogonal. (c1, c2, . . . )

Kim-Anh Lê Cao

AMATA 2013


Introduction with PCA

The new PCs form a a smaller subspaceof dimension < p

Project the data on these new axes tosummarize the information related tothe variance.

−1.0 −0.5 0.0 0.5 1.0

−0.4

−0.2

0.0

0.2

0.4

z1

z2

→ approximate representation of the data points in a lowerdimensional space

PCA should be a compulsory first step in exploratory data analysisto:

Have a first understanding of the underlying data structureIdentify bias, experimental errors, batch effects

Kim-Anh Lê Cao

AMATA 2013


Dealing with high dimensional data

Problem with PCA: interpretation can be difficult with very largenumber of (possibly) irrelevant variables.

Remember that the principal components are linear combinations ofthe original variables:

c = w1x1 + w2x2 + · · ·+ wpxp

A clearer signal could be observed if some of the variable weights{w1, . . . ,wp} could be set to 0 for the irrelevant variables:

c = 0 ∗ x1 + w2x2 + · · ·+ 0 ∗ xp

These variables weights are defined in the loading vectors.Important weights = important contribution to the PC.Similar weights = correlated variables.

Kim-Anh Lê Cao

AMATA 2013


Variable selection

sparse Principal Component Analysis: sPCA

Principal components

−1.0 −0.5 0.0 0.5 1.0

−0.4

−0.2

0.0

0.2

0.4

x1

x2

loading vectors(PCA)

0 20 40 60 80 100

0.0

00

.05

0.1

00

.15

loadin

g 1

0 20 40 60 80 100

0.0

00

.05

0.1

00

.15

loadin

g 2

sparse loading vectors(sPCA)

0 20 40 60 80 100

0.0

00

.05

0.1

00

.15

0 20 40 60 80 100

0.0

00

.05

0.1

00

.15

loadin

g 1

loadin

g 2

The principal components are linear combinations of the originalvariables, variables weights are defined in the associated loadingvectors.sparse PCA computes the sparse loading vectors to removeirrelevant variables using lasso penalizations (Shen & Huang 2008, J.

Multivariate Analysis).

Kim-Anh Lê Cao

AMATA 2013


Motivation

Integration of multiple data sets: what is the commoninformation contained in different experiments?

0"1"0"1"1"."."0"

."."."

1"2"."."."."n"

1"".".".".".".".".".".".".".".""."."p" 1""."."."."."."."."."."."."."."."."."."."."."."."."."q" 1""."."."."."."."."."r" 1"".".".".".".".".".".".".".""."s"

Statistical data integration, how to? Define the best components(linear combination of each data set variables) that are bestcorrelated with each other (RGCCA: Regularized Generalised CCA).

The same samples are measured across different experimentsSelect relevant biological entities which are correlated acrossthe different data setsEmphasis on graphical outputs to ease results interpretation

Kim-Anh Lê Cao

AMATA 2013


Example

ExampleKidney transplant study

Transcriptomics and proteomics studyof 40 patients with kidney transplant,rejecting (n1 = 20) or not (n2 = 20)the transplant, PROOF Centre, UBC.

Kim-Anh Lê Cao

AMATA 2013


Example

New developments include a sparse RGGCA (sGCCA) to selectcorrelated variables across the different platforms

Relevance networks Gonzales, Lê Caoet al. (2012), J. Data Mining

Kim-Anh Lê Cao

AMATA 2013


Motivation

Integration of time-course data: what are the correlatedfeatures across different time course experiments?

0"1"0"1"1"."."0"

1"2"."."."."n"

1""."."."."."."."."."..".".""."."p" 1""."."."."."."."."."."."."."."."."."."."."."."."."q" 1""."."."."."."."."."r" 1"".".".".".".".".".".".".".""."s"

Requires the same samples measured at several time pointVariable selection across platforms and timeHypothesize that correlated biological entities belong to thesame biological pathway

Kim-Anh Lê Cao

AMATA 2013


Modelling trajectories

Modelling trajectoriesUse cubic smoothing splinesto summarize each profile

Enables interpolation ofmissing values

There exists a theoretical link between smoothing splinesand linear mixed models (Verbyla et al. 1999).

Appropriate for correlated repeatedmeasuresNo parameters to tuneFlexibility of the modelAssesses the variability for eachfeature (technical , biologicalvariability)

●●

●

●

●

●

●

●●

●

●●

time

TP

M E

xpr

2025

3035

4045

50

d0 d6 d12 d18

Example on one CAGE cluster

● tech rep 1tech rep 2

Kim-Anh Lê Cao

AMATA 2013


Modelling trajectories

Step 1: use cubic smoothing splines to reduce one dimension(samples dimension)

Step 2: apply sGCCA on the estimated splines to identify correlatedprofiles both within and between the two data sets

→ modelisation of the trajectories→ filtering of the profiles→ integration of two types of data→ selection and clustering of the correlated time profiles

Kidney transplant study Transcriptomics and proteomics study of 40patients with kidney transplant, rejecting (n1 = 20) or not (n2 = 20) thetransplant. Follow up on 5 time points (weeks), PROOF Centre, UBC.

Kim-Anh Lê Cao

AMATA 2013


Some results

Variable representation: time profile clusters are selected

2 4 6 8 10

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

time

expr

essi

on

dimension 1

2 4 6 8 10

−2

−1

01

2

time

expr

essi

on

dimension 2

2 4 6 8 10

−2

−1

01

2

time

expr

essi

on

dimension 3

The approach selects both transcripts and proteins which arepositively or negatively correlated across time

Quality of clusters decreases with the number of PLS components(dimensions) as obvious patterns cannot be extracted anymore

Kim-Anh Lê Cao

AMATA 2013


Example: sample representation

Sample representation: how the approach can extractcommon information across these different platforms

Project GrandioseLongitudinal study of cellreprogramming- 8 time points- 10 platforms: microarray,cell surface proteome, totalproteome, RNA-seq,miRNA...

See also keynote speaker Session 11 Nicole CloonanKim-Anh Lê Cao

AMATA 2013


Motivation

Cross-platform comparison: can we combine similarexperiments performed in different labs or with differentplatforms?

Not the same samples measured across different experimentsExperimental biases (batch effects)Well known fact: microarray experiments across studies bringdifferent results! “But I would very much like to compare mygene signature from my experiment to my colleagues’ genesignatures from their experiments!"

Aim: Identify genes diagnostic of the key characteristics of stemcells derived from independent studies

See contributed paper Session 9 Celena Heazlewood and Poster 46, Florian

RohartKim-Anh Lê Cao

AMATA 2013


The challenges

What is a ‘batch’ effect?

Boxplots of the samples (data normalised

per experiment)

0.6 0.8 1.0 1.2 1.4

0.6

0.8

1.0

1.2

1.4

1

1

●FibroblasthESChiPSC

BrennandJiaMaheraliNaylerZaehres

−50 0 50 100 150 200

−40

−20

020

40

PC1: 58.76 %

PC

2: 1

5.87

%

Classic PCA: Across platform 3 classes

PCA highlights a ‘lab effect’.→ Combining different experiments is a real challenge

Kim-Anh Lê Cao

AMATA 2013


The challenges

1""2"1"3"3"1"3"

1"2"."."."."n1"

1"".".".".".".".".".".""."."p"

1"2"."n2"1"2"."."."n3"1"2"."."."n4"

1""3"3"2"

1""2"1"1"2"3"

1""2"1"3"3"1"

."

."

."

X = several microarray experiments,performed in different labs but studyingthe same biological conditionsNot the same samples, but the samevariablesGroup = experiment/platform/labY = biological conditions→ Supervised problem

We propose: multi-group analysis for theselection of common biomarkers across thedata sets.

Kim-Anh Lê Cao

AMATA 2013


The challenges

Multi-group Analysis

Fig from A. Eslami

Individuals are a priori structuredinto several groupswithin group part: group structureeffect is removedbetween-group part: groupstructure effect is taken intoaccountPartial and global components andloading vectors

→ Understand and model the common stucture between thegroups and within each group→Two block multi-group analysis by the means of Partial LeastSquares regression (PLS)

Kim-Anh Lê Cao

AMATA 2013


The challenges

Multi-group PLS-DA

−50 0 50

−50

050

100

X−variate 1

X−

varia

te 2

PLS−DA, diff platforms 3 classes

●●●

●●

●●●

●●● ●●

●●

●

●●●●●

−50 0 50 100 150

−10

010

2030

comp 1

com

p 2

mg PLS−DA, Across platform 3 classes

0.6 0.8 1.0 1.2 1.4

0.6

0.8

1.0

1.2

1.4

1

1

●FibroblasthESChiPSC

BrennandJiaMaheraliNaylerZaehres

Classical PLSDA (LHS) vs multi-group PLSDA(RHS), the latter gives better (visual) results.

Kim-Anh Lê Cao

AMATA 2013


The challenges

Multi-group PLS-DA, more data sets

●

●●

●

●

●●

●●●●●

●●

●●

−100 0 100 200 300

−60

−40

−20

020

Comp 1

Com

p 2

Multi group PLS−DA, same platform

●

●

●

●

●

●

●

●●

●

●

●

−40 −20 0 20 40 60

−60

−40

−20

020

4060

Comp 1

Com

p 2

Multi group PLS−DA, MAQC

●●

●●●

●●

●●●

●

●

●

●●●

●●

−10 0 10 20 30

−15

−10

−5

05

1015

Comp 1

Com

p 2

Multi group PLS−DA, diff platform 2 classes

Same Affymetrix platform; MAQC Illumina + Affymetrix; Illumina + Affymetrix

→ In order to select common biomarkers across different data sets,variable selection is needed

Kim-Anh Lê Cao

AMATA 2013


sparse multi-group PLSDA

sparse multi-group PLSDA and performance

●●●

●●

●●●

●●● ●●

●●

●

●●●●●

−50 0 50 100 150

−10

010

2030

comp 1

com

p 2

mg PLS−DA, Across platform 3 classes

●●●

●●

●●●

●●●

●●●

●●

●●●●●

−5 0 5 10 15 20

−5

05

10

comp 1

com

p 2

sparse mg PLS−DA, Across platform 3 classes

training set: circles,testing set: triangles withcolors as predicted

Table: Classif. error rate (10-fold CV)sameplatformK=3

diff.platformK=2

diff.platformK=3

#samples 53 70 82

#variables 22K 9K 9K

#groups 5 5 5

multi grp 8.9% 15.3% 16.1%

sparsemulti grp

7.7% 4.5% 10.6%

#selectedmarkers

40 30 20

Kim-Anh Lê Cao

AMATA 2013


Comparisons

Comparisons

●● ●

●●

●

●●

●●●

●●

●●

●

●●●

●

●

−0.1 0.0 0.1 0.2 0.3

−0.

10.

00.

10.

20.

30.

4

X−variate 1

X−

varia

te 2

(1)

●●●

●●

●●●

●●●

●●●

●●

●●●●●

−5 0 5 10 15 20

−5

05

10

comp 1

com

p 2

(2)

●●●

●●

●

●●

●●

●

●●

●●

●

●●●

●

●

0.0 0.1 0.2 0.3

−0.

10.

00.

10.

20.

3

comp 1

com

p 2

(3)

3 33 4 3

smgPLS sPLS

sPLS + ComBat

3 7 3

10

sPLS sPLS + ComBat

smgPLS

1 Data centered per group2 Multi group PLSDA3 ComBat Combating Batches of gene

expression microarray data (Johnson and Li,2007)

Kim-Anh Lê Cao

AMATA 2013


Conclusions

Multivariate integrative approaches are flexible and can answervarious types of questionsmixOmics implements 6 different methodologies plus theirsparse variantsData integration, variable selection, graphical outputsIncludes a web-associated interface (and soon through Galaxytoo)Help generating new hypotheses, further statistical tests

New additions! (still in development)Smoothing spline/LMM framework for the integration oftime-course dataMulti-group analysis to address a challenging current problemin molecular biology.

Kim-Anh Lê Cao

AMATA 2013


Un GRAND MERCI à ...

Data providersKidney transplant study

Oliver Günther UBCScott Tebutt UBCGrandiose project

Andras Nagy Univ. TorontoStemformatics team

Christine Wells UQ, Univ. GlasgowRowland Mosbergen UQOthmar Korn UQ

mixOmics teamSébastien Déjean Univ. TlseIgnacio González Univ. TlseXin Yi Chua QFAB

Methods developmentMultiple data integration

Arthur Tenenhaus Supelec ParisTime course

Jasmin Straube QFABKathy Ruggiero Univ. AucklandChristèle Robert INRA ToulouseCross-platform comparison

Stéphanie Bougeard ANSESAida Eslami ANSES, ONIRISFlorian Rohart AIBN UQ

Kim-Anh Lê Cao

AMATA 2013


Un GRAND MERCI à ...

Questions?

Springtime Bioinformatics TrainingWorkshops by QFAB (University ofQueensland)

RNA seq analysis, 4 Nov.Introduction to statististics forbiologists, 19-21 Nov.Variant detection, 22 Nov.Introduction to De Novo GenomeAssembly, 9 Dec.

[email protected]

http://www.math.univ-toulouse.fr/∼biostat/mixOmics

Kim-Anh Lê Cao

AMATA 2013

Integration of highly dimensional biological data sets with multivariate analysis - Kim-Anh Lê Cao

Technology

Transcript of Integration of highly dimensional biological data sets with multivariate analysis - Kim-Anh Lê Cao