Managing and analysing (next-generation) multivariate ecological … · 2012. 11. 28. · November...

Managing and analysing (next-generation)multivariate ecological data:

new concepts and tools

Steve C. Walker

McMaster UniversityDepartment of Mathematics and Statistics

Bolker lab

November 21, 2012EEB seminar, McMaster University

IntroductionMotivation from observational community ecologyIllustrating the basic issue

Previous work on the data management-analysis interface, bothinside and outside of community ecology

The ‘old’ schoolThe ‘middle’ schoolThe ‘new’ school

The R multitable package

Thermocline deepening experiment

Bythotrephes longimanus

Wisconsin Dept. of Natural Resources

Bythotrephes longimanus

Yan et al. 2002

Fourth-corner problem

abundance

species

siteproperties

prop

erties

spec

ies

fourthcorner

site

s

Legendre et al. 1997

Fourth-corner problem

abundance

speciessite

properties

prop

erties

spec

ies

fourthcorner

site

s


Statistical methods for analyzing ‘fourth-corner’-esque data

I Chessel et al. (1996) — RLQ analysis

I Legendre et al. (1997) — coined term ‘fourth-corner’

I Ives and Godfray (2006) — mixed models ofphylogenetically-structured foodwebs

I Dray and Legendre (2008) — extends Legendre et al.

I Pillar and Duarte (2010) — phylogenetic null models

I Leibold et al. (2010) — semi-partial correlations

I Ives and Helmus (2011) — phylogenetic generalized linearmixed models (PGLMMs)

I ter Braak et al. (2012) – multiple comparison tests

The data frame

variables

repl

icat

es

Thermocline manipulation experiment

abundance

time

species

traits

time

basin

scal

estim

e

species

env.vars.

Cantin et al. (2011)

How do we convert this into a data frame?

abundance

speciessite

properties

prop

erties

spec

ies

fourthcorner

site

s

Summarisation → lost information

site

smeanspecies

properties

=site

sabundance

species

×

spec

ies

speciesproperties

e.g. Leibold et al. 2010


site

s

meanspecies

propertiessite

properties



site

s

functionaldiversityindices

siteproperties


Repetition → redundant information

species 1, site 1

species 1, site 2

species 1, site 3

species 1, site 4

species 1, site 5

species 1, site 6

species 2, site 1

species 2, site 2

species 2, site 3

species 2, site 4

species 2, site 5

species 2, site 6

species 3, site 1

species 3, site 2

species 3, site 3

species 3, site 4

species 3, site 5

species 3, site 6

Abundance Environment Traits

When converting a fourth-corner problem into a single dataframe you’ve got two choices:

I Summarisation → lost information

I Repetition → redundant information

Linear algebra as data management

Ancient Chinese text (∼150 BCE)


Hart (2009)


Solve for the b’s

y1 = b1x11 + b2x12 + . . . + bmx1m

y2 = b1x21 + b2x22 + . . . + bmx2m...

......

. . ....

yn = b1xn1 + b2xn2 + . . . + bmxnm

(1)


y =

y1

y2...yn

,X =

x11 x12 . . . x1m

x21 x22 . . . x2m...

.... . .

...xn1 xn2 . . . xnm

,b =

b1

b2...

bn

y = Xb

XTy = XTXb

(XTX)−1

XTy = b

(2)

The importance of data management to science

I Good theories of data management

(e.g. matrix algebra)allow us to think at a higher level of abstraction, therebyallowing us to focus on the interesting new parts of theproblem (e.g. the meaning of Y,X,B).

I This is because the uninteresting old details (e.g. how to solvethe linear equation) are automatically correct if the theory iscorrectly applied (e.g. because it has been previously learned).

I Therefore, we don’t need to actively think about such detailsuntil we step outside of the domain of the theory.


I Good theories of data management (e.g. matrix algebra)

allow us to think at a higher level of abstraction, therebyallowing us to focus on the interesting new parts of theproblem (e.g. the meaning of Y,X,B).




I Good theories of data management (e.g. matrix algebra)allow us to think at a higher level of abstraction,

therebyallowing us to focus on the interesting new parts of theproblem (e.g. the meaning of Y,X,B).




I Good theories of data management (e.g. matrix algebra)allow us to think at a higher level of abstraction, therebyallowing us to focus on the interesting new parts of theproblem

(e.g. the meaning of Y,X,B).




I Good theories of data management (e.g. matrix algebra)allow us to think at a higher level of abstraction, therebyallowing us to focus on the interesting new parts of theproblem (e.g. the meaning of Y,X,B).





I This is because the uninteresting old details

(e.g. how to solvethe linear equation) are automatically correct if the theory iscorrectly applied (e.g. because it has been previously learned).




I This is because the uninteresting old details (e.g. how to solvethe linear equation)

are automatically correct if the theory iscorrectly applied (e.g. because it has been previously learned).




I This is because the uninteresting old details (e.g. how to solvethe linear equation) are automatically correct if the theory iscorrectly applied

(e.g. because it has been previously learned).





I Therefore, we don’t need to actively think about such details

until we step outside of the domain of the theory.

Ihaka and Gentleman 1996

The data frame

variables

repl

icat

es

The R framework for data management

replicates

den

temp

precip

Chambers and Hastie 1991


rep

licate

s

+den ~ temp + precip

den

tem

p

pre

cip



rep

licate

s


+lm / glmer / plot / xyplot

den

tem

p

pre

cip



rep

licate

s


+lm / glmer / plot / xyplot

den

tem

p

pre

cip

=

temp

den

p < 0.0001

(intcpt)tempprecip

coef-1.2 2.1-0.1

s.e.0.40.10.1


> datasetden temp precip

1 0.2 24.5 36.52 0.5 -26.4 36.03 0.8 4.9 15.54 1.5 12.2 34.85 0.6 18.7 99.3

> dataset[1:2, ]den temp precip

1 0.2 24.5 36.52 0.5 -26.4 36.0

> lm(den ~ temp + precip, data = dataset)Coefficients:(Intercept) temp precip

0.837385 0.001930 -0.002937

The data frame

variables

repl

icat

es

Fourth corner problem

abundance

speciessite

properties

prop

erties

spec

ies

fourthcorner

site

s



abundance

time

species

traits

time

basin

scal

estim

especies

env.vars.


Goal Analyze next-generation multiple-table data setsusing this framework

Problem R doesn’t do multiple-tables ‘out-of-the-box’

Strategy Develop some theory to better understand multipletable data management and then use that theory toextend the R framework to allow multiple-table datasets

data sourcesdata list

data frame + formula + function = analysis




data sources

data list


ock star: Hadley Wickham

reshape2

plyr

ggplot2

...

reshape2

abundance

time

spac

e

species

variables

repl

icat

escasting

melting

reshape2

abundance

time

spac

e

species

variables

repl

icat

es

casting

melting

reshape2

abundance

time

spac

e

species

variables

repl

icat

escasting

melting

reshape2

> X

, , capybara

midlatitude subtropical tropical equatorial arctic subarctic

2009 4 0 8 0 0 0

2008 0 10 0 7 0 0

1537 0 0 0 0 0 0

, , moss


2009 0 0 9 0 5 0

2008 6 0 0 3 0 0

1537 0 0 0 0 0 0

, , vampire


2009 0 0 0 0 0 0

2008 0 0 0 0 0 0

1537 0 1 0 0 0 0

reshape2

> Xmelt <- melt(X, varnames = c(’year’,’biome’,’species’),

value.name = ’abundance’)

> Xmelt

year biome species abundance

1 2009 midlatitude capybara 4



4 2009 subtropical capybara 0



7 2009 tropical capybara 8

...

48 1537 equatorial vampire 0

49 2009 arctic vampire 0



52 2009 subarctic vampire 0



reshape2

> acast(Xmelt, year ~ biome ~ species)

, , capybara

arctic equatorial midlatitude subarctic subtropical tropical

1537 0 0 0 0 0 0

2008 0 7 0 0 10 0

2009 0 0 4 0 0 8

, , moss


1537 0 0 0 0 0 0

2008 0 3 6 0 0 0

2009 5 0 0 0 0 9

, , vampire


1537 0 0 0 0 1 0

2008 0 0 0 0 0 0

2009 0 0 0 0 0 0


abundance

time

species

traits

time

basin

scal

estim

especies

env.vars.


Peter Solymos

mefa / mefa4

vegan

dclone

...

mefa / mefa4

count data matrix(x$xtab)

segments(x$segm)

data framefor samples(x$samp)

data frame for taxa(x$taxa)

Solymos 2009


abundance

time

species

traits

time

basin

scal

estim

especies

env.vars.


multitable



The central distinction of multitable

Variables

I Things that can berelated

I Axes on a scatterplot

I Columns in a dataframe (or database)

Replicates

I Information aboutrelationships

I Points on a scatterplot

I Rows in a data frame(or database)

The central distinction of multitable

VariablesI Things that can be

related

I Axes on a scatterplot

I Columns in a dataframe (or database)

Replicates

I Information aboutrelationships

I Points on a scatterplot

I Rows in a data frame(or database)

The scatterplot

●

●

●

●

●

●

●

●

●

●

−0.5 0.5 1.0 1.5

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

x variable

y va

riabl

e

The data frame

variables

repl

icat

es

The data frame — bipartite graph

replicates variables

Variables and replicates in the fourth corner problem?

abundance

speciessite

properties

prop

erties

spec

ies

fourthcorner

site

s


Fourth corner problem — bipartite graph

sites

species

environment

abundance

traits


abundance

time

species

traits

time

basin

scal

estim

especies

env.vars.


Thermocline manipulation experiment — bipartite graph

sites

time

species

environment

abundance

time scales

traits

Biadjacency matrices

sites

species

environment

abundance

traits

abundance environment traitssites 1 1 0

species 1 0 1

> install.packages(‘multitable’)> library(multitable)

> dlabundance:---------

sppA sppB sppCsiteA 0 1 10siteB 0 2 12siteC 2 1 1siteD 0 7 0siteE 2 0 0Replicated along: || sites || species ||

temperature:-----------siteA siteB siteC siteD siteE-0.24 0.40 2.12 -0.72 5.95Replicated along: || sites ||

continued...

bodysize:--------sppA sppB sppC0.87 1.52 2.67Replicated along: || species ||

REPLICATION DIMENSIONS:sites species

5 3

> summary(dl)abundance temperature bodysize

sites TRUE TRUE FALSEspecies TRUE FALSE TRUE

> dl[1:3, ]abundance:---------

sppA sppB sppCsiteA 0 1 10siteB 0 2 12siteC 2 1 1Replicated along: || sites || species ||

temperature:-----------siteA siteB siteC-0.24 0.40 2.12Replicated along: || sites ||

continued...


REPLICATION DIMENSIONS:sites species

3 3

> df <- as.data.frame(dl)> df

abundance temperature bodysizesiteA.sppA 0 -0.24 0.87siteB.sppA 0 0.40 0.87siteC.sppA 2 2.12 0.87siteD.sppA 0 -0.72 0.87siteE.sppA 2 5.95 0.87siteA.sppB 1 -0.24 1.52siteB.sppB 2 0.40 1.52siteC.sppB 1 2.12 1.52siteD.sppB 7 -0.72 1.52siteE.sppB 0 5.95 1.52siteA.sppC 10 -0.24 2.67siteB.sppC 12 0.40 2.67siteC.sppC 1 2.12 2.67siteD.sppC 0 -0.72 2.67siteE.sppC 0 5.95 2.67

> lm(abundance ~ temperature + bodysize, data = df)

Coefficients:(Intercept) temperature bodysize

-0.3613 -0.4403 2.1083

> lm(abundance ~ temperature * bodysize, data = df)

Coefficients:(Intercept) temperature bodysize

-2.1612 0.7580 3.1755temperature:bodysize

-0.7105

> df <- as.data.frame(dims_to_vars(dl))> df

abundance temperature bodysize sites speciessiteA.sppA 0 -0.24 0.87 siteA sppAsiteB.sppA 0 0.40 0.87 siteB sppAsiteC.sppA 2 2.12 0.87 siteC sppAsiteD.sppA 0 -0.72 0.87 siteD sppAsiteE.sppA 2 5.95 0.87 siteE sppAsiteA.sppB 1 -0.24 1.52 siteA sppBsiteB.sppB 2 0.40 1.52 siteB sppBsiteC.sppB 1 2.12 1.52 siteC sppBsiteD.sppB 7 -0.72 1.52 siteD sppBsiteE.sppB 0 5.95 1.52 siteE sppBsiteA.sppC 10 -0.24 2.67 siteA sppCsiteB.sppC 12 0.40 2.67 siteB sppCsiteC.sppC 1 2.12 2.67 siteC sppCsiteD.sppC 0 -0.72 2.67 siteD sppCsiteE.sppC 0 5.95 2.67 siteE sppC

> library(lme4)> form <- abundance ~ (temperature * bodysize) +

(-1 + temperature | species)> glmer(form, data = df, family = ’poisson’)

Bates, Maechler, and Bolker

Generalized linear mixed model fit by maximum likelihood:

Random effects:Groups Name Variance Std.Dev.species temperature 0.003439 0.05864

Number of obs: 15, groups: species, 3

Fixed effects:Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.8456 0.5988 -1.412 0.1579tmprt 0.2845 0.2491 1.142 0.2535bdys 1.0077 0.2566 3.928 8.57e-05 ***tmprtr:bdys -0.2848 0.1391 -2.049 0.0405 *

Bates, Maechler, and Bolker

library(ggplot2)ggplot(df) +

facet_wrap(~ species) +aes(x = temperature, y = abundance, size = bodysize) +geom_point()

● ●

●

●

●

●●

●

●

●

●●

●● ●

sppA sppB sppC

0.02.55.07.5

10.012.5

0 2 4 6 0 2 4 6 0 2 4 6temperature

abun

danc

e

bodysize●

●

●

●

1.0

1.5

2.0

2.5

> cor(as.data.frame(dl))abundance temperature bodysize

abundance 1.0000000 -0.2839765 0.4176379temperature -0.2839765 1.0000000 0.0000000bodysize 0.4176379 0.0000000 1.0000000> summary(dl)

abundance temperature bodysizesites TRUE TRUE FALSEspecies TRUE FALSE TRUE

> dlmelt(dl)

$sites.species

abndnc sites species

siteA.sppA 0 siteA sppA

siteB.sppA 0 siteB sppA

siteC.sppA 2 siteC sppA

siteD.sppA 0 siteD sppA

siteE.sppA 2 siteE sppA

siteA.sppB 1 siteA sppB

siteB.sppB 2 siteB sppB

siteC.sppB 1 siteC sppB

siteD.sppB 7 siteD sppB

siteE.sppB 0 siteE sppB

siteA.sppC 10 siteA sppC

siteB.sppC 12 siteB sppC

siteC.sppC 1 siteC sppC

siteD.sppC 0 siteD sppC

siteE.sppC 0 siteE sppC

$sites

temp sites

siteA -0.24 siteA

siteB 0.40 siteB

siteC 2.12 siteC

siteD -0.72 siteD

siteE 5.95 siteE

$species

bodysize species

sppA 0.87 sppA

sppB 1.52 sppB

sppC 2.67 sppC

> identical(dl, dlcast(dlmelt(dl)))TRUE

> dlapply(dl, 2, mean)abundance:---------sppA sppB sppC0.8 2.2 4.6

Replicated along: || species ||


REPLICATION DIMENSIONS:species

3

● ●

●

● ●

●

● ●

● ●

●

●

● ●

● ● ●

●

● ●

●

●

●

●

● ● ●

●

● ●

2.5

5.0

7.5

10.0

200 240 280week

The

rmoc

line.

Dep

th

basin

●

●

●

B1

B2

B3

(0.18) armoured rot (0.18) nauplii (0.18) unprotected rot

(0.33) Bosmina (0.36) colonial rot (0.45) Cycl adults

(0.75) Cal cope (0.77) Holopedium (0.80) Daphnia l&d

(0.96) Daphnia cat (1.23) Cycl cope (1.31) Cal adults

0.2

0.3

0.4

0.5

0.6

0.03

0.06

0.09

0.12

0.15

0.05

0.10

0.15

0.00

0.05

0.10

0.15

0.00

0.05

0.10

0.15

0.06

0.10

0.14

0.00

0.01

0.02

0.03

0.04

0.00

0.02

0.04

0.06

0.000.010.020.030.040.05

0.00

0.01

0.02

0.03

0.04

0.00

0.02

0.04

0.04

0.08

0.12

0.16

4 6 8 10 4 6 8 10 4 6 8 10

4 6 8 10 4 6 8 10 4 6 8 10

4 6 8 10 4 6 8 10 4 6 8 10

4 6 8 10 4 6 8 10 4 6 8 10Thermocline.Depth

sqrt

(abu

ndan

ce)

Length

0.25

0.50

0.75

1.00

1.25

basin

B1

B2

B3

(0.18) armoured rot (0.18) nauplii (0.18) unprotected rot

(0.33) Bosmina (0.36) colonial rot (0.45) Cycl adults

(0.75) Cal cope (0.77) Holopedium (0.80) Daphnia l&d

(0.96) Daphnia cat (1.23) Cycl cope (1.31) Cal adults

0.2

0.3

0.4

0.5

0.04

0.06

0.08

0.10

0.12

0.08

0.12

0.16

0.05

0.10

0.025

0.050

0.075

0.100

0.075

0.100

0.125

0.150

0.00

0.01

0.02

0.03

0.0100.0150.0200.0250.030

0.00

0.01

0.02

0.03

0.04

0.00

0.01

0.02

0.03

0.04

0.01

0.02

0.03

0.04

0.04

0.08

0.12

0.16

4 6 8 10 4 6 8 10 4 6 8 10

4 6 8 10 4 6 8 10 4 6 8 10

4 6 8 10 4 6 8 10 4 6 8 10

4 6 8 10 4 6 8 10 4 6 8 10Thermocline.Depth

sqrt

(abu

ndan

ce)

Length

0.25

0.50

0.75

1.00

1.25

basin

B1

B2

B3

ConclusionI The fundamental distinction between variables and replicates

that unifies most statistical software also applies tomultiple-table next-generation data

I Therefore, we may not need all of the new statisticaltechniques being developed specifically for next-generationdata in community ecology

I Although my field is observational community ecology, I thinkthat many fields may benefit from more systematic and formaltreatment of the distinction between variables and replicates

I Current limitations:I multitable only deals with arrays (not phylogenies, distance

matrices, etc...)I although data lists can be coerced to data frames which can

be used in virtually any R analysis function, it may be moreefficient to pass data lists directly






matrices, etc...)

I although data lists can be coerced to data frames which canbe used in virtually any R analysis function, it may be moreefficient to pass data lists directly






matrices, etc...)I although data lists can be coerced to data frames which can

be used in virtually any R analysis function, it may be moreefficient to pass data lists directly

Acknowledgements

I Ben Bolker (for being my new postdoc supervisor...and forextremely useful and encouraging discussions on this topiclong before that)

I Collaborators on the multitable project:I Pierre Legendre (previous postdoc supervisor)I Guillaume Guenard (Universite de Montreal)I Peter Solymos (University of Alberta)I Beatrix Beisner (Universite du Quebec a Montreal)

I Contributers to the free software I use

I Collectors of the free data I use

I Funding (NSERC, OGS, Pierre, Don, and the U of T)

I Laura Timms (McGill / ROM)

Managing and analysing (next-generation) multivariate ecological … · 2012. 11. 28. · November...

Documents

Transcript of Managing and analysing (next-generation) multivariate ecological … · 2012. 11. 28. · November...