Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower...

44
Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower 08-12-10 rvard School of Public Health partment of Biostatistics
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower...

Large scale functional data mining:What can we find in the data we have?

Curtis Huttenhower

08-12-10Harvard School of Public HealthDepartment of Biostatistics

2

Greatest biological discoveries?

Our job is to create computational microscopes:

To ask and answer specific biological questions using

millions of experimental results

3

A computational definition offunctional genomics

Genomic data Prior knowledge

Data↓

Function

Function↓

Function

Gene↓

Gene

Gene↓

Function

4

A framework for functional genomics

HighSimilarity

LowSimilarity

HighCorrelation

LowCorrelation

G1G2

+

G4G9

+

G3G6

-

G7G8

-

G2G5

?

0.9 0.7 … 0.1 0.2 … 0.8

+ - … - - … +

0.8 0.5 … 0.05 0.1 … 0.6

HighCorrelation

LowCorrelation

Fre

quen

cy

Let.Not let.

Fre

quen

cy

SimilarDissim.

Fre

quen

cy

P(G2-G5|Data) = 0.85

100Ms gene pairs →

← 1

Ks

data

sets

+ =

5

Functional networkprediction and analysis

Global interaction network

Carbon metabolism network Extracellular signaling network Gut community network

Currently includes data from30,000 human experimental results,

15,000 expression conditions +15,000 diverse others, analyzed for

200 biological functions and150 diseases

HEFalMp

6

Functional network prediction from diverse microbial data

486 bacterial expression

experiments

876 raw datasets

310 postprocessed

datasets

304 normalized coexpression networks

in 27 species

Integrated functional interaction networks

in 15 species

307 bacterial interaction

experiments

154796 raw interactions

114786 postprocessed

interactions

E. Coli Integration

← Precision ↑, Recall ↓

7

Cross-species knowledge transferusing functional data

PinakiSarder

)P()|P()|P( sssss FRFRDDFR ),P( ts FRFR

)|P( DFRs

)},{|P( ssts DFRFR

)P()|},P({ sssst FRFRDFR

st

stD

sss FRFRFRDFRs

)|P()|P()P(

TaFTan

8

TaFTan: Cross-species knowledge transfer using functional data

E. coli

B. subtilis

P. aeruginosa

M. tuberculosis

Species-specific data

Species’ data excluded

All species’ data

log(

prec

isio

n/ra

ndom

)

log(recall)

• Important to take advantage of all

available data for any one organism

• Important to take advantage of all

available data for every organism

• Scalable to dozens of organisms with

hundreds of functional datasets

• Currently working on making this

more context-specific

9

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

eiey ,

ieeeiey ,,

i

ieiee yw ,*,̂

22,

*, ˆ

1

eie

ies

w

Simple regression:All datasets are equally accurate

Random effects:Variation within and

among datasets and interactions

10

Meta-analysis for unsupervisedfunctional data integration

Evangelou 2007

Huttenhower 2006Hibbs 2007

1

1log2

1'

'

''

z

+ =

11

~2000

AML/ALLTemperature

DNA damage

Geneexpression

Batcheffects

Functionalmodules

So what does all of this have to do with

microbial communities ?

12

2010

Healthy/IBDTemperature

Location

Taxa &Orthologs

???

Niches &Phylogeny Test for

correlatesMultiple

hypothesiscorrection

Featureselection

p >> n

Confounds/stratification/environment

Cross-validate

Biological story?

Independent sample

Intervention/perturbation

13

What features to test?

16S reads

WGS reads

Taxa

Orthologous clusters

Pathways/modules

Functional roles

Pathway activity

Genomic data(Reference genomes)

Functional data(Experimental models)

Binning

Clustering

Microbiome data

14

MetaHIT: Data features

WGS reads

Pathways/modules

KO clusters

KEGG pathways

85 healthy, 15 IBD +

12 healthy, 12 IBD

ReBLASTed against KEGG since published data obfuscates read

counts

10x bootstrap within training cohort, test on

12+12 as validation

Taxa

PhymmBrady 2009

15

MetaHIT: Taxonomic CD biomarkersBacteroidetes

Firmicutes

Methanomicrobia

Enterobacteriaceae

Chromatiales

Desulfobacterales

OxalobacteraceaeRhodobacteraceae

Bradyrhizobiaceae

iTOLLetunic 2007

16

MetaHIT: Taxonomic CD biomarkers

Down in CD

Up in CD

17

MetaHIT: Functional CD biomarkers

Growth/replication Motility Transporters Sugar metabolism

Down in CD

Up in CD

18

MetaHIT: KO IBD biomarkers

Transporters

Growth/replication

Motility

Sugarmetabolism

Down in IBD

Up in IBD

LEfSe

NicolaSegata

t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis…

Metagenomic differential analysis: LEfSe

1. Is there a statistically significant difference?

2. Is the difference biologically significant?

3. How large is the difference? PCA, LDA, mean difference, class or cluster distance…

expert supervision, specific post-hoc tests…

p(ANOVA) < 0.05

pairwise post-hoc Wilcoxon OK

Log(Score(LDA)) = 3.68

LEfSe:

19

20

LEfSe: A non-human exampleViromes vs. bacterial metagenomes

Metastats (White 2009): p < 0.001ANOVA: p < 0.05

LEfSE: DIFF!

Hi-level functional category: CarbohydratesHi-level functional category: TransportersHi-level functional category: Nucleosides and Nucleotides

LEfSE: NO DIFF!

Microbial Viral

Dinsdale 2008

21

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!

Sleipnir: Software forscalable functional genomics

Massive datasets require efficientalgorithms and implementations.

It’s also speedy: microbial data integration

computationtakes <3hrs.

22

Recap

TaFTan Meta-analytic integration

LEfSe

• Unsupervised system for

data mining without curated

prior knowledge

• Comparative microbiome

analysis by taxa, orthologs,

and pathways• Sleipnir software for

scalable functional genomics

• Network framework for

scalable data integration

• Cross-species knowledge

transfer from functional data

23

Thanks!

http://huttenhower.sph.harvard.edu/sleipnir

Jacques Izard

Wendy Garrett

Sarah Fortune

Pinaki Sarder Nicola Segata

Levi Waldron LarisaMiropolsky

WillythssaPierre-Louis

25

Predicting Gene Function

Cell cycle genes

Predicted relationships between genes

HighConfidence

LowConfidence

26

Predicting Gene FunctionPredicted relationships

between genes

HighConfidence

LowConfidence

Cell cycle genes

27

Cell cycle genes

Predicting Gene FunctionPredicted relationships

between genes

HighConfidence

LowConfidence

These edges provide a measure of how likely a gene is to

specifically participate in the process of

interest.

28

Comprehensive Validation of Computational Predictions

Genomic data

Computational Predictions of Gene Function

MEFITSPELLHibbs et al 2007

bioPIXIEMyers et al 2005

Genes predicted to function in mitochondrion organization

and biogenesis

Laboratory ExperimentsPetite

frequencyGrowthcurves

Confocal microscopy

New known functions for correctly predicted genes

Retraining

With David Hess, Amy Caudy

Prior knowledge

29

Evaluating the Performance of Computational Predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

135Under-annotations

82Novel Confirmations,

First Iteration

17Novel Confirmations,

Second Iteration

340 total: >3x previously known genes in ~5 person-months

30

Evaluating the Performance of Computational Predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

95Under-annotations

40Confirmed

Under-annotations

80Novel Confirmations

First Iteration

17Novel Confirmations

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Computational predictions from large collections of genomic data can be

accurate despite incomplete or misleading gold standards, and they

continue to improve as additional data are incorporated.

31

Validating Human Predictions

Autophagy

Luciferase(Negative control)

ATG5(Positive control) LAMP2 RAB11A

NotStarved

Starved(Autophagic)

Predicted novel autophagy proteins

5½ of 7 predictions currently confirmed

With Erin Haley, Hilary Coller

32

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

The strength of these relationships indicates how

cohesive a process is.

Chemotaxis

33

Functional mapping: mining integrated networks

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

34

Functional mapping: mining integrated networks

Flagellar assembly

The strength of these relationships indicates how

associated two processes are.

Predicted relationships between genes

HighConfidence

LowConfidence

Chemotaxis

35

Functional Mapping:Scoring Functional Associations

How can we formalizethese relationships?

Any sets of genes G1 and G2 in a network can be compared

using four measures:

• Edges between their genes

• Edges within each set• The background edges

incident to each set• The baseline of all edges

in the network

),(),(

),(

2121

21, 21 GGwithin

baseline

GGbackground

GGbetweenFA GG

Stronger connections between the sets increase association.

Stronger within self-connections or nonspecific background connections decrease association.

36

Functional Mapping:Bootstrap p-values

• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?

Empirically!# Genes 1 5 10 50

1

5

10

50

Histograms of FAs for random sets

For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is

approximately normal with mean 1.

Standard deviation is asymptotic in the sizes

of both gene sets.

Maps FA scores to p-values for any gene sets and

underlying graph.

100

102

104

100

101

102

103

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

|G1|

|G2|

Null distribution σs for one graph

|)(|||

|||)(|),(ˆ

1),(ˆ

ji

jijiFA

jiFA

GCG

BGGAGG

GG

)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG

37

Functional Mapping:Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

38

Functional Mapping:Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

39

Functional maps for cross-speciesknowledge transfer

G17

G16G15

G10

G6

G9

G8

G5

G11

G7

G12

G13

G14

G2

G1

G4

G3

O8

O4O5

O7

O9

O6

O2

O3

O1

O1: G1, G2, G3O2: G4O3: G6…

ECG1, ECG2BSG1ECG3, BSG2…

40

Functional maps for functional metagenomics

GOS 4441599.3Hypersaline Lagoon, Ecuador

KEGG Pathways

Org

anis

ms

Pathog ens

Env.

Mapping genes into pathways

Mapping pathways into

organisms

+ Integrated functional interaction networks

in 27 species

Mapping organisms into phyla

=

41

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

Data integration summarizes an impossibly huge amount of experimental data into an

impossibly huge number of predictions; what next?

42

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

How can a biologist take advantage of all this data to study

his/her favorite gene/pathway/disease without

losing information?

Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease

associations• Underlying experimental results and

functional activities in data

43

Functional maps for cross-speciesknowledge transfer

← Precision ↑, Recall ↓

Following up with unsupervised and partially anchored network alignment

44

LEfSe: A non-human exampleViromes vs. bacterial metagenomes

Metastats (White 2009): p < 0.001ANOVA: p < 0.05

LEfSE: DIFF!

Hi-level functional category: CarbohydratesHi-level functional category: Membrane TransportHi-level functional category: Nitrogen MetabolismHi-level functional category: Nucleosides and Nucleotides

LEfSE: NO DIFF!

Microbial Viral