Scalable data mining for functional genomics and metagenomics
-
Upload
merrill-salas -
Category
Documents
-
view
29 -
download
0
description
Transcript of Scalable data mining for functional genomics and metagenomics
![Page 1: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/1.jpg)
Scalable data mining for functional genomics and metagenomics
Curtis Huttenhower
09-16-10Harvard School of Public HealthDepartment of Biostatistics
![Page 2: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/2.jpg)
2
Greatest discoveries in biology?
Our job is to create computational microscopes:
To ask and answer specific biological questions using
millions of experimental results
![Page 3: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/3.jpg)
3
Outline
1. Data mining:Integrating very large
genomic data compendia
2. Metagenomics:Network models of
microbial communities
![Page 4: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/4.jpg)
4
A computational definition offunctional genomics
Genomic data Prior knowledge
Data↓
Function
Function↓
Function
Gene↓
Gene
Gene↓
Function
![Page 5: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/5.jpg)
5
A framework for functional genomics
HighSimilarity
LowSimilarity
HighCorrelation
LowCorrelation
G1G2
+
G4G9
+
…
G3G6
-
G7G8
-
…
G2G5
?
0.9 0.7 … 0.1 0.2 … 0.8
+ - … - - … +
0.8 0.5 … 0.05 0.1 … 0.6
HighCorrelation
LowCorrelation
Fre
quen
cy
Let.Not let.
Fre
quen
cy
SimilarDissim.
Fre
quen
cy
P(G2-G5|Data) = 0.85
100Ms gene pairs →
← 1
Ks
data
sets
+ =
![Page 6: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/6.jpg)
6
Functional networkprediction and analysis
Global interaction network
Carbon metabolism network Extracellular signaling network Gut community network
Currently includes data from30,000 human experimental results,
15,000 expression conditions +15,000 diverse others, analyzed for
200 biological functions and150 diseases
HEFalMp
![Page 7: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/7.jpg)
7
Functional network prediction from diverse microbial data
486 bacterial expression
experiments
876 raw datasets
310 postprocessed
datasets
304 normalized coexpression networks
in 27 species
Integrated functional interaction networks
in 15 species
307 bacterial interaction
experiments
154796 raw interactions
114786 postprocessed
interactions
E. Coli Integration
← Precision ↑, Recall ↓
![Page 8: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/8.jpg)
8
Meta-analysis for unsupervisedfunctional data integration
Evangelou 2007
Huttenhower 2006Hibbs 2007
1
1log2
1'
'
''
z
eiey ,
ieeeiey ,,
i
ieiee yw ,*,̂
22,
*, ˆ
1
eie
ies
w
Simple regression:All datasets are equally accurate
Random effects:Variation within and
among datasets and interactions
![Page 9: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/9.jpg)
9
Meta-analysis for unsupervisedfunctional data integration
Evangelou 2007
Huttenhower 2006Hibbs 2007
1
1log2
1'
'
''
z
+ =
![Page 10: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/10.jpg)
10
Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune
Graphle http://huttenhower.sph.harvard.edu/graphle/
![Page 11: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/11.jpg)
11
Unsupervised data integration:TB virulence and ESX-1 secretionWith Sarah Fortune
Graphle http://huttenhower.sph.harvard.edu/graphle/
X?
![Page 12: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/12.jpg)
12
Predicting gene function
Cell cycle genes
Predicted relationships between genes
HighConfidence
LowConfidence
![Page 13: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/13.jpg)
13
Predicting gene functionPredicted relationships
between genes
HighConfidence
LowConfidence
Cell cycle genes
![Page 14: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/14.jpg)
14
Cell cycle genes
Predicting gene functionPredicted relationships
between genes
HighConfidence
LowConfidence
These edges provide a measure of how likely a gene is to
specifically participate in the process of
interest.
![Page 15: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/15.jpg)
15
Comprehensive validation of computational predictions
Genomic data
Computational Predictions of Gene Function
MEFITSPELLHibbs et al 2007
bioPIXIEMyers et al 2005
Genes predicted to function in mitochondrion organization
and biogenesis
Laboratory ExperimentsPetite
frequencyGrowthcurves
Confocal microscopy
New known functions for correctly predicted genes
Retraining
With David Hess, Amy Caudy
Prior knowledge
![Page 16: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/16.jpg)
16
Evaluating the performance of computational predictions
106Original GO Annotations
Genes involved in mitochondrion organization and biogenesis
135Under-annotations
82Novel Confirmations,
First Iteration
17Novel Confirmations,
Second Iteration
340 total: >3x previously known genes in ~5 person-months
![Page 17: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/17.jpg)
17
Evaluating the performance of computational predictions
106Original GO Annotations
Genes involved in mitochondrion organization and biogenesis
95Under-annotations
40Confirmed
Under-annotations
80Novel Confirmations
First Iteration
17Novel Confirmations
Second Iteration
340 total: >3x previously known genes in ~5 person-months
Computational predictions from large collections of genomic data can be
accurate despite incomplete or misleading gold standards, and they
continue to improve as additional data are incorporated.
![Page 18: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/18.jpg)
18
Functional mapping: mining integrated networks
Predicted relationships between genes
HighConfidence
LowConfidence
The strength of these relationships indicates how
cohesive a process is.
Chemotaxis
![Page 19: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/19.jpg)
19
Functional mapping: mining integrated networks
Predicted relationships between genes
HighConfidence
LowConfidence
Chemotaxis
![Page 20: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/20.jpg)
20
Functional mapping: mining integrated networks
Flagellar assembly
The strength of these relationships indicates how
associated two processes are.
Predicted relationships between genes
HighConfidence
LowConfidence
Chemotaxis
![Page 21: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/21.jpg)
21
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
![Page 22: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/22.jpg)
22
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
![Page 23: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/23.jpg)
23
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
NodesCohesiveness of processes
BelowBaseline
Baseline(genomic
background)
VeryCohesive
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
![Page 24: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/24.jpg)
24
Functional mapping:Associations among processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
NodesCohesiveness of processes
BelowBaseline
Baseline(genomic
background)
VeryCohesive
BordersData coverage of processes
WellCovered
SparselyCovered
![Page 25: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/25.jpg)
25
Cross-species knowledge transferusing functional data
PinakiSarder
)P()|P()|P( sssss FRFRDDFR ),P( ts FRFR
)|P( DFRs
)},{|P( ssts DFRFR
)P()|},P({ sssst FRFRDFR
st
stD
sss FRFRFRDFRs
)|P()|P()P(
TaFTan
![Page 26: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/26.jpg)
26
TaFTan: Cross-species knowledge transfer using functional data
E. coli
B. subtilis
P. aeruginosa
M. tuberculosis
Species-specific data
Species’ data excluded
All species’ data
log(
prec
isio
n/ra
ndom
)
log(recall)
• Important to take advantage of all
available data for any one organism
• Important to take advantage of all
available data for every organism
• Scalable to dozens of organisms with
hundreds of functional datasets
• Currently working on making this
more context-specific
![Page 27: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/27.jpg)
27
Outline
1. Data mining:Integrating very large
genomic data compendia
2. Metagenomics:Network models of
microbial communities
![Page 28: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/28.jpg)
28
~2000
AML/ALLSurvival
Mutation
Geneexpression
Batcheffects
Functionalmodules
So what does all of this have to do with
microbial communities ?
![Page 29: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/29.jpg)
29
~2005
Healthy/DiabetesBMI
M/F
SNPgenotypes
Populationstructure
LD
![Page 30: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/30.jpg)
30
2010
Healthy/IBDTemperature
Location
Taxa &Orthologs
???
Niches &Phylogeny Test for
correlatesMultiple
hypothesiscorrection
Featureselection
p >> n
Confounds/stratification/environment
Cross-validate
Biological story?
Independent sample
Intervention/perturbation
![Page 31: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/31.jpg)
31
What’s metagenomics?Total collection of microorganisms
within a community
Also microbial community or microbiota
Total genomic potential of a microbial community
Total biomolecular repertoire of a microbial community
Study of uncultured microorganisms from the environment, which can include
humans or other living hosts
![Page 32: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/32.jpg)
32
The Human Microbiome Project
2006 - ongoing
• 300 “normal” adults, 18-40
• 16S rDNA + WGS• 5 sites/18 samples +
blood• Oral cavity: saliva, tongue,
palate, buccal mucosa, gingiva,
tonsils, throat, teeth• Skin: ears, inner elbows• Nasal cavity• Gut: stool• Vagina: introitus, mid,
fornix• Reference genomes
(~200-800)
All healthy subjects; followup projects in psoriasis, Crohn’s,
colitis, obesity, acne, cancer, resistant
infection…
Hamady, 2009
![Page 33: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/33.jpg)
33
What features to test?
16S reads
WGS reads
Taxa
Orthologous clusters
Pathways/modules
Functional roles
Pathway activity
Genomic data(Reference genomes)
Functional data(Experimental models)
Binning
Clustering
Microbiome data
![Page 34: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/34.jpg)
34
HMP: Data features
16S reads
Orthologous clusters
Pathways/modules
Taxa
Genes(KOs)
Pathways(KEGGs)
![Page 35: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/35.jpg)
35
HMP: Body sites
Taxa
KOs
KEGGs
Vanilla linear SVM
![Page 36: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/36.jpg)
36
HMP: Subjects
Taxa
KEGGs
We can tell who you are by the bugs in
your mouth!
![Page 37: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/37.jpg)
37
HMP: Metabolic reconstruction
WGS reads
Pathways/modules
Genes(KOs)
Pathways(KEGGs)
Functional seq.KEGG + MetaCYC
CAZy, TCDB,VFDB, MEROPS…
BLAST → Genes
rra
r
raa
p
gap
gc
)(
)(
1
)()1(
)(
Genes → PathwaysMinPath (Ye 2009)
SmoothingWitten-Bell
otherwiseTNNgc
gcTNTVTNgc
)/()(
0)()/()/()(
Gap filling
300 subjects1-3 visits/subject
15-18 body sites/visit10-20M reads/sample
100bp reads
BLAST
?
![Page 38: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/38.jpg)
38
HMP: Metabolic reconstruction
Pathway coverage Pathway abundance
![Page 39: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/39.jpg)
39
HMP: Metabolic reconstruction
Pathway coverage
Pathway abundance← Samples →
← P
ath
wa
ys
→
Aerobic body sites
Gastrointestinal body sites
All
bo
dy
sit
es
(“c
ore
”)
![Page 40: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/40.jpg)
40
MetaHIT: Data features
WGS reads
Pathways/modules
85 healthy, 15 IBD +
12 healthy, 12 IBD
ReBLASTed against KEGG since published data obfuscates read
counts
10x bootstrap within training cohort, test on
12+12 as validation
Taxa
PhymmBrady 2009
Genes(KOs)
Pathways(KEGGs)
![Page 41: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/41.jpg)
41
MetaHIT: Taxonomic CD biomarkersBacteroidetes
Firmicutes
Methanomicrobia
Enterobacteriaceae
Chromatiales
Desulfobacterales
OxalobacteraceaeRhodobacteraceae
Bradyrhizobiaceae
iTOLLetunic 2007
![Page 42: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/42.jpg)
42
MetaHIT: Taxonomic CD biomarkers
Down in CD
Up in CD
![Page 43: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/43.jpg)
43
MetaHIT: Functional CD biomarkers
Growth/replication Motility Transporters Sugar metabolism
Down in CD
Up in CD
![Page 44: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/44.jpg)
44
MetaHIT: KO IBD biomarkers
Transporters
Growth/replication
Motility
Sugarmetabolism
Down in IBD
Up in IBD
LEfSe
NicolaSegata
![Page 45: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/45.jpg)
t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis…
Metagenomic differential analysis: LEfSe
1. Is there a statistically significant difference?
2. Is the difference biologically significant?
3. How large is the difference? PCA, LDA, mean difference, class or cluster distance…
expert supervision, specific post-hoc tests…
p(ANOVA) < 0.05
pairwise post-hoc Wilcoxon OK
Log(Score(LDA)) = 3.68
LEfSe:
45
![Page 46: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/46.jpg)
46
LEfSe: A non-human exampleViromes vs. bacterial metagenomes
Metastats (White 2009): p < 0.001ANOVA: p < 0.05
LEfSE: DIFF!
Hi-level functional category: CarbohydratesHi-level functional category: TransportersHi-level functional category: Nucleosides and Nucleotides
LEfSE: NO DIFF!
Microbial Viral
Dinsdale 2008
![Page 47: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/47.jpg)
47
• Sleipnir C++ library for computational functional genomics
• Data types for biological entities• Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.• Network communication, parallelization
• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)
• And it’s fully documented!
Sleipnir: Software forscalable functional genomics
Massive datasets require efficientalgorithms and implementations.
It’s also speedy: microbial data integration
computationtakes <3hrs.
![Page 48: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/48.jpg)
48
Outline
1. Data mining:Integrating very large
genomic data compendia
2. Metagenomics:Network models of
microbial communities
• Network framework for
scalable data integration
• HEFalMp: human data
integration
• TaFTan: cross-species
knowledge transfer from
functional data
• 16S and WGS community
metabolic reconstruction
• LEfSe: biologically relevant
community differences
• Sleipnir: software forscalable genomic
datamining
![Page 49: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/49.jpg)
49
Thanks!
http://huttenhower.sph.harvard.edu/sleipnir
Jacques Izard
Wendy Garrett
Sarah Fortune
Pinaki Sarder Nicola Segata
Levi Waldron LarisaMiropolsky
WillythssaPierre-Louis
Interested? We’re lookingfor postdocs!
http://huttenhower.sph.harvard.edu
OlgaTroyanskayaChris ParkDavid HessMatt HibbsChad MyersAna PopAaron Wong
Hilary CollerErin Haley
![Page 50: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/50.jpg)
![Page 51: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/51.jpg)
51
HEFalMp: Predicting human gene function
HEFalMp
![Page 52: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/52.jpg)
52
HEFalMp: Predicting humangenetic interactions
HEFalMp
![Page 53: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/53.jpg)
53
HEFalMp: Analyzing human genomic data
HEFalMp
![Page 54: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/54.jpg)
54
HEFalMp: Understanding human disease
HEFalMp
![Page 55: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/55.jpg)
55
Validating Human Predictions
Autophagy
Luciferase(Negative control)
ATG5(Positive control) LAMP2 RAB11A
NotStarved
Starved(Autophagic)
Predicted novel autophagy proteins
5½ of 7 predictions currently confirmed
With Erin Haley, Hilary Coller
![Page 56: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/56.jpg)
56
Functional Mapping:Scoring Functional Associations
How can we formalizethese relationships?
Any sets of genes G1 and G2 in a network can be compared
using four measures:
• Edges between their genes
• Edges within each set• The background edges
incident to each set• The baseline of all edges
in the network
),(),(
),(
2121
21, 21 GGwithin
baseline
GGbackground
GGbetweenFA GG
Stronger connections between the sets increase association.
Stronger within self-connections or nonspecific background connections decrease association.
![Page 57: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/57.jpg)
57
Functional Mapping:Bootstrap p-values
• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?
Empirically!# Genes 1 5 10 50
1
5
10
50
Histograms of FAs for random sets
For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is
approximately normal with mean 1.
Standard deviation is asymptotic in the sizes
of both gene sets.
Maps FA scores to p-values for any gene sets and
underlying graph.
100
102
104
100
101
102
103
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
|G1|
|G2|
Null distribution σs for one graph
|)(|||
|||)(|),(ˆ
1),(ˆ
ji
jijiFA
jiFA
GCG
BGGAGG
GG
)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG
![Page 58: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/58.jpg)
58
Functional maps for cross-speciesknowledge transfer
G17
G16G15
G10
G6
G9
G8
G5
G11
G7
G12
G13
G14
G2
G1
G4
G3
O8
O4O5
O7
O9
O6
O2
O3
O1
O1: G1, G2, G3O2: G4O3: G6…
ECG1, ECG2BSG1ECG3, BSG2…
![Page 59: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/59.jpg)
59
Functional maps for functional metagenomics
GOS 4441599.3Hypersaline Lagoon, Ecuador
KEGG Pathways
Org
anis
ms
Pathog ens
Env.
Mapping genes into pathways
Mapping pathways into
organisms
+ Integrated functional interaction networks
in 27 species
Mapping organisms into phyla
=
![Page 60: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/60.jpg)
60
Functional Maps:Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA
Data integration summarizes an impossibly huge amount of experimental data into an
impossibly huge number of predictions; what next?
![Page 61: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/61.jpg)
61
Functional Maps:Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA
How can a biologist take advantage of all this data to study
his/her favorite gene/pathway/disease without
losing information?
Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease
associations• Underlying experimental results and
functional activities in data
![Page 62: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/62.jpg)
62
Functional maps for cross-speciesknowledge transfer
← Precision ↑, Recall ↓
Following up with unsupervised and partially anchored network alignment
![Page 63: Scalable data mining for functional genomics and metagenomics](https://reader035.fdocuments.us/reader035/viewer/2022062314/56812bc9550346895d901dd4/html5/thumbnails/63.jpg)
63
LEfSe: A non-human exampleViromes vs. bacterial metagenomes
Metastats (White 2009): p < 0.001ANOVA: p < 0.05
LEfSE: DIFF!
Hi-level functional category: CarbohydratesHi-level functional category: Membrane TransportHi-level functional category: Nitrogen MetabolismHi-level functional category: Nucleosides and Nucleotides
LEfSE: NO DIFF!
Microbial Viral