Computation for Large Systems II: Applications and...

64
ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY OF TENNESSEE Computation for Large Systems II: Applications and Analysis ARC Winter School in Mathematical and Computational Biology IMB, UQ, 5-9 July, 2010 Mike Langston Professor Department of Electrical Engineering and Computer Science University of Tennessee USA 8 July 2010

Transcript of Computation for Large Systems II: Applications and...

Page 1: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

Computation for Large Systems II:Applications and Analysis

ARC Winter School in Mathematicaland Computational Biology

IMB, UQ, 5-9 July, 2010

Mike LangstonProfessor

Department of Electrical Engineering and Computer ScienceUniversity of Tennessee

USA8 July 2010

Page 2: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

2

Outline of Talk

Foundations

Gene Coexpression Analysis

Clustering

Data Integration

Sample Applications to Human Health

Application to Model Organisms

Page 3: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

3

Foundations

Systems Biology

• How do biological entities function in unison and atall levels of scale?

• Linkage, communication and networks (graphs!)

Page 4: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

4

Foundations

Systems Biology

Correlation

Here are five mouse geneswith Pearson correlations of at least 0.65. What of

• noise?• experimental design?• circadian rhythms?• sex, tameness, etc?• other confounds?• other metrics?

Page 5: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

5

Foundations

Systems Biology

CorrelationCoefficient ProfilesSometimes via• Pearson• Spearman• Mutual Information• Etc

Other times we need• p-values• Bonferroni corrections• q-values• false discovery rates...

Page 6: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

6

Foundations

Systems Biology

Correlation

Omics: key to deciphering complex systems

Page 7: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

7

Foundations

Systems Biology

Correlation

Omics: key to deciphering complex systemsHumans: 1014+ cells, 200+ cell types

Page 8: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

8

Foundations

Systems Biology

Correlation

Omics: key to deciphering complex systemsHumans: 1013+ cells, 200+ cell typesGenome (blueprint, 20K+ genes, 10M+ polymorphisms)

Page 9: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

9

Foundations

Systems Biology

Correlation

Omics: key to deciphering complex systemsHumans: 1013+ cells, 200+ cell typesGenome (blueprint, 20K+ genes, 10M+ polymorphisms)Proteome (functional units, unknown # of proteins)

Page 10: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

10

Foundations

Systems Biology

Correlation

Omics: key to deciphering complex systemsHumans: 1013+ cells, 200+ cell typesGenome (blueprint, 20K+ genes, 10M+ polymorphisms)Proteome (functional units, unknown # of proteins)Transcriptome

Translation (tRNA) via transcription (mRNA)Function and Signaling (siRNA, miRNA, etc)

Page 11: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

11

Foundations

Systems Biology

Correlation

Omics: key to deciphering complex systemsHumans: 1013+ cells, 200+ cell typesGenome (blueprint, 20K+ genes, 10M+ polymorphisms)Proteome (functional units, unknown # of proteins)Transcriptome

Translation (tRNA) via transcription (mRNA)Function and Signaling (siRNA, miRNA, etc)

Other: metabalome, lipidome, interactome, omeome!

Page 12: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

12

Foundations

Systems Biology

Correlation

Omics

Visualization- highly dependenton scale

Page 13: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

13

Foundations

Systems Biology

Correlation

Omics

Visualization

Computational Tools - focus usually on dense subgraphs

Page 14: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

14

Foundations

Systems Biology

Correlation

Omics

Visualization

Computational ToolsMaximum Clique

• must run often• time is a limiting factor• exploit fixed-parameter tractability (FPT)

Page 15: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

15

Foundations

Systems Biology

Correlation

Omics

Visualization

Computational ToolsMaximum CliqueMaximal Clique

• huge outputs• various orderings• memory is often the limiting factor

Page 16: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

16

Foundations

Systems Biology

Correlation

Omics

Visualization

Computational ToolsMaximum CliqueMaximal CliqueBiclique

• new algorithms• bipartite graphs

Page 17: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

17

Foundations

Systems Biology

Correlation

Omics

Visualization

Computational ToolsMaximum CliqueMaximal CliqueBicliqueParaclique

• noisy data and/or soft clustering

Page 18: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

18

Outline of Talk

Foundations

Gene Coexpression Analysis

Clustering

Data Integration

Sample Applications to Human Health

Application to Model Organisms

Page 19: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

19

Coexpression Analysis

Raw Data

Gene Expression Profiles

Edge-Weighted Complete Graph

cDNA or mRNA Microarrays

Correlation Computation

High-Pass Filtering

Normalization

Real-Valued Matrix

Graph Transforms

Unweighted Incomplete Graph

Clique-CentricMethods

k-Cores k-ConnectedComponents

Principal Component Analysis

k-MeansClustering

… . . . . . . . .

Paraclique

. . . . . . .MaximalClique

MaximumClique

...Increasing Edge Density(and Increasing Problem Complexity)

NP-completeProblems

Unsupervised Methods

Biclique...

HCSSubgraphs. .

. . . . .FPT VCCodes

HPC &Novel

Methods

Toolchain

Thresholding

Page 20: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

Coexpression Analysis

Thresholding

20

Page 21: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

Coexpression Analysis

Thresholding

21

Method Anoxia Reoxygen-ation Alpha Absolute deviations

from GO threshold

GO Functional Similarity 0.97 0.92 0.85

Spectral Clustering 0.93 0.97 0.89 0.04+0.05+0.04=0.13

Maximal Clique-2 0.90 0.91 0.74 0.07+0.01+0.11=0.19

Power 0.88 0.94 0.96 0.09+0.02+0.11=0.22

Bonferroni adjustment 0.85 0.93 0.95 0.12+0.01+0.10=0.23

Control-Spot 0.93 0.83 0.70 0.04+0.09+0.15=0.28

Maximal Clique-3 0.87 0.89 0.60 0.10+0.03+0.25=0.38

Top 1 Percent 0.81 0.81 0.72 0.16+0.11+0.13=0.40

Estimated threshold for each dataset, sorted by performance of the methods.GO functional similarity thresholds are the standard against which the methods are

compared, summing absolute deviations across datasets (thresholds above GO are in bold).

Page 22: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

Coexpression Analysis

Thresholding

22

Method Dataset Estimated Threshold

BootstrapMean Differencea

Bootstrap Standard Deviation

Maximal Clique-2

Anoxia 0.90 0.91 -0.01 0.015Reoxy 0.91 0.93 -0.02 0.009Alpha 0.74 0.78 -0.04 0.057

Spectral Clustering

Anoxia 0.93 0.95 -0.02 0.012Reoxy 0.97 0.97 0.00 0.011Alpha 0.89 **0.95 -0.06 0.017

Top 1%Anoxia 0.81 0.83 -0.02 0.011Reoxy 0.81 0.84 -0.03 0.016Alpha 0.72 **0.79 -0.07 0.027

Control Spot

Anoxia 0.93 0.95 -0.02 0.015Reoxy 0.83 **0.90 -0.07 0.034Alpha 0.70 **0.82 -0.08 0.043

Summary of bootstrap results. Estimated threshold is compared with the bootstrapdistribution for selected methods.a Estimated threshold minus bootstrap mean.** Estimated threshold is more than 2 std. deviations from bootstrap mean.

Page 23: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

23

Coexpression Analysis

Gene (vertex) comparisons:• differential expression• does not require multiple conditions • compare the two lists of gene expression levels

Page 24: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

24

Coexpression Analysis

Correlate (edge) comparisons• differential correlation• requires multiple conditions in control versus stimulus• compare two lists of gene-gene correlations

Page 25: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

25

Coexpression Analysis

Putative network (clique) comparisons• differential topology• compare cliques, sort by ontology, CREs, etc• consider granularity, for example, with the clique intersection graph

Page 26: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

26

Outline of Talk

Foundations

Gene Coexpression Analysis

Clustering

Data Integration

Sample Applications to Human Health

Application to Model Organisms

Page 27: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

27

Clustering

27

Clique is the gold standard. But data is seldom without errors:• false-discovery rate low via clique• biggest problem is missing edges

What we really want are very dense subgraphs. It’s straightforwardenough to use neighborhoods, but on real data:

• 1-neighborhoods produce edge densities of only around 16%.• 2-neighborhoods produce edge densities of only around 6%

The Paraclique Algorithm:• A clique gloms onto highly connected vertices.• Edge density stays north of 96%.• Lift and separate.

466-paraclique

280-clique

Page 28: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

28

Clustering

2828

Allows Overlapping Clusters

Pre-specified Number of Clusters (k)

Thresholded Correlations

Method Type Tool Result Range Other Parameters Tested

Ward Hierarchical R k Y

Average Hierarchical R k Y

McQuitty Hierarchical R k Y

Complete Hierarchical R k Y

K-Means Partitioning R k Y

SOM Neural network MeV k Y Grid size/type

QT Clust Partitioning MeV 24-385 Maximum cluster diameters

CAST Graph-based MeV 1-6162 Y

CLICK Graph-based Expander 4-32 Cluster homogeneity

SAMBA Graph-based Expander 7-30 Overlap prior factor

WGCNA Graph-based stand-alone 4-160 Power, Module detection method

NNN Graph-based stand-alone 23-52 (Y) Minimum neighborhood size

K-Clique Communities Graph-based CFinder/Ours 1-68 Y Y Clique size

Maximal Clique Graph-based Ours 1,000-64,000 Y Y

Paraclique Graph-based Ours 8-615 Y Y Glom factor

Methods Tested

Page 29: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

29

Clustering

2929

Page 30: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

30

Clustering

303030

Average Quartile

Small (3-10 genes) Medium (11-100 genes) Large (101-1000 genes)

Clustering Method QuartileBAT5

Jaccard Quartile BAT5 Jaccard Quartile BAT5 Jaccard

K-Clique Communities 1.00 1 0.7531 1 0.4465 1 0.4915

Maximal Clique 1.00 1 0.8433 1 0.4081 0.0000

Paraclique 1.00 1 0.7576 1 0.4285 1 0.4169

Ward (H) 1.33 2 0.5782 1 0.4011 1 0.5723

CAST 1.67 1 0.7455 3 0.3146 1 0.4994

QT Clust 2.00 2 0.5473 2 0.3670 2 0.3944

Complete (H) 2.33 3 0.3933 2 0.3677 2 0.3419

NNN 2.67 2 0.5521 2 0.3705 4 0.2406

K-Means 3.00 4 0.2573 3 0.3015 2 0.3463

SOM 3.00 4 0.3260 2 0.3286 3 0.3282

WGCNA 3.00 3 0.4391 3 0.3106 3 0.2949

Average (H) 3.33 3 0.4087 4 0.2792 3 0.3037

McQuitty (H) 3.33 3 0.4594 3 0.3065 4 0.2868

SAMBA 3.50 0.0000 4 0.1860 3 0.3298

CLICK 4.00 4 0.0339 4 0.1453 4 0.2817

Algorithms Ranked by Quartile Comparisons

Page 31: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

31

Clustering

313131

Algorithm Pipelining, Clustering Suites, Teragrid

Page 32: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

32

Clustering

SevenQuantativeTrait Loci

There’s a high probability that somewhere in here is a polymorphism controlling this trait.

Transcriptabundance can be the phenotype!

Relationship to QTLs

Page 33: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

33

Concentrated Parental Alleles

TwoParacliques

Clustering

Page 34: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

34

Outline of Talk

Foundations

Gene Coexpression Analysis

Clustering

Data Integration

Sample Applications to Human Health

Application to Model Organisms

Page 35: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

35

Data Integration

Phenotypic Data (e. g., diseased versus healthy patients)

Page 36: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

36

Data Integration

Phenotypic Data (e. g., diseased versus healthy patients)Proteomic Data (e. g., amino acid peaks from mass spec)

Page 37: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

37

Data Integration

Phenotypic Data (e. g., diseased versus healthy patients)Proteomic Data (e. g., amino acid peaks from mass spec)Transcriptomic Data (e.g., gene expression from µarrays)

Page 38: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

38

Data Integration

Phenotypic Data (e. g., diseased versus healthy patients)Proteomic Data (e. g., amino acid peaks from mass spec)Transcriptomic Data (e.g., gene expression from µarrays)Genotypic Data: SNPs

• DNA sequence variations, each occurringwhen a single nucleotide in the genomediffers between members of a species

• highly conserved throughout evolution and within population

• almost always just two alleles

• detected with SNP arrays designed to detect polymorphisms

Page 39: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

39

Data Integration

Proteins

AT

TCCG

TCACGT

AGCTGT

mRNA Co-expression

Network

Multi-LocusGenetic

RegulatoryNetwork Models

Natural Allelic Perturbations

(SNPs)

Protein-GeneRelationships

Proteins

Proteins

Protein PeakFactors

T/C

C/G

A/T

G/G

C/TPutative

Biomarkers

Diseased

Healthy

Data Integration

Page 40: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

40

Outline of Talk

Foundations

Gene Coexpression Analysis

Clustering

Data Integration

Sample Applications to Human Health

Application to Model Organisms

Page 41: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

41

Application, Allergy

Data Description• Mikael Benson, Göteborg, Sweden: 56 patients and 39 controls• Affymetrix HU133 arrays• roughly 33,000 genes• hay fever, eczema• nasal secretions, lymphocytes, skin

Page 42: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

42

Data Description• Mikael Benson, Göteborg, Sweden: 56 patients and 39 controls• Affymetrix HU133 arrays• roughly 33,000 genes• hay fever, eczema• nasal secretions, lymphocytes, skin

Preprocessing• MAS5.0• log transformed• centered around zero with z scores• probesets with consistently low expression levels removed• replicates averaged

Application, Allergy

Page 43: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

43

Data Description• Mikael Benson, Göteborg, Sweden, 56 patients and 39 controls• Affymetrix HU133 arrays• roughly 33,000 genes• hay fever, eczema• nasal secretions, lymphocytes, skin

Preprocessing• MAS5.0• log transformed• centered around zero with z scores• probesets with consistently low expression levels removed• replicates averaged

Threshold Selection• chosen to balance graph densities• AFFX spots retained for quality control

Application, Allergy

Page 44: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

44

0

500000

1000000

1500000

2000000

2500000

Correlation Value

Freq

uenc

y

Patient

Control

Correlation Coefficient Distribution

Application, Allergy

Page 45: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

45

Threshold Vertices Edges Maximal Cliques Maximum Size

0.88 8009 256346 240146378 84

0.89 7169 178144 15067064 79

0.90 6254 118900 1579041 71

0.91 5317 75541 243232 66

0.92 4415 45471 51315 59

Control

Threshold Vertices Edges Maximal Cliques Maximum Size

0.88 5809 91152 2298595 61

0.89 4999 62271 447176 52

0.90 4146 40933 114030 450.91 3405 26031 41605 35

0.92 2628 11322 11322 28

Patientribosomal or RNA-related

T-lymphocytes or epithelial cells

Graph Properties

Application, Allergy

Page 46: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

46

Clique profiles using the five most highly represented genes:

Control Patient

Gene Symbol Clique membership Gene Symbol Clique membership

UBE1C 29% FGFR2 66%

RANBP6 27% NFIB 65%

DKFZP564O123 26% PPL 64%

SLC25A13 24% FGFR3 64%

GTPBP4 21% CDH3 56%

Application, Allergy

Page 47: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

47

Clique profiles using the five most highly represented genes:

Control Patient

Gene Symbol Clique membership Gene Symbol Clique membership

UBE1C 29% FGFR2 66%

RANBP6 27% NFIB 65%

DKFZP564O123 26% PPL 64%

SLC25A13 24% FGFR3 64%

GTPBP4 21% CDH3 56%

Of course gene representation is only a small part of the story.

Application, Allergy

Page 48: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

48

We can use traditional algorithmic tools• extract cores, cliques and other dense subgraphs• check for scale-freeness, putative TFs, hubs, etc

Application, Allergy

Page 49: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

49

We can use traditional algorithmic tools• extract cores, cliques and other dense subgraphs• check for scale-freeness, putative TFs, hubs, etc

We can use commercial and other tools• sort subgraphs by ontological enrichment, CREs, etc• compare to literature, databases, etc• match genes and gene products with known interactions

Application, Allergy

Page 50: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

50

We can use traditional algorithmic tools• extract cores, cliques and other dense subgraphs• check for scale-freeness, putative TFs, hubs, etc

We can use commercial and other tools• sort subgraphs by ontological enrichment, CREs, etc• compare to literature, databases, etc• match genes and gene products with known interactions

It’s tempting to scan for your favorites...

Application, Allergy

Page 51: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

51

We can use traditional algorithmic tools• extract cores, cliques and other dense subgraphs• check for scale-freeness, putative TFs, hubs, etc

We can use commercial and other tools• sort subgraphs by ontological enrichment, CREs, etc• compare to literature, databases, etc• match genes and gene products with known interactions

It’s tempting to scan for your favorites...

But our goal is to identify altered interactions

Application, Allergy

Page 52: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

52

Differential AnalysisGene (vertex) comparisons:

• differential expression• does not require multiple conditions • compare the two lists of gene expression levels

Correlate (edge) comparisons • differential correlation• requires multiple conditions in control, in dose• compare the two lists of gene-gene correlations

Putative network (clique) comparisons• differential topology• focus on network aka clique differences• consider the clique intersection graph

Application, Allergy

Page 53: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

53

Differential AnalysisGene (vertex) comparisons:

• differential expression• does not require multiple conditions • compare the two lists of gene expression levels

Correlate (edge) comparisons • differential correlation• requires multiple conditions in control, in dose• compare the two lists of gene-gene correlations

Putative network (clique) comparisons• differential topology• focus on network aka clique differences• consider the clique intersection graph

Ongoing Work• 62 genes pass all three screens, 6 match a known pathway• ITK (IL2-inducible T-cell kinase), studying in depth...moved on to Illumina

Application, Allergy

Page 54: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

54

Differential AnalysisGene (vertex) comparisons:

• differential expression• does not require multiple conditions • compare the two lists of gene expression levels

Correlate (edge) comparisons • differential correlation• requires multiple conditions in control, in dose• compare the two lists of gene-gene correlations

Putative network (clique) comparisons• differential topology• focus on network aka clique differences• consider the clique intersection graph

Ongoing Work• 62 genes pass all three screens, 6 match a known pathway• ITK (IL2-inducible T-cell kinase), studying in depth...moving to Illumina

For Impact• concentrate on real data, and working with bench biologists• strategic publications (e.g., Nature Genetics, PLoS Comp Bio, etc)

Application, Allergy

Page 55: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

55

Outline of Talk

Foundations

Gene Coexpression Analysis

Clustering

Data Integration

Sample Applications to Human Health

Application to Model Organisms

Page 56: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

56

Application,Model Organisms

LD: a measure of statistical dependence between genetic markers• non-random association of alleles at two or more loci• the occurrence in a population of two linked alleles at a frequencyhigher or lower than expected on the basis of the individual frequencies

• not necessarily on the same chromosome

Page 57: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

57

Application,Model Organisms

LD: a measure of statistical dependence between genetic markers• non-random association of alleles at two or more loci• the occurrence in a population of two linked alleles at a frequencyhigher or lower than expected on the basis of the individual frequencies

• not necessarily on the same chromosome

Reflects biologically meaningful association of loci

Page 58: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

58

Application,Model Organisms

LD: a measure of statistical dependence between genetic markers• non-random association of alleles at two or more loci• the occurrence in a population of two linked alleles at a frequencyhigher or lower than expected on the basis of the individual frequencies

• not necessarily on the same chromosome

Reflects biologically meaningful association of loci

Generally a result of population history• population genealogy• recombination frequency• co-adaptive allele selection• natural selection• other factors

LD: a measure of statistical dependence between genetic markers• non-random association of alleles at two or more loci• the occurrence in a population of two linked alleles at a frequencyhigher or lower than expected on the basis of the individual frequencies

• not necessarily on the same chromosome

Reflects biologically meaningful association of loci

Page 59: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

59

Application,Model Organisms

Evaluation of Mus musculus breeding strategies

Solution: Use SNPs, correlation, paraclique and proximity

Standard Inbred (SI)Recombinant Inbred (RI)

BXD, LXS, etcThe Collaborative Cross

Page 60: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

60

Application,Model Organisms

0

200

400

600

800

1000

1200

0.90.850.80.750.70.650.60.550.50.450.40.350.30.250.20.150.1

Mutual Information

Num

ber o

f Par

acliq

ues

cont

aini

ng m

ore

than

3 S

NPs

that

cro

ss m

ultip

le

chro

mos

omes

67SI

89BXD

0

200

400

600

800

1000

1200

1400

0.90.850.80.750.70.650.60.550.50.450.40.350.30.250.20.150.1

Mutual Information

Nu

mb

er o

f P

arac

liqu

es c

on

tain

ing

m

ore

th

an 3

SN

Ps

67SI

89BXD

Number of LD Networks

Number of Non-Syntenic LD Networks

67 Inbred Strains

0

200

400

600

800

1000

1200

1400

0.90.850.80.750.70.650.60.550.50.450.40.350.30.250.20.150.1

Mutual Information

Num

ber

of P

arac

lique

s (s

ize

> 3)

1 chrs 2 chrs 3 chrs4 chrs 5 chrs 6 chrs7 chrs 8 chrs 9 chrs10 chrs 11 chrs 12 chrs13 chrs 14 chrs 15 chrs16 chrs 17 chrs

89 BXD Strains

0

200

400

600

800

1000

1200

1400

0.90.850.80.750.70.650.60.550.50.450.40.350.30.250.20.150.1

Mutual Information

Num

ber o

f Par

acliq

ues

(Siz

e>3)

1 chrs 2 chrs3 chrs 4 chrsRecombinant Inbred

Standard Inbred

Chromosome Coverage

Page 61: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

61

Application,Model Organisms

rs13476024

CEL-1_103029662

Chr 1

Chr 7

StandardInbred

Chr 11rs3664950

rs3724175

rs3674958rs13480968

Chr 4

rs3718552Chr 7 rs8243991

UT_7_136.88578

rs4226997

rs3714636

rs3694146rs6334210 rs1347955

3 mCV22291963rs13479554 rs13479555

rs6392543 CEL-

7_126142971

rs3663988

rs6303477 rs366012

2 rs3659292 rs1347956

6 rs13479567 rs62121

86 rs13479569rs13479570 rs366616

0 rs13479571

CEL-7_126301023

rs13479559

CEL-7_126570687

rs6216320

Recombinant Inbred

Chr 7

Example of ContrastingParaclique Profiles

Page 62: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

62

CollaboratorsComputer Science, Mathematics, Molecular Biology, Statistics

Research Scientists (Incomplete!):Mikael BensonElissa Chesler Frank DehneVlad Estivill-CastroMike FellowsIvan GerlingFrans HenskensReza MobiniPablo MoscatoMark RaganFran RosamondArnold SaxtonBrynn VoyRob Williams

Current/Recent Students:Bhavesh BoratePatricia CareyJohn EblenJeremy JayDenise KoesslerZuopan LiZahra MahoorSudhir NaswaClinton NolanAndy PerkinsCharles PhillipsGary RogersPeter ShawDinesh WeerapurageYun Zhang

Page 63: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

63

Come Join Us!

Biology, Computer Science, Mathematics, StatisticsThe University of Tennessee and Oak Ridge National Laboratory

• Analysis of Algorithms• Applications

o Alcoholism, Allergy, Cancer, Diabetes, Neuroscience, Radiology...• Complex Cross (210 RI Mouse Lines)• High Performance Computing• Metabolomics, Proteomics, Transcriptomics, ...• Ontological Discovery• RNAi and µRNA Interference• Threshold Selection• Time Series Analysis• Work Alongside Both Computer and Domain Scientists!

Page 64: Computation for Large Systems II: Applications and Analysisbioinformatics.org.au/resources/ws10/presentations/... · 2014-03-20 · ELECTRICAL ENGINEERING & COMPUTER SCIENCE UNIVERSITY

ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE

64

The Langston Lab(Geeks Я Us)