Post on 29-Jun-2020
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
Computation for Large Systems II:Applications and Analysis
ARC Winter School in Mathematicaland Computational Biology
IMB, UQ, 5-9 July, 2010
Mike LangstonProfessor
Department of Electrical Engineering and Computer ScienceUniversity of Tennessee
USA8 July 2010
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
2
Outline of Talk
Foundations
Gene Coexpression Analysis
Clustering
Data Integration
Sample Applications to Human Health
Application to Model Organisms
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
3
Foundations
Systems Biology
• How do biological entities function in unison and atall levels of scale?
• Linkage, communication and networks (graphs!)
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
4
Foundations
Systems Biology
Correlation
Here are five mouse geneswith Pearson correlations of at least 0.65. What of
• noise?• experimental design?• circadian rhythms?• sex, tameness, etc?• other confounds?• other metrics?
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
5
Foundations
Systems Biology
CorrelationCoefficient ProfilesSometimes via• Pearson• Spearman• Mutual Information• Etc
Other times we need• p-values• Bonferroni corrections• q-values• false discovery rates...
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
6
Foundations
Systems Biology
Correlation
Omics: key to deciphering complex systems
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
7
Foundations
Systems Biology
Correlation
Omics: key to deciphering complex systemsHumans: 1014+ cells, 200+ cell types
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
8
Foundations
Systems Biology
Correlation
Omics: key to deciphering complex systemsHumans: 1013+ cells, 200+ cell typesGenome (blueprint, 20K+ genes, 10M+ polymorphisms)
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
9
Foundations
Systems Biology
Correlation
Omics: key to deciphering complex systemsHumans: 1013+ cells, 200+ cell typesGenome (blueprint, 20K+ genes, 10M+ polymorphisms)Proteome (functional units, unknown # of proteins)
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
10
Foundations
Systems Biology
Correlation
Omics: key to deciphering complex systemsHumans: 1013+ cells, 200+ cell typesGenome (blueprint, 20K+ genes, 10M+ polymorphisms)Proteome (functional units, unknown # of proteins)Transcriptome
Translation (tRNA) via transcription (mRNA)Function and Signaling (siRNA, miRNA, etc)
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
11
Foundations
Systems Biology
Correlation
Omics: key to deciphering complex systemsHumans: 1013+ cells, 200+ cell typesGenome (blueprint, 20K+ genes, 10M+ polymorphisms)Proteome (functional units, unknown # of proteins)Transcriptome
Translation (tRNA) via transcription (mRNA)Function and Signaling (siRNA, miRNA, etc)
Other: metabalome, lipidome, interactome, omeome!
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
12
Foundations
Systems Biology
Correlation
Omics
Visualization- highly dependenton scale
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
13
Foundations
Systems Biology
Correlation
Omics
Visualization
Computational Tools - focus usually on dense subgraphs
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
14
Foundations
Systems Biology
Correlation
Omics
Visualization
Computational ToolsMaximum Clique
• must run often• time is a limiting factor• exploit fixed-parameter tractability (FPT)
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
15
Foundations
Systems Biology
Correlation
Omics
Visualization
Computational ToolsMaximum CliqueMaximal Clique
• huge outputs• various orderings• memory is often the limiting factor
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
16
Foundations
Systems Biology
Correlation
Omics
Visualization
Computational ToolsMaximum CliqueMaximal CliqueBiclique
• new algorithms• bipartite graphs
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
17
Foundations
Systems Biology
Correlation
Omics
Visualization
Computational ToolsMaximum CliqueMaximal CliqueBicliqueParaclique
• noisy data and/or soft clustering
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
18
Outline of Talk
Foundations
Gene Coexpression Analysis
Clustering
Data Integration
Sample Applications to Human Health
Application to Model Organisms
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
19
Coexpression Analysis
Raw Data
Gene Expression Profiles
Edge-Weighted Complete Graph
cDNA or mRNA Microarrays
Correlation Computation
High-Pass Filtering
Normalization
Real-Valued Matrix
Graph Transforms
Unweighted Incomplete Graph
Clique-CentricMethods
k-Cores k-ConnectedComponents
Principal Component Analysis
k-MeansClustering
… . . . . . . . .
Paraclique
. . . . . . .MaximalClique
MaximumClique
...Increasing Edge Density(and Increasing Problem Complexity)
NP-completeProblems
Unsupervised Methods
Biclique...
HCSSubgraphs. .
. . . . .FPT VCCodes
HPC &Novel
Methods
Toolchain
Thresholding
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
Coexpression Analysis
Thresholding
20
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
Coexpression Analysis
Thresholding
21
Method Anoxia Reoxygen-ation Alpha Absolute deviations
from GO threshold
GO Functional Similarity 0.97 0.92 0.85
Spectral Clustering 0.93 0.97 0.89 0.04+0.05+0.04=0.13
Maximal Clique-2 0.90 0.91 0.74 0.07+0.01+0.11=0.19
Power 0.88 0.94 0.96 0.09+0.02+0.11=0.22
Bonferroni adjustment 0.85 0.93 0.95 0.12+0.01+0.10=0.23
Control-Spot 0.93 0.83 0.70 0.04+0.09+0.15=0.28
Maximal Clique-3 0.87 0.89 0.60 0.10+0.03+0.25=0.38
Top 1 Percent 0.81 0.81 0.72 0.16+0.11+0.13=0.40
Estimated threshold for each dataset, sorted by performance of the methods.GO functional similarity thresholds are the standard against which the methods are
compared, summing absolute deviations across datasets (thresholds above GO are in bold).
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
Coexpression Analysis
Thresholding
22
Method Dataset Estimated Threshold
BootstrapMean Differencea
Bootstrap Standard Deviation
Maximal Clique-2
Anoxia 0.90 0.91 -0.01 0.015Reoxy 0.91 0.93 -0.02 0.009Alpha 0.74 0.78 -0.04 0.057
Spectral Clustering
Anoxia 0.93 0.95 -0.02 0.012Reoxy 0.97 0.97 0.00 0.011Alpha 0.89 **0.95 -0.06 0.017
Top 1%Anoxia 0.81 0.83 -0.02 0.011Reoxy 0.81 0.84 -0.03 0.016Alpha 0.72 **0.79 -0.07 0.027
Control Spot
Anoxia 0.93 0.95 -0.02 0.015Reoxy 0.83 **0.90 -0.07 0.034Alpha 0.70 **0.82 -0.08 0.043
Summary of bootstrap results. Estimated threshold is compared with the bootstrapdistribution for selected methods.a Estimated threshold minus bootstrap mean.** Estimated threshold is more than 2 std. deviations from bootstrap mean.
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
23
Coexpression Analysis
Gene (vertex) comparisons:• differential expression• does not require multiple conditions • compare the two lists of gene expression levels
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
24
Coexpression Analysis
Correlate (edge) comparisons• differential correlation• requires multiple conditions in control versus stimulus• compare two lists of gene-gene correlations
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
25
Coexpression Analysis
Putative network (clique) comparisons• differential topology• compare cliques, sort by ontology, CREs, etc• consider granularity, for example, with the clique intersection graph
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
26
Outline of Talk
Foundations
Gene Coexpression Analysis
Clustering
Data Integration
Sample Applications to Human Health
Application to Model Organisms
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
27
Clustering
27
Clique is the gold standard. But data is seldom without errors:• false-discovery rate low via clique• biggest problem is missing edges
What we really want are very dense subgraphs. It’s straightforwardenough to use neighborhoods, but on real data:
• 1-neighborhoods produce edge densities of only around 16%.• 2-neighborhoods produce edge densities of only around 6%
The Paraclique Algorithm:• A clique gloms onto highly connected vertices.• Edge density stays north of 96%.• Lift and separate.
466-paraclique
280-clique
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
28
Clustering
2828
Allows Overlapping Clusters
Pre-specified Number of Clusters (k)
Thresholded Correlations
Method Type Tool Result Range Other Parameters Tested
Ward Hierarchical R k Y
Average Hierarchical R k Y
McQuitty Hierarchical R k Y
Complete Hierarchical R k Y
K-Means Partitioning R k Y
SOM Neural network MeV k Y Grid size/type
QT Clust Partitioning MeV 24-385 Maximum cluster diameters
CAST Graph-based MeV 1-6162 Y
CLICK Graph-based Expander 4-32 Cluster homogeneity
SAMBA Graph-based Expander 7-30 Overlap prior factor
WGCNA Graph-based stand-alone 4-160 Power, Module detection method
NNN Graph-based stand-alone 23-52 (Y) Minimum neighborhood size
K-Clique Communities Graph-based CFinder/Ours 1-68 Y Y Clique size
Maximal Clique Graph-based Ours 1,000-64,000 Y Y
Paraclique Graph-based Ours 8-615 Y Y Glom factor
Methods Tested
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
29
Clustering
2929
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
30
Clustering
303030
Average Quartile
Small (3-10 genes) Medium (11-100 genes) Large (101-1000 genes)
Clustering Method QuartileBAT5
Jaccard Quartile BAT5 Jaccard Quartile BAT5 Jaccard
K-Clique Communities 1.00 1 0.7531 1 0.4465 1 0.4915
Maximal Clique 1.00 1 0.8433 1 0.4081 0.0000
Paraclique 1.00 1 0.7576 1 0.4285 1 0.4169
Ward (H) 1.33 2 0.5782 1 0.4011 1 0.5723
CAST 1.67 1 0.7455 3 0.3146 1 0.4994
QT Clust 2.00 2 0.5473 2 0.3670 2 0.3944
Complete (H) 2.33 3 0.3933 2 0.3677 2 0.3419
NNN 2.67 2 0.5521 2 0.3705 4 0.2406
K-Means 3.00 4 0.2573 3 0.3015 2 0.3463
SOM 3.00 4 0.3260 2 0.3286 3 0.3282
WGCNA 3.00 3 0.4391 3 0.3106 3 0.2949
Average (H) 3.33 3 0.4087 4 0.2792 3 0.3037
McQuitty (H) 3.33 3 0.4594 3 0.3065 4 0.2868
SAMBA 3.50 0.0000 4 0.1860 3 0.3298
CLICK 4.00 4 0.0339 4 0.1453 4 0.2817
Algorithms Ranked by Quartile Comparisons
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
31
Clustering
313131
Algorithm Pipelining, Clustering Suites, Teragrid
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
32
Clustering
SevenQuantativeTrait Loci
There’s a high probability that somewhere in here is a polymorphism controlling this trait.
Transcriptabundance can be the phenotype!
Relationship to QTLs
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
33
Concentrated Parental Alleles
TwoParacliques
Clustering
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
34
Outline of Talk
Foundations
Gene Coexpression Analysis
Clustering
Data Integration
Sample Applications to Human Health
Application to Model Organisms
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
35
Data Integration
Phenotypic Data (e. g., diseased versus healthy patients)
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
36
Data Integration
Phenotypic Data (e. g., diseased versus healthy patients)Proteomic Data (e. g., amino acid peaks from mass spec)
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
37
Data Integration
Phenotypic Data (e. g., diseased versus healthy patients)Proteomic Data (e. g., amino acid peaks from mass spec)Transcriptomic Data (e.g., gene expression from µarrays)
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
38
Data Integration
Phenotypic Data (e. g., diseased versus healthy patients)Proteomic Data (e. g., amino acid peaks from mass spec)Transcriptomic Data (e.g., gene expression from µarrays)Genotypic Data: SNPs
• DNA sequence variations, each occurringwhen a single nucleotide in the genomediffers between members of a species
• highly conserved throughout evolution and within population
• almost always just two alleles
• detected with SNP arrays designed to detect polymorphisms
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
39
Data Integration
Proteins
AT
TCCG
TCACGT
AGCTGT
mRNA Co-expression
Network
Multi-LocusGenetic
RegulatoryNetwork Models
Natural Allelic Perturbations
(SNPs)
Protein-GeneRelationships
Proteins
Proteins
Protein PeakFactors
T/C
C/G
A/T
G/G
C/TPutative
Biomarkers
Diseased
Healthy
Data Integration
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
40
Outline of Talk
Foundations
Gene Coexpression Analysis
Clustering
Data Integration
Sample Applications to Human Health
Application to Model Organisms
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
41
Application, Allergy
Data Description• Mikael Benson, Göteborg, Sweden: 56 patients and 39 controls• Affymetrix HU133 arrays• roughly 33,000 genes• hay fever, eczema• nasal secretions, lymphocytes, skin
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
42
Data Description• Mikael Benson, Göteborg, Sweden: 56 patients and 39 controls• Affymetrix HU133 arrays• roughly 33,000 genes• hay fever, eczema• nasal secretions, lymphocytes, skin
Preprocessing• MAS5.0• log transformed• centered around zero with z scores• probesets with consistently low expression levels removed• replicates averaged
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
43
Data Description• Mikael Benson, Göteborg, Sweden, 56 patients and 39 controls• Affymetrix HU133 arrays• roughly 33,000 genes• hay fever, eczema• nasal secretions, lymphocytes, skin
Preprocessing• MAS5.0• log transformed• centered around zero with z scores• probesets with consistently low expression levels removed• replicates averaged
Threshold Selection• chosen to balance graph densities• AFFX spots retained for quality control
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
44
0
500000
1000000
1500000
2000000
2500000
Correlation Value
Freq
uenc
y
Patient
Control
Correlation Coefficient Distribution
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
45
Threshold Vertices Edges Maximal Cliques Maximum Size
0.88 8009 256346 240146378 84
0.89 7169 178144 15067064 79
0.90 6254 118900 1579041 71
0.91 5317 75541 243232 66
0.92 4415 45471 51315 59
Control
Threshold Vertices Edges Maximal Cliques Maximum Size
0.88 5809 91152 2298595 61
0.89 4999 62271 447176 52
0.90 4146 40933 114030 450.91 3405 26031 41605 35
0.92 2628 11322 11322 28
Patientribosomal or RNA-related
T-lymphocytes or epithelial cells
Graph Properties
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
46
Clique profiles using the five most highly represented genes:
Control Patient
Gene Symbol Clique membership Gene Symbol Clique membership
UBE1C 29% FGFR2 66%
RANBP6 27% NFIB 65%
DKFZP564O123 26% PPL 64%
SLC25A13 24% FGFR3 64%
GTPBP4 21% CDH3 56%
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
47
Clique profiles using the five most highly represented genes:
Control Patient
Gene Symbol Clique membership Gene Symbol Clique membership
UBE1C 29% FGFR2 66%
RANBP6 27% NFIB 65%
DKFZP564O123 26% PPL 64%
SLC25A13 24% FGFR3 64%
GTPBP4 21% CDH3 56%
Of course gene representation is only a small part of the story.
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
48
We can use traditional algorithmic tools• extract cores, cliques and other dense subgraphs• check for scale-freeness, putative TFs, hubs, etc
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
49
We can use traditional algorithmic tools• extract cores, cliques and other dense subgraphs• check for scale-freeness, putative TFs, hubs, etc
We can use commercial and other tools• sort subgraphs by ontological enrichment, CREs, etc• compare to literature, databases, etc• match genes and gene products with known interactions
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
50
We can use traditional algorithmic tools• extract cores, cliques and other dense subgraphs• check for scale-freeness, putative TFs, hubs, etc
We can use commercial and other tools• sort subgraphs by ontological enrichment, CREs, etc• compare to literature, databases, etc• match genes and gene products with known interactions
It’s tempting to scan for your favorites...
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
51
We can use traditional algorithmic tools• extract cores, cliques and other dense subgraphs• check for scale-freeness, putative TFs, hubs, etc
We can use commercial and other tools• sort subgraphs by ontological enrichment, CREs, etc• compare to literature, databases, etc• match genes and gene products with known interactions
It’s tempting to scan for your favorites...
But our goal is to identify altered interactions
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
52
Differential AnalysisGene (vertex) comparisons:
• differential expression• does not require multiple conditions • compare the two lists of gene expression levels
Correlate (edge) comparisons • differential correlation• requires multiple conditions in control, in dose• compare the two lists of gene-gene correlations
Putative network (clique) comparisons• differential topology• focus on network aka clique differences• consider the clique intersection graph
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
53
Differential AnalysisGene (vertex) comparisons:
• differential expression• does not require multiple conditions • compare the two lists of gene expression levels
Correlate (edge) comparisons • differential correlation• requires multiple conditions in control, in dose• compare the two lists of gene-gene correlations
Putative network (clique) comparisons• differential topology• focus on network aka clique differences• consider the clique intersection graph
Ongoing Work• 62 genes pass all three screens, 6 match a known pathway• ITK (IL2-inducible T-cell kinase), studying in depth...moved on to Illumina
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
54
Differential AnalysisGene (vertex) comparisons:
• differential expression• does not require multiple conditions • compare the two lists of gene expression levels
Correlate (edge) comparisons • differential correlation• requires multiple conditions in control, in dose• compare the two lists of gene-gene correlations
Putative network (clique) comparisons• differential topology• focus on network aka clique differences• consider the clique intersection graph
Ongoing Work• 62 genes pass all three screens, 6 match a known pathway• ITK (IL2-inducible T-cell kinase), studying in depth...moving to Illumina
For Impact• concentrate on real data, and working with bench biologists• strategic publications (e.g., Nature Genetics, PLoS Comp Bio, etc)
Application, Allergy
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
55
Outline of Talk
Foundations
Gene Coexpression Analysis
Clustering
Data Integration
Sample Applications to Human Health
Application to Model Organisms
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
56
Application,Model Organisms
LD: a measure of statistical dependence between genetic markers• non-random association of alleles at two or more loci• the occurrence in a population of two linked alleles at a frequencyhigher or lower than expected on the basis of the individual frequencies
• not necessarily on the same chromosome
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
57
Application,Model Organisms
LD: a measure of statistical dependence between genetic markers• non-random association of alleles at two or more loci• the occurrence in a population of two linked alleles at a frequencyhigher or lower than expected on the basis of the individual frequencies
• not necessarily on the same chromosome
Reflects biologically meaningful association of loci
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
58
Application,Model Organisms
LD: a measure of statistical dependence between genetic markers• non-random association of alleles at two or more loci• the occurrence in a population of two linked alleles at a frequencyhigher or lower than expected on the basis of the individual frequencies
• not necessarily on the same chromosome
Reflects biologically meaningful association of loci
Generally a result of population history• population genealogy• recombination frequency• co-adaptive allele selection• natural selection• other factors
LD: a measure of statistical dependence between genetic markers• non-random association of alleles at two or more loci• the occurrence in a population of two linked alleles at a frequencyhigher or lower than expected on the basis of the individual frequencies
• not necessarily on the same chromosome
Reflects biologically meaningful association of loci
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
59
Application,Model Organisms
Evaluation of Mus musculus breeding strategies
Solution: Use SNPs, correlation, paraclique and proximity
Standard Inbred (SI)Recombinant Inbred (RI)
BXD, LXS, etcThe Collaborative Cross
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
60
Application,Model Organisms
0
200
400
600
800
1000
1200
0.90.850.80.750.70.650.60.550.50.450.40.350.30.250.20.150.1
Mutual Information
Num
ber o
f Par
acliq
ues
cont
aini
ng m
ore
than
3 S
NPs
that
cro
ss m
ultip
le
chro
mos
omes
67SI
89BXD
0
200
400
600
800
1000
1200
1400
0.90.850.80.750.70.650.60.550.50.450.40.350.30.250.20.150.1
Mutual Information
Nu
mb
er o
f P
arac
liqu
es c
on
tain
ing
m
ore
th
an 3
SN
Ps
67SI
89BXD
Number of LD Networks
Number of Non-Syntenic LD Networks
67 Inbred Strains
0
200
400
600
800
1000
1200
1400
0.90.850.80.750.70.650.60.550.50.450.40.350.30.250.20.150.1
Mutual Information
Num
ber
of P
arac
lique
s (s
ize
> 3)
1 chrs 2 chrs 3 chrs4 chrs 5 chrs 6 chrs7 chrs 8 chrs 9 chrs10 chrs 11 chrs 12 chrs13 chrs 14 chrs 15 chrs16 chrs 17 chrs
89 BXD Strains
0
200
400
600
800
1000
1200
1400
0.90.850.80.750.70.650.60.550.50.450.40.350.30.250.20.150.1
Mutual Information
Num
ber o
f Par
acliq
ues
(Siz
e>3)
1 chrs 2 chrs3 chrs 4 chrsRecombinant Inbred
Standard Inbred
Chromosome Coverage
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
61
Application,Model Organisms
rs13476024
CEL-1_103029662
Chr 1
Chr 7
StandardInbred
Chr 11rs3664950
rs3724175
rs3674958rs13480968
Chr 4
rs3718552Chr 7 rs8243991
UT_7_136.88578
rs4226997
rs3714636
rs3694146rs6334210 rs1347955
3 mCV22291963rs13479554 rs13479555
rs6392543 CEL-
7_126142971
rs3663988
rs6303477 rs366012
2 rs3659292 rs1347956
6 rs13479567 rs62121
86 rs13479569rs13479570 rs366616
0 rs13479571
CEL-7_126301023
rs13479559
CEL-7_126570687
rs6216320
Recombinant Inbred
Chr 7
Example of ContrastingParaclique Profiles
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
62
CollaboratorsComputer Science, Mathematics, Molecular Biology, Statistics
Research Scientists (Incomplete!):Mikael BensonElissa Chesler Frank DehneVlad Estivill-CastroMike FellowsIvan GerlingFrans HenskensReza MobiniPablo MoscatoMark RaganFran RosamondArnold SaxtonBrynn VoyRob Williams
Current/Recent Students:Bhavesh BoratePatricia CareyJohn EblenJeremy JayDenise KoesslerZuopan LiZahra MahoorSudhir NaswaClinton NolanAndy PerkinsCharles PhillipsGary RogersPeter ShawDinesh WeerapurageYun Zhang
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
63
Come Join Us!
Biology, Computer Science, Mathematics, StatisticsThe University of Tennessee and Oak Ridge National Laboratory
• Analysis of Algorithms• Applications
o Alcoholism, Allergy, Cancer, Diabetes, Neuroscience, Radiology...• Complex Cross (210 RI Mouse Lines)• High Performance Computing• Metabolomics, Proteomics, Transcriptomics, ...• Ontological Discovery• RNAi and µRNA Interference• Threshold Selection• Time Series Analysis• Work Alongside Both Computer and Domain Scientists!
ELECTRICAL ENGINEERING & COMPUTER SCIENCEUNIVERSITY OF TENNESSEE
64
The Langston Lab(Geeks Я Us)