L8: Part 1 Hierarchical trees Representing time€¦ · Hierarchical trees Representing time Kirill...
Transcript of L8: Part 1 Hierarchical trees Representing time€¦ · Hierarchical trees Representing time Kirill...
L8: Part 1 Hierarchical trees Representing time
Kirill Bessonov
Nov 10th 2015
1
Talk Plan
• Trees – Similarity assessment via trees – Phylogenetic trees vocabulary and types
• Practical on phylogenetic trees and sequence alignment – Identifying source viral sequences
• Networks – examples – main definitions – biological examples
• Practical on WGCNA package – main protocol steps – interpretation of network modules – WGCNA demo
2
Decision Trees (DTs) • A data structure type used in CS
• A data model
– Purpose 1: recursively partition data
• cut data space into perpendicular hyper-planes (w)
– Purpose 2: classify data
• DTs with class label at the leaf node
• E.g. a decision tree that estimates whether or not a potential customer will respond to a direct mailing
– predicted binary class: YES or NO
Source: DECISION TREES by Lior Rokach
DT growth and splitting
• In top-down approach – assign all data to root node
• Select attribute(s)/feature(s) to split the node
• Splitting based on – 1 feature: univariate split
– ≥2 features: multivaraite split
• Stop tree growth based on Max depth reached
Splitting criteria is not met
Leaf
s/Te
rmin
al
no
de
s
Selected feature(s)
X>x X<x
Y>y Y<y
Hierarchical Trees
• Trees can be used also for – Clustering
– Hierarchy determination • E.g. phylogenetic trees
• Convenient visualization – effective visual condensation of the
clustering results
• Gene Ontology – Direct acyclic graph (DAG)
– Example of functional hierarchy
5
GO tree example
6
Phylogenetic trees
• Show evolutionary relationships
• Taxa (taxon) – Group of organisms
• Clade – A group of organisms having
a common ancestor
• Common ancestor – an ancestor that given organisms
have in common
7
clade
Building a phylo tree using ape
• Ape - Analyses of Phylogenetics and Evolution
– Functions to create and manipulate phylo trees
– Graphical exploration of phylogenetic data
• To build a phylogenetic tree
1. Download protein sequences from DB
2. Align sequences
3. Calculate pairwise distance using ape
4. Visualize a phylogenetic tree
Building an unrooted phylogenetic tree (1)
#install req. libraries
install.packages("seqinr");
source("https://bioconductor.org/biocLite.R");
biocLite("muscle");
install.packages("ape");
library("seqinr");
library("muscle");
library("ape");
multipleSeqAlignment <- function (seqnames, seqs){
tmp=data.frame(V1=rep(0,length(seqs)),V2=rep(0,length(seqs)));
for(i in 1:length(seqs)){
tmp[i,1]=seqnames[i]
tmp[i,2]=paste(seqs[[i]],collapse="")
}
fasta_seqs_Object = AAStringSet(tmp[,2]); names(fasta_seqs_Object) = seqnames;
#multiple sequence alignment
alignment=muscle::muscle(fasta_seqs_Object); #muscle format
alignment_ape=ape::as.alignment(as.matrix(alignment));
return (alignment_ape)
}
Building an unrooted phylogenetic tree (2)
#main part of the code
choosebank("swissprot") #selects database for query
seqnames <- c("P06747", "P0C569", "O56773", "Q5VKP1");
seqs=list();
for(i in 1:length(seqnames)){
query <- query(paste("AC=",seqnames[i],sep=""));
seqs[i]=getSequence(query);
}
#multipleSeqAlignment() is defined on previous slide
alignment_ape <- multipleSeqAlignment(seqnames, seqs);
mydist <- dist.alignment(alignment_ape);
#nj() performs the neighbor-joining tree estimation by Saitou and Nei mytree$tip.label=c("Q5VKP1-\nWestern Caucasian bat virus\nphosphoprotein","P06747-\nrabies virus\nphosphoprotein","P0C569-\nMokola virus\nphosphoprotein","O56773-\nLagos bat virus\nphosphoprotein");
plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=0.8, no.margin=T, srt=50);
Unrooted Phylogenetic Tree
• Phylogenetic tree showing distance between 4 protein viral sequences
• the genetic distance between O56773 and P0C569 is the smallest
Unrooted phylogenetic tree (1)
• The lengths of the branches
– proportional to the amount of evolutionary change
• estimated by number of mutations
• This is an unrooted phylogenetic tree – does not contain an outgroup sequence,
• sequence of a protein that is known to be more distantly related to the other proteins in the tree than they are to each other
• i.e. the common ancestor to all taxa
Unrooted phylogenetic tree(2)
• we cannot tell which direction evolutionary time ran in along the internal branches of the tree.
• Cannot tell whether – the node representing the
common ancestor of (O56773, P0C569) was
• an ancestor of the node representing the common ancestor of (Q5VKP1, P06747),
• or the other way around…
Distance matrix
• Inspecting calculated distance matrix between aligned sequences confirms results seen in phylogenetic tree
• Closest pair is O56773 and P0C559 proteins
Q5VKP1 P06747 P0C569
P06747 0.49
P0C569 0.48 0.45
O56773 0.50 0.46 0.41
Rooted phylogenetic tree
• In order to convert the unrooted tree into a rooted tree, we need to add an outgroup sequence – Outgroup
• a taxon outside the group of interest • will branch off at the base of phylogeny • Represented by
– Caenorhabditis elegans (UniProt accession Q10572) and – Caenorhabditis remanei (UniProt E3M2K8)
• If we were to build a phylogenetic tree of the Fox-1 homologues in verterbrates, the distantly related sequence from worms would probably be a good choice of outgroup – this protein is from a different taxa/group (worms)
Building an rooted phylogenetic tree (1)
#BUILDIN ROOTED TREE OF PROTEIN SEQUNCES (FOX1)
#Q9NWB1 - Human
#Q17QD3 - Cow
#Q95KI0 - Monkey
#A1A5R1 - Rat
#Q10572 - Worm C.elegans(Root)
#E1G4K8 - Eye worm
seqnames <- c("Q9NWB1","Q17QD3","Q95KI0","A1A5R1","Q10572","E1G4K8")
choosebank("swissprot") #selects database for query
seqs=list()
for(i in 1:length(seqnames)){
query <- query(paste("AC=",seqnames[i],sep=""))
seqs[i]=getSequence(query)
}
alignment_ape <- multipleSeqAlignment(seqnames, seqs);
mydist <- dist.alignment(alignment_ape)
Building an rooted phylogenetic tree (2)
library("ape")
mytree <- nj(mydist)
mytree$tip.label=c("E1G4K8-Eye worm ", "Q10572-C.elegans(Root)",
"A1A5R1-Rat", "Q9NWB1-Human", "Q17QD3-Cow", "Q95KI0-Monkey")
myrootedtree <- root(mytree, outgroup="Q10572-C.elegans(Root)",
r=TRUE)
#Phylogenetic tree with 6 tips and 5 internal nodes.
#Tip labels:
#[1] "E1G4K8" "Q8WS01" "Q9VT99" "A8NSK3" "Q10572" "E3M2K8"
#Rooted; includes branch lengths.
plot.phylo(myrootedtree, edge.color = "blue", edge.width = 3 ,
type="p")
Rooted tree of FOX1 proteins
• The invertebrates are grouped together
• Worms form a distinct group yet with large genetic distance
• Human FOX1 is closest to monkey and cow sequences
outgroup (worms)
Distance matrix E1G4K8 Q10572 A1A5R1 Q9NWB1 Q17QD3 Q10572 0.72
A1A5R1 0.75 0.63 Q9NWB1 0.72 0.62 0.44 Q17QD3 0.73 0.62 0.50 0.28
Q95KI0 0.73 0.61 0.49 0.28 0.14
• As expected, eye worms are the mostly distantly related species to vertebrates
• Cow and monkey have the closest relationship and the lowest genetic distance
Table legend: Q9NWB1 – Human Q95KI0 – Monkey Q10572 - Worm C.elegans (Root) Q17QD3 – Cow A1A5R1 – Rat E1G4K8 - Eye worm
Rooted tree
• Time runs from left to right
• Monkey, Cow and Human have common ancestor 3
• Ancestor 1 is common to ancestors 2 and 3
TIME
Exercises on phylogenetic tree building
• Q1. Calculate the genetic distances (i.e. genetic distance) between the following NS1 proteins from different Dengue virus strains: Dengue virus 1 NS1 protein (Uniprot ID: Q9YRR4), Dengue virus 2 NS1 protein (UniProt: Q9YP96), Dengue virus 3 NS1 protein (UniProt: B0LSS3), and Dengue virus 4 NS1 protein (UniProt: Q6TFL5). Which viruses are the most closely related, and which are the least closely related, based on the genetic distances? Note: Dengue virus causes Dengue fever, which is classified by the WHO as a neglected tropical disease. There are four main types of Dengue virus, Dengue virus 1, Dengue virus 2, Dengue virus 3, and Dengue virus 4.
• Q2. Build an unrooted phylogenetic tree of the NS1 proteins from Dengue virus 1, Dengue virus 2, Dengue virus 3 and Dengue virus 4, using the neighbour-joining algorithm. Which are the most closely related proteins, based on the tree?
• Q3. The Zika virus is related to Dengue viruses, but is not a Dengue virus, and so therefore can be used as an outgroup in phylogenetic trees of Dengue virus sequences. UniProt accession Q32ZE1 consists of a sequence with similarity to the Dengue NS1 protein, so seems to be a related protein from Zika virus. Build a rooted phylogenetic tree of the Dengue NS1 proteins based on an alignment, using the Zika virus protein as the outgroup. Which are the most closely related Dengue virus proteins, based on the tree? What extra information does this tree tell you, compared to the unrooted tree in Q2?
Exercises on phylogenetic tree building
Answers
Question 1: Summary of viral proteins and Uniprot accession numbers: Uniprot ID: Q9YRR4 Dengue virus 1 NS1 protein UniProt: Q9YP96 Dengue virus 2 NS1 protein UniProt: B0LSS3 Dengue virus 3 NS1 protein UniProt: Q6TFL5 Dengue virus 4 NS1 protein seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5")
choosebank("swissprot") #selects database for query
seqs=list()
for(i in 1:length(seqnames)){
query <- query(paste("AC=",seqnames[i],sep=""))
seqs[i]=getSequence(query)
}
alignment_ape <- multipleSeqAlignment(seqnames, seqs);
mydist <- dist.alignment(alignment_ape);
mydist
Answers
• Q1. The distance matrix is as follows
The most distant are Q9YP96(V2) and Q6TFL5(V4) with genetic distance of 0,33 while the most closely related are Q9YP96(V1) and BOLSS3(V3) with genetic distance of 0,227
Q6TFL5 Q9YRR4 Q9YP96
Q9YRR4 0.306 Q9YP96 0.333 0.254
B0LSS3 0.297 0.230 0.227
Answers
Question 2:
library("ape")
mytree <- nj(mydist)
#plotting unrooted tree
plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2,
no.margin=T, srt=0)
#clean the sequences from gaps
seqs_trim=seqs
for(i in 1:length(seqs)){
start=regexpr("DMGY", paste(seqs_trim[[i]],collapse="") ) [1]
stop=regexpr("GEDG", paste(seqs_trim[[i]],collapse="") ) [1]
seqs_trim[[i]]=seqs_trim[[i]][start:stop]
}
alignment_ape <- multipleSeqAlignment(seqnames, seqs_trim);
mydist <- dist.alignment(alignment_ape);mydist
library("ape")
mytree <- nj(mydist)
#plotting unrooted tree based on alignment of whole protein sequences
plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2,
no.margin=T, srt=0)
Question 2 (continued):
alignment_ape <- multipleSeqAlignment(seqnames, seqs_trim);
mydist <- dist.alignment(alignment_ape);mydist
library("ape")
mytree <- nj(mydist)
#tree based on the best aligned portion
plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2,
no.margin=T, srt=0)
Answers
Answers • The resulting Q2 un-rooted tree This un-rooted tree agrees with the genetic distance matrix calculated in Q1. The tree suggests that BOLSS3 and Q9YP96 are the mostly related proteins. To improve quality of the tree it is best to select region that has minimal number of gaps between protein sequences. How gap cleaning affects phylogentic tree performance please see reference [2]
Below you can see that there are regions with lots of gaps. Let’s build another tree based on the bolded(most conserved) region to see if it is the same
Q6TFL5 DMGCVVSWNGKELKC…KDQKAVHADMGYWIESSKNQTWQIEKASLIEVKTCLWPKTHTL…GMEIRPLSEKEENMVKSQVTA
Q9YRR4 ------------------------DMGYWIESEKNETWKLARASFIEVKTCIWPKSHTL…GMEI-----------------
Q9YP96 DSGCVVSWKNKELKC…KDNRAVHADMGYWIESALNDTWKIEKASFIEVKNCHWPKSHTL…GMEIRPLKEKEENLVNSLVTA
B0LSS3 --------------------ASHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTL…------------------------
Alignment of proteins: Built using the full lengths of proteins
Answers
• The resulting tree looks the same but we had achieved overall better resolution between proteins
Q6TFL5 Q9YRR4 Q9YP96
Q9YRR4 0.317 Q9YP96 0.317 0.264
B0LSS3 0.292 0.233 0.216 Built using the bolded region
Whole protein sequences used
Best aligned portion of protein sequences used
Q6TFL5 Q9YRR4 Q9YP96 Q9YRR4 0.306
Q9YP96 0.332 0.254 B0LSS3 0.297 0.230 0.227
Answers
Question 3:
#Q3 building rooted tree based on Q89277 (yellow fever virus) as out group
library("seqinr")
library("muscle")
library("ape")
seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5", "Q89277")
choosebank("swissprot") #selects database for query
seqs=list()
for(i in 1:length(seqnames)){
query <- query(paste("AC=",seqnames[i],sep=""))
seqs[i]=getSequence(query)
}
alignment_ape <- multipleSeqAlignment(seqnames, seqs);
mydist <- dist.alignment(alignment_ape);mydist
library("ape")
mytree <- nj(mydist)
myrootedtree <- root(mytree, outgroup="Q89277", r=TRUE)
plot.phylo(myrootedtree ,type="p", edge.color = "blue", edge.width = 3,
cex=1.2, no.margin=T, srt=0)
Answers
• Q3 asks to build a rooted tree using out-group yellow fever virus (Q89277)
• Most closely related viruses: – BOLSS3 and Q9YP96
• This rooted tree tells you which of the Dengue virus NS1 proteins branched off the earliest from the ancestors. Unrooted tree does not provide ancestry information (i.e. time sequence)
Q89277 Q6TFL5 Q9YRR4 Q9YP96
Q6TFL5 0.523 Q9YRR4 0.511 0.306
Q9YP96 0.486 0.333 0.254
B0LSS3 0.487 0.297 0.230 0.227
outgroup
References
1. Ape library for phylogenetic trees and ancestry with bootstrap methods http://cran.r-project.org/web/packages/ape/ape.pdf
2. Gerard Talavera and Jose Castresana. Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments. Systematic Biology Volume 56, Issue 4 p. 564-577 (link)
L8: Part 2 Networks of Biological
interactions Kirill Bessonov
Nov 10th 2015
32
We are surrounded by networks
33
34
Transportation Networks
35
Computer Networks
36
Social networks
37
Internet submarine cable map
38
From describing to engineering
• In 1950
– Alex Bavelas founds the Networks Laboratory Group at M.I.T. to study effectiveness of different communication patterns
39
Social interaction patterns
40
PPI (Protein Interaction Networks)
• Nodes – protein names • Links – physical binding event 41
Network Definitions
42
Network components
• Networks also called graphs
– Graph (G) contains
• Nodes (N): genes, SNPs, cities, PCs, etc.
• Edges (E): links connecting two nodes
43
Some characteristics
• Networks are
– Complex
– Dynamic
– Can be used to reduce data dimensionally
44 time = t0 time = t
Topology
• Refers to connection pattern
– The pattern of links
45
Small – world networks
• Six degrees of separation – everyone is 6 or fever steps away from each other
• Reference: Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of ‘small-world’networks." nature 393.6684 (1998): 440-442.
46
Scale-free networks
• Biological processes are characterized by this topology – Few hubs (highly connected nodes) – Predominance of poorly connected nodes – New vertices attach preferentially to highly connected ones
• Barabási, Albert-László, and Réka Albert. "Emergence of scaling in random networks." science 286.5439 (1999): 509-512. 47
Modules
• Sub-networks with
– Specific topology
– Function
• Biological context
– Protein complex
– Common function
• E.g. energy production
48 clique
Edges Types
N nodes
E edges
graph:
directed
undirected
Network types • Directed
– Edge have directionality
– Some links are unidirectional
– Direction matters • Going A B is not the same as BA
– Analogous to chemical reactions • Forward rate might not be the same as reverse
– E.g. directed gene regulatory networks (TF gene)
• Undirected – Edges have no directionality
– Simpler to describe and work with
– E.g. co-expression networks
50
Neighbours of node(s)
• Neighbours(node, order) = {node1 … nodep}
• Neighbours(3,1) = {2,4}
• Neighbours(2,2) = {1,3,5,4}
51
Reachability of two nodes i and j
• Walk – Sequence of visited nodes on a
path from node i to j
– e.g. nodes(1,2) = {5,2,1,2,3,4,5,2}
• Trail – a walk with no repeated edges
– e.g. nodes(1,4)={5,4}
• Path – a walk with no repeated nodes
– e.g. nodes(1,6)={5,4,6}
52
visited nodes
Connectivity • Line (edge) connectivity (λ)
– Minimum number of lines (edges) that need to be removed to disconnect graph G
• i.e. no other links would be able to connect a node
• Node connectivity (κ)
– Minimum number of nodes that need to be removed to disconnect graph G
53
λs = 3 and κs = 2
λt = 3 and κt = 2
Connectivity matrix (also known as adjacency matrix)
A =
Size
binary or weighted
Node degree (k)
• the number of edges connected to the node
• k(6) = 1
• k(4) = 3
55
Degree distribution (P(k))
• Determines the statistical properties of
uncorrelated networks
56
source: http://www.network-science.org/powerlaw_scalefree_node_degree_distribution.html
Topologies: scale-free Most real networks have
Degree distribution that follows power-law
• the sizes of earthquakes craters on the moon
• solar flares • the sizes of activity patterns of neuronal
populations • the frequencies of words in most languages • frequencies of family names • sizes of power outages • criminal charges per convict • and many more
Topology: random
Degree distribution of nodes is statistically independent
Shortest path (p)
• Indicates the distance between i and j in
terms of geodesics (unweighted)
• p(1,3) =
– {1-5-4-3}
– {1-5-2-3}
– {1-2-5-4-3}
– {1-2-3}
59
Betweenness centrality
# SPs from j to k via i
# SPs from j to k
the ratio between • all shortest paths (SP) that path the node i and all shortest paths existing in the graph G
Facebook academic network
61 Blue low and red is high betweenness
Betweenness centrality
• reflects the – amount of control over the interactions of other nodes in the network
• bc = ((bab(c) / bab) + (bae(c) / bae) + (bad(c) / bad) + (bbe(c) / bbe) + (bbd(b) / bbd) + (bde(b) / bde)) = ((0/1)+(1 / 2) + (0 / 1) + (1 / 2) + (0 / 1) + 0/1)
• bc = 1 62
Possible node combinations: {AB, AD, AE, AC, BD, BE, BC, CD, CE DE}
Betweenness centrality standardized • For standardization
– the denominator is (n-1)(n-2)/2 (15)
– the maximum possible number of edges
63
Node b b - standardized
1 0 0
2 0 0
3 9 9/15
4 9 9/15
5 8 9/15
6 0 0
7 0 0
Possible node pairs (21) 12 23 34 45 56 67 13 24 35 46 57 14 25 36 47 15 26 37 16 27 17
Cliques
• A clique of a graph G is a complete subgraph of G
– i.e. maximally interconnected subgraph
• The highlighted clique is the maximal clique of size 4 (nodes) 64
–Robert Kiyosaki
“The richest people in the world look for and
build networks. Everyone else looks for work.”
Biological context
66
Biological Networks
67
Biological examples
• Co-expression – For genes that have similar expression profile
• Directed gene regulatory networks (GRNs) – show directionality between gene interactions
• Transcription factor target gene expression
– Show direction of information flow – E.g. transcription factor activating target gene
• Protein-Protein Interaction Networks (PPI) – Show physical interaction between proteins – Concentrate on binding events
• Others – Metabolic, differential, Bayesian, etc.
68
Biological networks
• Three main classes
69
Type Name Nodes Edges Resource
molecular interactions PPI proteins physical bonds BioGRID DTI drugs/targets physical bonds PubChem
functional associations
GI genes genetic interactions BioGRID
ON Gene Ontology
functional relations GO
GDA genes/diseases associations OMIM
functional/structural similarities Co-Ex genes
expression profile similarity
GEO, ArrayExpress
PStrS proteins structural similarities PDB
Source: Gligorijević, Vladimir, and Nataša Pržulj. "Methods for biological data integration: perspectives and challenges." Journal of The Royal Society Interface 12.112 (2015): 20150571.
Inferring co-expression networks in R
WGCNA package (Weighted Gene Correlation Network Analysis)
70
Main features
• Builds correlation networks
• Correlations are
– simple to calculate
– fast on large scale data
• Support sign of association (not direction)
• Lots of network metrics (e.g. connectivity)
• Easy identification of modules
– Reduction of dataset dimensionality good
71
Construct a network Search for genes with similar expression profile
Identify modules in predicted network Reduce data into gene sets / groups
Relate modules to external information
find biologically interesting modules E.g.: Clinical data, biological function (gene ontology, pathways)
Find the key drivers in interesting modules Experimental validation, therapeutics, biomarkers
Study Module Preservation across different data Check robustness of module definition
72
Steps for constructing a co-expression network
A) Obtain gene expression data
B) Measure co-expression between genes via a correlation coefficient
C) Build correlation matrix = network A) Adjacency matrix
D) Transform correlation matrix with the power adjacency function new adjacency matrix weighted network
73
Network=Adjacency Matrix
• Adjacency matrix, A=[aij], encodes how a pair of nodes is connected (if at all)
– Weighted networks = aij is edge value (weight)
– Unweighted networks = aij presence or absence of edge
74
Scale Free Network Topology
• Scale free topology means
– presence of hub nodes highly connected to other nodes
– metabolic networks exhibit scale free topology at least approximately
– Node connectivity (k), degree, follows power law
– p(k)=proportion of nodes that have connectivity k
Frequency Distribution of Connectivity
Connectivity k
Fre
qu
en
cy
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035
01
00
20
03
00
40
05
00
60
07
00
75
How to check Scale Free Topology?
Only few nodes display
high connectivity
Check if obtained network follows scale free topology Idea: Log transformation p(k) and k and look at scatter plots Answer: R^2 can be used to quantify goodness of fit R^2 > 0.6 mean that networks follows scale free topology
76
Power function transformation
• Idea:
– transform correlation matrix via power function
– Impose scale free topology
– Select the best beta (β)
• Pick the largest beta
• Corresponds to largest R^2
(Beta)
R^2
Power function
77
Defining modules • based on a hierarchical cluster tree
– Build a tree and cut it – Dynamic tree cutting at optimal height [1] Module=branch of
a cluster tree
78
Analysis of modules
• Perform gene ontology analysis on genes from each module (e.g. yellow = “genes 1”)
• Link modules to clinical data (e.g. weight) – Via module eigengene e.g. cor(trait, eigengene)
genes 1 genes 2 genes 3 genes 4
Modules
79
Heatmap view of module
mo
du
les
tissue samples
vertical bands indicate tight co-expression of module genes
GE
NE
S
Module of
co-expressed
genes
80
Modules as eigengenes
• Can summarized all genes in a module by one eigengene (i.e. virtual gene)
• allow one to relate modules to each other
– Allows calculate distance between modules
• to relate modules to clinical traits and SNPs
81
brown
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185
brown
-0.10.0
0.10.2
0.30.4
Module Eigengene= measure of over-expression=average redness
Rows,=genes, Columns=microarray
The brown module eigengenes across samples
82
Analysis of modules
• Relate modules to traits
• Interested in modules with correlation > 0.75 (red)
83
WGCNA Demo Simulated data - 5 modules
84
Simulating expression data (1) Note: install 1st Hmisc library otherwise WGCNA installation fails install.packages("Hmisc");
install.packages("WGCNA");
source("https://bioconductor.org/biocLite.R") ;
biocLite(c("GO.db", "preprocessCore", "impute"));
#Simulate data
# Load WGCNA package
library(WGCNA)
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Here are input parameters of the simulation model
# number of samples or microarrays in the training data
no.obs=50
# now we specify the true measures of eigengene significance
# recall that ESturquoise=cor(y,MEturquoise)
ESturquoise=0; ESbrown= -.6;
ESgreen=.6;ESyellow=0
# Note that we dont specify the eigengene significance of the blue module
# since it is highly correlated with the turquoise module.
ESvector=c(ESturquoise,ESbrown,ESgreen,ESyellow)
# number of genes
nGenes1=3000
# proportion of genes in the turquoise, blue, brown, green, and yellow module #respectively.
simulateProportions1=c(0.2,0.15, 0.08, 0.06, 0.04)
# Note that the proportions dont add up to 1. The remaining genes will be colored grey,
# ie the grey genes are non-module genes.
# set the seed of the random number generator. As a homework exercise change this seed.
set.seed(1) 85
Simulating expression data (2)
#Step 1: simulate a module eigengene network.
# Training Data Set I
MEgreen=rnorm(no.obs)
scaledy=MEgreen*ESgreen+sqrt(1-ESgreen^2)*rnorm(no.obs)
y=ifelse( scaledy>median(scaledy),2,1)
MEturquoise= ESturquoise*scaledy+sqrt(1-ESturquoise^2)*rnorm(no.obs)
# we simulate a strong dependence between MEblue and MEturquoise
MEblue= 0.6*MEturquoise+ sqrt(1-.6^2) *rnorm(no.obs)
MEbrown= ESbrown*scaledy+sqrt(1-ESbrown^2)*rnorm(no.obs)
MEyellow= ESyellow*scaledy+sqrt(1-ESyellow^2)*rnorm(no.obs)
ModuleEigengeneNetwork1=data.frame(y,MEturquoise,MEblue,MEbrown,MEgreen, MEyellow)
86
Simulating expression data (3) dat1=simulateDatExpr5Modules(MEturquoise=ModuleEigengeneNetwork1$MEturquoise,
MEblue=ModuleEigengeneNetwork1$MEblue,
MEbrown=ModuleEigengeneNetwork1$MEbrown,
MEyellow=ModuleEigengeneNetwork1$MEyellow,
MEgreen=ModuleEigengeneNetwork1$MEgreen,
nGenes=nGenes1,
simulateProportions=simulateProportions1)
datExpr = dat1$datExpr;
truemodules = dat1$truemodule;
datME = dat1$datME;
attach(ModuleEigengeneNetwork1)
datExpr=data.frame(datExpr)
ArrayName=paste("Sample",1:dim(datExpr)[[1]], sep="" )
# The following code is useful for outputting the simulated data
GeneName=paste("Gene",1:dim(datExpr)[[2]], sep="" )
dimnames(datExpr)[[1]]=ArrayName
dimnames(datExpr)[[2]]=GeneName
rm(dat1); collectGarbage();
# The following command will save all variables defined in the current session.
save.image("Simulated-dataSimulation.RData");
cat("Note: *.RData file written in ",getwd(), "\n") 87
Construction of a weighted gene co-expression network (1)
# Load WGCNA package
library(WGCNA)
# Load additional necessary packages
library(cluster)
1# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
# Load the previously saved data
load("Simulated-StandardScreening.RData");
attach(ModuleEigengeneNetwork1)
sft=pickSoftThreshold(datExpr,powerVector=1:20)
plot(sft$fitIndices[,1],-sign(sft$fitIndices[,3])*sft$fitIndices[,2], xlab="Soft Threshold (power)",ylab="SFT, signed R^2", type="o")
abline(h=0.90,col="red")
88
Construction of a weighted gene co-expression network (2)
# here we define the adjacency matrix using soft
thresholding with beta=6
ADJ1=abs(cor(datExpr,use="p"))^6
# When you have relatively few genes (<5000) use the
following code
k=as.vector(apply(ADJ1,2,sum, na.rm=T))
# When you have a lot of genes use the following code
#k=softConnectivity(datE=datExpr,power=6)
# Plot a histogram of k and a scale free topology plot
sizeGrWindow(10,5)
par(mfrow=c(1,2))
hist(k)
scaleFreePlot(k, main="Check scale free topology\n")
89
Definition of co-expression modules (1)
#Many clustering procedures require a dissimilarity matrix as input. We define a dissimilarity based on adjacency
# Turn adjacency into a measure of dissimilarity
dissADJ=1-ADJ1
hierADJ=hclust(as.dist(dissADJ), method="average" )
# Plot the resulting clustering tree together with the true color assignment
sizeGrWindow(10,5);
plotDendroAndColors(hierADJ, colors = data.frame(truemodules), dendroLabels = FALSE, hang = 0.03,
main = "Gene hierarchical clustering dendrogram and simulated module colors" )
90
Definition of co-expression modules (2)
#static tree cutting
colorStaticADJ=as.character(cutreeStaticColor(hierADJ, cutHeight=.99, minSize=20))
# Plot the dendrogram with module colors
sizeGrWindow(10,5);
plotDendroAndColors(hierADJ, colors = data.frame(truemodules, colorStaticADJ),
dendroLabels = FALSE, abHeight = 0.99,
main = "Gene dendrogram and module colors")
#dynamic tree cutting
branch.number=cutreeDynamic(hierADJ,method="tree")
# This function transforms the branch numbers into colors
colorDynamicADJ=labels2colors(branch.number)
sizeGrWindow(10,5)
plotDendroAndColors(dendro = hierADJ,
colors=data.frame(truemodules, colorStaticADJ,
colorDynamicADJ, colorDynamicADJ),
dendroLabels = FALSE, marAll = c(0.2, 8, 2.7, 0.2),
main = "Gene dendrogram and module colors")
91
Calculating module eigengenes
#caluculate eigengenes for each module
datME=moduleEigengenes(datExpr,colorStaticADJ)$eigengenes
#correlation between modules based on their eigengenes
signif(cor(datME, use="p"), 2)
#dendrogram
dissimME=(1-t(cor(datME, method="p")))/2
hclustdatME=hclust(as.dist(dissimME), method="average" )
# Plot the eigengene dendrogram
par(mfrow=c(1,1))
plot(hclustdatME, main="Clustering tree based of the module eigengenes")
#see expression profiles - diagnostic plots
#show available modules
levels(as.factor(colorStaticADJ))
sizeGrWindow(8,9)
par(mfrow=c(3,1), mar=c(1, 2, 4, 1))
which.module="blue";
plotMat(t(scale(datExpr[,colorStaticADJ==which.module ]) ),nrgcols=30,rlabels=T,
clabels=T,rcols=which.module,
title=which.module )
ME=datME[, paste("ME",which.module, sep="")]
barplot(ME, col=which.module, main="", cex.main=2,
ylab="eigengene expression",xlab="array sample")
92
Relating modules to trait
#all modules (green and brown modules look interesting)
signif(cor(y,datME, use="p"),2)
#get statistical significance of module association to
trait
cor.test(y, datME$MEbrown)
cor.test(y, datME$MEgreen)
93
References
[1] Langfelder P, Zhang B et al (2007) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut library for R. Bioinformatics 2008 24(5):719-720
[2] Steve Horvath, Tutorials for the WGCNA package
94