L8: Part 1 Hierarchical trees Representing time€¦ · Hierarchical trees Representing time Kirill...

L8: Part 1 Hierarchical trees Representing time

Kirill Bessonov

Nov 10th 2015

1

Talk Plan

• Trees – Similarity assessment via trees – Phylogenetic trees vocabulary and types

• Practical on phylogenetic trees and sequence alignment – Identifying source viral sequences

• Networks – examples – main definitions – biological examples

• Practical on WGCNA package – main protocol steps – interpretation of network modules – WGCNA demo

2

Decision Trees (DTs) • A data structure type used in CS

• A data model

– Purpose 1: recursively partition data

• cut data space into perpendicular hyper-planes (w)

– Purpose 2: classify data

• DTs with class label at the leaf node

• E.g. a decision tree that estimates whether or not a potential customer will respond to a direct mailing

– predicted binary class: YES or NO

Source: DECISION TREES by Lior Rokach

DT growth and splitting

• In top-down approach – assign all data to root node

• Select attribute(s)/feature(s) to split the node

• Splitting based on – 1 feature: univariate split

– ≥2 features: multivaraite split

• Stop tree growth based on Max depth reached

Splitting criteria is not met

Leaf

s/Te

rmin

al

no

de

s

Selected feature(s)

X>x X<x

Y>y Y<y

Hierarchical Trees

• Trees can be used also for – Clustering

– Hierarchy determination • E.g. phylogenetic trees

• Convenient visualization – effective visual condensation of the

clustering results

• Gene Ontology – Direct acyclic graph (DAG)

– Example of functional hierarchy

5

GO tree example

6

Phylogenetic trees

• Show evolutionary relationships

• Taxa (taxon) – Group of organisms

• Clade – A group of organisms having

a common ancestor

• Common ancestor – an ancestor that given organisms

have in common

7

clade

Building a phylo tree using ape

• Ape - Analyses of Phylogenetics and Evolution

– Functions to create and manipulate phylo trees

– Graphical exploration of phylogenetic data

• To build a phylogenetic tree

1. Download protein sequences from DB

2. Align sequences

3. Calculate pairwise distance using ape

4. Visualize a phylogenetic tree

Building an unrooted phylogenetic tree (1)

#install req. libraries

install.packages("seqinr");

source("https://bioconductor.org/biocLite.R");

biocLite("muscle");

install.packages("ape");

library("seqinr");

library("muscle");

library("ape");

multipleSeqAlignment <- function (seqnames, seqs){

tmp=data.frame(V1=rep(0,length(seqs)),V2=rep(0,length(seqs)));

for(i in 1:length(seqs)){

tmp[i,1]=seqnames[i]

tmp[i,2]=paste(seqs[[i]],collapse="")

}

fasta_seqs_Object = AAStringSet(tmp[,2]); names(fasta_seqs_Object) = seqnames;

#multiple sequence alignment

alignment=muscle::muscle(fasta_seqs_Object); #muscle format

alignment_ape=ape::as.alignment(as.matrix(alignment));

return (alignment_ape)

}

Building an unrooted phylogenetic tree (2)

#main part of the code

choosebank("swissprot") #selects database for query

seqnames <- c("P06747", "P0C569", "O56773", "Q5VKP1");

seqs=list();

for(i in 1:length(seqnames)){

query <- query(paste("AC=",seqnames[i],sep=""));

seqs[i]=getSequence(query);

}

#multipleSeqAlignment() is defined on previous slide

alignment_ape <- multipleSeqAlignment(seqnames, seqs);

mydist <- dist.alignment(alignment_ape);

#nj() performs the neighbor-joining tree estimation by Saitou and Nei mytree$tip.label=c("Q5VKP1-\nWestern Caucasian bat virus\nphosphoprotein","P06747-\nrabies virus\nphosphoprotein","P0C569-\nMokola virus\nphosphoprotein","O56773-\nLagos bat virus\nphosphoprotein");

plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=0.8, no.margin=T, srt=50);

Unrooted Phylogenetic Tree

• Phylogenetic tree showing distance between 4 protein viral sequences

• the genetic distance between O56773 and P0C569 is the smallest

Unrooted phylogenetic tree (1)

• The lengths of the branches

– proportional to the amount of evolutionary change

• estimated by number of mutations

• This is an unrooted phylogenetic tree – does not contain an outgroup sequence,

• sequence of a protein that is known to be more distantly related to the other proteins in the tree than they are to each other

• i.e. the common ancestor to all taxa

Unrooted phylogenetic tree(2)

• we cannot tell which direction evolutionary time ran in along the internal branches of the tree.

• Cannot tell whether – the node representing the

common ancestor of (O56773, P0C569) was

• an ancestor of the node representing the common ancestor of (Q5VKP1, P06747),

• or the other way around…

Distance matrix

• Inspecting calculated distance matrix between aligned sequences confirms results seen in phylogenetic tree

• Closest pair is O56773 and P0C559 proteins

Q5VKP1 P06747 P0C569

P06747 0.49

P0C569 0.48 0.45

O56773 0.50 0.46 0.41

Rooted phylogenetic tree

• In order to convert the unrooted tree into a rooted tree, we need to add an outgroup sequence – Outgroup

• a taxon outside the group of interest • will branch off at the base of phylogeny • Represented by

– Caenorhabditis elegans (UniProt accession Q10572) and – Caenorhabditis remanei (UniProt E3M2K8)

• If we were to build a phylogenetic tree of the Fox-1 homologues in verterbrates, the distantly related sequence from worms would probably be a good choice of outgroup – this protein is from a different taxa/group (worms)

Building an rooted phylogenetic tree (1)

#BUILDIN ROOTED TREE OF PROTEIN SEQUNCES (FOX1)

#Q9NWB1 - Human

#Q17QD3 - Cow

#Q95KI0 - Monkey

#A1A5R1 - Rat

#Q10572 - Worm C.elegans(Root)

#E1G4K8 - Eye worm

seqnames <- c("Q9NWB1","Q17QD3","Q95KI0","A1A5R1","Q10572","E1G4K8")


seqs=list()


query <- query(paste("AC=",seqnames[i],sep=""))

seqs[i]=getSequence(query)

}


mydist <- dist.alignment(alignment_ape)

Building an rooted phylogenetic tree (2)

library("ape")

mytree <- nj(mydist)

mytree$tip.label=c("E1G4K8-Eye worm ", "Q10572-C.elegans(Root)",

"A1A5R1-Rat", "Q9NWB1-Human", "Q17QD3-Cow", "Q95KI0-Monkey")

myrootedtree <- root(mytree, outgroup="Q10572-C.elegans(Root)",

r=TRUE)

#Phylogenetic tree with 6 tips and 5 internal nodes.

#Tip labels:

#[1] "E1G4K8" "Q8WS01" "Q9VT99" "A8NSK3" "Q10572" "E3M2K8"

#Rooted; includes branch lengths.

plot.phylo(myrootedtree, edge.color = "blue", edge.width = 3 ,

type="p")

Rooted tree of FOX1 proteins

• The invertebrates are grouped together

• Worms form a distinct group yet with large genetic distance

• Human FOX1 is closest to monkey and cow sequences

outgroup (worms)

Distance matrix E1G4K8 Q10572 A1A5R1 Q9NWB1 Q17QD3 Q10572 0.72

A1A5R1 0.75 0.63 Q9NWB1 0.72 0.62 0.44 Q17QD3 0.73 0.62 0.50 0.28

Q95KI0 0.73 0.61 0.49 0.28 0.14

• As expected, eye worms are the mostly distantly related species to vertebrates

• Cow and monkey have the closest relationship and the lowest genetic distance

Table legend: Q9NWB1 – Human Q95KI0 – Monkey Q10572 - Worm C.elegans (Root) Q17QD3 – Cow A1A5R1 – Rat E1G4K8 - Eye worm

Rooted tree

• Time runs from left to right

• Monkey, Cow and Human have common ancestor 3

• Ancestor 1 is common to ancestors 2 and 3

TIME

Exercises on phylogenetic tree building

• Q1. Calculate the genetic distances (i.e. genetic distance) between the following NS1 proteins from different Dengue virus strains: Dengue virus 1 NS1 protein (Uniprot ID: Q9YRR4), Dengue virus 2 NS1 protein (UniProt: Q9YP96), Dengue virus 3 NS1 protein (UniProt: B0LSS3), and Dengue virus 4 NS1 protein (UniProt: Q6TFL5). Which viruses are the most closely related, and which are the least closely related, based on the genetic distances? Note: Dengue virus causes Dengue fever, which is classified by the WHO as a neglected tropical disease. There are four main types of Dengue virus, Dengue virus 1, Dengue virus 2, Dengue virus 3, and Dengue virus 4.

• Q2. Build an unrooted phylogenetic tree of the NS1 proteins from Dengue virus 1, Dengue virus 2, Dengue virus 3 and Dengue virus 4, using the neighbour-joining algorithm. Which are the most closely related proteins, based on the tree?

• Q3. The Zika virus is related to Dengue viruses, but is not a Dengue virus, and so therefore can be used as an outgroup in phylogenetic trees of Dengue virus sequences. UniProt accession Q32ZE1 consists of a sequence with similarity to the Dengue NS1 protein, so seems to be a related protein from Zika virus. Build a rooted phylogenetic tree of the Dengue NS1 proteins based on an alignment, using the Zika virus protein as the outgroup. Which are the most closely related Dengue virus proteins, based on the tree? What extra information does this tree tell you, compared to the unrooted tree in Q2?

Exercises on phylogenetic tree building

Answers

Question 1: Summary of viral proteins and Uniprot accession numbers: Uniprot ID: Q9YRR4 Dengue virus 1 NS1 protein UniProt: Q9YP96 Dengue virus 2 NS1 protein UniProt: B0LSS3 Dengue virus 3 NS1 protein UniProt: Q6TFL5 Dengue virus 4 NS1 protein seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5")


seqs=list()




}


mydist <- dist.alignment(alignment_ape);

mydist

Answers

• Q1. The distance matrix is as follows

The most distant are Q9YP96(V2) and Q6TFL5(V4) with genetic distance of 0,33 while the most closely related are Q9YP96(V1) and BOLSS3(V3) with genetic distance of 0,227

Q6TFL5 Q9YRR4 Q9YP96

Q9YRR4 0.306 Q9YP96 0.333 0.254

B0LSS3 0.297 0.230 0.227

Answers

Question 2:

library("ape")


#plotting unrooted tree

plot.phylo(mytree,type="u", edge.color = "blue", edge.width = 3, cex=1.2,

no.margin=T, srt=0)

#clean the sequences from gaps

seqs_trim=seqs

for(i in 1:length(seqs)){

start=regexpr("DMGY", paste(seqs_trim[[i]],collapse="") ) [1]

stop=regexpr("GEDG", paste(seqs_trim[[i]],collapse="") ) [1]

seqs_trim[[i]]=seqs_trim[[i]][start:stop]

}

alignment_ape <- multipleSeqAlignment(seqnames, seqs_trim);

mydist <- dist.alignment(alignment_ape);mydist

library("ape")


#plotting unrooted tree based on alignment of whole protein sequences


no.margin=T, srt=0)

Question 2 (continued):

alignment_ape <- multipleSeqAlignment(seqnames, seqs_trim);


library("ape")


#tree based on the best aligned portion


no.margin=T, srt=0)

Answers

Answers • The resulting Q2 un-rooted tree This un-rooted tree agrees with the genetic distance matrix calculated in Q1. The tree suggests that BOLSS3 and Q9YP96 are the mostly related proteins. To improve quality of the tree it is best to select region that has minimal number of gaps between protein sequences. How gap cleaning affects phylogentic tree performance please see reference [2]

Below you can see that there are regions with lots of gaps. Let’s build another tree based on the bolded(most conserved) region to see if it is the same

Q6TFL5 DMGCVVSWNGKELKC…KDQKAVHADMGYWIESSKNQTWQIEKASLIEVKTCLWPKTHTL…GMEIRPLSEKEENMVKSQVTA

Q9YRR4 ------------------------DMGYWIESEKNETWKLARASFIEVKTCIWPKSHTL…GMEI-----------------

Q9YP96 DSGCVVSWKNKELKC…KDNRAVHADMGYWIESALNDTWKIEKASFIEVKNCHWPKSHTL…GMEIRPLKEKEENLVNSLVTA

B0LSS3 --------------------ASHADMGYWIESQKNGSWKLEKASLIEVKTCTWPKSHTL…------------------------

Alignment of proteins: Built using the full lengths of proteins

Answers

• The resulting tree looks the same but we had achieved overall better resolution between proteins

Q6TFL5 Q9YRR4 Q9YP96

Q9YRR4 0.317 Q9YP96 0.317 0.264

B0LSS3 0.292 0.233 0.216 Built using the bolded region

Whole protein sequences used

Best aligned portion of protein sequences used

Q6TFL5 Q9YRR4 Q9YP96 Q9YRR4 0.306

Q9YP96 0.332 0.254 B0LSS3 0.297 0.230 0.227

Answers

Question 3:

#Q3 building rooted tree based on Q89277 (yellow fever virus) as out group

library("seqinr")

library("muscle")

library("ape")

seqnames <- c("Q9YRR4","Q9YP96","B0LSS3","Q6TFL5", "Q89277")


seqs=list()




}



library("ape")


myrootedtree <- root(mytree, outgroup="Q89277", r=TRUE)

plot.phylo(myrootedtree ,type="p", edge.color = "blue", edge.width = 3,

cex=1.2, no.margin=T, srt=0)

Answers

• Q3 asks to build a rooted tree using out-group yellow fever virus (Q89277)

• Most closely related viruses: – BOLSS3 and Q9YP96

• This rooted tree tells you which of the Dengue virus NS1 proteins branched off the earliest from the ancestors. Unrooted tree does not provide ancestry information (i.e. time sequence)

Q89277 Q6TFL5 Q9YRR4 Q9YP96

Q6TFL5 0.523 Q9YRR4 0.511 0.306

Q9YP96 0.486 0.333 0.254

B0LSS3 0.487 0.297 0.230 0.227

outgroup

References

1. Ape library for phylogenetic trees and ancestry with bootstrap methods http://cran.r-project.org/web/packages/ape/ape.pdf

2. Gerard Talavera and Jose Castresana. Improvement of Phylogenies after Removing Divergent and Ambiguously Aligned Blocks from Protein Sequence Alignments. Systematic Biology Volume 56, Issue 4 p. 564-577 (link)

http://cran.r-project.org/web/packages/ape/ape.pdf




http://sysbio.oxfordjournals.org/content/56/4/564.long

L8: Part 2 Networks of Biological

interactions Kirill Bessonov

Nov 10th 2015

32

We are surrounded by networks

33

Transportation Networks

35

Computer Networks

36

Social networks

37

Internet submarine cable map

38

From describing to engineering

• In 1950

– Alex Bavelas founds the Networks Laboratory Group at M.I.T. to study effectiveness of different communication patterns

39

Social interaction patterns

40

PPI (Protein Interaction Networks)

• Nodes – protein names • Links – physical binding event 41

Network Definitions

42

Network components

• Networks also called graphs

– Graph (G) contains

• Nodes (N): genes, SNPs, cities, PCs, etc.

• Edges (E): links connecting two nodes

43

Some characteristics

• Networks are

– Complex

– Dynamic

– Can be used to reduce data dimensionally

44 time = t0 time = t

Topology

• Refers to connection pattern

– The pattern of links

45

Small – world networks

• Six degrees of separation – everyone is 6 or fever steps away from each other

• Reference: Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of ‘small-world’networks." nature 393.6684 (1998): 440-442.

46

http://www.nature.com/nature/journal/v393/n6684/full/393440a0.html














Scale-free networks

• Biological processes are characterized by this topology – Few hubs (highly connected nodes) – Predominance of poorly connected nodes – New vertices attach preferentially to highly connected ones

• Barabási, Albert-László, and Réka Albert. "Emergence of scaling in random networks." science 286.5439 (1999): 509-512. 47

http://www.sciencemag.org/content/286/5439/509.full















Modules

• Sub-networks with

– Specific topology

– Function

• Biological context

– Protein complex

– Common function

• E.g. energy production

48 clique

Edges Types

N nodes

E edges

graph:

directed

undirected

Network types • Directed

– Edge have directionality

– Some links are unidirectional

– Direction matters • Going A B is not the same as BA

– Analogous to chemical reactions • Forward rate might not be the same as reverse

– E.g. directed gene regulatory networks (TF gene)

• Undirected – Edges have no directionality

– Simpler to describe and work with

– E.g. co-expression networks

50

Neighbours of node(s)

• Neighbours(node, order) = {node1 … nodep}

• Neighbours(3,1) = {2,4}

• Neighbours(2,2) = {1,3,5,4}

51

Reachability of two nodes i and j

• Walk – Sequence of visited nodes on a

path from node i to j

– e.g. nodes(1,2) = {5,2,1,2,3,4,5,2}

• Trail – a walk with no repeated edges

– e.g. nodes(1,4)={5,4}

• Path – a walk with no repeated nodes

– e.g. nodes(1,6)={5,4,6}

52

visited nodes

Connectivity • Line (edge) connectivity (λ)

– Minimum number of lines (edges) that need to be removed to disconnect graph G

• i.e. no other links would be able to connect a node

• Node connectivity (κ)

– Minimum number of nodes that need to be removed to disconnect graph G

53

λs = 3 and κs = 2

λt = 3 and κt = 2

Connectivity matrix (also known as adjacency matrix)

A =

Size

binary or weighted

Node degree (k)

• the number of edges connected to the node

• k(6) = 1

• k(4) = 3

55

Degree distribution (P(k))

• Determines the statistical properties of

uncorrelated networks

56

source: http://www.network-science.org/powerlaw_scalefree_node_degree_distribution.html

Topologies: scale-free Most real networks have

Degree distribution that follows power-law

• the sizes of earthquakes craters on the moon

• solar flares • the sizes of activity patterns of neuronal

populations • the frequencies of words in most languages • frequencies of family names • sizes of power outages • criminal charges per convict • and many more

Topology: random

Degree distribution of nodes is statistically independent

Shortest path (p)

• Indicates the distance between i and j in

terms of geodesics (unweighted)

• p(1,3) =

– {1-5-4-3}

– {1-5-2-3}

– {1-2-5-4-3}

– {1-2-3}

59

Betweenness centrality

# SPs from j to k via i

# SPs from j to k

the ratio between • all shortest paths (SP) that path the node i and all shortest paths existing in the graph G

Facebook academic network

61 Blue low and red is high betweenness

Betweenness centrality

• reflects the – amount of control over the interactions of other nodes in the network

• bc = ((bab(c) / bab) + (bae(c) / bae) + (bad(c) / bad) + (bbe(c) / bbe) + (bbd(b) / bbd) + (bde(b) / bde)) = ((0/1)+(1 / 2) + (0 / 1) + (1 / 2) + (0 / 1) + 0/1)

• bc = 1 62

Possible node combinations: {AB, AD, AE, AC, BD, BE, BC, CD, CE DE}

Betweenness centrality standardized • For standardization

– the denominator is (n-1)(n-2)/2 (15)

– the maximum possible number of edges

63

Node b b - standardized

1 0 0

2 0 0

3 9 9/15

4 9 9/15

5 8 9/15

6 0 0

7 0 0

Possible node pairs (21) 12 23 34 45 56 67 13 24 35 46 57 14 25 36 47 15 26 37 16 27 17

Cliques

• A clique of a graph G is a complete subgraph of G

– i.e. maximally interconnected subgraph

• The highlighted clique is the maximal clique of size 4 (nodes) 64

–Robert Kiyosaki

“The richest people in the world look for and

build networks. Everyone else looks for work.”

Biological context

66

Biological Networks

67

Biological examples

• Co-expression – For genes that have similar expression profile

• Directed gene regulatory networks (GRNs) – show directionality between gene interactions

• Transcription factor target gene expression

– Show direction of information flow – E.g. transcription factor activating target gene

• Protein-Protein Interaction Networks (PPI) – Show physical interaction between proteins – Concentrate on binding events

• Others – Metabolic, differential, Bayesian, etc.

68

Biological networks

• Three main classes

69

Type Name Nodes Edges Resource

molecular interactions PPI proteins physical bonds BioGRID DTI drugs/targets physical bonds PubChem

functional associations

GI genes genetic interactions BioGRID

ON Gene Ontology

functional relations GO

GDA genes/diseases associations OMIM

functional/structural similarities Co-Ex genes

expression profile similarity

GEO, ArrayExpress

PStrS proteins structural similarities PDB

Source: Gligorijević, Vladimir, and Nataša Pržulj. "Methods for biological data integration: perspectives and challenges." Journal of The Royal Society Interface 12.112 (2015): 20150571.

http://rsif.royalsocietypublishing.org/content/12/112/20150571.full











Inferring co-expression networks in R

WGCNA package (Weighted Gene Correlation Network Analysis)

70

http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/

Main features

• Builds correlation networks

• Correlations are

– simple to calculate

– fast on large scale data

• Support sign of association (not direction)

• Lots of network metrics (e.g. connectivity)

• Easy identification of modules

– Reduction of dataset dimensionality good

71

Construct a network Search for genes with similar expression profile

Identify modules in predicted network Reduce data into gene sets / groups

Relate modules to external information

find biologically interesting modules E.g.: Clinical data, biological function (gene ontology, pathways)

Find the key drivers in interesting modules Experimental validation, therapeutics, biomarkers

Study Module Preservation across different data Check robustness of module definition

72

Steps for constructing a co-expression network

A) Obtain gene expression data

B) Measure co-expression between genes via a correlation coefficient

C) Build correlation matrix = network A) Adjacency matrix

D) Transform correlation matrix with the power adjacency function new adjacency matrix weighted network

73

Network=Adjacency Matrix

• Adjacency matrix, A=[aij], encodes how a pair of nodes is connected (if at all)

– Weighted networks = aij is edge value (weight)

– Unweighted networks = aij presence or absence of edge

74

Scale Free Network Topology

• Scale free topology means

– presence of hub nodes highly connected to other nodes

– metabolic networks exhibit scale free topology at least approximately

– Node connectivity (k), degree, follows power law

– p(k)=proportion of nodes that have connectivity k

Frequency Distribution of Connectivity

Connectivity k

Fre

qu

en

cy

0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035

01

00

20

03

00

40

05

00

60

07

00

75

How to check Scale Free Topology?

Only few nodes display

high connectivity

Check if obtained network follows scale free topology Idea: Log transformation p(k) and k and look at scatter plots Answer: R^2 can be used to quantify goodness of fit R^2 > 0.6 mean that networks follows scale free topology

76

Power function transformation

• Idea:

– transform correlation matrix via power function

– Impose scale free topology

– Select the best beta (β)

• Pick the largest beta

• Corresponds to largest R^2

(Beta)

R^2

Power function

77

Defining modules • based on a hierarchical cluster tree

– Build a tree and cut it – Dynamic tree cutting at optimal height [1] Module=branch of

a cluster tree

78

Analysis of modules

• Perform gene ontology analysis on genes from each module (e.g. yellow = “genes 1”)

• Link modules to clinical data (e.g. weight) – Via module eigengene e.g. cor(trait, eigengene)

genes 1 genes 2 genes 3 genes 4

Modules

79

Heatmap view of module

mo

du

les

tissue samples

vertical bands indicate tight co-expression of module genes

GE

NE

S

Module of

co-expressed

genes

80

Modules as eigengenes

• Can summarized all genes in a module by one eigengene (i.e. virtual gene)

• allow one to relate modules to each other

– Allows calculate distance between modules

• to relate modules to clinical traits and SNPs

81

brown

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185

brown

-0.10.0

0.10.2

0.30.4

Module Eigengene= measure of over-expression=average redness

Rows,=genes, Columns=microarray

The brown module eigengenes across samples

82

Analysis of modules

• Relate modules to traits

• Interested in modules with correlation > 0.75 (red)

83

WGCNA Demo Simulated data - 5 modules

84

Simulating expression data (1) Note: install 1st Hmisc library otherwise WGCNA installation fails install.packages("Hmisc");

install.packages("WGCNA");

source("https://bioconductor.org/biocLite.R") ;

biocLite(c("GO.db", "preprocessCore", "impute"));

#Simulate data

# Load WGCNA package

library(WGCNA)

# The following setting is important, do not omit.

options(stringsAsFactors = FALSE);

# Here are input parameters of the simulation model

# number of samples or microarrays in the training data

no.obs=50

# now we specify the true measures of eigengene significance

# recall that ESturquoise=cor(y,MEturquoise)

ESturquoise=0; ESbrown= -.6;

ESgreen=.6;ESyellow=0

# Note that we dont specify the eigengene significance of the blue module

# since it is highly correlated with the turquoise module.

ESvector=c(ESturquoise,ESbrown,ESgreen,ESyellow)

# number of genes

nGenes1=3000

# proportion of genes in the turquoise, blue, brown, green, and yellow module #respectively.

simulateProportions1=c(0.2,0.15, 0.08, 0.06, 0.04)

# Note that the proportions dont add up to 1. The remaining genes will be colored grey,

# ie the grey genes are non-module genes.

# set the seed of the random number generator. As a homework exercise change this seed.

set.seed(1) 85

Simulating expression data (2)

#Step 1: simulate a module eigengene network.

# Training Data Set I

MEgreen=rnorm(no.obs)

scaledy=MEgreen*ESgreen+sqrt(1-ESgreen^2)*rnorm(no.obs)

y=ifelse( scaledy>median(scaledy),2,1)

MEturquoise= ESturquoise*scaledy+sqrt(1-ESturquoise^2)*rnorm(no.obs)

# we simulate a strong dependence between MEblue and MEturquoise

MEblue= 0.6*MEturquoise+ sqrt(1-.6^2) *rnorm(no.obs)

MEbrown= ESbrown*scaledy+sqrt(1-ESbrown^2)*rnorm(no.obs)

MEyellow= ESyellow*scaledy+sqrt(1-ESyellow^2)*rnorm(no.obs)

ModuleEigengeneNetwork1=data.frame(y,MEturquoise,MEblue,MEbrown,MEgreen, MEyellow)

86

Simulating expression data (3) dat1=simulateDatExpr5Modules(MEturquoise=ModuleEigengeneNetwork1$MEturquoise,

MEblue=ModuleEigengeneNetwork1$MEblue,

MEbrown=ModuleEigengeneNetwork1$MEbrown,

MEyellow=ModuleEigengeneNetwork1$MEyellow,

MEgreen=ModuleEigengeneNetwork1$MEgreen,

nGenes=nGenes1,

simulateProportions=simulateProportions1)

datExpr = dat1$datExpr;

truemodules = dat1$truemodule;

datME = dat1$datME;

attach(ModuleEigengeneNetwork1)

datExpr=data.frame(datExpr)

ArrayName=paste("Sample",1:dim(datExpr)[[1]], sep="" )

# The following code is useful for outputting the simulated data

GeneName=paste("Gene",1:dim(datExpr)[[2]], sep="" )

dimnames(datExpr)[[1]]=ArrayName

dimnames(datExpr)[[2]]=GeneName

rm(dat1); collectGarbage();

# The following command will save all variables defined in the current session.

save.image("Simulated-dataSimulation.RData");

cat("Note: *.RData file written in ",getwd(), "\n") 87

Construction of a weighted gene co-expression network (1)

# Load WGCNA package

library(WGCNA)

# Load additional necessary packages

library(cluster)

1# The following setting is important, do not omit.

options(stringsAsFactors = FALSE);

# Load the previously saved data

load("Simulated-StandardScreening.RData");

attach(ModuleEigengeneNetwork1)

sft=pickSoftThreshold(datExpr,powerVector=1:20)

plot(sft$fitIndices[,1],-sign(sft$fitIndices[,3])*sft$fitIndices[,2], xlab="Soft Threshold (power)",ylab="SFT, signed R^2", type="o")

abline(h=0.90,col="red")

88

Construction of a weighted gene co-expression network (2)

# here we define the adjacency matrix using soft

thresholding with beta=6

ADJ1=abs(cor(datExpr,use="p"))^6

# When you have relatively few genes (<5000) use the

following code

k=as.vector(apply(ADJ1,2,sum, na.rm=T))

# When you have a lot of genes use the following code

#k=softConnectivity(datE=datExpr,power=6)

# Plot a histogram of k and a scale free topology plot

sizeGrWindow(10,5)

par(mfrow=c(1,2))

hist(k)

scaleFreePlot(k, main="Check scale free topology\n")

89

Definition of co-expression modules (1)

#Many clustering procedures require a dissimilarity matrix as input. We define a dissimilarity based on adjacency

# Turn adjacency into a measure of dissimilarity

dissADJ=1-ADJ1

hierADJ=hclust(as.dist(dissADJ), method="average" )

# Plot the resulting clustering tree together with the true color assignment

sizeGrWindow(10,5);

plotDendroAndColors(hierADJ, colors = data.frame(truemodules), dendroLabels = FALSE, hang = 0.03,

main = "Gene hierarchical clustering dendrogram and simulated module colors" )

90

Definition of co-expression modules (2)

#static tree cutting

colorStaticADJ=as.character(cutreeStaticColor(hierADJ, cutHeight=.99, minSize=20))

# Plot the dendrogram with module colors

sizeGrWindow(10,5);

plotDendroAndColors(hierADJ, colors = data.frame(truemodules, colorStaticADJ),

dendroLabels = FALSE, abHeight = 0.99,

main = "Gene dendrogram and module colors")

#dynamic tree cutting

branch.number=cutreeDynamic(hierADJ,method="tree")

# This function transforms the branch numbers into colors

colorDynamicADJ=labels2colors(branch.number)

sizeGrWindow(10,5)

plotDendroAndColors(dendro = hierADJ,

colors=data.frame(truemodules, colorStaticADJ,

colorDynamicADJ, colorDynamicADJ),

dendroLabels = FALSE, marAll = c(0.2, 8, 2.7, 0.2),

main = "Gene dendrogram and module colors")

91

Calculating module eigengenes

#caluculate eigengenes for each module

datME=moduleEigengenes(datExpr,colorStaticADJ)$eigengenes

#correlation between modules based on their eigengenes

signif(cor(datME, use="p"), 2)

#dendrogram

dissimME=(1-t(cor(datME, method="p")))/2

hclustdatME=hclust(as.dist(dissimME), method="average" )

# Plot the eigengene dendrogram

par(mfrow=c(1,1))

plot(hclustdatME, main="Clustering tree based of the module eigengenes")

#see expression profiles - diagnostic plots

#show available modules

levels(as.factor(colorStaticADJ))

sizeGrWindow(8,9)

par(mfrow=c(3,1), mar=c(1, 2, 4, 1))

which.module="blue";

plotMat(t(scale(datExpr[,colorStaticADJ==which.module ]) ),nrgcols=30,rlabels=T,

clabels=T,rcols=which.module,

title=which.module )

ME=datME[, paste("ME",which.module, sep="")]

barplot(ME, col=which.module, main="", cex.main=2,

ylab="eigengene expression",xlab="array sample")

92

Relating modules to trait

#all modules (green and brown modules look interesting)

signif(cor(y,datME, use="p"),2)

#get statistical significance of module association to

trait

cor.test(y, datME$MEbrown)

cor.test(y, datME$MEgreen)

93

References

[1] Langfelder P, Zhang B et al (2007) Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut library for R. Bioinformatics 2008 24(5):719-720

[2] Steve Horvath, Tutorials for the WGCNA package

94

http://bioinformatics.oxfordjournals.org/content/24/5/719.full











http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html



L8: Part 1 Hierarchical trees Representing time€¦ · Hierarchical trees Representing time Kirill...

Documents

Transcript of L8: Part 1 Hierarchical trees Representing time€¦ · Hierarchical trees Representing time Kirill...