Network analysis of biological data

S L I D E 1

Network analysis of biological data

A Jeremy WillseyGene760 - April 08, 2013

General theory, problems, and potential solutions.

S L I D E 2

Overview

• Goal of network analysis• Types of biological networks• Network analysis concepts• Properties of biological networks• Issues with ‘conventional’ (database-reliant) network analysis• Co-expression analysis – general concepts & implementation• Co-expression analysis – WGCNA• Successful applications of WGCNA• Pitfalls of co-expression analysis• Appendix: Network analysis tools and software

S L I D E 3

Network analysis converts biological information into network structure

• The goal of network analysis is to connect genes or proteins meaningfully in order to elucidate the underlying biology– Actionable understanding of gene-gene or protein-protein relationships– Identification of key genes

• Network analysis is becoming common in biology– Explosion of publicly available biological data– Biological activities depend on coordinated effects of many interacting

species, the study of these interactions is fundamental to understanding biological systems

– Understanding the complexity of most human diseases requires pathway level knowledge

– Developments in systems biology network theory (i.e. ubiquity of scale free topology)

A.-L. Barabási, N. Gulbahce, J. Loscalzo, Network medicine: a network-based approach to human disease, Nat Rev Genet 12, 56–68 (2011).

S L I D E 4

Types of biological networks

• Protein-protein interaction networks– Yeast two-hybrid– Immunoprecipitation and high-throughput mass-spectrometry– Individually validated interactions (mined from databases)– Predicted function (orthology, paralogy)– Text mining

• Metabolic networks– System of connected enzymatic/chemical reactions– Generally very well characterized

• Regulatory networks– ChIP-on-chip– ChIP-seq

• RNA networks– RNA-RNA and RNA-DNA interactions

• Gene co-expression networks– Patterns of gene expression connect genes


S L I D E 5

Networks are composed of nodes and edges (connections between nodes)

• In biological networks (graphs), nodes (vertices) typically represent genes, proteins, or metabolites whereas edges represent relationships

• Formally, a graph G can be defined as a pair (V,E) where V is a set of vertices representing the nodes and E is a set of edges representing the connections between the nodes– Define as E= {(i,j) | i, j, ε V} the single connection between nodes (i.e. E=(1,2) )– Graph can be represented as a symmetric adjacency matrix made of 0’s and 1’s where 1

represents a connection between two nodes which are the rows and columns

G. A. Pavlopoulos et al., Using graph theory to analyze biological networks, BioData Mining 4, 10 (2011).

Nodes

Edges

Hub

2 3

1

4 5

1 2 3 4 5

1 0 1 1 1 1

2 1 0 0 0 0

3 1 0 0 0 1

4 1 0 0 0 0

5 1 0 1 0 0

Corresponding adjacency matrix

S L I D E 6

Networks can be undirected, directed, or weighted


Undirected

• Edges represent biological relationships• Multi-edge connections are possible, used to

represent multiple relationships2 3

1

4 5

S L I D E 7



Undirected

2 3

1

4 5

1 2 3 4 5

1 0 1 1 1 1

2 1 0 0 0 0

3 1 0 0 0 1

4 1 0 0 0 0

5 1 0 1 0 0


S L I D E 8

• Example: PPI database String (http://string-db.org/) - evidence view– Edges represent associations based on several forms of evidence

Different colors represent different types of evidence for association

http://string-db.org/



S L I D E 9

• Edges retain directionality• Commonly used for metabolic, signal

transduction, or regulatory networks



Directed

2 3

1

4 5

S L I D E 10



Directed

2 3

1

4 5

1 2 3 4 5

1 0 -1 1 0 1

2 0 0 0 0 0

3 0 0 0 0 1

4 1 0 0 0 0

5 0 0 1 0 0


S L I D E 11

• Example: PPI database String (http://string-db.org/) - action view– Edges represent connection and type of relationship

Modes of action are shown in different colors


S L I D E 12

• Example: KEGG http://www.genome.jp/kegg/– Edges represent activating or inhibiting interactions

http://www.genome.jp/kegg/



S L I D E 13

Weighted

• Most widely used type of network in bioinformatics

• Weight of edge indicates strength of connection (or confidence, relevance, etc)



2 3

1

4 5

S L I D E 14

Weighted



2 3

1

4 5

1 2 3 4 5

1 0 0.2 1 0.5 0.3

2 0.2 0 0 0 0

3 1 0 0 0 0.1

4 0.5 0 0 0 0

5 0.3 0 0.1 0 0


S L I D E 15

• Example: PPI database String (http://string-db.org/) - confidence view– Edges represent strength of association (based on strength of evidence)

Stronger associations are represented by thicker lines


S L I D E 16

Properties of biological networks

• Biological networks tend to follow a series of basic organizing principles that distinguish them from random networks– Modules

• Highly interlinked (connected) local regions in the network– Degree distribution and hubs – scale free topology

• Degree distribution (fraction of nodes with a given degree) decays according to a power law (as opposed to Poisson distribution)

– A few highly connected genes (hubs) hold the networks together– Small world phenomena

• Short path between any pair of nodes– Motifs

• Subgraphs repeated within or across multiple networks– Betweenness centrality

• Some genes mediate connections between subnetworks


S L I D E 17

What do these properties mean for biological network analysis?

• Modules– Correspond to ‘functional’ units

• Degree distribution and hubs – scale free topology– Some genes (hubs) contribute more to network structure, these are likely

more important• Small world phenomena

– Perturbing the state of a given node can perturb other nodes and have consequences for the entire network

• Motifs– Likely associated with optimized biological function (i.e. negative feedback)

• Betweenness centrality– Nodes with high betweenness centrality tend to correlate with essentiality


S L I D E 18

Conventional network analysis is fraught with problems

• Databases are incomplete• Some data is incorrect• Investigative biases • Annotation biases• Inability to determine novel relationships• Lack of spatiotemporal consideration• Which databases to use? Which tools/methods to use?• Consistency / reproducibility across methods

http://clair.si.umich.edu/~radev/cs6998/papers_to_replicate/nbt0108-69.pdf

S L I D E 19

• Both methods use the same general set of databases

• 2/10 String network nodes are found in the GeneMANIA network

• Different methods of weighting evidence

GeneMANIA http://genemania.org/

String (http://string-db.org/)

http://genemania.org/


S L I D E 20

Building networks from expression data

• Genes with similar co-expression patterns are connected– Hypotheses:

• Co-expressed genes function together• Co-expressed genes are likely co-regulated

• Overcomes many of the aforementioned issues with network analysis– Does not rely on divergent or heterogenous databases– Ability to determine novel relationships– Spatiotemporal information utilized– Methods for determining co-expression networks are relatively simple, well

established, and consistent (Pearson’s correlation)

S L I D E 21

Co-expression analysis seeks to group genes based on similarity of expression profiles

• Determine pairwise correlations between genes across a set of samples• Connect genes with similar expression profiles (co-expressed genes)• Group sets of highly connected genes

P. Langfelder, S. Horvath, WGCNA: an R package for weighted correlation network analysis, BMC Bioinformatics 9, 559 (2008).

S L I D E 22

Co-expression analysis seeks to group genes based on similarity of expression profiles

• Determine pairwise correlations between genes across a set of samples• Connect genes with similar expression profiles (co-expressed genes)• Group sets of highly connected genes


S L I D E 23

Co-expression analysis can be bottom-up or top-down

• Bottom-up approach– Co-expressed genes are connected and grouped together by

interconnectedness (unsupervised clustering)– Determine global system structure, emergent properties of the data– Useful for hypothesis-naïve approach to network construction

• Top-down approach– Start with a set of ‘seed’ genes and build outwards to determine local system– Useful for hypothesis-driven approach to network construction

S L I D E 24

Weighted gene co-expression network analysis (WGCNA)


S L I D E 25

WGCNA – Step 1 Network Construction

• Define n x m matrix X = [xil] where the row indices correspond to genes (nodes, i = 1, …, n) and the column indices (l = 1, …, m) correspond to sample measurements

Sample 1 … m

Gene 1 2.5 5 10 15 20

Gene 2 20 15 10 5 2.5

Gene 3 2.5 5 10 15 20

Gene n 1 1 1 1 1

Matrix X of expression level


Node profile

S L I D E 26


• Define n x m matrix X = [xil] where the row indices correspond to genes (nodes, i = 1, …, n) and the column indices (l = 1, …, m) correspond to sample measurements– Correlation network methodology describes pairwise relationships

(correlations) between the rows of X

Sample 1 … m

Gene 1 2.5 5 10 15 20

Gene 2 20 15 10 5 2.5

Gene 3 2.5 5 10 15 20

Gene n 1 1 1 1 1



Node profile

Positively correlated

S L I D E 27




Sample 1 … m

Gene 1 2.5 5 10 15 20

Gene 2 20 15 10 5 2.5

Gene 3 2.5 5 10 15 20

Gene n 1 1 1 1 1



Node profile

Negatively correlated

S L I D E 28




Sample 1 … m

Gene 1 2.5 5 10 15 20

Gene 2 20 15 10 5 2.5

Gene 3 2.5 5 10 15 20

Gene n 1 1 1 1 1



Node profile

Not correlated

S L I D E 29


• Define co-expression similarity sij between genes i and j as– sij = |cor(xi,xj)|

• i.es1,2 = -0.98s1,3 = 1.00s1,n = -0.06

Sample 1 … m

Gene 1 2.5 5 10 15 20

Gene 2 20 15 10 5 2.5

Gene 3 2.5 5 10 15 20

Gene n 1 2 1 2 1



Node profile

S L I D E 30

WGCNA – Step 1 Network Construction - unweighted


• Create adjacency matrix aij from all s– Unweighted

1 if sij ≥ τ0 otherwise

aij =

1 2 3 n

1 0 1 1 0

2 1 0 1 0

3 1 1 0 0

n 0 0 0 0

Unweighted adjacency matrix


S L I D E 31

WGCNA – Step 1 Network Construction - weighted


• Create adjacency matrix aij from all s– Unweighted

1 if sij ≥ τ0 otherwise

– Weighted[aij] = [sij]

OR

aij = sijβ

aij =

1 2 3 n

1 0 0.98 1.00 0.06

2 0.98 0 0.98 0.06

3 1 0.98 0 0.06

n 0.06 0.06 0.06 0

Weighted adjacency matrix


Choose β as lowest power for which the scale free fit index ≥0.90

S L I D E 32

WGCNA – Step 2 Module Detection

• Define modules as clusters of densely connected genes– Determine network interconnectedness using topological overlap measure

(TOM)• A pair of nodes has high topological overlap if they are strongly connected to the

same group of nodes• In gene networks, genes with high topological overlap are likely to be in the same

biological pathway


High topologicaloverlap

Low topologicaloverlap

S L I D E 33

WGCNA – Step 2 Module Detection

• Convert TOM to dissimilarity measure (1-TOM) & identify modules using unsupervised hierarchical clustering and branch cutting algorithm– Modules correspond to sets of rows of X that are highly correlated (low

dissimilarity measure)


Module

1 2 3 n

1 0 0.98 1.00 0.06

2 0.98 0 0.98 0.06

3 1 0.98 0 0.06

n 0.06 0.06 0.06 0

Weighted adjacency matrix

S L I D E 34

S L I D E 35

S L I D E 36

WGCNA – Step 3 Relate modules to external data and identify important genes

• Define sample trait T as a vector with m components (T = (T1, … Tm) that correspond to the columns (samples) of the matrix X– Trait-based node significance (GSi) measure can be defined as

• GSi = |cor(xi, T)|– We can prioritize genes by significance measure and modules by average

gene significance measure


S L I D E 37

S L I D E 38

Gene significance and module membership are correlated

S L I D E 39

WGCNA – Step 3 Relate modules to external data and identify important genes

• Define sample trait T as a vector with m components (T = (T1, … Tm) that correspond to the columns (samples) of the matrix X– Trait-based node significance (GSi) measure can be defined as

• GSi = |cor(xi, T)|– We can prioritize genes by significance measure and modules by average

gene significance measure• Can also examine gene ontology enrichment, burden of disease loci

(GWAS, known mutations, etc)


S L I D E 40

WGCNA – Step 4 Study module relationships

• Define the module eigengene E as the first principal component of a given module– Considered representative of the gene

expression profiles in a module• Rationale is to understand how modules

interact; also reduction in data, multiple comparisons


S L I D E 41

Clustering of eigengenes identifies meta-modules and trait associations

S L I D E 42

WGCNA – Step 5 Identify key drivers in interesting modules

• Output from Steps 1-4– Candidate modules– Candidate genes within these modules

• Need hypothesis-driven experimental validation– Additional clinical data or follow up in

patients– Targeted sequencing of candidate genes– Perturbation of key genes (hubs) in

human cell lines or model organisms– Build networks with alternative methods

and data and examine convergence


S L I D E 43

WGCNA Example 1

Nature 478, 483–489 (2011).

S L I D E 44

The dataset is a comprehensive map of gene expression patterns in the developing human brain

• Whole transcriptome profiling across 1,340 tissue samples collected from 57 developing and adult post-mortem brains of clinically unremarkable donors (males & females of multiple ethnicities)– Samples from transient prenatal structures and immature and mature forms of 16

brain regions (11 neocortical, 5 non-neocortical) from each sample

• N=57 (39 with both hemispheres)• Age: 5.7 weeks post-conception to 82 years• Sex: 31 males and 26 females• Post-mortem interval 12.11 ± 8.63 hours• pH 6.45 ± 0.34• Total RNA extracted from each sample (RIN 8.83 ± 0.93)• Gene expression assessed with the Affymetrix GeneChip Human Exon 1.0 ST

Array platform– Comprehensive coverage of the human genome, 1.4 million probe sets assaying

expression across entire transcripts and individual exons

Kang, H. J. et al. Spatio-temporal transcriptome of the human brain. Nature 478, 483–489 (2011).

S L I D E 45

WGCNA performed on the multidimensional spatio-temporal dataset identified 29 modules

• General quality control– No large-scale structural abnormalities identified by genotyping– Hierarchical clustering

• Remove outliers and nsure clustering by region and time, not by covariates– Averaged Spearman correlation coefficient of a given brain region / NCX area

calculated for each period• Remove outliers

• WGCNA Data cleaning steps:– Brain-expressed genes only: log2(intensity) > 6 in at least 1 sample– Coefficient of variance > 0.08– Total of 9,093 genes fit this criteria

S L I D E 46

Module M8 may be important for development of neocortical and hippocampal projection neurons

24 GenesGene ontology enrichment:- Neuronal differentiation p* = 0.008- Transcription factors p* = 0.005

*Bonferroni-adjusted

Hub genes include transcription factors TBR1, FEZF2, FOXG1, SATB2, NEUROD6 and EMX1 - functionally implicated in the development of NCX and HIP projection

FOXG1 variants have also been linked to Rett syndrome and intellectual disability

S L I D E 47

Module M15 may be important for neurotransmission

310 GenesGene ontology enrichment:- Ionic channels p* = 8.0 x10-8

- Neuroactive ligand-receptor interaction p* = 4.0 x10-14

*Bonferroni-adjusted

Sequence variants in Hub genes are linked to major depression (GDA) and to schizophrenia and affective disorders (NRGN and RGS4)

S L I D E 48

Modules M20 and M2 have opposite trajectories and drastic changes near birth

Module M20 Module M2

GO enrichment for - zinc-finger proteins (P = 7.3 × 10−48)- transcription factors (P = 4.8 × 10−50)

GO enrichment for - membrane proteins (P = 1.8 × 10−21) - calcium signalling (P = 8.1 × 10−10), - synaptic transmission (P = 1.6 × 10−6)

neuroactive ligand–receptor interaction (P = 4.1 × 10−4)

S L I D E 49

Conclusions

• Modules of genes related to development of neocortical and hippocampal projection neurons identified– Hub genes indicate important genes in this process– Module may be relevant to Rett Syndrome and intellectual disability

• Module of genes related to neurotransmission also identified– Module may be relevant to other neuropsychiatric disorder like Schizphrenia

and major depression• Genes in these modules (particularly hub genes) are candidates for causal

association with disease

S L I D E 50

WGCNA Example 2

Nature 474, 380–384 (2011).

S L I D E 51

Analysis of gene expression in post-mortem brain tissue from autism cases and matched controls

• Whole transcriptome profiling of 19 cases and 17 controls in 3 brain regions– Superior temporal gyrus (STG), prefrontal cortex (pFC) and cerebellar vermis

(CV)– Samples genotyped and screened for structural variation– Transcriptome assessed with Illumina microarrays

• Data quality control criteria– Higher inter-array correlation (Pearson correlation coefficients > 0.85)– Detection of outlier arrays based on mean inter-array correlation and

hierarchical clustering– Probes considered as robustly expressed if the detection P value < 0.05 for at

least half the samples in the data set– 58 cortex samples (29 autism, 29 control) and 21 cerebellum sampls (11

autism, 10 control) based QC steps and were used for WGCNA

S L I D E 52

Coexpression network was created using data from cases and controls

• WGCNA analysis grouped genes into modules and determined module eigengenes

• Eigengene correlation to disease status assessed (as well as other potential covariates and confounders)

• Two network modules with eigengenes highly correlated with disease status (and no confounding variables)– M12 module significance p = 3 x10-4

– M16 module significance p = 4 x10-3

S L I D E 53

Module M12 highly enriched for neuronal markers

• Significant enrichment for a list of experimentally-defined neuron specific markers (p=9.33x10-37)

• Also GO enrichment for categories involved in synaptic function, vesicular transport and neuronal projection

• Module downregulated in Cases

S L I D E 54

Module M16 enriched for markers of astrocytes and markers of activated microglia

• Astrocyte markers (p=1.4x10-37), activated microglia markers (p=5x10-3)• Also GO enrichment for categories involved in immune & inflammatory gene

function• Module upregulated in Cases

S L I D E 55

M12 appears to be causally involved in ASD pathogenesis

• M12 but not M16 significantly enriched for Autism genetic association signals (p = 5 × 10−4 vs. 0.95)

• M12 also has significant overrepresentation of known autism susceptibility genes (p = 6.1 × 10−4)

• M12 downregulation likely causally associated with disease

• M16 upregulation in cases has no common genetic component– May be secondary to disease

or caused by environmental factors

S L I D E 56

Conclusions

• Two modules, M12 and M16, are significantly correlated with disease status

• Only module M12 appears to be causally involved in pathogenesis– Hub genes are strongest candidates for follow up

• Co-expression analysis generates testable hypotheses!

S L I D E 57

Pitfalls of co-expression analysis

• Indirect links between genes• Incidental correlations• Resolution• Need dimensionality to data• Need large datasets• Outliers may drive false correlations

S L I D E 58

Appendix – Network analysis tools and software

http://www.cs.rice.edu/~nakhleh/COMP572/NetworkResources.html

Network analysis of biological data

Documents

Transcript of Network analysis of biological data