Bioinformatics and the logic of life

53
Bioinformatics to reveal the logic of life M. Gonzalo Claros Díaz Dpto Biología Molecular y Bioquímica Plataforma Andaluza de Bioinformática 1 http://about.me/mgclaros/ @MGClaros

description

How can bioinformatics reveal the logic of life

Transcript of Bioinformatics and the logic of life

Page 1: Bioinformatics and the logic of life

Bioinformatics to reveal the logic of life

M. Gonzalo Claros Díaz

Dpto Biología Molecular y Bioquímica Plataforma Andaluza de Bioinformática

1

Centro de Bioinnovación

http://about.me/mgclaros/

@MGClaros

Page 2: Bioinformatics and the logic of life

http://www.scbi.uma.es

The meaning/logic of life

2

Page 3: Bioinformatics and the logic of life

http://www.scbi.uma.es

There are many reflections about life

3

Genetics

PhilosophyReligion

Physics

And many more

Page 4: Bioinformatics and the logic of life

http://www.scbi.uma.es

A living being for some scientists

4

The cell is a kind of black box

Page 5: Bioinformatics and the logic of life

http://www.scbi.uma.es

Molecular biology provides some logic…

5

How to select the few combinations having some sense?

Page 6: Bioinformatics and the logic of life

http://www.scbi.uma.es

A hierarchical logic…

6

the way back cannot be predicted

Page 7: Bioinformatics and the logic of life

http://www.scbi.uma.es

In fact, a complex logic plenty of interactions

7

Page 8: Bioinformatics and the logic of life

http://www.scbi.uma.es

Metabolism offers another source of logic

8

Page 9: Bioinformatics and the logic of life

http://www.scbi.uma.es

Other sciences were also interested in life logic

9

Page 10: Bioinformatics and the logic of life

http://www.scbi.uma.es

Bioinformatics = integration

10

http://bioinformatics.biol.ntnu.edu.tw/sher/Teaching.html

Page 11: Bioinformatics and the logic of life

http://www.scbi.uma.es

Bioinformatics receives and gives new data and insights

11

Biology

Computer science Statistics

The living being is the result of all observations and cannot be inferred

from biassed observations

Page 12: Bioinformatics and the logic of life

http://www.scbi.uma.es

A living being for some scientists

12

The cell is a kind of black box

Page 13: Bioinformatics and the logic of life

http://www.scbi.uma.es

A living being for a bioinformatician

13

Life ontology

Page 14: Bioinformatics and the logic of life

http://www.scbi.uma.es

So, we begin to understand

14

Bioinformatician Biotechnologist

Other scientists

Page 15: Bioinformatics and the logic of life

http://www.scbi.uma.es

Bioinformatics emerged with data accumulation

15

Page 16: Bioinformatics and the logic of life

http://www.scbi.uma.es

Regarding data, informatics is in the rear of biology

16

Page 17: Bioinformatics and the logic of life

http://www.scbi.uma.es

Therefore, biology and informatics are interdependent

17http://www.genomicglossaries.com/presentation/SLAgenomics.asp

Page 18: Bioinformatics and the logic of life

http://www.scbi.uma.es

Without mobility issues

18

Page 19: Bioinformatics and the logic of life

Some logic in living beings based on bioinformatics

19

Page 20: Bioinformatics and the logic of life

http://www.scbi.uma.es

Bioinformatics integration in alcohol induced disorders

20http://pubs.niaaa.nih.gov/publications/arh311/5-11.htm

Through integration and modeling, these studies would allow us to better exploit the complexity of genomic and functional genomic data and to extract their biological and clinical significance

Page 21: Bioinformatics and the logic of life

http://www.scbi.uma.es

Drug discovery was expensive

21

Experimental drugs were chemically synthesized and

then tested in animals

Classic approach

Page 22: Bioinformatics and the logic of life

http://www.scbi.uma.es

Drug discovery was expensive

21

Experimental drugs were chemically synthesized and

then tested in animals

Classic approach Bioinformatics approach

Only candidate drugs are synthesized. A cost-effective approach

Ligand database

Page 23: Bioinformatics and the logic of life

http://www.scbi.uma.es

Nobel of chemistry in 2013

22

Por el desarrollo de modelos computacionales para conocer y predecir procesos químicos

Químico teórico Biofísico Bioquímico

http://blogs.plos.org/biologue/2013/10/18/the-significance-of-the-2013-nobel-prize-in-chemistry-and-the-challenges-ahead/

Bioquímico

Page 24: Bioinformatics and the logic of life

http://www.scbi.uma.es

Nobel of chemistry in 2013

22

Por el desarrollo de modelos computacionales para conocer y predecir procesos químicos

Químico teórico Biofísico Bioquímico

http://blogs.plos.org/biologue/2013/10/18/the-significance-of-the-2013-nobel-prize-in-chemistry-and-the-challenges-ahead/

Bioquímico

This Nobel Prize is the first given to work in computational biology, indicating that the field has matured and is on a par with experimental biology

The blog of PLOS Computational Biology

Page 25: Bioinformatics and the logic of life

http://www.scbi.uma.es

A cell was full of molecular cascades

23

Divergent cascades Convergent cascades

Page 26: Bioinformatics and the logic of life

http://www.scbi.uma.es

Then, a cell was a subway map

24Subway map designed by Claudia Bentley. Web design by Nick Allin.Edited by Cath Brooksbank and Sandra Clark.© 2002 Nature Publishing Group.

http://www.nature.com/nrc/poster/subpathways/index.html

Page 27: Bioinformatics and the logic of life

http://www.scbi.uma.es

Finally, a cell is a network

25

Cell network complexity increases with whole organism complexity. Key nodes revealed key functions

Page 28: Bioinformatics and the logic of life

http://www.scbi.uma.es

allow the formation of supramolecular activator orinhibitory complexes, depending on their componentsand possible combinations.Transcription factors (TFs) are an essential subset of

interacting proteins responsible for the control of geneexpression. They interact with DNA regions and tendto form transcriptional regulatory complexes. Thus,the final effect of one of these complexes is determinedby its TF composition.The number of TFs varies among organisms,

although it appears to be linked to the organism’scomplexity. Around 200–300 TFs are predicted forEscherichia coli [18] and Saccharomyces [19,20]. Bycontrast, comparative analysis in multicellular organ-isms shows that the predicted number of TFs reaches600–820 in C. elegans and D. melanogaster [20,21], and1500–1800 in Arabidopsis (1200 cloned sequences)[20–22]. For humans, around 1500 TFs have beendocumented [21] and it is estimated that there are2000–3000 [21,23]. Such an increase in the number ofTFs is associated with higher control of gene regula-tion [24]. Interestingly, such an increase is based onthe use of the same structural types of proteins.Human transcription factors are predominantly Zn fin-gers, followed by homeobox and basic helix–loop–helix[21]. Phylogenetic studies have shown that the amplifi-cation and shuffling of protein domains determine thegrowth of certain transcription factor families [25–28].Here, a domain can be defined as a protein sub-structure that can fold independently into a compactstructure. Different domains of a protein are oftenassociated with different functions [29,30].When dealing with TF networks, several relevant

questions arise. How are these factors distributed andrelated through the network structure? How importanthas the protein domain universe been in shaping thenetwork? Analysis of global patterns of networkorganization is required to answer these questions.To this end, we explored, for the first time, the

human transcription factor network (HTFN) obtainedfrom the protein–protein interaction information avail-able in the TRANSFAC database [31], using noveltools of network analysis. We show that this approxi-mation allows us to propose evolutionary considera-tions concerning the mechanisms shaping networkarchitecture.

Results and Discussion

Topological analysis

Data compilation from the TRANSFAC transcriptionfactor database provided 1370 human entries. After

filtering according to criteria given in ExperimentalProcedures, a graph of N ¼ 230 interacting humanTFs was obtained (Fig. 1). This can be understood asthe architecture of the regulatory backbone. It pro-vides a topological view of the interaction patternsamong the elements responsible for gene expression.This corresponds to the protein hardware that carriesout genomic instructions. The remaining TFs con-tained in the database did not form subgraphs andappeared isolated. The relatively small size of the con-nected graph compared with all the entries in the data-base might be due, at least in part, to the currentdegree of knowledge of this transcriptional regulatorynetwork, with only sparse data for many of its compo-nents. Although a number of possible sources of biasare present, it is worth noting that the topological pat-tern of organization reported from different sources ofprotein–protein interactions seems consistent [32].

Topological analysis of HTFN is summarized inTable 1 showing that HTFN is a sparse, small-worldgraph. The degree distribution (Fig. 2A) and clustering(Fig. 2B) show a heterogeneous, skewed shape remind-ing us of a power–law behaviour, indicating that mostTFs are linked to only a few others, whereas a handfulof them have many connections. The average between-ness centrality (b) shows well-defined power–law

Fig. 1. Human transcription factor network built from data extracted

from the TRANSFAC 8.2 database. Numbered black filled nodes

are the highest connected transcription factors. 1, TATA-binding

protein (TBP); 2, p53; 3, p300; 4, retinoid X receptor a (RXRa); 5,retinoblastoma protein (pRB); 6, nuclear factor NFjB p65 subunit

(RelA); 7, c-jun; 8, c-myc; 9, c-fos.

Human transcription factor network topology C. Rodriguez-Caso et al.

6424 FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS

Transcription factor network explains some cancers

26

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

or via control of TF expression, less connected factorsmay also be relevant to cell survival.

Functional and structural patterns from topology

In order to reveal the mechanisms that shape the struc-ture of HTFN, we studied its topological modularityin relation to the function and structure of TFs fromavailable information. From a structural point of view,the overabundance of self-interactions is associatedwith a majority group of 55% of basic helix–loop–helix (bHLH) and leucine zippers (bZip), 17.5% of Znfingers and 22.5% corresponding to a more hetero-geneous group, the ‘beta-scaffold factor with minorgroove contact’ (according to the TRANSFAC classifi-cation) superclass, which includes Rel homologyregions, MADS factors and others.

Such structures can be understood as proteindomains, which can be found alone or combined togive rise to TFs. These domains are responsible forrelevant properties, such as TF–DNA or TF–TF bind-ing. In this context, self-interactions can be explainedby the presence of domains with the ability to bindbetween them as is the case of bHLH and bZip. Theyfollow a general mechanism to interact with DNAbased on protein dimerization [42]. Zn finger domainsare common in TFs, allowing them to bind DNA, butnot to interact with other protein regions [42]. Thisgroup of self-interacting Zn finger proteins is a subsetof the nuclear receptor superfamily (steroid, retinoidand thyroid, as well as some orphan receptors) [26,43].They obey a general mechanism in which Zn fingerTFs have to form dimers in order to recognize tandemsequences in DNA [42]. In fact, regulation at the levelof formation of transcriptional regulatory complexes islinked to a homo ⁄heterodimerization of TFs contain-ing these self-interacting domains. Attending to thissimple rule of domain self-interaction, relative levels ofthese proteins could determine the final composition of

a complex, by varying their function and affinity toDNA. This is the case of the bHLH–bZip proto-onco-gen c-myc [44], or the Zn finger retinoid X receptorRXR [45].

From a topological viewpoint, connections by self-interacting domains would imply high clustering andmodularity, because all these proteins share the samerules and they have the potential to give a highly inter-connected subgraph (i.e. a module). According to this,the high clustering of HTFN (see Fig. 1) could beexplained as a by-product of the overabundance ofself-interacting domains.

We wondered whether the HTFN modular architec-ture (Fig. 3C) might include both functionality andstructural similarity. In order to simplify the study ofmodularity, we traced an arbitrary line identifyingseven putative protein groups (dashed line in Fig. 3C).Nodes of each group were identified by different col-ours in the HTFN graph (Fig. 4A) where we visualizethe modules defined by the topological overlap algo-rithm. We note that a consequence of the hierarchicalcomponent of HTFN is that not all factors in eachgroup have the same level of relation. Unlike asimple modular network, the combination of hierarchyand modularity cannot give homogeneous groups.Figure 4B shows the HTFN core graph, highlightingits modularity, the under-representation of connectionsbetween hubs and the overabundance of highly con-nected nodes linked to poorly connected ones (bothobserved in the correlation profile). The central role ofthe hubs in topological groups defined in Fig. 3Ashould be stressed, such hubs are those described inTable 2, with the exception of E12 (with k ¼ 11),which is involved in lymphocyte development [46].

An analysis of the topological modules of the Fig. 3(labelled A–G) shows that they include structuraland ⁄or functional features. Table 3 summarizes themain structural and functional features of thesegroups. In agreement with the structural homogeneity

Table 2. Description and functionality of transcriptions factor hubs. Transcription factor (TF), degree (k), betweenness centrality (b).

TF Description Associate disease k b · 103

TBP Basal transcription machinery initiator Spinocerebellar ataxia [40] 27 17.3

p53 Tumor suppressor protein Proliferative disease [68] 23 18.5

P300 Coactivator. Histone acetyltransferase May play a role in epithelial cancer [69] 18 20.2

RXR-a Retinoid X-a receptor Hepatocellular carcinoma [70] 18 8

pRB retinoblastoma suppressor protein.

Tumour suppressor protein

Proliferative disease Bladder cancer.

Osteosarcoma [71]

15 27.1

RelA NF-jB pathway Hepatocyte apoptosis and foetal death [72] 14 6.6

c-jun AP-1 complex (activator). Proto-oncogen Proliferative disease [73] 14 4.1

c-myc Activator. Proto-oncogen Proliferative disease [74] 13 10.5

c-fos AP-1 complex (activator). Proto-oncogen Proliferative disease [75] 12 2

C. Rodriguez-Caso et al. Human transcription factor network topology

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6427

2 1

4 5

7 6 9

Page 29: Bioinformatics and the logic of life

http://www.scbi.uma.es

allow the formation of supramolecular activator orinhibitory complexes, depending on their componentsand possible combinations.Transcription factors (TFs) are an essential subset of

interacting proteins responsible for the control of geneexpression. They interact with DNA regions and tendto form transcriptional regulatory complexes. Thus,the final effect of one of these complexes is determinedby its TF composition.The number of TFs varies among organisms,

although it appears to be linked to the organism’scomplexity. Around 200–300 TFs are predicted forEscherichia coli [18] and Saccharomyces [19,20]. Bycontrast, comparative analysis in multicellular organ-isms shows that the predicted number of TFs reaches600–820 in C. elegans and D. melanogaster [20,21], and1500–1800 in Arabidopsis (1200 cloned sequences)[20–22]. For humans, around 1500 TFs have beendocumented [21] and it is estimated that there are2000–3000 [21,23]. Such an increase in the number ofTFs is associated with higher control of gene regula-tion [24]. Interestingly, such an increase is based onthe use of the same structural types of proteins.Human transcription factors are predominantly Zn fin-gers, followed by homeobox and basic helix–loop–helix[21]. Phylogenetic studies have shown that the amplifi-cation and shuffling of protein domains determine thegrowth of certain transcription factor families [25–28].Here, a domain can be defined as a protein sub-structure that can fold independently into a compactstructure. Different domains of a protein are oftenassociated with different functions [29,30].When dealing with TF networks, several relevant

questions arise. How are these factors distributed andrelated through the network structure? How importanthas the protein domain universe been in shaping thenetwork? Analysis of global patterns of networkorganization is required to answer these questions.To this end, we explored, for the first time, the

human transcription factor network (HTFN) obtainedfrom the protein–protein interaction information avail-able in the TRANSFAC database [31], using noveltools of network analysis. We show that this approxi-mation allows us to propose evolutionary considera-tions concerning the mechanisms shaping networkarchitecture.

Results and Discussion

Topological analysis

Data compilation from the TRANSFAC transcriptionfactor database provided 1370 human entries. After

filtering according to criteria given in ExperimentalProcedures, a graph of N ¼ 230 interacting humanTFs was obtained (Fig. 1). This can be understood asthe architecture of the regulatory backbone. It pro-vides a topological view of the interaction patternsamong the elements responsible for gene expression.This corresponds to the protein hardware that carriesout genomic instructions. The remaining TFs con-tained in the database did not form subgraphs andappeared isolated. The relatively small size of the con-nected graph compared with all the entries in the data-base might be due, at least in part, to the currentdegree of knowledge of this transcriptional regulatorynetwork, with only sparse data for many of its compo-nents. Although a number of possible sources of biasare present, it is worth noting that the topological pat-tern of organization reported from different sources ofprotein–protein interactions seems consistent [32].

Topological analysis of HTFN is summarized inTable 1 showing that HTFN is a sparse, small-worldgraph. The degree distribution (Fig. 2A) and clustering(Fig. 2B) show a heterogeneous, skewed shape remind-ing us of a power–law behaviour, indicating that mostTFs are linked to only a few others, whereas a handfulof them have many connections. The average between-ness centrality (b) shows well-defined power–law

Fig. 1. Human transcription factor network built from data extracted

from the TRANSFAC 8.2 database. Numbered black filled nodes

are the highest connected transcription factors. 1, TATA-binding

protein (TBP); 2, p53; 3, p300; 4, retinoid X receptor a (RXRa); 5,retinoblastoma protein (pRB); 6, nuclear factor NFjB p65 subunit

(RelA); 7, c-jun; 8, c-myc; 9, c-fos.

Human transcription factor network topology C. Rodriguez-Caso et al.

6424 FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS

Transcription factor network explains some cancers

26

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

or via control of TF expression, less connected factorsmay also be relevant to cell survival.

Functional and structural patterns from topology

In order to reveal the mechanisms that shape the struc-ture of HTFN, we studied its topological modularityin relation to the function and structure of TFs fromavailable information. From a structural point of view,the overabundance of self-interactions is associatedwith a majority group of 55% of basic helix–loop–helix (bHLH) and leucine zippers (bZip), 17.5% of Znfingers and 22.5% corresponding to a more hetero-geneous group, the ‘beta-scaffold factor with minorgroove contact’ (according to the TRANSFAC classifi-cation) superclass, which includes Rel homologyregions, MADS factors and others.

Such structures can be understood as proteindomains, which can be found alone or combined togive rise to TFs. These domains are responsible forrelevant properties, such as TF–DNA or TF–TF bind-ing. In this context, self-interactions can be explainedby the presence of domains with the ability to bindbetween them as is the case of bHLH and bZip. Theyfollow a general mechanism to interact with DNAbased on protein dimerization [42]. Zn finger domainsare common in TFs, allowing them to bind DNA, butnot to interact with other protein regions [42]. Thisgroup of self-interacting Zn finger proteins is a subsetof the nuclear receptor superfamily (steroid, retinoidand thyroid, as well as some orphan receptors) [26,43].They obey a general mechanism in which Zn fingerTFs have to form dimers in order to recognize tandemsequences in DNA [42]. In fact, regulation at the levelof formation of transcriptional regulatory complexes islinked to a homo ⁄heterodimerization of TFs contain-ing these self-interacting domains. Attending to thissimple rule of domain self-interaction, relative levels ofthese proteins could determine the final composition of

a complex, by varying their function and affinity toDNA. This is the case of the bHLH–bZip proto-onco-gen c-myc [44], or the Zn finger retinoid X receptorRXR [45].

From a topological viewpoint, connections by self-interacting domains would imply high clustering andmodularity, because all these proteins share the samerules and they have the potential to give a highly inter-connected subgraph (i.e. a module). According to this,the high clustering of HTFN (see Fig. 1) could beexplained as a by-product of the overabundance ofself-interacting domains.

We wondered whether the HTFN modular architec-ture (Fig. 3C) might include both functionality andstructural similarity. In order to simplify the study ofmodularity, we traced an arbitrary line identifyingseven putative protein groups (dashed line in Fig. 3C).Nodes of each group were identified by different col-ours in the HTFN graph (Fig. 4A) where we visualizethe modules defined by the topological overlap algo-rithm. We note that a consequence of the hierarchicalcomponent of HTFN is that not all factors in eachgroup have the same level of relation. Unlike asimple modular network, the combination of hierarchyand modularity cannot give homogeneous groups.Figure 4B shows the HTFN core graph, highlightingits modularity, the under-representation of connectionsbetween hubs and the overabundance of highly con-nected nodes linked to poorly connected ones (bothobserved in the correlation profile). The central role ofthe hubs in topological groups defined in Fig. 3Ashould be stressed, such hubs are those described inTable 2, with the exception of E12 (with k ¼ 11),which is involved in lymphocyte development [46].

An analysis of the topological modules of the Fig. 3(labelled A–G) shows that they include structuraland ⁄or functional features. Table 3 summarizes themain structural and functional features of thesegroups. In agreement with the structural homogeneity

Table 2. Description and functionality of transcriptions factor hubs. Transcription factor (TF), degree (k), betweenness centrality (b).

TF Description Associate disease k b · 103

TBP Basal transcription machinery initiator Spinocerebellar ataxia [40] 27 17.3

p53 Tumor suppressor protein Proliferative disease [68] 23 18.5

P300 Coactivator. Histone acetyltransferase May play a role in epithelial cancer [69] 18 20.2

RXR-a Retinoid X-a receptor Hepatocellular carcinoma [70] 18 8

pRB retinoblastoma suppressor protein.

Tumour suppressor protein

Proliferative disease Bladder cancer.

Osteosarcoma [71]

15 27.1

RelA NF-jB pathway Hepatocyte apoptosis and foetal death [72] 14 6.6

c-jun AP-1 complex (activator). Proto-oncogen Proliferative disease [73] 14 4.1

c-myc Activator. Proto-oncogen Proliferative disease [74] 13 10.5

c-fos AP-1 complex (activator). Proto-oncogen Proliferative disease [75] 12 2

C. Rodriguez-Caso et al. Human transcription factor network topology

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6427

2 1

4 5

7 6 9

At least 9 transcription factors drive to cancer if their function is affected

Page 30: Bioinformatics and the logic of life

http://www.scbi.uma.es

allow the formation of supramolecular activator orinhibitory complexes, depending on their componentsand possible combinations.Transcription factors (TFs) are an essential subset of

interacting proteins responsible for the control of geneexpression. They interact with DNA regions and tendto form transcriptional regulatory complexes. Thus,the final effect of one of these complexes is determinedby its TF composition.The number of TFs varies among organisms,

although it appears to be linked to the organism’scomplexity. Around 200–300 TFs are predicted forEscherichia coli [18] and Saccharomyces [19,20]. Bycontrast, comparative analysis in multicellular organ-isms shows that the predicted number of TFs reaches600–820 in C. elegans and D. melanogaster [20,21], and1500–1800 in Arabidopsis (1200 cloned sequences)[20–22]. For humans, around 1500 TFs have beendocumented [21] and it is estimated that there are2000–3000 [21,23]. Such an increase in the number ofTFs is associated with higher control of gene regula-tion [24]. Interestingly, such an increase is based onthe use of the same structural types of proteins.Human transcription factors are predominantly Zn fin-gers, followed by homeobox and basic helix–loop–helix[21]. Phylogenetic studies have shown that the amplifi-cation and shuffling of protein domains determine thegrowth of certain transcription factor families [25–28].Here, a domain can be defined as a protein sub-structure that can fold independently into a compactstructure. Different domains of a protein are oftenassociated with different functions [29,30].When dealing with TF networks, several relevant

questions arise. How are these factors distributed andrelated through the network structure? How importanthas the protein domain universe been in shaping thenetwork? Analysis of global patterns of networkorganization is required to answer these questions.To this end, we explored, for the first time, the

human transcription factor network (HTFN) obtainedfrom the protein–protein interaction information avail-able in the TRANSFAC database [31], using noveltools of network analysis. We show that this approxi-mation allows us to propose evolutionary considera-tions concerning the mechanisms shaping networkarchitecture.

Results and Discussion

Topological analysis

Data compilation from the TRANSFAC transcriptionfactor database provided 1370 human entries. After

filtering according to criteria given in ExperimentalProcedures, a graph of N ¼ 230 interacting humanTFs was obtained (Fig. 1). This can be understood asthe architecture of the regulatory backbone. It pro-vides a topological view of the interaction patternsamong the elements responsible for gene expression.This corresponds to the protein hardware that carriesout genomic instructions. The remaining TFs con-tained in the database did not form subgraphs andappeared isolated. The relatively small size of the con-nected graph compared with all the entries in the data-base might be due, at least in part, to the currentdegree of knowledge of this transcriptional regulatorynetwork, with only sparse data for many of its compo-nents. Although a number of possible sources of biasare present, it is worth noting that the topological pat-tern of organization reported from different sources ofprotein–protein interactions seems consistent [32].

Topological analysis of HTFN is summarized inTable 1 showing that HTFN is a sparse, small-worldgraph. The degree distribution (Fig. 2A) and clustering(Fig. 2B) show a heterogeneous, skewed shape remind-ing us of a power–law behaviour, indicating that mostTFs are linked to only a few others, whereas a handfulof them have many connections. The average between-ness centrality (b) shows well-defined power–law

Fig. 1. Human transcription factor network built from data extracted

from the TRANSFAC 8.2 database. Numbered black filled nodes

are the highest connected transcription factors. 1, TATA-binding

protein (TBP); 2, p53; 3, p300; 4, retinoid X receptor a (RXRa); 5,retinoblastoma protein (pRB); 6, nuclear factor NFjB p65 subunit

(RelA); 7, c-jun; 8, c-myc; 9, c-fos.

Human transcription factor network topology C. Rodriguez-Caso et al.

6424 FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS

Transcription factor network explains some cancers

26

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3

1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain

2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain

3 Santa Fe Institute, Santa Fe, New Mexico, USA

Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.

Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they

Keywords

human; molecular evolution; protein

interaction; tinkering; transcription factor

network

Correspondence

Ricard V. Sole, ICREA - Complex System

Laboratory, Universitat Pompeu Fabra,

Dr Aiguader 80, 08003 Barcelona, Spain

Fax: +34 93 221 3237

Tel: +34 93 542 2821

E-mail: [email protected]

(Received 5 August 2005, revised 25

October 2005, accepted 31 October 2005)

doi:10.1111/j.1742-4658.2005.05041.x

Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.

Abbreviations

ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423

or via control of TF expression, less connected factorsmay also be relevant to cell survival.

Functional and structural patterns from topology

In order to reveal the mechanisms that shape the struc-ture of HTFN, we studied its topological modularityin relation to the function and structure of TFs fromavailable information. From a structural point of view,the overabundance of self-interactions is associatedwith a majority group of 55% of basic helix–loop–helix (bHLH) and leucine zippers (bZip), 17.5% of Znfingers and 22.5% corresponding to a more hetero-geneous group, the ‘beta-scaffold factor with minorgroove contact’ (according to the TRANSFAC classifi-cation) superclass, which includes Rel homologyregions, MADS factors and others.

Such structures can be understood as proteindomains, which can be found alone or combined togive rise to TFs. These domains are responsible forrelevant properties, such as TF–DNA or TF–TF bind-ing. In this context, self-interactions can be explainedby the presence of domains with the ability to bindbetween them as is the case of bHLH and bZip. Theyfollow a general mechanism to interact with DNAbased on protein dimerization [42]. Zn finger domainsare common in TFs, allowing them to bind DNA, butnot to interact with other protein regions [42]. Thisgroup of self-interacting Zn finger proteins is a subsetof the nuclear receptor superfamily (steroid, retinoidand thyroid, as well as some orphan receptors) [26,43].They obey a general mechanism in which Zn fingerTFs have to form dimers in order to recognize tandemsequences in DNA [42]. In fact, regulation at the levelof formation of transcriptional regulatory complexes islinked to a homo ⁄heterodimerization of TFs contain-ing these self-interacting domains. Attending to thissimple rule of domain self-interaction, relative levels ofthese proteins could determine the final composition of

a complex, by varying their function and affinity toDNA. This is the case of the bHLH–bZip proto-onco-gen c-myc [44], or the Zn finger retinoid X receptorRXR [45].

From a topological viewpoint, connections by self-interacting domains would imply high clustering andmodularity, because all these proteins share the samerules and they have the potential to give a highly inter-connected subgraph (i.e. a module). According to this,the high clustering of HTFN (see Fig. 1) could beexplained as a by-product of the overabundance ofself-interacting domains.

We wondered whether the HTFN modular architec-ture (Fig. 3C) might include both functionality andstructural similarity. In order to simplify the study ofmodularity, we traced an arbitrary line identifyingseven putative protein groups (dashed line in Fig. 3C).Nodes of each group were identified by different col-ours in the HTFN graph (Fig. 4A) where we visualizethe modules defined by the topological overlap algo-rithm. We note that a consequence of the hierarchicalcomponent of HTFN is that not all factors in eachgroup have the same level of relation. Unlike asimple modular network, the combination of hierarchyand modularity cannot give homogeneous groups.Figure 4B shows the HTFN core graph, highlightingits modularity, the under-representation of connectionsbetween hubs and the overabundance of highly con-nected nodes linked to poorly connected ones (bothobserved in the correlation profile). The central role ofthe hubs in topological groups defined in Fig. 3Ashould be stressed, such hubs are those described inTable 2, with the exception of E12 (with k ¼ 11),which is involved in lymphocyte development [46].

An analysis of the topological modules of the Fig. 3(labelled A–G) shows that they include structuraland ⁄or functional features. Table 3 summarizes themain structural and functional features of thesegroups. In agreement with the structural homogeneity

Table 2. Description and functionality of transcriptions factor hubs. Transcription factor (TF), degree (k), betweenness centrality (b).

TF Description Associate disease k b · 103

TBP Basal transcription machinery initiator Spinocerebellar ataxia [40] 27 17.3

p53 Tumor suppressor protein Proliferative disease [68] 23 18.5

P300 Coactivator. Histone acetyltransferase May play a role in epithelial cancer [69] 18 20.2

RXR-a Retinoid X-a receptor Hepatocellular carcinoma [70] 18 8

pRB retinoblastoma suppressor protein.

Tumour suppressor protein

Proliferative disease Bladder cancer.

Osteosarcoma [71]

15 27.1

RelA NF-jB pathway Hepatocyte apoptosis and foetal death [72] 14 6.6

c-jun AP-1 complex (activator). Proto-oncogen Proliferative disease [73] 14 4.1

c-myc Activator. Proto-oncogen Proliferative disease [74] 13 10.5

c-fos AP-1 complex (activator). Proto-oncogen Proliferative disease [75] 12 2

C. Rodriguez-Caso et al. Human transcription factor network topology

FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6427

2 1

4 5

7 6 9

At least 9 transcription factors drive to cancer if their function is affected

If I know the gene network of a process THEN

I can predict which genes are really essential

Page 31: Bioinformatics and the logic of life

http://www.scbi.uma.es

Biomarkers can be obtained from the observation of bioinformatics networks

27

Breast cancer

Page 32: Bioinformatics and the logic of life

http://www.scbi.uma.es

Gene signatures to cancer diagnosis

28

Robust gene signatures from microarray data using genetic algorithmsenriched with biological pathway keywords

R.M. Luque-Baena a,⇑, D. Urda a,b, M. Gonzalo Claros c, L. Franco a,b, J.M. Jerez a,b

a Departmento de Lenguajes y Ciencias de la Computación, University of Málaga, Bulevar Louis Pasteur, 35, 29071 Málaga, Spainb Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga, Spainc Supercomputing and Bioinformatics Centre, University of Málaga, C/ Severo Ochoa, 34, 29590 Málaga, Spain

a r t i c l e i n f o

Article history:Received 24 July 2013Accepted 16 January 2014Available online 27 January 2014

Keywords:DNA analysisEvolutionary algorithmsBiological enrichmentFeature selection

a b s t r a c t

Genetic algorithms are widely used in the estimation of expression profiles from microarrays data. How-ever, these techniques are unable to produce stable and robust solutions suitable to use in clinical and bio-medical studies. This paper presents a novel two-stage evolutionary strategy for gene feature selectioncombining the genetic algorithm with biological information extracted from the KEGG database. A com-parative study is carried out over public data from three different types of cancer (leukemia, lung cancerand prostate cancer). Even though the analyses only use features having KEGG information, the resultsdemonstrate that this two-stage evolutionary strategy increased the consistency, robustness and accuracyof a blind discrimination among relapsed and healthy individuals. Therefore, this approach could facilitatethe definition of gene signatures for the clinical prognosis and diagnostic of cancer diseases in a nearfuture. Additionally, it could also be used for biological knowledge discovery about the studied disease.

! 2014 Elsevier Inc. All rights reserved.

1. Introduction

The term cancer encompasses more than 100 potentially life-threatening diseases affecting nearly every part of the body. Canceris a complex, multifactorial, genetic disease involving structuraland expression abnormalities of both coding and non-codinggenes. In this sense, gene expression profiling plays an importantrole in a wide range of areas in biological science for handling can-cer diseases [1–4]. The analysis of DNA microarray data requires aselection of features (genes) due to the small number of samplesavailable (mostly less than a hundred) and the large number offeatures (in the order of thousands). This problem is well-knownin the literature as the ‘‘large-p-small-n’’ paradigm or the curseof dimensionality [5].

Evolutionary models have been proposed in several works[6–12] and constitute one of the most widely used techniques forfeature selection and prognosis analysis in microarray datasets.Despite all the variety of feature selection techniques proposedin the literature, it still remains a problematic intrinsic to the

domain of DNA microarrays. Genetic algorithms (GAs) [13–18],as a particular case of evolutionary models, use classification tech-niques within the algorithm to evaluate and evolve the population.Producing stable or robust solutions is a desired property of featureselection algorithms, in particular for clinical and biomedical stud-ies. Nevertheless, robustness is a property difficult to be analyzedand is often overlooked. In [19–21] different approaches are pro-posed, addressing the main drawbacks related to overfitting androbustness, through a modified GA that includes an early-stoppingcriteria and establishing a feature ranking method that leads tomore robust solutions. Although some proposals use biologicalinformation to analyze DNA microarray data [22], none of them in-cludes it into the mechanisms that guide the searching procedurein the GA. In our opinion, this strategy would, on one hand, pro-duce more robust feature subset selections and, on the other hand,permit to obtain signatures more relevant for clinicians and bio-medical researchers.

In this approach, a two-stage procedure is proposed in order toobtain robust feature subset selections with good performancerates in test future data. Bootstrap Cross-Validation (BCV) is usedsince its good behavior related to misclassification error with smallsamples has been previously demonstrated [23,24], including DNAmicroarray datasets. A novel feature scoring method within the GAis also proposed, taking into account biological information relatedto the studied disorders. One widely used source of biologicalinformation is the Gene Ontology (GO) database [25] since it

http://dx.doi.org/10.1016/j.jbi.2014.01.0061532-0464/! 2014 Elsevier Inc. All rights reserved.

⇑ Corresponding author. Address: Department of Computer Languages andComputer Science, University of Málaga, Bulevar Louis Pasteur, 35, 29071 Málaga,Spain. Fax: +34 952131397.

E-mail addresses: [email protected] (R.M. Luque-Baena), [email protected](D. Urda), [email protected] (M. Gonzalo Claros), [email protected] (L. Franco),[email protected] (J.M. Jerez).

Journal of Biomedical Informatics 49 (2014) 32–44

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier .com/locate /y jb in

relevant information, in order to obtain a robust feature subsetselection with good performance rates. The approach incorporatesa novel feature scoring method within the GA, taking into accountbiological information about proteins (mostly enzymes) involvedin the pathways of the studied disorders. The most remarkablefinding is that our proposal improves the standard GA strategyregardless of the classification model used (LDA or SVM) in thethree analyzed data sets (Table 4, Accuracy column), leading to sta-tistically significant results in two of them (Leukemia and Lung).Even more important from the biological and clinic point of view,the robustness, in terms of the most selected genes that can beused to define gene signatures, is also improved in all three ana-lyzed databases (Table 4, Robustness column). The main conse-quence of both facts is that the results of a KEGG-improved GAcan provide more repetitive and consistent results that will facili-tate the definition of gene signatures for further clinical diagnosticand prognostic. Moreover, the comparative analysis done amongthe KEGG-improved GA (Table 5) and three alternative filter meth-ods (Cons, IG and ReliefF) demonstrated a similar or higher perfor-mance of the KEGG-improved GA, with the additional benefit of thebiological information about the disease dynamics provided by thisnew GA-based strategy.

Regarding the summarizing results of Fig. 3 it can be seen thatthe best placed pathways in Table 4 provide more accurate and ro-

bust results. This opens the possibility of a deeper study of whichKEGG-pathway(s) provide(s) the better results for any diseasedataset. It should be noted that those feature subsets that includemore genes of the analyzed pathways analyzed might indicate thatthis particular pathway has a greater biological impact on thedisease.

But the proposed KEGG-improved GA not only can be usedfor diagnostic and prognostic, but also for biological knowledgediscovery about the disease. Regarding the most remarkablegenes of Tables 6–8 that even not originally present in the se-lected pathways, form part of the final selection thus playingan important role for obtaining robust and accurate predictionresults. For example, in Table 6 (Leukemia set), the gene ZYX7

is repetitively selected in all but one pathways; it codes zyxin, aadhesion plaque protein that prompts the formation of actin-richstructures at which signal transduction assemble. In the case ofthe lung database (Table 7), several adhesion pathways are in-volved in this cancer (cf., 04530, 04514) while the ZYX gene doesnot seem to be significant. The gene SEMA3C8 corresponds to asemaphorin, a protein including an inmunoglobulin domain. It

Fig. 3. Accuracy and robustness obtained for the selected pathways for each considered database (Leukemia, Lung and Prostate). The graphs include the results obtainedwhen using a strategy based only on genetic algorithms (GA) and on genetic algorithms plus the filtering approach (Filter + GA) (see text for more details).

7 http://www.genecards.org/cgi-bin/carddisp.pl?gene=ZYX.8 http://www.genecards.org/cgi-bin/carddisp.pl?gene=SEMA3C.

42 R.M. Luque-Baena et al. / Journal of Biomedical Informatics 49 (2014) 32–44

0.8, 1, 2, 3, 5} and coef0, r = {0, 1, 2}. It should be noted that not allthe parameters are required for each kernel type. For further infor-mation please visit [49].

Table 4 shows a comparison of the results after applying differ-ent strategies. The first three columns show the classificationmethod, the dataset and the strategy used. The fourth column rep-resents the number of genes, on average, after executing the meth-od fifty resamplings and five repetitions for each resampling.Robustness column in Table 2 indicates the average frequency of

the most selected genes, which are those that appear more than5% of the time in any of the solutions. The last column shows theresult of prediction of the disease over a test set not used duringall the process.

The accuracy results for the LDA method are, in general, slightlybetter than those obtained by applying SVM, although LDA has low-er complexity. This is not surprising since it has been shown beforethat simpler classification techniques can lead to competitive oreven better results [50]. Therefore, the following analysis done to

Table 5Performance comparison among the ‘‘Filter + GA + Pathway’’ combined strategy and three well-known filtering methods (Cons, IG and ReliefF). ACC and number of genes(mean ± std) are reported for LDA and SVM classifiers on the three analyzed datasets.

Strategy Leukemia

LDA SVM

ACC #Genes ACC #Genes

Filter + GA + Pathway 05340 97.13 ± 1.16 31.83 ± 1.86 93.87 ± 2.02 30.82 ± 1.62Filter + GA + Pathway 04640 96.38 ± 1.26 4.47 ± 0.71 94.86 ± 1.13 4.05 ± 0.80Cons 85.85 ± 8.55 1.84 ± 0.51 88.24 ± 5.95 1.84 ± 0.51IG 93.13 ± 4.40 9 ± 0 93.36 ± 4.33 9 ± 0ReliefF 93.31 ± 4.37 9 ± 0 90.48 ± 5.15 9 ± 0

LungFilter + GA + Pathway 04144 98.09 ± 0.68 4.29 ± 0.53 96.25 ± 0.97 4.15 ± 0.57Filter + GA + Pathway 04530 98.26 ± 0.46 3.84 ± 0.46 97.05 ± 0.90 3.55 ± 0.64Cons 94.08 ± 3.36 1.84 ± 0.42 94.57 ± 2.55 1.84 ± 0.42IG 98.68 ± 1.51 22 ± 0 98.88 ± 1.39 22 ± 0ReliefF 97.89 ± 1.81 22 ± 0 98.47 ± 1.43 22 ± 0

ProstateFilter + GA + Pathway 00980 91.37 ± 1.15 8.27 ± 0.83 87.96 ± 2.39 11.15 ± 2.10Filter + GA + Pathway 00480 90.80 ± 1.36 14.30 ± 2.63 88.90 ± 2.29 26.24 ± 4.02Cons 81.51 ± 7.57 3.20 ± 0.67 82.49 ± 6.72 3.20 ± 0.67IG 91.66 ± 4.07 12 ± 0 85.86 ± 4.86 12 ± 0ReliefF 90.22 ± 4.53 12 ± 0 88.50 ± 5.17 12 ± 0

Fig. 2. Proportion of the final selected genes which belong to the analyzed pathway for the databases Leukemia, Lung and Prostate.

38 R.M. Luque-Baena et al. / Journal of Biomedical Informatics 49 (2014) 32–44

Page 33: Bioinformatics and the logic of life

http://www.scbi.uma.es

Gene signatures to cancer diagnosis

28

Robust gene signatures from microarray data using genetic algorithmsenriched with biological pathway keywords

R.M. Luque-Baena a,⇑, D. Urda a,b, M. Gonzalo Claros c, L. Franco a,b, J.M. Jerez a,b

a Departmento de Lenguajes y Ciencias de la Computación, University of Málaga, Bulevar Louis Pasteur, 35, 29071 Málaga, Spainb Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga, Spainc Supercomputing and Bioinformatics Centre, University of Málaga, C/ Severo Ochoa, 34, 29590 Málaga, Spain

a r t i c l e i n f o

Article history:Received 24 July 2013Accepted 16 January 2014Available online 27 January 2014

Keywords:DNA analysisEvolutionary algorithmsBiological enrichmentFeature selection

a b s t r a c t

Genetic algorithms are widely used in the estimation of expression profiles from microarrays data. How-ever, these techniques are unable to produce stable and robust solutions suitable to use in clinical and bio-medical studies. This paper presents a novel two-stage evolutionary strategy for gene feature selectioncombining the genetic algorithm with biological information extracted from the KEGG database. A com-parative study is carried out over public data from three different types of cancer (leukemia, lung cancerand prostate cancer). Even though the analyses only use features having KEGG information, the resultsdemonstrate that this two-stage evolutionary strategy increased the consistency, robustness and accuracyof a blind discrimination among relapsed and healthy individuals. Therefore, this approach could facilitatethe definition of gene signatures for the clinical prognosis and diagnostic of cancer diseases in a nearfuture. Additionally, it could also be used for biological knowledge discovery about the studied disease.

! 2014 Elsevier Inc. All rights reserved.

1. Introduction

The term cancer encompasses more than 100 potentially life-threatening diseases affecting nearly every part of the body. Canceris a complex, multifactorial, genetic disease involving structuraland expression abnormalities of both coding and non-codinggenes. In this sense, gene expression profiling plays an importantrole in a wide range of areas in biological science for handling can-cer diseases [1–4]. The analysis of DNA microarray data requires aselection of features (genes) due to the small number of samplesavailable (mostly less than a hundred) and the large number offeatures (in the order of thousands). This problem is well-knownin the literature as the ‘‘large-p-small-n’’ paradigm or the curseof dimensionality [5].

Evolutionary models have been proposed in several works[6–12] and constitute one of the most widely used techniques forfeature selection and prognosis analysis in microarray datasets.Despite all the variety of feature selection techniques proposedin the literature, it still remains a problematic intrinsic to the

domain of DNA microarrays. Genetic algorithms (GAs) [13–18],as a particular case of evolutionary models, use classification tech-niques within the algorithm to evaluate and evolve the population.Producing stable or robust solutions is a desired property of featureselection algorithms, in particular for clinical and biomedical stud-ies. Nevertheless, robustness is a property difficult to be analyzedand is often overlooked. In [19–21] different approaches are pro-posed, addressing the main drawbacks related to overfitting androbustness, through a modified GA that includes an early-stoppingcriteria and establishing a feature ranking method that leads tomore robust solutions. Although some proposals use biologicalinformation to analyze DNA microarray data [22], none of them in-cludes it into the mechanisms that guide the searching procedurein the GA. In our opinion, this strategy would, on one hand, pro-duce more robust feature subset selections and, on the other hand,permit to obtain signatures more relevant for clinicians and bio-medical researchers.

In this approach, a two-stage procedure is proposed in order toobtain robust feature subset selections with good performancerates in test future data. Bootstrap Cross-Validation (BCV) is usedsince its good behavior related to misclassification error with smallsamples has been previously demonstrated [23,24], including DNAmicroarray datasets. A novel feature scoring method within the GAis also proposed, taking into account biological information relatedto the studied disorders. One widely used source of biologicalinformation is the Gene Ontology (GO) database [25] since it

http://dx.doi.org/10.1016/j.jbi.2014.01.0061532-0464/! 2014 Elsevier Inc. All rights reserved.

⇑ Corresponding author. Address: Department of Computer Languages andComputer Science, University of Málaga, Bulevar Louis Pasteur, 35, 29071 Málaga,Spain. Fax: +34 952131397.

E-mail addresses: [email protected] (R.M. Luque-Baena), [email protected](D. Urda), [email protected] (M. Gonzalo Claros), [email protected] (L. Franco),[email protected] (J.M. Jerez).

Journal of Biomedical Informatics 49 (2014) 32–44

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier .com/locate /y jb in

relevant information, in order to obtain a robust feature subsetselection with good performance rates. The approach incorporatesa novel feature scoring method within the GA, taking into accountbiological information about proteins (mostly enzymes) involvedin the pathways of the studied disorders. The most remarkablefinding is that our proposal improves the standard GA strategyregardless of the classification model used (LDA or SVM) in thethree analyzed data sets (Table 4, Accuracy column), leading to sta-tistically significant results in two of them (Leukemia and Lung).Even more important from the biological and clinic point of view,the robustness, in terms of the most selected genes that can beused to define gene signatures, is also improved in all three ana-lyzed databases (Table 4, Robustness column). The main conse-quence of both facts is that the results of a KEGG-improved GAcan provide more repetitive and consistent results that will facili-tate the definition of gene signatures for further clinical diagnosticand prognostic. Moreover, the comparative analysis done amongthe KEGG-improved GA (Table 5) and three alternative filter meth-ods (Cons, IG and ReliefF) demonstrated a similar or higher perfor-mance of the KEGG-improved GA, with the additional benefit of thebiological information about the disease dynamics provided by thisnew GA-based strategy.

Regarding the summarizing results of Fig. 3 it can be seen thatthe best placed pathways in Table 4 provide more accurate and ro-

bust results. This opens the possibility of a deeper study of whichKEGG-pathway(s) provide(s) the better results for any diseasedataset. It should be noted that those feature subsets that includemore genes of the analyzed pathways analyzed might indicate thatthis particular pathway has a greater biological impact on thedisease.

But the proposed KEGG-improved GA not only can be usedfor diagnostic and prognostic, but also for biological knowledgediscovery about the disease. Regarding the most remarkablegenes of Tables 6–8 that even not originally present in the se-lected pathways, form part of the final selection thus playingan important role for obtaining robust and accurate predictionresults. For example, in Table 6 (Leukemia set), the gene ZYX7

is repetitively selected in all but one pathways; it codes zyxin, aadhesion plaque protein that prompts the formation of actin-richstructures at which signal transduction assemble. In the case ofthe lung database (Table 7), several adhesion pathways are in-volved in this cancer (cf., 04530, 04514) while the ZYX gene doesnot seem to be significant. The gene SEMA3C8 corresponds to asemaphorin, a protein including an inmunoglobulin domain. It

Fig. 3. Accuracy and robustness obtained for the selected pathways for each considered database (Leukemia, Lung and Prostate). The graphs include the results obtainedwhen using a strategy based only on genetic algorithms (GA) and on genetic algorithms plus the filtering approach (Filter + GA) (see text for more details).

7 http://www.genecards.org/cgi-bin/carddisp.pl?gene=ZYX.8 http://www.genecards.org/cgi-bin/carddisp.pl?gene=SEMA3C.

42 R.M. Luque-Baena et al. / Journal of Biomedical Informatics 49 (2014) 32–44

0.8, 1, 2, 3, 5} and coef0, r = {0, 1, 2}. It should be noted that not allthe parameters are required for each kernel type. For further infor-mation please visit [49].

Table 4 shows a comparison of the results after applying differ-ent strategies. The first three columns show the classificationmethod, the dataset and the strategy used. The fourth column rep-resents the number of genes, on average, after executing the meth-od fifty resamplings and five repetitions for each resampling.Robustness column in Table 2 indicates the average frequency of

the most selected genes, which are those that appear more than5% of the time in any of the solutions. The last column shows theresult of prediction of the disease over a test set not used duringall the process.

The accuracy results for the LDA method are, in general, slightlybetter than those obtained by applying SVM, although LDA has low-er complexity. This is not surprising since it has been shown beforethat simpler classification techniques can lead to competitive oreven better results [50]. Therefore, the following analysis done to

Table 5Performance comparison among the ‘‘Filter + GA + Pathway’’ combined strategy and three well-known filtering methods (Cons, IG and ReliefF). ACC and number of genes(mean ± std) are reported for LDA and SVM classifiers on the three analyzed datasets.

Strategy Leukemia

LDA SVM

ACC #Genes ACC #Genes

Filter + GA + Pathway 05340 97.13 ± 1.16 31.83 ± 1.86 93.87 ± 2.02 30.82 ± 1.62Filter + GA + Pathway 04640 96.38 ± 1.26 4.47 ± 0.71 94.86 ± 1.13 4.05 ± 0.80Cons 85.85 ± 8.55 1.84 ± 0.51 88.24 ± 5.95 1.84 ± 0.51IG 93.13 ± 4.40 9 ± 0 93.36 ± 4.33 9 ± 0ReliefF 93.31 ± 4.37 9 ± 0 90.48 ± 5.15 9 ± 0

LungFilter + GA + Pathway 04144 98.09 ± 0.68 4.29 ± 0.53 96.25 ± 0.97 4.15 ± 0.57Filter + GA + Pathway 04530 98.26 ± 0.46 3.84 ± 0.46 97.05 ± 0.90 3.55 ± 0.64Cons 94.08 ± 3.36 1.84 ± 0.42 94.57 ± 2.55 1.84 ± 0.42IG 98.68 ± 1.51 22 ± 0 98.88 ± 1.39 22 ± 0ReliefF 97.89 ± 1.81 22 ± 0 98.47 ± 1.43 22 ± 0

ProstateFilter + GA + Pathway 00980 91.37 ± 1.15 8.27 ± 0.83 87.96 ± 2.39 11.15 ± 2.10Filter + GA + Pathway 00480 90.80 ± 1.36 14.30 ± 2.63 88.90 ± 2.29 26.24 ± 4.02Cons 81.51 ± 7.57 3.20 ± 0.67 82.49 ± 6.72 3.20 ± 0.67IG 91.66 ± 4.07 12 ± 0 85.86 ± 4.86 12 ± 0ReliefF 90.22 ± 4.53 12 ± 0 88.50 ± 5.17 12 ± 0

Fig. 2. Proportion of the final selected genes which belong to the analyzed pathway for the databases Leukemia, Lung and Prostate.

38 R.M. Luque-Baena et al. / Journal of Biomedical Informatics 49 (2014) 32–44

If I have determined a gene signature THEN

I can know which is the desease

Page 34: Bioinformatics and the logic of life

http://www.scbi.uma.es

Cancer signatures to reveal prognosis

29

A microRNA Signature Associated with Early Recurrencein Breast CancerLuis G. Perez-Rivas1., Jose M. Jerez2., Rosario Carmona3, Vanessa de Luque1, Luis Vicioso4,

M. Gonzalo Claros3,5, Enrique Viguera6, Bella Pajares1, Alfonso Sanchez1, Nuria Ribelles1,

Emilio Alba1, Jose Lozano1,5*

1 Laboratorio de Oncologıa Molecular, Servicio de Oncologıa Medica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga,

Spain, 2 Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga, Malaga, Spain, 3 Plataforma Andaluza de Bioinformatica, Universidad de

Malaga, Malaga, Spain, 4 Servicio de Anatomıa Patologica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga, Spain,

5 Departmento de Biologıa Molecular y Bioquımica, Universidad de Malaga, Malaga, Spain, 6 Departmento of Biologıa Celular, Genetica y Fisiologıa Animal, Universidad de

Malaga, Malaga, Spain

Abstract

Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse patternafter surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years,respectively. Although several clinical and pathological features have been used to discriminate between low- and high-riskpatients, the identification of molecular biomarkers with prognostic value remains an unmet need in the currentmanagement of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developedearly (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregatedtumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarraydata analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentiallyexpressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs weredown-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-riskgroup of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by publicdatabases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result inan overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-relatedmicroRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breastsurgery.

Citation: Perez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoSONE 9(3): e91884. doi:10.1371/journal.pone.0091884

Editor: Sonia Rocha, University of Dundee, United Kingdom

Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014

Copyright: ! 2014 Perez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio deEconomıa, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucıa (TIN-4026, to JJ). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

. These authors contributed equally to this work.

Introduction

Breast cancer comprises a group of heterogeneous diseases thatcan be classified based on both clinical and molecular features [1–5]. Improvements in the early detection of primary tumors and thedevelopment of novel targeted therapies, together with thesystematic use of adjuvant chemotherapy, has drastically reducedmortality rates and increased disease-free survival (DFS) in breastcancer. Still, about one third of patients undergoing breast tumorexcision will develop metastases, the major life-threatening eventwhich is strongly associated with poor outcome [6,7].

The risk of relapse after tumor resection is not constant overtime. A detailed examination of large series of long-term follow-upstudies over the last two decades reveals a bimodal hazard functionwith two peaks of early and late recurrence occurring at 1.5 and 5

years, respectively, followed by a nearly flat plateau in which therisk of relapse tends to zero [8–10]. A causal link between tumorsurgery and the bimodal pattern of recurrence has been proposedby some investigators (i.e. an iatrogenic effect) [11]. According tothat model, surgical removal of the primary breast tumor wouldaccelerate the growth of dormant metastatic foci by altering thebalance between circulating pro- and anti-angiogenic factors[9,11–14]. Such hypothesis is supported by the fact that the twopeaks of relapse are observed regardless other factors than surgery,such as the axillary nodal status, the type of surgery or theadministration of adjuvant therapy. Although estrogen receptor(ER)-negative tumors are commonly associated with a higher riskof early relapse [15], the bimodal distribution pattern is observedwith independence of the hormone receptor status [16]. Otherstudies also suggest that the dynamics of tumor relapse may be a

PLOS ONE | www.plosone.org 1 March 2014 | Volume 9 | Issue 3 | e91884

A microRNA Signature Associated with Early Recurrencein Breast CancerLuis G. Perez-Rivas1., Jose M. Jerez2., Rosario Carmona3, Vanessa de Luque1, Luis Vicioso4,

M. Gonzalo Claros3,5, Enrique Viguera6, Bella Pajares1, Alfonso Sanchez1, Nuria Ribelles1,

Emilio Alba1, Jose Lozano1,5*

1 Laboratorio de Oncologıa Molecular, Servicio de Oncologıa Medica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga,

Spain, 2 Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga, Malaga, Spain, 3 Plataforma Andaluza de Bioinformatica, Universidad de

Malaga, Malaga, Spain, 4 Servicio de Anatomıa Patologica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga, Spain,

5 Departmento de Biologıa Molecular y Bioquımica, Universidad de Malaga, Malaga, Spain, 6 Departmento of Biologıa Celular, Genetica y Fisiologıa Animal, Universidad de

Malaga, Malaga, Spain

Abstract

Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse patternafter surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years,respectively. Although several clinical and pathological features have been used to discriminate between low- and high-riskpatients, the identification of molecular biomarkers with prognostic value remains an unmet need in the currentmanagement of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developedearly (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregatedtumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarraydata analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentiallyexpressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs weredown-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-riskgroup of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by publicdatabases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result inan overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-relatedmicroRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breastsurgery.

Citation: Perez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoSONE 9(3): e91884. doi:10.1371/journal.pone.0091884

Editor: Sonia Rocha, University of Dundee, United Kingdom

Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014

Copyright: ! 2014 Perez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio deEconomıa, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucıa (TIN-4026, to JJ). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

. These authors contributed equally to this work.

Introduction

Breast cancer comprises a group of heterogeneous diseases thatcan be classified based on both clinical and molecular features [1–5]. Improvements in the early detection of primary tumors and thedevelopment of novel targeted therapies, together with thesystematic use of adjuvant chemotherapy, has drastically reducedmortality rates and increased disease-free survival (DFS) in breastcancer. Still, about one third of patients undergoing breast tumorexcision will develop metastases, the major life-threatening eventwhich is strongly associated with poor outcome [6,7].

The risk of relapse after tumor resection is not constant overtime. A detailed examination of large series of long-term follow-upstudies over the last two decades reveals a bimodal hazard functionwith two peaks of early and late recurrence occurring at 1.5 and 5

years, respectively, followed by a nearly flat plateau in which therisk of relapse tends to zero [8–10]. A causal link between tumorsurgery and the bimodal pattern of recurrence has been proposedby some investigators (i.e. an iatrogenic effect) [11]. According tothat model, surgical removal of the primary breast tumor wouldaccelerate the growth of dormant metastatic foci by altering thebalance between circulating pro- and anti-angiogenic factors[9,11–14]. Such hypothesis is supported by the fact that the twopeaks of relapse are observed regardless other factors than surgery,such as the axillary nodal status, the type of surgery or theadministration of adjuvant therapy. Although estrogen receptor(ER)-negative tumors are commonly associated with a higher riskof early relapse [15], the bimodal distribution pattern is observedwith independence of the hormone receptor status [16]. Otherstudies also suggest that the dynamics of tumor relapse may be a

PLOS ONE | www.plosone.org 1 March 2014 | Volume 9 | Issue 3 | e91884

with tumors from relapse-free patients (group A, Table 2). MiR-625 was excluded from any further studies since RT-qPCR datashowed minimal variation between groups (FC,2). Next, we re-clustered the 71 tumors based on the 5-miRNA signature. Asshown in Figure 2, tumors from groups A and B were clearlysegregated in two distinct clusters, which included most of theexpected samples in each category: 78.8% group A in cluster 1b(low risk) and 70.4% group B in cluster 2b (high risk). Of note, thesupervised analysis included most tumors from group C (72.8%),in cluster 1b, indicating that the 5-miRNA signature specifically

discriminates tumors with an overall higher risk of earlyrecurrence.

The 5-miRNA signatureMiR-149 was the most significant miRNA downregulated in

group B, as determined by microarray hybridization and by RT-qPCR. This miRNA has been described as a TS-miR thatregulates the expression of genes associated with cell cycle,invasion or migration and its downregulation has been observed inseveral tumor diseases, including gastric cancer and breast cancer[70,77–81]. Down-regulation of miR-149 can occur epigenetical-

Figure 2. A 5-miRNA signature is associated with early recurrence in breast cancer. Hierarchical clustering of the 71 tumor samples basedon expression of the 5-miRNA signature. Note that lower expression levels of the 5-miRNA signature defines a distinct cluster 2b wich mainly includestumors from ‘‘high risk’’ patients (group B). On the contrary, most patients with good prognosis (group A) had tumors with normal or higher-thannormal levels of the 5-miRNA signature, defining a different cluster 1b (‘‘low risk’’).doi:10.1371/journal.pone.0091884.g002

Figure 3. The 5-miRNA signature discriminates patients with diferent RFS. A) Kaplan-Meier graph for the whole patient cohort included inthis study. B) Those patients whose tumors showed an overall down-regulation of the 5-miRNA signature (i.e. those from cluster 2b in Fig. 2) wereclassified as ‘‘high risk’’ (red line) and their cumulative RFS was calculated (red line). RFS was also calculated for the remaining patients in the cohort(‘‘low risk’’, black line). The Kaplan-Meier plot shows that the 5-miRNA signature specifically discriminates tumors with an overall higher risk of earlyrecurrence.doi:10.1371/journal.pone.0091884.g003

A miRNA Signature Predictive of Early Recurrence

PLOS ONE | www.plosone.org 7 March 2014 | Volume 9 | Issue 3 | e91884

ly, by hypermethylation of the neighbouring CpG island [80] or byimpaired processing of the pri-miR-149 precursor, in a polymor-phic variant [79]. In a recent work, downregulation of miR-149has been associated with elevated levels of the transcription factorSP1, increase invasiveness and lower 5-year survival in colorectalcancer [80]. The p53 repressor ZBTB2 is also a target of miR-149[81], which could explain, at least partially, its function as a TS-miR.

MiR-30a-3p is a member of the miR-30 family, which isassociated with mesenchymal and stemness features [82,83] and isdownregulated in several types of cancer [84–86]. Recently,Rodriguez-Gonzalez et al. have linked low levels of this miRNA totamoxifen resistance in ER+ breast tumors. They have alsoproposed several targets of miR-30a-3p involved in proliferationand apoptosis, such as BCL2, NFkB, MAP2K4, PDGFA,CDK5R1 and CHN1 [87].

Regarding miR-20b, this miRNA is part of the miR-106b-363cluster, which is frequently deregulated in cancer [88–91]. Thelevels of miR-20b associate with histological grade in breast cancer[92,93]. This miRNA has been involved in regulating several keyproteins such as ESR1, HIF-1a, VEGF or STAT3 [92,94,95]. Inparticular, because it targets both HIF-1a and VEGF and HIF-1anegatively controls miR-20b levels, it has been defined as an anti-angiogenic miRNA [95].

Both oncogenic and tumor suppressor features have beenreported for miR-10a [96]. Thus, reduced expression of miR-10ahas been associated with MAP3K7- and bTRC-mediatedactivation of the proinflammatory NFkB pathway [97]. Also,miR-10a downregulation represses differentiation in part byderegulation of the histone deacetylase HDAC4 [98] andpositively affects invasiveness by de-repressing several membersof the homeobox family of transcription factors [99].

Regarding miR-342-5p, it appears significantly deregulatedonly when we compare B vs AC (Table 2). Together with itscounterpart (miR-342-3p), it is deregulated in inflammatory breastcancer [74] and its low expression has been associated with lower

post-recurrence survival [100], likely because it targets AKT1mRNA [101].

In sum, the available bibliographic data suggests that down-regulation of miR-149, miR-30a-3p, miR-20b, miR-10a andmiR342-5p in primary breast tumors could confer them enhancedproliferative, angiogenic and invasive potentials.

Prognostic value of the 5-miRNA signature. The relation-ship between expression of the 5-miRNA signature and RFS wasexamined by a survival analysis. Figure 3A shows a Kaplan-Meiergraph for the whole series of patients included in the study. Due tothe intrinsic characteristics of the cohort, decreases in the RFS areonly observed in the intervals 0–24 and 50–60 months(corresponding to groups B and C, respectively). We next groupedthe tumors according to their 5-miRNA signature status in twodifferent groups. One group included those tumors with all fivemiRNAs simultaneously downregulated, (FC.2 and p,0.05) anda second group included those tumors not having all five miRNAsdownregulated. A survival analysis was performed using clinicaldata from the corresponding patients. As shown in Figure 3B, theKaplan-Meier graphs for the two groups demonstrate that the 5-miRNA signature defines a ‘‘high risk’’ group of patients with ashorter RFS (Peto-Peto test with p-value = 0.02, when comparingthe low vs high risk groups).

Using a Cox proportional hazard regression model, we alsotested all possible combinations of different covariates (tumorsubtype, patient age, tumor size, number of lymph nodes affectedand the 5-miRNA signature) with early relapse (#24 months) toidentify the best prognostic factors. The best model according tothe AIC criterion included the tumor size and expression of the 5-miRNA signature (data not shown). Only the 5-miRNA signature(all five miRNAs down-regulated) resulted statistically significant inthe Cox model for the high risk group (p-value = 0.02 withHR = 2.73, 95% CI: 1.17–6.36). The 5-miRNA expression datawere also used to develop a predictor model through boot-strapping over a Naive Bayes classifier (B = 200 with N = 71, seemethods). The prognostic accuracy of the models was assessed by areceiver operating charateristic (ROC) test (Figure 4). Consideredindividually, miR-30a-3p and miR-10a showed a strikingly highArea Under the Curve (AUC) score (0.890 and 0.875, respective-ly). This result suggests that mRNA targets regulated by miR-30a-3p and miR-10a could potentially add a greater contribution tothe final outcome of the disease. However, the 5-miRNA signaturehad the strongest predictive value to discriminate tumors frompatients that will develop early relapse (group B) from those thatwill remain free of disease (group A), with an AUC = 0.993(Figure 4). In summary, the 5-miRNA signature has a goodperformance as a risk predictor for early breast cancer recurrence.

Candidate targets for the 5-miRNA signature. To extendour set of five miRNAs with regulatory information, we next tookadvantage of the existing public databases curating predicted andvalidated miRNA-target interactions (MTIs). In particular, vali-dated targets were obtained from the miTarBase and miRecordsrepositories (see methods). First, we created a biological network inCytoscape [66] containing all the individual miRNAs included inthe 5-miRNA signature (miR-149, miR-20b, miR10a, miR-30a-3p and miR-342-5p). Next, we extended the network by adding H.sapiens MTI data retrieved from the indicated repositories and,finally, extended regulatory interaction networks (RIN) weregenerated and visualized in Cytoscape. Each regulatory interac-tion in the network consist of two nodes, a regulatory component(miRNA) and a target biomolecule (mRNA) connected throughone directed edge. Figure 5 shows the extended network when theRIN threshold was set to 1 (i.e. each predicted target appears in, atleast, one RIN). Thus, at RIN = 1 the network included 14

Figure 4. Receiver operating characteristic curve (ROC) forearly breast cancer recurrence by the 5-miRNA signaturestatus. ROC curves generated using the prognosis information andexpression levels of the 5-miRNA signature can discriminate betweenpatients who will develop early recurrence and those who will remainfree of disease. Note that, although miR-30-3p and miR10a, individuallyhave a high area under the curve (AUC) score, the 5-miRNA signaturehas the strongest predictive value (AUC = 0.993) to discriminate thosepatients likely to recur early (group B in our cohort).doi:10.1371/journal.pone.0091884.g004

A miRNA Signature Predictive of Early Recurrence

PLOS ONE | www.plosone.org 8 March 2014 | Volume 9 | Issue 3 | e91884

validated targets assigned to miR-20b (VEGFA, BAMBI, EFNB2,MYLIP, CRIM1, ARID4B, HIF1A, HIPK3, CDKN1A, PPARG,STAT3, MUC17, EPHB4, and ESR1), 7 validated targetsassigned to miR-10a (HOXA1, NCOR2, SRSF1, SRSF10/TRA2B, MAP3K7, USF2 and BTRC) and 9 validated targetsassigned to miR-3a-3p (THBS1, VEZT, TUBA1A, CDK6,WDR82, TMEM2, KRT7, CYR61 and SLC7A6) (Figure 5).Taking these results into account and considering that i) theextended network was constructed with the 5-miRNA signature asthe network nodes and ii) all MTIs depicted in Figure 5 have beenexperimentally verified, we suggest that at least some of the

30 mRNAs (Figure 5) could be regulated in vivo by the 5-miRNAsignature in early-relapsing tumors.

To gain further insight into the molecular basis of the 5-miRNAsignature prognostic value, we investigated the biological pathwaysassociated with the 30 experimentally verified targets fromFigure 5. To that end, we searched for Gene Ontology (GO)terms and Kyoto Encyclopedia of Genes and Genomes (KEGG)pathways associated with the 30 targets as a whole set. It should benoted, however, that our restrictive approach –including onlyexperimentally validated miRNA targets-, left miR-149 and miR-342-5p out of the GO analysis and therefore, additional biologicalpathways could be affected by downregulation of the 5-miRNA

Figure 5. Prediction of mRNA targets likely to be regulated by the 5-miRNA signature. Biological networks were created using theCytoscape software. Each network includes two types of nodes: the five individual miRNAs included in the 5-miRNA signature and their predictedmRNA targets (yellow circles), obtained from two different public databases (miRTarBase and miRecords). The number of databases included in theanalysis defines the regulatory interaction network (RIN) threshold. Thus, at RIN = 1 the network includes all mRNA targets that appear in, at least, onedatabase. The databases included in the RIN are identified by the color of the connecting arrows: miRTarBase (blue) and miRecords (red). Althoughmany mRNAs are potential targets for miR-149 and miR-342-5p, the miRTarBase and miRecords versions included in this study did not reveal anytargets experimentally validated for the two miRNAs.doi:10.1371/journal.pone.0091884.g005

A miRNA Signature Predictive of Early Recurrence

PLOS ONE | www.plosone.org 9 March 2014 | Volume 9 | Issue 3 | e91884

Page 35: Bioinformatics and the logic of life

http://www.scbi.uma.es

Cancer signatures to reveal prognosis

29

A microRNA Signature Associated with Early Recurrencein Breast CancerLuis G. Perez-Rivas1., Jose M. Jerez2., Rosario Carmona3, Vanessa de Luque1, Luis Vicioso4,

M. Gonzalo Claros3,5, Enrique Viguera6, Bella Pajares1, Alfonso Sanchez1, Nuria Ribelles1,

Emilio Alba1, Jose Lozano1,5*

1 Laboratorio de Oncologıa Molecular, Servicio de Oncologıa Medica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga,

Spain, 2 Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga, Malaga, Spain, 3 Plataforma Andaluza de Bioinformatica, Universidad de

Malaga, Malaga, Spain, 4 Servicio de Anatomıa Patologica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga, Spain,

5 Departmento de Biologıa Molecular y Bioquımica, Universidad de Malaga, Malaga, Spain, 6 Departmento of Biologıa Celular, Genetica y Fisiologıa Animal, Universidad de

Malaga, Malaga, Spain

Abstract

Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse patternafter surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years,respectively. Although several clinical and pathological features have been used to discriminate between low- and high-riskpatients, the identification of molecular biomarkers with prognostic value remains an unmet need in the currentmanagement of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developedearly (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregatedtumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarraydata analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentiallyexpressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs weredown-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-riskgroup of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by publicdatabases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result inan overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-relatedmicroRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breastsurgery.

Citation: Perez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoSONE 9(3): e91884. doi:10.1371/journal.pone.0091884

Editor: Sonia Rocha, University of Dundee, United Kingdom

Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014

Copyright: ! 2014 Perez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio deEconomıa, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucıa (TIN-4026, to JJ). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

. These authors contributed equally to this work.

Introduction

Breast cancer comprises a group of heterogeneous diseases thatcan be classified based on both clinical and molecular features [1–5]. Improvements in the early detection of primary tumors and thedevelopment of novel targeted therapies, together with thesystematic use of adjuvant chemotherapy, has drastically reducedmortality rates and increased disease-free survival (DFS) in breastcancer. Still, about one third of patients undergoing breast tumorexcision will develop metastases, the major life-threatening eventwhich is strongly associated with poor outcome [6,7].

The risk of relapse after tumor resection is not constant overtime. A detailed examination of large series of long-term follow-upstudies over the last two decades reveals a bimodal hazard functionwith two peaks of early and late recurrence occurring at 1.5 and 5

years, respectively, followed by a nearly flat plateau in which therisk of relapse tends to zero [8–10]. A causal link between tumorsurgery and the bimodal pattern of recurrence has been proposedby some investigators (i.e. an iatrogenic effect) [11]. According tothat model, surgical removal of the primary breast tumor wouldaccelerate the growth of dormant metastatic foci by altering thebalance between circulating pro- and anti-angiogenic factors[9,11–14]. Such hypothesis is supported by the fact that the twopeaks of relapse are observed regardless other factors than surgery,such as the axillary nodal status, the type of surgery or theadministration of adjuvant therapy. Although estrogen receptor(ER)-negative tumors are commonly associated with a higher riskof early relapse [15], the bimodal distribution pattern is observedwith independence of the hormone receptor status [16]. Otherstudies also suggest that the dynamics of tumor relapse may be a

PLOS ONE | www.plosone.org 1 March 2014 | Volume 9 | Issue 3 | e91884

A microRNA Signature Associated with Early Recurrencein Breast CancerLuis G. Perez-Rivas1., Jose M. Jerez2., Rosario Carmona3, Vanessa de Luque1, Luis Vicioso4,

M. Gonzalo Claros3,5, Enrique Viguera6, Bella Pajares1, Alfonso Sanchez1, Nuria Ribelles1,

Emilio Alba1, Jose Lozano1,5*

1 Laboratorio de Oncologıa Molecular, Servicio de Oncologıa Medica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga,

Spain, 2 Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga, Malaga, Spain, 3 Plataforma Andaluza de Bioinformatica, Universidad de

Malaga, Malaga, Spain, 4 Servicio de Anatomıa Patologica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga, Spain,

5 Departmento de Biologıa Molecular y Bioquımica, Universidad de Malaga, Malaga, Spain, 6 Departmento of Biologıa Celular, Genetica y Fisiologıa Animal, Universidad de

Malaga, Malaga, Spain

Abstract

Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse patternafter surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years,respectively. Although several clinical and pathological features have been used to discriminate between low- and high-riskpatients, the identification of molecular biomarkers with prognostic value remains an unmet need in the currentmanagement of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developedearly (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregatedtumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarraydata analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentiallyexpressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs weredown-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-riskgroup of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by publicdatabases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result inan overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-relatedmicroRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breastsurgery.

Citation: Perez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoSONE 9(3): e91884. doi:10.1371/journal.pone.0091884

Editor: Sonia Rocha, University of Dundee, United Kingdom

Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014

Copyright: ! 2014 Perez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio deEconomıa, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucıa (TIN-4026, to JJ). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

. These authors contributed equally to this work.

Introduction

Breast cancer comprises a group of heterogeneous diseases thatcan be classified based on both clinical and molecular features [1–5]. Improvements in the early detection of primary tumors and thedevelopment of novel targeted therapies, together with thesystematic use of adjuvant chemotherapy, has drastically reducedmortality rates and increased disease-free survival (DFS) in breastcancer. Still, about one third of patients undergoing breast tumorexcision will develop metastases, the major life-threatening eventwhich is strongly associated with poor outcome [6,7].

The risk of relapse after tumor resection is not constant overtime. A detailed examination of large series of long-term follow-upstudies over the last two decades reveals a bimodal hazard functionwith two peaks of early and late recurrence occurring at 1.5 and 5

years, respectively, followed by a nearly flat plateau in which therisk of relapse tends to zero [8–10]. A causal link between tumorsurgery and the bimodal pattern of recurrence has been proposedby some investigators (i.e. an iatrogenic effect) [11]. According tothat model, surgical removal of the primary breast tumor wouldaccelerate the growth of dormant metastatic foci by altering thebalance between circulating pro- and anti-angiogenic factors[9,11–14]. Such hypothesis is supported by the fact that the twopeaks of relapse are observed regardless other factors than surgery,such as the axillary nodal status, the type of surgery or theadministration of adjuvant therapy. Although estrogen receptor(ER)-negative tumors are commonly associated with a higher riskof early relapse [15], the bimodal distribution pattern is observedwith independence of the hormone receptor status [16]. Otherstudies also suggest that the dynamics of tumor relapse may be a

PLOS ONE | www.plosone.org 1 March 2014 | Volume 9 | Issue 3 | e91884

with tumors from relapse-free patients (group A, Table 2). MiR-625 was excluded from any further studies since RT-qPCR datashowed minimal variation between groups (FC,2). Next, we re-clustered the 71 tumors based on the 5-miRNA signature. Asshown in Figure 2, tumors from groups A and B were clearlysegregated in two distinct clusters, which included most of theexpected samples in each category: 78.8% group A in cluster 1b(low risk) and 70.4% group B in cluster 2b (high risk). Of note, thesupervised analysis included most tumors from group C (72.8%),in cluster 1b, indicating that the 5-miRNA signature specifically

discriminates tumors with an overall higher risk of earlyrecurrence.

The 5-miRNA signatureMiR-149 was the most significant miRNA downregulated in

group B, as determined by microarray hybridization and by RT-qPCR. This miRNA has been described as a TS-miR thatregulates the expression of genes associated with cell cycle,invasion or migration and its downregulation has been observed inseveral tumor diseases, including gastric cancer and breast cancer[70,77–81]. Down-regulation of miR-149 can occur epigenetical-

Figure 2. A 5-miRNA signature is associated with early recurrence in breast cancer. Hierarchical clustering of the 71 tumor samples basedon expression of the 5-miRNA signature. Note that lower expression levels of the 5-miRNA signature defines a distinct cluster 2b wich mainly includestumors from ‘‘high risk’’ patients (group B). On the contrary, most patients with good prognosis (group A) had tumors with normal or higher-thannormal levels of the 5-miRNA signature, defining a different cluster 1b (‘‘low risk’’).doi:10.1371/journal.pone.0091884.g002

Figure 3. The 5-miRNA signature discriminates patients with diferent RFS. A) Kaplan-Meier graph for the whole patient cohort included inthis study. B) Those patients whose tumors showed an overall down-regulation of the 5-miRNA signature (i.e. those from cluster 2b in Fig. 2) wereclassified as ‘‘high risk’’ (red line) and their cumulative RFS was calculated (red line). RFS was also calculated for the remaining patients in the cohort(‘‘low risk’’, black line). The Kaplan-Meier plot shows that the 5-miRNA signature specifically discriminates tumors with an overall higher risk of earlyrecurrence.doi:10.1371/journal.pone.0091884.g003

A miRNA Signature Predictive of Early Recurrence

PLOS ONE | www.plosone.org 7 March 2014 | Volume 9 | Issue 3 | e91884

ly, by hypermethylation of the neighbouring CpG island [80] or byimpaired processing of the pri-miR-149 precursor, in a polymor-phic variant [79]. In a recent work, downregulation of miR-149has been associated with elevated levels of the transcription factorSP1, increase invasiveness and lower 5-year survival in colorectalcancer [80]. The p53 repressor ZBTB2 is also a target of miR-149[81], which could explain, at least partially, its function as a TS-miR.

MiR-30a-3p is a member of the miR-30 family, which isassociated with mesenchymal and stemness features [82,83] and isdownregulated in several types of cancer [84–86]. Recently,Rodriguez-Gonzalez et al. have linked low levels of this miRNA totamoxifen resistance in ER+ breast tumors. They have alsoproposed several targets of miR-30a-3p involved in proliferationand apoptosis, such as BCL2, NFkB, MAP2K4, PDGFA,CDK5R1 and CHN1 [87].

Regarding miR-20b, this miRNA is part of the miR-106b-363cluster, which is frequently deregulated in cancer [88–91]. Thelevels of miR-20b associate with histological grade in breast cancer[92,93]. This miRNA has been involved in regulating several keyproteins such as ESR1, HIF-1a, VEGF or STAT3 [92,94,95]. Inparticular, because it targets both HIF-1a and VEGF and HIF-1anegatively controls miR-20b levels, it has been defined as an anti-angiogenic miRNA [95].

Both oncogenic and tumor suppressor features have beenreported for miR-10a [96]. Thus, reduced expression of miR-10ahas been associated with MAP3K7- and bTRC-mediatedactivation of the proinflammatory NFkB pathway [97]. Also,miR-10a downregulation represses differentiation in part byderegulation of the histone deacetylase HDAC4 [98] andpositively affects invasiveness by de-repressing several membersof the homeobox family of transcription factors [99].

Regarding miR-342-5p, it appears significantly deregulatedonly when we compare B vs AC (Table 2). Together with itscounterpart (miR-342-3p), it is deregulated in inflammatory breastcancer [74] and its low expression has been associated with lower

post-recurrence survival [100], likely because it targets AKT1mRNA [101].

In sum, the available bibliographic data suggests that down-regulation of miR-149, miR-30a-3p, miR-20b, miR-10a andmiR342-5p in primary breast tumors could confer them enhancedproliferative, angiogenic and invasive potentials.

Prognostic value of the 5-miRNA signature. The relation-ship between expression of the 5-miRNA signature and RFS wasexamined by a survival analysis. Figure 3A shows a Kaplan-Meiergraph for the whole series of patients included in the study. Due tothe intrinsic characteristics of the cohort, decreases in the RFS areonly observed in the intervals 0–24 and 50–60 months(corresponding to groups B and C, respectively). We next groupedthe tumors according to their 5-miRNA signature status in twodifferent groups. One group included those tumors with all fivemiRNAs simultaneously downregulated, (FC.2 and p,0.05) anda second group included those tumors not having all five miRNAsdownregulated. A survival analysis was performed using clinicaldata from the corresponding patients. As shown in Figure 3B, theKaplan-Meier graphs for the two groups demonstrate that the 5-miRNA signature defines a ‘‘high risk’’ group of patients with ashorter RFS (Peto-Peto test with p-value = 0.02, when comparingthe low vs high risk groups).

Using a Cox proportional hazard regression model, we alsotested all possible combinations of different covariates (tumorsubtype, patient age, tumor size, number of lymph nodes affectedand the 5-miRNA signature) with early relapse (#24 months) toidentify the best prognostic factors. The best model according tothe AIC criterion included the tumor size and expression of the 5-miRNA signature (data not shown). Only the 5-miRNA signature(all five miRNAs down-regulated) resulted statistically significant inthe Cox model for the high risk group (p-value = 0.02 withHR = 2.73, 95% CI: 1.17–6.36). The 5-miRNA expression datawere also used to develop a predictor model through boot-strapping over a Naive Bayes classifier (B = 200 with N = 71, seemethods). The prognostic accuracy of the models was assessed by areceiver operating charateristic (ROC) test (Figure 4). Consideredindividually, miR-30a-3p and miR-10a showed a strikingly highArea Under the Curve (AUC) score (0.890 and 0.875, respective-ly). This result suggests that mRNA targets regulated by miR-30a-3p and miR-10a could potentially add a greater contribution tothe final outcome of the disease. However, the 5-miRNA signaturehad the strongest predictive value to discriminate tumors frompatients that will develop early relapse (group B) from those thatwill remain free of disease (group A), with an AUC = 0.993(Figure 4). In summary, the 5-miRNA signature has a goodperformance as a risk predictor for early breast cancer recurrence.

Candidate targets for the 5-miRNA signature. To extendour set of five miRNAs with regulatory information, we next tookadvantage of the existing public databases curating predicted andvalidated miRNA-target interactions (MTIs). In particular, vali-dated targets were obtained from the miTarBase and miRecordsrepositories (see methods). First, we created a biological network inCytoscape [66] containing all the individual miRNAs included inthe 5-miRNA signature (miR-149, miR-20b, miR10a, miR-30a-3p and miR-342-5p). Next, we extended the network by adding H.sapiens MTI data retrieved from the indicated repositories and,finally, extended regulatory interaction networks (RIN) weregenerated and visualized in Cytoscape. Each regulatory interac-tion in the network consist of two nodes, a regulatory component(miRNA) and a target biomolecule (mRNA) connected throughone directed edge. Figure 5 shows the extended network when theRIN threshold was set to 1 (i.e. each predicted target appears in, atleast, one RIN). Thus, at RIN = 1 the network included 14

Figure 4. Receiver operating characteristic curve (ROC) forearly breast cancer recurrence by the 5-miRNA signaturestatus. ROC curves generated using the prognosis information andexpression levels of the 5-miRNA signature can discriminate betweenpatients who will develop early recurrence and those who will remainfree of disease. Note that, although miR-30-3p and miR10a, individuallyhave a high area under the curve (AUC) score, the 5-miRNA signaturehas the strongest predictive value (AUC = 0.993) to discriminate thosepatients likely to recur early (group B in our cohort).doi:10.1371/journal.pone.0091884.g004

A miRNA Signature Predictive of Early Recurrence

PLOS ONE | www.plosone.org 8 March 2014 | Volume 9 | Issue 3 | e91884

validated targets assigned to miR-20b (VEGFA, BAMBI, EFNB2,MYLIP, CRIM1, ARID4B, HIF1A, HIPK3, CDKN1A, PPARG,STAT3, MUC17, EPHB4, and ESR1), 7 validated targetsassigned to miR-10a (HOXA1, NCOR2, SRSF1, SRSF10/TRA2B, MAP3K7, USF2 and BTRC) and 9 validated targetsassigned to miR-3a-3p (THBS1, VEZT, TUBA1A, CDK6,WDR82, TMEM2, KRT7, CYR61 and SLC7A6) (Figure 5).Taking these results into account and considering that i) theextended network was constructed with the 5-miRNA signature asthe network nodes and ii) all MTIs depicted in Figure 5 have beenexperimentally verified, we suggest that at least some of the

30 mRNAs (Figure 5) could be regulated in vivo by the 5-miRNAsignature in early-relapsing tumors.

To gain further insight into the molecular basis of the 5-miRNAsignature prognostic value, we investigated the biological pathwaysassociated with the 30 experimentally verified targets fromFigure 5. To that end, we searched for Gene Ontology (GO)terms and Kyoto Encyclopedia of Genes and Genomes (KEGG)pathways associated with the 30 targets as a whole set. It should benoted, however, that our restrictive approach –including onlyexperimentally validated miRNA targets-, left miR-149 and miR-342-5p out of the GO analysis and therefore, additional biologicalpathways could be affected by downregulation of the 5-miRNA

Figure 5. Prediction of mRNA targets likely to be regulated by the 5-miRNA signature. Biological networks were created using theCytoscape software. Each network includes two types of nodes: the five individual miRNAs included in the 5-miRNA signature and their predictedmRNA targets (yellow circles), obtained from two different public databases (miRTarBase and miRecords). The number of databases included in theanalysis defines the regulatory interaction network (RIN) threshold. Thus, at RIN = 1 the network includes all mRNA targets that appear in, at least, onedatabase. The databases included in the RIN are identified by the color of the connecting arrows: miRTarBase (blue) and miRecords (red). Althoughmany mRNAs are potential targets for miR-149 and miR-342-5p, the miRTarBase and miRecords versions included in this study did not reveal anytargets experimentally validated for the two miRNAs.doi:10.1371/journal.pone.0091884.g005

A miRNA Signature Predictive of Early Recurrence

PLOS ONE | www.plosone.org 9 March 2014 | Volume 9 | Issue 3 | e91884

If I know the which genes ARE expressed THEN

I can know which output WILL be obtained

Page 36: Bioinformatics and the logic of life

http://www.scbi.uma.es

Characterization of complex variations in cancer

30

©20

14 N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

NATURE BIOTECHNOLOGY ADVANCE ONLINE PUBLICATION 3

A N A LY S I S

for structural variants of different sizes (Supplementary Table 2). For the present comparison, we ran them as described in their companies’ corresponding publication or website.

We first observed that the calling of somatic SNVs was nearly opti-mal and within the same range in Mutect and SMUFIN, with sensitivi-ties of 97% and 92%, and specificities of 93% and 99%, respectively (Table 1 and Supplementary Table 3). On the other hand, the calling

efficiency of somatic structural variants varied greatly between differ-ent methods, revealing clear differences when compared to SMUFIN. Some methods reached reasonable levels of sensitivity when the eval-uation was restricted to the range of structural variants they were designed to detect (Pindel and Delly), but these dropped drastically when compared against the complete catalog of structural variations in the tumor (Supplementary Table 4). By contrast, SMUFIN was

Con

stru

ctio

n of

brea

kpoi

nt b

lock

sD

efin

ition

and

cla

ssifi

catio

nof

var

iant

sA

ssig

ning

ref

eren

ceco

ordi

nate

s

Quaternary sequence tree

1 2 3 4 5 67 8 9

3 6110 11 12

Read

nt

1

2

3

4

5

6

7

8

9

10

11

n

Single orientation breakpoint

Double orientation breakpoint

Quaternary sequence tree

Overlappingand complementary

reads from normalgenome

Construction of breakpoint blocks

Undefined breakpoint blocks

Reads in tumor-specific branches

Com

paris

on o

f nor

mal

and

tum

or r

eads

and

iden

tific

atio

n of

pot

entia

l bre

akpo

ints

Normal

Reads

Tumor

FASTQ file

Qualityfilters

Tumor and normal genome sequencing

Read1

23

456

789

3

6

1

nt 1 2 3 4 5 6 7 8 9 1011.................................n = Readlength

Short insertion

SNV

Large SV

101112

Tumor and normal reads

Unambiguous extension of normal and mutated

tumor allele

Mutated tumor allele

Nonmutated tumor allele

Normal alleles

Definition of small variants (n < read size)

Definition of breakpoint and variant sequence for large SVs ( > read size)

Extension of the variant and normal sequences around the breakpoint100 nt 100 ntBreakpoint

SNVs

TumorNormal

Deletions

Inversions

Insertions

Small SVs Breakpoint of large SV

Reference genome

Mapping of normal sequences (BWA)Independent mapping of normal sequences

flanking the breakpoint (BWA)

a

b

c

d

Tumor-specific reads with potential breakpoints

Page 37: Bioinformatics and the logic of life

http://www.scbi.uma.es

Characterization of complex variations in cancer

30

©20

14 N

atur

e A

mer

ica,

Inc.

All

righ

ts r

eser

ved.

NATURE BIOTECHNOLOGY ADVANCE ONLINE PUBLICATION 3

A N A LY S I S

for structural variants of different sizes (Supplementary Table 2). For the present comparison, we ran them as described in their companies’ corresponding publication or website.

We first observed that the calling of somatic SNVs was nearly opti-mal and within the same range in Mutect and SMUFIN, with sensitivi-ties of 97% and 92%, and specificities of 93% and 99%, respectively (Table 1 and Supplementary Table 3). On the other hand, the calling

efficiency of somatic structural variants varied greatly between differ-ent methods, revealing clear differences when compared to SMUFIN. Some methods reached reasonable levels of sensitivity when the eval-uation was restricted to the range of structural variants they were designed to detect (Pindel and Delly), but these dropped drastically when compared against the complete catalog of structural variations in the tumor (Supplementary Table 4). By contrast, SMUFIN was

Con

stru

ctio

n of

brea

kpoi

nt b

lock

sD

efin

ition

and

cla

ssifi

catio

nof

var

iant

sA

ssig

ning

ref

eren

ceco

ordi

nate

s

Quaternary sequence tree

1 2 3 4 5 67 8 9

3 6110 11 12

Read

nt

1

2

3

4

5

6

7

8

9

10

11

n

Single orientation breakpoint

Double orientation breakpoint

Quaternary sequence tree

Overlappingand complementary

reads from normalgenome

Construction of breakpoint blocks

Undefined breakpoint blocks

Reads in tumor-specific branches

Com

paris

on o

f nor

mal

and

tum

or r

eads

and

iden

tific

atio

n of

pot

entia

l bre

akpo

ints

Normal

Reads

Tumor

FASTQ file

Qualityfilters

Tumor and normal genome sequencing

Read1

23

456

789

3

6

1

nt 1 2 3 4 5 6 7 8 9 1011.................................n = Readlength

Short insertion

SNV

Large SV

101112

Tumor and normal reads

Unambiguous extension of normal and mutated

tumor allele

Mutated tumor allele

Nonmutated tumor allele

Normal alleles

Definition of small variants (n < read size)

Definition of breakpoint and variant sequence for large SVs ( > read size)

Extension of the variant and normal sequences around the breakpoint100 nt 100 ntBreakpoint

SNVs

TumorNormal

Deletions

Inversions

Insertions

Small SVs Breakpoint of large SV

Reference genome

Mapping of normal sequences (BWA)Independent mapping of normal sequences

flanking the breakpoint (BWA)

a

b

c

d

Tumor-specific reads with potential breakpoints

If I know the polymorphisms of a person THEN

I can predict which disease WILL he suffer

Page 38: Bioinformatics and the logic of life

http://www.scbi.uma.es

Personalised medicine

31

A needle in a haystack WAS FOUND

Page 39: Bioinformatics and the logic of life

http://www.scbi.uma.es

Linking unrelated diseases

32

Alzheimer patients use to be free of cancer, and cancer patients use to be free of mental diseases

Page 40: Bioinformatics and the logic of life

http://www.scbi.uma.es

Linking unrelated diseases

32

Alzheimer patients use to be free of cancer, and cancer patients use to be free of mental diseases

Molecular Evidence for the Inverse Comorbidity betweenCentral Nervous System Disorders and Cancers Detectedby Transcriptomic Meta-analysesKristina Ibanez1., Cesar Boullosa1., Rafael Tabares-Seisdedos2, Anaıs Baudot3*, Alfonso Valencia1*

1 Structural Biology and Biocomputing Programme, Spanish National Cancer, Research Centre (CNIO), Madrid, Spain, 2 Department of Medicine, University of Valencia,

CIBERSAM, INCLIVA, Valencia, Spain, 3 Aix-Marseille Universite, CNRS, I2M, UMR 7373, Marseille, France

Abstract

There is epidemiological evidence that patients with certain Central Nervous System (CNS) disorders have a lower thanexpected probability of developing some types of Cancer. We tested here the hypothesis that this inverse comorbidity isdriven by molecular processes common to CNS disorders and Cancers, and that are deregulated in opposite directions. Weconducted transcriptomic meta-analyses of three CNS disorders (Alzheimer’s disease, Parkinson’s disease and Schizophrenia)and three Cancer types (Lung, Prostate, Colorectal) previously described with inverse comorbidities. A significant overlap wasobserved between the genes upregulated in CNS disorders and downregulated in Cancers, as well as between the genesdownregulated in CNS disorders and upregulated in Cancers. We also observed expression deregulations in oppositedirections at the level of pathways. Our analysis points to specific genes and pathways, the upregulation of which couldincrease the incidence of CNS disorders and simultaneously lower the risk of developing Cancer, while the downregulationof another set of genes and pathways could contribute to a decrease in the incidence of CNS disorders while increasing theCancer risk. These results reinforce the previously proposed involvement of the PIN1 gene, Wnt and P53 pathways, andreveal potential new candidates, in particular related with protein degradation processes.

Citation: Ibanez K, Boullosa C, Tabares-Seisdedos R, Baudot A, Valencia A (2014) Molecular Evidence for the Inverse Comorbidity between Central NervousSystem Disorders and Cancers Detected by Transcriptomic Meta-analyses. PLoS Genet 10(2): e1004173. doi:10.1371/journal.pgen.1004173

Editor: Marshall S. Horwitz, University of Washington, United States of America

Received September 16, 2013; Accepted December 30, 2013; Published February 20, 2014

Copyright: ! 2014 Ibanez et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by a Fellowship from Obra Social la Caixa grant to KI (http://obrasocial.lacaixa.es/laCaixaFoundation/home_en.html), FPI grantBES-2008-006332 to CB and grant BIO2012 to AV Group. The funders had no role in study design, data collection and analysis, decision to publish, or preparationof the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected] (AB); [email protected] (AV)

. These authors contributed equally to this work.

Introduction

Epidemiological evidences point to a lower-than-expectedprobability of developing some types of Cancer in certain CNSdisorders, including Alzheimer’s disease (AD), Parkinson’s disease(PD) and Schizophrenia (SCZ) [1–6]. Our current understandingof such inverse comorbidities suggests that this phenomenon isinfluenced by environmental factors, drug treatments and otheraspects related with disease diagnosis. Genetics can additionallycontribute to the inverse comorbidity between complex diseases,together with these external factors (for review, see [3–7]). Inparticular, we propose the deregulation in opposite directions of acommon set of genes and pathways as an underlying cause ofinverse comorbidities.

To investigate the biological plausibility of this hypothesis, abasic initial step is to establish the existence of inverse geneexpression deregulations (i.e., down- versus up-regulations) in CNSdisorders and Cancers. Towards this objective, we have performedintegrative meta-analyses of collections of gene expression data,publically available for AD, PD and SCZ, and Lung (LC),Colorectal (CRC) and Prostate (PC) Cancers. Clinical andepidemiological data previously reported inverse comorbidities forthese complex disorders, according to population studies assessingthe Cancer risks among patients with CNS disorders [8–17].

Results and Discussion

For each CNS disorder and Cancer type independently, weundertook meta-analyses from a large collection of microarraygene expression datasets to identify the genes that are significantlyup- and down-regulated in disease when compared with theircorresponding healthy control samples (Differentially ExpressedGenes – DEGs –, FDR corrected p-value (q-value),0.05, seeMethods and Table S1). Then, the DEGs of the CNS disordersand Cancer types were compared to each others. There weresignificant overlaps (Fisher’s exact test, corrected p-value (q-value),0.05, see Methods) between the DEGs upregulated inCNS disorders and those downregulated in Cancers. Similarly,DEGs downregulated in CNS disorders overlapped significantlywith DEGs upregulated in Cancers (Figure 1A). Significantoverlaps between DEGs deregulated in opposite directions in CNSdisorders and Cancers are still observed while setting morestringent cutoffs for the detection of DEGs (qvalues lower than0.005, 0.0005, 0.00005 and 0.000005, Figure S1). A significantoverlap between DEGs deregulated in the same direction was onlyidentified in the case of CRC and PD upregulated genes(Figure 1A).

A molecular interpretation of the inverse comorbidity between CNSdisorders and Cancers could be that the downregulation of certain

PLOS Genetics | www.plosgenetics.org 1 February 2014 | Volume 10 | Issue 2 | e1004173

Molecular Evidence for the Inverse Comorbidity betweenCentral Nervous System Disorders and Cancers Detectedby Transcriptomic Meta-analysesKristina Ibanez1., Cesar Boullosa1., Rafael Tabares-Seisdedos2, Anaıs Baudot3*, Alfonso Valencia1*

1 Structural Biology and Biocomputing Programme, Spanish National Cancer, Research Centre (CNIO), Madrid, Spain, 2 Department of Medicine, University of Valencia,

CIBERSAM, INCLIVA, Valencia, Spain, 3 Aix-Marseille Universite, CNRS, I2M, UMR 7373, Marseille, France

Abstract

There is epidemiological evidence that patients with certain Central Nervous System (CNS) disorders have a lower thanexpected probability of developing some types of Cancer. We tested here the hypothesis that this inverse comorbidity isdriven by molecular processes common to CNS disorders and Cancers, and that are deregulated in opposite directions. Weconducted transcriptomic meta-analyses of three CNS disorders (Alzheimer’s disease, Parkinson’s disease and Schizophrenia)and three Cancer types (Lung, Prostate, Colorectal) previously described with inverse comorbidities. A significant overlap wasobserved between the genes upregulated in CNS disorders and downregulated in Cancers, as well as between the genesdownregulated in CNS disorders and upregulated in Cancers. We also observed expression deregulations in oppositedirections at the level of pathways. Our analysis points to specific genes and pathways, the upregulation of which couldincrease the incidence of CNS disorders and simultaneously lower the risk of developing Cancer, while the downregulationof another set of genes and pathways could contribute to a decrease in the incidence of CNS disorders while increasing theCancer risk. These results reinforce the previously proposed involvement of the PIN1 gene, Wnt and P53 pathways, andreveal potential new candidates, in particular related with protein degradation processes.

Citation: Ibanez K, Boullosa C, Tabares-Seisdedos R, Baudot A, Valencia A (2014) Molecular Evidence for the Inverse Comorbidity between Central NervousSystem Disorders and Cancers Detected by Transcriptomic Meta-analyses. PLoS Genet 10(2): e1004173. doi:10.1371/journal.pgen.1004173

Editor: Marshall S. Horwitz, University of Washington, United States of America

Received September 16, 2013; Accepted December 30, 2013; Published February 20, 2014

Copyright: ! 2014 Ibanez et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by a Fellowship from Obra Social la Caixa grant to KI (http://obrasocial.lacaixa.es/laCaixaFoundation/home_en.html), FPI grantBES-2008-006332 to CB and grant BIO2012 to AV Group. The funders had no role in study design, data collection and analysis, decision to publish, or preparationof the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected] (AB); [email protected] (AV)

. These authors contributed equally to this work.

Introduction

Epidemiological evidences point to a lower-than-expectedprobability of developing some types of Cancer in certain CNSdisorders, including Alzheimer’s disease (AD), Parkinson’s disease(PD) and Schizophrenia (SCZ) [1–6]. Our current understandingof such inverse comorbidities suggests that this phenomenon isinfluenced by environmental factors, drug treatments and otheraspects related with disease diagnosis. Genetics can additionallycontribute to the inverse comorbidity between complex diseases,together with these external factors (for review, see [3–7]). Inparticular, we propose the deregulation in opposite directions of acommon set of genes and pathways as an underlying cause ofinverse comorbidities.

To investigate the biological plausibility of this hypothesis, abasic initial step is to establish the existence of inverse geneexpression deregulations (i.e., down- versus up-regulations) in CNSdisorders and Cancers. Towards this objective, we have performedintegrative meta-analyses of collections of gene expression data,publically available for AD, PD and SCZ, and Lung (LC),Colorectal (CRC) and Prostate (PC) Cancers. Clinical andepidemiological data previously reported inverse comorbidities forthese complex disorders, according to population studies assessingthe Cancer risks among patients with CNS disorders [8–17].

Results and Discussion

For each CNS disorder and Cancer type independently, weundertook meta-analyses from a large collection of microarraygene expression datasets to identify the genes that are significantlyup- and down-regulated in disease when compared with theircorresponding healthy control samples (Differentially ExpressedGenes – DEGs –, FDR corrected p-value (q-value),0.05, seeMethods and Table S1). Then, the DEGs of the CNS disordersand Cancer types were compared to each others. There weresignificant overlaps (Fisher’s exact test, corrected p-value (q-value),0.05, see Methods) between the DEGs upregulated inCNS disorders and those downregulated in Cancers. Similarly,DEGs downregulated in CNS disorders overlapped significantlywith DEGs upregulated in Cancers (Figure 1A). Significantoverlaps between DEGs deregulated in opposite directions in CNSdisorders and Cancers are still observed while setting morestringent cutoffs for the detection of DEGs (qvalues lower than0.005, 0.0005, 0.00005 and 0.000005, Figure S1). A significantoverlap between DEGs deregulated in the same direction was onlyidentified in the case of CRC and PD upregulated genes(Figure 1A).

A molecular interpretation of the inverse comorbidity between CNSdisorders and Cancers could be that the downregulation of certain

PLOS Genetics | www.plosgenetics.org 1 February 2014 | Volume 10 | Issue 2 | e1004173

Figure 1. Comparisons of Differentially Expressed Genes (DEGs). (A) Comparisons of DEGs associated with Central Nervous System (CNS)disorders and Cancers. The DEGs identified as significantly up- and down-regulated (q-value,0.05) after gene expression meta-analysis in each CNSdisorder (Alzheimer’s Disease, AD; Parkinson’s Disease, PD; and Schizophrenia, SCZ) and Cancer type (Lung Cancer, LC; Colorectal Cancer, CRC; andProstate Cancer, PC) are compared to each others. (B) Comparisons of DEGs between CNS disorders, Cancers and Asthma, HIV, Malaria, Dystrophy,Sarcoidosis. The DEGs identified as significantly up- and down-regulated (q-value,0.05) after gene expression meta-analysis in each CNS disorder(Alzheimer’s Disease, AD; Parkinson’s Disease, PD; and Schizophrenia, SCZ), Cancer type (Lung Cancer, LC; Colorectal Cancer, CRC; and ProstateCancer, PC), and in Asthma, HIV, Malaria, Dystrophia and Sarcoidosis, are compared to each others. Cells are coloured according to the significance ofthe overlaps (Fisher’s exact test, Bonferroni correction for multiple testing, see Methods). Grey cells correspond to non-significant overlaps(q-value.0.05).doi:10.1371/journal.pgen.1004173.g001

Table 1. DEGs significantly downregulated in the three CNS disorders and upregulated in the three Cancer types (q-value,0.05).

PPIAP11, IARS, GGCT, NME2, GAPDHP1, CDC123, PSMD8, MRPS33, FIBP, OAZ2, IARS2, SLC35B1, APOO, TMEM189-UBE2V1, VDAC1, TMED3, SMS, DNM1L, PRPS1, SRSF2,TMEM14D, TOMM70A, ATP6V1C1, NUP93, MRPL15, UBA5, PPIH, SMYD3, NIT2, SRD5A1, NUDT21, MRPL12, EEF1E1, MRPS7, TTPAL, BZW1P2, RP11-552M11.4, TSN, MECR,ZWINT, RPRD1A, UCHL5, NHP2P2, TFB2M, FEN1, CGREF1, IMPAD1, ARL1, ACLY, MRPL42, LSM4, KPNA1, TIMM23B, RP11-164O23.5, RP11-762H8.2, FARSA, MRPL4, API5,RP3-425P12.4, RFC3, RANBP9, TFCP2, GMDS, CCNB1, TMEM177, GUF1, HSPA13, NMD3, GCFC2, TUBGCP5, TBCE, YKT6, PHF14, BRCC3

doi:10.1371/journal.pgen.1004173.t001

Inverse Comorbidity among Cancer and CNS Disorders

PLOS Genetics | www.plosgenetics.org 3 February 2014 | Volume 10 | Issue 2 | e1004173

Comparing differentially

expressed genes

SCZ: schizophrenia AD: Alzheimer disease PD: Parkinson disease CRC: colorectal cancer PC: prostate cancer LC: lung cancer

Page 41: Bioinformatics and the logic of life

http://www.scbi.uma.es

Mental and cancer diseases are really connected

33

AD and PD, and upregulated in CRC (Reactome database;Figure S2).

Aside the Wnt and p53 pathways, our analysis reveals otherpathways related to protein folding and protein degradationdisplaying patterns of downregulation in CNS disorders andupregulation in Cancers, and that may be relevant for inversecomorbidity. For instance, the Ubiquitin/Proteasome system isconsistently downregulated in CNS disorders and upregulated inCancers according to the three pathway databases analyzed(Figure 2, Figure S2, Table S3). The inverse relationshipbetween the levels of expression deregulations of these pathwayspossibly suggests opposite roles in CNS disorders and Cancers.

A detailed examination of the KEGG pathways deregulated inopposite directions in CNS disorders and Cancers finallyrevealed that 89% of the KEGG pathways that wereupregulated in Cancers and downregulated in CNS disordersare related to Metabolism and Genetic Information Processing(Figure 2, Figure 3). By contrast, the pathways downregulatedin Cancers and upregulated in CNS disorders are related to thecell’s communication with its environment (EnvironmentalInformation Processing and Organismal System; Figure 2,Figure 3). Hence, global regulations of cellular activity mayaccount for a protective effect between inversely comorbiddiseases.

Table 2. DEGs significantly upregulated in the three CNS disorders and downregulated in the three Cancer types (q-value,0.05).

MT2A, MT1X, NFKBIA, AC009469.1, DHRS3, CDKN1A, TNFRSF1A, CRYBG3, IL4R, MT1M, FAM107A, ITPKC, MID1, IL11RA, AHNAK, KAT2B, BCL2, PTH1R, NFASC

doi:10.1371/journal.pgen.1004173.t002

Figure 2. KEGG pathways significantly deregulated in Central Nervous System (CNS) disorders and Cancer types. KEGG pathways [24]significantly up- and downregulated in each disease were identified using the GSEA method [34] (q-value,0.05). The significant pathways werecompared between the 6 diseases and combined in a network representation. Node pie charts are coloured according to the pathway status asCancer upregulated (yellow), Cancer downregulated (blue), CNS disorder upregulated (green) and CNS disorder downregulated (red). The green/blueand yellow/red associations thus correspond to pathways deregulated in opposite directions in CNS disorders and Cancers. Pathway labels arecoloured according to their classifications provided by KEGG [24], as: Metabolism (green), Genetic Information Processing (yellow), Cellular Process(pink), Environmental Information Processing (red) and Organismal Systems (dark red). All networks are available at bioinfo.cnio.es/people/cboullosa/validation/cytoscape/Ibanezetal.zip, in cytoscape format (http://www.cytoscape.org/).doi:10.1371/journal.pgen.1004173.g002

Inverse Comorbidity among Cancer and CNS Disorders

PLOS Genetics | www.plosgenetics.org 4 February 2014 | Volume 10 | Issue 2 | e1004173

Typical cancer functions

Typical mental disease functions

Page 42: Bioinformatics and the logic of life

http://www.scbi.uma.es

Mental and cancer diseases are really connected

33

AD and PD, and upregulated in CRC (Reactome database;Figure S2).

Aside the Wnt and p53 pathways, our analysis reveals otherpathways related to protein folding and protein degradationdisplaying patterns of downregulation in CNS disorders andupregulation in Cancers, and that may be relevant for inversecomorbidity. For instance, the Ubiquitin/Proteasome system isconsistently downregulated in CNS disorders and upregulated inCancers according to the three pathway databases analyzed(Figure 2, Figure S2, Table S3). The inverse relationshipbetween the levels of expression deregulations of these pathwayspossibly suggests opposite roles in CNS disorders and Cancers.

A detailed examination of the KEGG pathways deregulated inopposite directions in CNS disorders and Cancers finallyrevealed that 89% of the KEGG pathways that wereupregulated in Cancers and downregulated in CNS disordersare related to Metabolism and Genetic Information Processing(Figure 2, Figure 3). By contrast, the pathways downregulatedin Cancers and upregulated in CNS disorders are related to thecell’s communication with its environment (EnvironmentalInformation Processing and Organismal System; Figure 2,Figure 3). Hence, global regulations of cellular activity mayaccount for a protective effect between inversely comorbiddiseases.

Table 2. DEGs significantly upregulated in the three CNS disorders and downregulated in the three Cancer types (q-value,0.05).

MT2A, MT1X, NFKBIA, AC009469.1, DHRS3, CDKN1A, TNFRSF1A, CRYBG3, IL4R, MT1M, FAM107A, ITPKC, MID1, IL11RA, AHNAK, KAT2B, BCL2, PTH1R, NFASC

doi:10.1371/journal.pgen.1004173.t002

Figure 2. KEGG pathways significantly deregulated in Central Nervous System (CNS) disorders and Cancer types. KEGG pathways [24]significantly up- and downregulated in each disease were identified using the GSEA method [34] (q-value,0.05). The significant pathways werecompared between the 6 diseases and combined in a network representation. Node pie charts are coloured according to the pathway status asCancer upregulated (yellow), Cancer downregulated (blue), CNS disorder upregulated (green) and CNS disorder downregulated (red). The green/blueand yellow/red associations thus correspond to pathways deregulated in opposite directions in CNS disorders and Cancers. Pathway labels arecoloured according to their classifications provided by KEGG [24], as: Metabolism (green), Genetic Information Processing (yellow), Cellular Process(pink), Environmental Information Processing (red) and Organismal Systems (dark red). All networks are available at bioinfo.cnio.es/people/cboullosa/validation/cytoscape/Ibanezetal.zip, in cytoscape format (http://www.cytoscape.org/).doi:10.1371/journal.pgen.1004173.g002

Inverse Comorbidity among Cancer and CNS Disorders

PLOS Genetics | www.plosgenetics.org 4 February 2014 | Volume 10 | Issue 2 | e1004173

Typical cancer functions

Typical mental disease functions

↑↑ cancer ↓↓ mental disease

74 genes19 genescancer ↓↓

mental disease↑↑

Since 93 genes are inversely expressed in cancer and CNS disorders

THEN I can explain the inverse correlation between both diseases

Page 43: Bioinformatics and the logic of life

http://www.scbi.uma.es

After basic research, translational research is easy

34

Page 44: Bioinformatics and the logic of life

http://www.scbi.uma.es

Higher vertebrates have conserved genomes

35

The bonobo genome compared with the chimpanzee and human genomes Kay Prüfer et al. Nature 486, 527–531 (28 June 2012)

The zebrafish reference genome sequence and its relationship to the human genome Kerstin Howe et al. Nature 496, 498–503 (25 April 2013)

70% of protein-coding human genes are related to genes found in the zebrafish

84% of genes known to be associated with human disease have a zebrafish counterpart

Chimpanzee

Page 45: Bioinformatics and the logic of life

http://www.scbi.uma.es

Genome plasticity in bacteria

36

Estimating the size of the bacterial pan-genomeLapierre & Gogarten

Trends in Genetics 23(3), 2009, Pages 107–110

Pangenomics – an avenue to improved industrial starter cultures and probioticsGarrigues et al.Current Opinion in Biotechnology 2013, 24:187–191

Page 46: Bioinformatics and the logic of life

http://www.scbi.uma.es

Minimum number of genes for a living organism

37

1354 genes

Giovannoni et al., (2005) Science 309: 1242-1245

500 genes

Page 47: Bioinformatics and the logic of life

http://www.scbi.uma.es

Minimum number of genes for a living organism

37

1354 genes

Giovannoni et al., (2005) Science 309: 1242-1245

500 genesIf I know the minimal gene number of an organism THEN

I can design artificial organisms for biotechnological purposes

Page 48: Bioinformatics and the logic of life

http://www.scbi.uma.es

There aren’t new genes but duplicated genes

38

The number of gene families plateaus with genome size

Figure 3.15 Because many genes areduplicated, the number of different genefamilies is much less than the totalnumber of genes. The histogram comparesthe total number of genes with the numberof distinct gene families.

Figure 3.16 The proportion ofgenes that are present in multiplecopies increases with genome sizein higher eukaryotes.

their exons. (A family of related genes arises by duplication of an an-cestral gene followed by accumulation of changes in sequence betweenthe copies. Most often the members of a family are related but not iden-tical.) The number of types of genes is calculated by adding the numberof unique genes (where there is no other related gene at all) to the num-bers of families that have 2 or more members.

Figure 3.15 compares the total number of genes with the number ofdistinct families in each of six genomes. In bacteria, most genes areunique, so the number of distinct families is close to the total gene num-ber. The situation is different even in the lower eukaryote S. cerevisiae,where there is a significant proportion of repeated genes. The most strik-ing effect is that the number of genes increases quite sharply in the highereukaryotes, but the number of gene families does not change much.

Figure 3.16 shows that the proportion of unique genes dropssharply with genome size. When genes are present in families, the num-ber of members in a family is small in bacteria and lower eukaryotes,but is large in higher eukaryotes. Much of the extra genome size ofArabidopsis is accounted for by families with >4 members.

If every gene is expressed, the total number of genes will account forthe total number of proteins required to make the organism (the pro-teome). However, two effects mean that the proteome is different fromthe total gene number. Because genes are duplicated, some of themcode for the same protein (although it may be expressed in a differenttime or place) and others may code for related proteins that again playthe same role in different times or places. And because some genes canproduce more than one protein by means of alternative splicing, theproteome can be larger than the number of genes.

What is the core proteome—the basic number of the different typesof proteins in the organism? A minimum estimate is given by the num-ber of gene families, ranging from 1400 in the bacterium, >4000 in theyeast, and a range of 11,000-14,000 for the fly and worm.

What is the distribution of the proteome among types of proteins?The 6000 proteins of the yeast proteome include 5000 soluble proteinsand 1000 transmembrane proteins. About half of the proteins are cyto-plasmic, a quarter are in the nucleolus, and the remainder are split be-tween the mitochondrion and the ER/Golgi system.

How many genes are common to all organisms (or to groups such asbacteria or higher eukaryotes) and how many are specific for the individ-ual type of organism? Figure 3.17 summarizes the comparison betweenyeast, worm, and fly. Genes that code for corresponding proteins in differ-ent organisms are called orthologs. Operationally, we usually reckon thattwo genes in different organisms can be considered to provide correspond-ing functions if their sequences are similar over >80% of the length. B>this criterion, -20% of the fly genes have orthologs in both yeast and theworm. These genes are probably required by all eukaryotes. The propor-tion increases to 30% when fly and worm are compared, probably repre-senting the addition of gene functions that are common to multicellulaieukaryotes. This still leaves a major proportion of genes as coding for pro-teins that are required specifically by either flies or worms, respectively

The proteome can be deduced from the number and structures olgenes, and can also be directly measured by analyzing the total proteircontent of a cell or organism. By such approaches, some proteins havebeen identified that were not suspected on the basis of genome analysisand that have therefore led to the identification of new genes. Severamethods are used for large scale analysis of proteins. Mass spectrometrycan be used for separating and identifying proteins in a mixture obtainecdirectly from cells or tissues. Hybrid proteins bearing tags can be obtained by expression of cDNAs made by linking the sequences of opeireading frames to appropriate expression vectors that incorporate the sequences for affinity tags. This allows array analysis to be used to analyz*

62 CHAPTER 3 The content of the genomeBy Book_Crazy [IND]

WHOLE-GENOME SEQUENCING HAS GIVEN UNIQUE INSIGHTS INTO THE GENOMES STRUCTURE

GENOME"SIZE"HAS"NOTHING"TO"DO"WITH"GENE"NUMBER"

VARIABILITY"AMONG"GENOMES"ARISES"FROM"A"NUMBER"OF"DIFFERENT"SOURCES"

HIGH&THROUGHPUT"TECHNOLOGIES"OVERVIEW"7"

Page 49: Bioinformatics and the logic of life

http://www.scbi.uma.es

We are not able to predict which kind of organism is produced when having the genome sequence

39

?

Page 50: Bioinformatics and the logic of life

http://www.scbi.uma.es

We are not able to predict which kind of organism is produced when having the genome sequence

39

?

A living being si more than the sum of its components

Page 51: Bioinformatics and the logic of life

http://www.scbi.uma.es

We can now relate facial shapes with genes

40

Modeling 3D Facial Shape from DNAPeter Claes1, Denise K. Liberton2, Katleen Daniels1, Kerri Matthes Rosana2, Ellen E. Quillen2,

Laurel N. Pearson2, Brian McEvoy3, Marc Bauchet2, Arslan A. Zaidi2, Wei Yao2, Hua Tang4,

Gregory S. Barsh4,5, Devin M. Absher5, David A. Puts2, Jorge Rocha6,7, Sandra Beleza4,8,

Rinaldo W. Pereira9, Gareth Baynam10,11,12, Paul Suetens1, Dirk Vandermeulen1, Jennifer K. Wagner13,

James S. Boster14, Mark D. Shriver2*

1 Medical Image Computing, ESAT/PSI, Department of Electrical Engineering, KU Leuven, Medical Imaging Research Center, KU Leuven & UZ Leuven, iMinds-KU Leuven

Future Health Department, Leuven, Belgium, 2 Department of Anthropology, Penn State University, University Park, Pennsylvania, United States of America, 3 Smurfit

Institute of Genetics, Dublin, Ireland, 4 Department of Genetics, Stanford University, Palo Alto, California, United States of America, 5 HudsonAlpha Institute for

Biotechnology, Huntsville, Alabama, United States of America, 6 CIBIO: Centro de Investigacao em Biodiversidade e Recursos Geneticos, Universidade do Porto, Porto,

Portugal, 7 Departamento de Biologia, Faculdade de Ciencias, Universidade do Porto, Porto, Portugal, 8 IPATIMUP: Instituto de Patologia e Imunologia Molecular da

Universidade do Porto, Porto, Portugal, 9 Programa de Pos-Graduacao em Ciencias Genomicas e Biotecnologia, Universidade Catolica de Brasılia, Brasilia, Brasil, 10 School

of Paediatrics and Child Health, University of Western Australia, Perth, Australia, 11 Institute for Immunology and Infectious Diseases, Murdoch University, Perth, Australia,

12 Genetic Services of Western Australia, King Edward Memorial Hospital, Perth, Australia, 13 Center for the Integration of Genetic Healthcare Technologies, University of

Pennsylvania, Philadelphia, Pennsylvania, United States of America, 14 Department of Anthropology, University of Connecticut, Storrs, Connecticut, United States of

America

Abstract

Human facial diversity is substantial, complex, and largely scientifically unexplained. We used spatially dense quasi-landmarks to measure face shape in population samples with mixed West African and European ancestry from threelocations (United States, Brazil, and Cape Verde). Using bootstrapped response-based imputation modeling (BRIM), weuncover the relationships between facial variation and the effects of sex, genomic ancestry, and a subset of craniofacialcandidate genes. The facial effects of these variables are summarized as response-based imputed predictor (RIP) variables,which are validated using self-reported sex, genomic ancestry, and observer-based facial ratings (femininity andproportional ancestry) and judgments (sex and population group). By jointly modeling sex, genomic ancestry, andgenotype, the independent effects of particular alleles on facial features can be uncovered. Results on a set of 20 genesshowing significant effects on facial features provide support for this approach as a novel means to identify genes affectingnormal-range facial features and for approximating the appearance of a face from genetic markers.

Citation: Claes P, Liberton DK, Daniels K, Rosana KM, Quillen EE, et al. (2014) Modeling 3D Facial Shape from DNA. PLoS Genet 10(3): e1004224. doi:10.1371/journal.pgen.1004224

Editor: Daniela Luquetti, Seattle Children’s Research Institute, United States of America

Received September 12, 2013; Accepted January 22, 2014; Published March 20, 2014

Copyright: ! 2014 Claes et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This investigation was supported by grants to MDS from Science Foundation of Ireland Walton Fellowship (04.W4/B643); to MDS and DAP from theNational Institute Justice (2008-DN-BX-K125); to JKW from the NIH/National Human Genome Research Institute (K99HG006446); to DKL from the National ScienceFoundation (BCS-0851815) and from the Wenner Gren Foundation (Fieldwork Grant 7967). PC is partly supported by the Flemish Institute for the Promotion ofInnovation by Science and Technology in Flanders (IWT Vlaanderen), the Research Program of the Fund for Scientific Research - Flanders (Belgium) (FWO), theResearch Fund KU Leuven and SB was supported by the Portuguese Institution ‘‘Fundacao para a Ciencia e a Tecnologia’’ [FCT; PTDC/BIABDE/64044/2006(project) and SFRH/BPD/21887/2005 (post-doc grant)] and by a Dean’s Postdoctoral Fellowship at Stanford University. The funders had no role in study design,data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Introduction

The craniofacial complex is initially modulated by precisely-timed embryonic gene expression and molecular interactionsmediated through complex pathways [1]. As humans grow,hormones and biomechanical factors also affect many parts ofthe face [2,3]. The inability to systematically summarize facialvariation has impeded the discovery of the determinants andcorrelates of face shape. In contrast to genomic technologies,systematic and comprehensive phenotyping has lagged. This isespecially so in the context of multipartite traits such as the humanface. In typical genome-wide association studies (GWAS) todayphenotypes are summarized as univariate variables, which isinherently limiting for multivariate traits, which, by definitioncannot be expressed with single variables. Current state-of-the-art

genetic association studies for facial traits are limited in theirdescription of facial morphology [4–7]. These analyses start from asparse set of anatomical landmarks (these being defined as ‘‘a pointof correspondence on an object that matches between and withinpopulations’’), which overlooks salient features of facial shape.Subsequently, either a set of conventional morphometric mea-surements such as distances and angles are extracted, whichdrastically oversimplify facial shape, or a set of principalcomponents (PCs) are extracted using principal componentsanalysis (PCA) on the shape-space obtained with superimpositiontechniques, where each PC is assumed to represent a distinctmorphological trait. Here we describe a novel method thatfacilitates the compounding of all PCs into a single scalar variablecustomized to relevant independent variables including, sex,genomic ancestry, and genes. Our approach combines placing

PLOS Genetics | www.plosgenetics.org 1 March 2014 | Volume 10 | Issue 3 | e1004224

Modeling 3D Facial Shape from DNAPeter Claes1, Denise K. Liberton2, Katleen Daniels1, Kerri Matthes Rosana2, Ellen E. Quillen2,

Laurel N. Pearson2, Brian McEvoy3, Marc Bauchet2, Arslan A. Zaidi2, Wei Yao2, Hua Tang4,

Gregory S. Barsh4,5, Devin M. Absher5, David A. Puts2, Jorge Rocha6,7, Sandra Beleza4,8,

Rinaldo W. Pereira9, Gareth Baynam10,11,12, Paul Suetens1, Dirk Vandermeulen1, Jennifer K. Wagner13,

James S. Boster14, Mark D. Shriver2*

1 Medical Image Computing, ESAT/PSI, Department of Electrical Engineering, KU Leuven, Medical Imaging Research Center, KU Leuven & UZ Leuven, iMinds-KU Leuven

Future Health Department, Leuven, Belgium, 2 Department of Anthropology, Penn State University, University Park, Pennsylvania, United States of America, 3 Smurfit

Institute of Genetics, Dublin, Ireland, 4 Department of Genetics, Stanford University, Palo Alto, California, United States of America, 5 HudsonAlpha Institute for

Biotechnology, Huntsville, Alabama, United States of America, 6 CIBIO: Centro de Investigacao em Biodiversidade e Recursos Geneticos, Universidade do Porto, Porto,

Portugal, 7 Departamento de Biologia, Faculdade de Ciencias, Universidade do Porto, Porto, Portugal, 8 IPATIMUP: Instituto de Patologia e Imunologia Molecular da

Universidade do Porto, Porto, Portugal, 9 Programa de Pos-Graduacao em Ciencias Genomicas e Biotecnologia, Universidade Catolica de Brasılia, Brasilia, Brasil, 10 School

of Paediatrics and Child Health, University of Western Australia, Perth, Australia, 11 Institute for Immunology and Infectious Diseases, Murdoch University, Perth, Australia,

12 Genetic Services of Western Australia, King Edward Memorial Hospital, Perth, Australia, 13 Center for the Integration of Genetic Healthcare Technologies, University of

Pennsylvania, Philadelphia, Pennsylvania, United States of America, 14 Department of Anthropology, University of Connecticut, Storrs, Connecticut, United States of

America

Abstract

Human facial diversity is substantial, complex, and largely scientifically unexplained. We used spatially dense quasi-landmarks to measure face shape in population samples with mixed West African and European ancestry from threelocations (United States, Brazil, and Cape Verde). Using bootstrapped response-based imputation modeling (BRIM), weuncover the relationships between facial variation and the effects of sex, genomic ancestry, and a subset of craniofacialcandidate genes. The facial effects of these variables are summarized as response-based imputed predictor (RIP) variables,which are validated using self-reported sex, genomic ancestry, and observer-based facial ratings (femininity andproportional ancestry) and judgments (sex and population group). By jointly modeling sex, genomic ancestry, andgenotype, the independent effects of particular alleles on facial features can be uncovered. Results on a set of 20 genesshowing significant effects on facial features provide support for this approach as a novel means to identify genes affectingnormal-range facial features and for approximating the appearance of a face from genetic markers.

Citation: Claes P, Liberton DK, Daniels K, Rosana KM, Quillen EE, et al. (2014) Modeling 3D Facial Shape from DNA. PLoS Genet 10(3): e1004224. doi:10.1371/journal.pgen.1004224

Editor: Daniela Luquetti, Seattle Children’s Research Institute, United States of America

Received September 12, 2013; Accepted January 22, 2014; Published March 20, 2014

Copyright: ! 2014 Claes et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This investigation was supported by grants to MDS from Science Foundation of Ireland Walton Fellowship (04.W4/B643); to MDS and DAP from theNational Institute Justice (2008-DN-BX-K125); to JKW from the NIH/National Human Genome Research Institute (K99HG006446); to DKL from the National ScienceFoundation (BCS-0851815) and from the Wenner Gren Foundation (Fieldwork Grant 7967). PC is partly supported by the Flemish Institute for the Promotion ofInnovation by Science and Technology in Flanders (IWT Vlaanderen), the Research Program of the Fund for Scientific Research - Flanders (Belgium) (FWO), theResearch Fund KU Leuven and SB was supported by the Portuguese Institution ‘‘Fundacao para a Ciencia e a Tecnologia’’ [FCT; PTDC/BIABDE/64044/2006(project) and SFRH/BPD/21887/2005 (post-doc grant)] and by a Dean’s Postdoctoral Fellowship at Stanford University. The funders had no role in study design,data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

Introduction

The craniofacial complex is initially modulated by precisely-timed embryonic gene expression and molecular interactionsmediated through complex pathways [1]. As humans grow,hormones and biomechanical factors also affect many parts ofthe face [2,3]. The inability to systematically summarize facialvariation has impeded the discovery of the determinants andcorrelates of face shape. In contrast to genomic technologies,systematic and comprehensive phenotyping has lagged. This isespecially so in the context of multipartite traits such as the humanface. In typical genome-wide association studies (GWAS) todayphenotypes are summarized as univariate variables, which isinherently limiting for multivariate traits, which, by definitioncannot be expressed with single variables. Current state-of-the-art

genetic association studies for facial traits are limited in theirdescription of facial morphology [4–7]. These analyses start from asparse set of anatomical landmarks (these being defined as ‘‘a pointof correspondence on an object that matches between and withinpopulations’’), which overlooks salient features of facial shape.Subsequently, either a set of conventional morphometric mea-surements such as distances and angles are extracted, whichdrastically oversimplify facial shape, or a set of principalcomponents (PCs) are extracted using principal componentsanalysis (PCA) on the shape-space obtained with superimpositiontechniques, where each PC is assumed to represent a distinctmorphological trait. Here we describe a novel method thatfacilitates the compounding of all PCs into a single scalar variablecustomized to relevant independent variables including, sex,genomic ancestry, and genes. Our approach combines placing

PLOS Genetics | www.plosgenetics.org 1 March 2014 | Volume 10 | Issue 3 | e1004224

Modeling 3D Facial Shape from DNA

PLOS Genetics | www.plosgenetics.org 5 March 2014 | Volume 10 | Issue 3 | e1004224

Figure 4. Relationships between the ancestry and sex RIP variables and their initial predictor variables. (A) RIP-A with genomicancestry; genomic ancestry is calculated using the core panel of 68 AIMs and RIP-A is calculated using this ancestry estimate on the set of threepopulations combined (N = 592). Populations are indicated as shown in the legend with United States participants shown with black circles, Brazilianswith red circles, and Cape Verdeans with blue circles. (B) Histograms of RIP-S by self-reported sex.doi:10.1371/journal.pgen.1004224.g004

Modeling 3D Facial Shape from DNA

PLOS Genetics | www.plosgenetics.org 7 March 2014 | Volume 10 | Issue 3 | e1004224

Page 52: Bioinformatics and the logic of life

http://www.scbi.uma.es

We have found the treasure coffer, but…

41http://www.slideshare.net/MGonzaloClaros

Page 53: Bioinformatics and the logic of life

http://www.scbi.uma.es

We have found the treasure coffer, but…

41http://www.slideshare.net/MGonzaloClaros