Bioinformatics and the logic of life
-
Upload
m-gonzalo-claros -
Category
Science
-
view
239 -
download
3
description
Transcript of Bioinformatics and the logic of life
Bioinformatics to reveal the logic of life
M. Gonzalo Claros Díaz
Dpto Biología Molecular y Bioquímica Plataforma Andaluza de Bioinformática
1
Centro de Bioinnovación
http://about.me/mgclaros/
@MGClaros
http://www.scbi.uma.es
There are many reflections about life
3
Genetics
PhilosophyReligion
Physics
And many more
http://www.scbi.uma.es
A living being for some scientists
4
The cell is a kind of black box
http://www.scbi.uma.es
Molecular biology provides some logic…
5
How to select the few combinations having some sense?
http://www.scbi.uma.es
A hierarchical logic…
6
the way back cannot be predicted
http://www.scbi.uma.es
Bioinformatics = integration
10
http://bioinformatics.biol.ntnu.edu.tw/sher/Teaching.html
http://www.scbi.uma.es
Bioinformatics receives and gives new data and insights
11
Biology
Computer science Statistics
The living being is the result of all observations and cannot be inferred
from biassed observations
http://www.scbi.uma.es
A living being for some scientists
12
The cell is a kind of black box
http://www.scbi.uma.es
A living being for a bioinformatician
13
Life ontology
http://www.scbi.uma.es
So, we begin to understand
14
Bioinformatician Biotechnologist
Other scientists
http://www.scbi.uma.es
Regarding data, informatics is in the rear of biology
16
http://www.scbi.uma.es
Therefore, biology and informatics are interdependent
17http://www.genomicglossaries.com/presentation/SLAgenomics.asp
Some logic in living beings based on bioinformatics
19
http://www.scbi.uma.es
Bioinformatics integration in alcohol induced disorders
20http://pubs.niaaa.nih.gov/publications/arh311/5-11.htm
Through integration and modeling, these studies would allow us to better exploit the complexity of genomic and functional genomic data and to extract their biological and clinical significance
http://www.scbi.uma.es
Drug discovery was expensive
21
Experimental drugs were chemically synthesized and
then tested in animals
Classic approach
http://www.scbi.uma.es
Drug discovery was expensive
21
Experimental drugs were chemically synthesized and
then tested in animals
Classic approach Bioinformatics approach
Only candidate drugs are synthesized. A cost-effective approach
Ligand database
http://www.scbi.uma.es
Nobel of chemistry in 2013
22
Por el desarrollo de modelos computacionales para conocer y predecir procesos químicos
Químico teórico Biofísico Bioquímico
http://blogs.plos.org/biologue/2013/10/18/the-significance-of-the-2013-nobel-prize-in-chemistry-and-the-challenges-ahead/
Bioquímico
http://www.scbi.uma.es
Nobel of chemistry in 2013
22
Por el desarrollo de modelos computacionales para conocer y predecir procesos químicos
Químico teórico Biofísico Bioquímico
http://blogs.plos.org/biologue/2013/10/18/the-significance-of-the-2013-nobel-prize-in-chemistry-and-the-challenges-ahead/
Bioquímico
This Nobel Prize is the first given to work in computational biology, indicating that the field has matured and is on a par with experimental biology
The blog of PLOS Computational Biology
http://www.scbi.uma.es
A cell was full of molecular cascades
23
Divergent cascades Convergent cascades
http://www.scbi.uma.es
Then, a cell was a subway map
24Subway map designed by Claudia Bentley. Web design by Nick Allin.Edited by Cath Brooksbank and Sandra Clark.© 2002 Nature Publishing Group.
http://www.nature.com/nrc/poster/subpathways/index.html
http://www.scbi.uma.es
Finally, a cell is a network
25
Cell network complexity increases with whole organism complexity. Key nodes revealed key functions
http://www.scbi.uma.es
allow the formation of supramolecular activator orinhibitory complexes, depending on their componentsand possible combinations.Transcription factors (TFs) are an essential subset of
interacting proteins responsible for the control of geneexpression. They interact with DNA regions and tendto form transcriptional regulatory complexes. Thus,the final effect of one of these complexes is determinedby its TF composition.The number of TFs varies among organisms,
although it appears to be linked to the organism’scomplexity. Around 200–300 TFs are predicted forEscherichia coli [18] and Saccharomyces [19,20]. Bycontrast, comparative analysis in multicellular organ-isms shows that the predicted number of TFs reaches600–820 in C. elegans and D. melanogaster [20,21], and1500–1800 in Arabidopsis (1200 cloned sequences)[20–22]. For humans, around 1500 TFs have beendocumented [21] and it is estimated that there are2000–3000 [21,23]. Such an increase in the number ofTFs is associated with higher control of gene regula-tion [24]. Interestingly, such an increase is based onthe use of the same structural types of proteins.Human transcription factors are predominantly Zn fin-gers, followed by homeobox and basic helix–loop–helix[21]. Phylogenetic studies have shown that the amplifi-cation and shuffling of protein domains determine thegrowth of certain transcription factor families [25–28].Here, a domain can be defined as a protein sub-structure that can fold independently into a compactstructure. Different domains of a protein are oftenassociated with different functions [29,30].When dealing with TF networks, several relevant
questions arise. How are these factors distributed andrelated through the network structure? How importanthas the protein domain universe been in shaping thenetwork? Analysis of global patterns of networkorganization is required to answer these questions.To this end, we explored, for the first time, the
human transcription factor network (HTFN) obtainedfrom the protein–protein interaction information avail-able in the TRANSFAC database [31], using noveltools of network analysis. We show that this approxi-mation allows us to propose evolutionary considera-tions concerning the mechanisms shaping networkarchitecture.
Results and Discussion
Topological analysis
Data compilation from the TRANSFAC transcriptionfactor database provided 1370 human entries. After
filtering according to criteria given in ExperimentalProcedures, a graph of N ¼ 230 interacting humanTFs was obtained (Fig. 1). This can be understood asthe architecture of the regulatory backbone. It pro-vides a topological view of the interaction patternsamong the elements responsible for gene expression.This corresponds to the protein hardware that carriesout genomic instructions. The remaining TFs con-tained in the database did not form subgraphs andappeared isolated. The relatively small size of the con-nected graph compared with all the entries in the data-base might be due, at least in part, to the currentdegree of knowledge of this transcriptional regulatorynetwork, with only sparse data for many of its compo-nents. Although a number of possible sources of biasare present, it is worth noting that the topological pat-tern of organization reported from different sources ofprotein–protein interactions seems consistent [32].
Topological analysis of HTFN is summarized inTable 1 showing that HTFN is a sparse, small-worldgraph. The degree distribution (Fig. 2A) and clustering(Fig. 2B) show a heterogeneous, skewed shape remind-ing us of a power–law behaviour, indicating that mostTFs are linked to only a few others, whereas a handfulof them have many connections. The average between-ness centrality (b) shows well-defined power–law
Fig. 1. Human transcription factor network built from data extracted
from the TRANSFAC 8.2 database. Numbered black filled nodes
are the highest connected transcription factors. 1, TATA-binding
protein (TBP); 2, p53; 3, p300; 4, retinoid X receptor a (RXRa); 5,retinoblastoma protein (pRB); 6, nuclear factor NFjB p65 subunit
(RelA); 7, c-jun; 8, c-myc; 9, c-fos.
Human transcription factor network topology C. Rodriguez-Caso et al.
6424 FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS
Transcription factor network explains some cancers
26
Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3
1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain
2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain
3 Santa Fe Institute, Santa Fe, New Mexico, USA
Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.
Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they
Keywords
human; molecular evolution; protein
interaction; tinkering; transcription factor
network
Correspondence
Ricard V. Sole, ICREA - Complex System
Laboratory, Universitat Pompeu Fabra,
Dr Aiguader 80, 08003 Barcelona, Spain
Fax: +34 93 221 3237
Tel: +34 93 542 2821
E-mail: [email protected]
(Received 5 August 2005, revised 25
October 2005, accepted 31 October 2005)
doi:10.1111/j.1742-4658.2005.05041.x
Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.
Abbreviations
ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423
Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3
1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain
2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain
3 Santa Fe Institute, Santa Fe, New Mexico, USA
Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.
Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they
Keywords
human; molecular evolution; protein
interaction; tinkering; transcription factor
network
Correspondence
Ricard V. Sole, ICREA - Complex System
Laboratory, Universitat Pompeu Fabra,
Dr Aiguader 80, 08003 Barcelona, Spain
Fax: +34 93 221 3237
Tel: +34 93 542 2821
E-mail: [email protected]
(Received 5 August 2005, revised 25
October 2005, accepted 31 October 2005)
doi:10.1111/j.1742-4658.2005.05041.x
Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.
Abbreviations
ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423
Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3
1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain
2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain
3 Santa Fe Institute, Santa Fe, New Mexico, USA
Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.
Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they
Keywords
human; molecular evolution; protein
interaction; tinkering; transcription factor
network
Correspondence
Ricard V. Sole, ICREA - Complex System
Laboratory, Universitat Pompeu Fabra,
Dr Aiguader 80, 08003 Barcelona, Spain
Fax: +34 93 221 3237
Tel: +34 93 542 2821
E-mail: [email protected]
(Received 5 August 2005, revised 25
October 2005, accepted 31 October 2005)
doi:10.1111/j.1742-4658.2005.05041.x
Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.
Abbreviations
ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423
or via control of TF expression, less connected factorsmay also be relevant to cell survival.
Functional and structural patterns from topology
In order to reveal the mechanisms that shape the struc-ture of HTFN, we studied its topological modularityin relation to the function and structure of TFs fromavailable information. From a structural point of view,the overabundance of self-interactions is associatedwith a majority group of 55% of basic helix–loop–helix (bHLH) and leucine zippers (bZip), 17.5% of Znfingers and 22.5% corresponding to a more hetero-geneous group, the ‘beta-scaffold factor with minorgroove contact’ (according to the TRANSFAC classifi-cation) superclass, which includes Rel homologyregions, MADS factors and others.
Such structures can be understood as proteindomains, which can be found alone or combined togive rise to TFs. These domains are responsible forrelevant properties, such as TF–DNA or TF–TF bind-ing. In this context, self-interactions can be explainedby the presence of domains with the ability to bindbetween them as is the case of bHLH and bZip. Theyfollow a general mechanism to interact with DNAbased on protein dimerization [42]. Zn finger domainsare common in TFs, allowing them to bind DNA, butnot to interact with other protein regions [42]. Thisgroup of self-interacting Zn finger proteins is a subsetof the nuclear receptor superfamily (steroid, retinoidand thyroid, as well as some orphan receptors) [26,43].They obey a general mechanism in which Zn fingerTFs have to form dimers in order to recognize tandemsequences in DNA [42]. In fact, regulation at the levelof formation of transcriptional regulatory complexes islinked to a homo ⁄heterodimerization of TFs contain-ing these self-interacting domains. Attending to thissimple rule of domain self-interaction, relative levels ofthese proteins could determine the final composition of
a complex, by varying their function and affinity toDNA. This is the case of the bHLH–bZip proto-onco-gen c-myc [44], or the Zn finger retinoid X receptorRXR [45].
From a topological viewpoint, connections by self-interacting domains would imply high clustering andmodularity, because all these proteins share the samerules and they have the potential to give a highly inter-connected subgraph (i.e. a module). According to this,the high clustering of HTFN (see Fig. 1) could beexplained as a by-product of the overabundance ofself-interacting domains.
We wondered whether the HTFN modular architec-ture (Fig. 3C) might include both functionality andstructural similarity. In order to simplify the study ofmodularity, we traced an arbitrary line identifyingseven putative protein groups (dashed line in Fig. 3C).Nodes of each group were identified by different col-ours in the HTFN graph (Fig. 4A) where we visualizethe modules defined by the topological overlap algo-rithm. We note that a consequence of the hierarchicalcomponent of HTFN is that not all factors in eachgroup have the same level of relation. Unlike asimple modular network, the combination of hierarchyand modularity cannot give homogeneous groups.Figure 4B shows the HTFN core graph, highlightingits modularity, the under-representation of connectionsbetween hubs and the overabundance of highly con-nected nodes linked to poorly connected ones (bothobserved in the correlation profile). The central role ofthe hubs in topological groups defined in Fig. 3Ashould be stressed, such hubs are those described inTable 2, with the exception of E12 (with k ¼ 11),which is involved in lymphocyte development [46].
An analysis of the topological modules of the Fig. 3(labelled A–G) shows that they include structuraland ⁄or functional features. Table 3 summarizes themain structural and functional features of thesegroups. In agreement with the structural homogeneity
Table 2. Description and functionality of transcriptions factor hubs. Transcription factor (TF), degree (k), betweenness centrality (b).
TF Description Associate disease k b · 103
TBP Basal transcription machinery initiator Spinocerebellar ataxia [40] 27 17.3
p53 Tumor suppressor protein Proliferative disease [68] 23 18.5
P300 Coactivator. Histone acetyltransferase May play a role in epithelial cancer [69] 18 20.2
RXR-a Retinoid X-a receptor Hepatocellular carcinoma [70] 18 8
pRB retinoblastoma suppressor protein.
Tumour suppressor protein
Proliferative disease Bladder cancer.
Osteosarcoma [71]
15 27.1
RelA NF-jB pathway Hepatocyte apoptosis and foetal death [72] 14 6.6
c-jun AP-1 complex (activator). Proto-oncogen Proliferative disease [73] 14 4.1
c-myc Activator. Proto-oncogen Proliferative disease [74] 13 10.5
c-fos AP-1 complex (activator). Proto-oncogen Proliferative disease [75] 12 2
C. Rodriguez-Caso et al. Human transcription factor network topology
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6427
2 1
4 5
7 6 9
http://www.scbi.uma.es
allow the formation of supramolecular activator orinhibitory complexes, depending on their componentsand possible combinations.Transcription factors (TFs) are an essential subset of
interacting proteins responsible for the control of geneexpression. They interact with DNA regions and tendto form transcriptional regulatory complexes. Thus,the final effect of one of these complexes is determinedby its TF composition.The number of TFs varies among organisms,
although it appears to be linked to the organism’scomplexity. Around 200–300 TFs are predicted forEscherichia coli [18] and Saccharomyces [19,20]. Bycontrast, comparative analysis in multicellular organ-isms shows that the predicted number of TFs reaches600–820 in C. elegans and D. melanogaster [20,21], and1500–1800 in Arabidopsis (1200 cloned sequences)[20–22]. For humans, around 1500 TFs have beendocumented [21] and it is estimated that there are2000–3000 [21,23]. Such an increase in the number ofTFs is associated with higher control of gene regula-tion [24]. Interestingly, such an increase is based onthe use of the same structural types of proteins.Human transcription factors are predominantly Zn fin-gers, followed by homeobox and basic helix–loop–helix[21]. Phylogenetic studies have shown that the amplifi-cation and shuffling of protein domains determine thegrowth of certain transcription factor families [25–28].Here, a domain can be defined as a protein sub-structure that can fold independently into a compactstructure. Different domains of a protein are oftenassociated with different functions [29,30].When dealing with TF networks, several relevant
questions arise. How are these factors distributed andrelated through the network structure? How importanthas the protein domain universe been in shaping thenetwork? Analysis of global patterns of networkorganization is required to answer these questions.To this end, we explored, for the first time, the
human transcription factor network (HTFN) obtainedfrom the protein–protein interaction information avail-able in the TRANSFAC database [31], using noveltools of network analysis. We show that this approxi-mation allows us to propose evolutionary considera-tions concerning the mechanisms shaping networkarchitecture.
Results and Discussion
Topological analysis
Data compilation from the TRANSFAC transcriptionfactor database provided 1370 human entries. After
filtering according to criteria given in ExperimentalProcedures, a graph of N ¼ 230 interacting humanTFs was obtained (Fig. 1). This can be understood asthe architecture of the regulatory backbone. It pro-vides a topological view of the interaction patternsamong the elements responsible for gene expression.This corresponds to the protein hardware that carriesout genomic instructions. The remaining TFs con-tained in the database did not form subgraphs andappeared isolated. The relatively small size of the con-nected graph compared with all the entries in the data-base might be due, at least in part, to the currentdegree of knowledge of this transcriptional regulatorynetwork, with only sparse data for many of its compo-nents. Although a number of possible sources of biasare present, it is worth noting that the topological pat-tern of organization reported from different sources ofprotein–protein interactions seems consistent [32].
Topological analysis of HTFN is summarized inTable 1 showing that HTFN is a sparse, small-worldgraph. The degree distribution (Fig. 2A) and clustering(Fig. 2B) show a heterogeneous, skewed shape remind-ing us of a power–law behaviour, indicating that mostTFs are linked to only a few others, whereas a handfulof them have many connections. The average between-ness centrality (b) shows well-defined power–law
Fig. 1. Human transcription factor network built from data extracted
from the TRANSFAC 8.2 database. Numbered black filled nodes
are the highest connected transcription factors. 1, TATA-binding
protein (TBP); 2, p53; 3, p300; 4, retinoid X receptor a (RXRa); 5,retinoblastoma protein (pRB); 6, nuclear factor NFjB p65 subunit
(RelA); 7, c-jun; 8, c-myc; 9, c-fos.
Human transcription factor network topology C. Rodriguez-Caso et al.
6424 FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS
Transcription factor network explains some cancers
26
Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3
1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain
2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain
3 Santa Fe Institute, Santa Fe, New Mexico, USA
Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.
Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they
Keywords
human; molecular evolution; protein
interaction; tinkering; transcription factor
network
Correspondence
Ricard V. Sole, ICREA - Complex System
Laboratory, Universitat Pompeu Fabra,
Dr Aiguader 80, 08003 Barcelona, Spain
Fax: +34 93 221 3237
Tel: +34 93 542 2821
E-mail: [email protected]
(Received 5 August 2005, revised 25
October 2005, accepted 31 October 2005)
doi:10.1111/j.1742-4658.2005.05041.x
Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.
Abbreviations
ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423
Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3
1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain
2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain
3 Santa Fe Institute, Santa Fe, New Mexico, USA
Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.
Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they
Keywords
human; molecular evolution; protein
interaction; tinkering; transcription factor
network
Correspondence
Ricard V. Sole, ICREA - Complex System
Laboratory, Universitat Pompeu Fabra,
Dr Aiguader 80, 08003 Barcelona, Spain
Fax: +34 93 221 3237
Tel: +34 93 542 2821
E-mail: [email protected]
(Received 5 August 2005, revised 25
October 2005, accepted 31 October 2005)
doi:10.1111/j.1742-4658.2005.05041.x
Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.
Abbreviations
ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423
Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3
1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain
2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain
3 Santa Fe Institute, Santa Fe, New Mexico, USA
Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.
Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they
Keywords
human; molecular evolution; protein
interaction; tinkering; transcription factor
network
Correspondence
Ricard V. Sole, ICREA - Complex System
Laboratory, Universitat Pompeu Fabra,
Dr Aiguader 80, 08003 Barcelona, Spain
Fax: +34 93 221 3237
Tel: +34 93 542 2821
E-mail: [email protected]
(Received 5 August 2005, revised 25
October 2005, accepted 31 October 2005)
doi:10.1111/j.1742-4658.2005.05041.x
Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.
Abbreviations
ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423
or via control of TF expression, less connected factorsmay also be relevant to cell survival.
Functional and structural patterns from topology
In order to reveal the mechanisms that shape the struc-ture of HTFN, we studied its topological modularityin relation to the function and structure of TFs fromavailable information. From a structural point of view,the overabundance of self-interactions is associatedwith a majority group of 55% of basic helix–loop–helix (bHLH) and leucine zippers (bZip), 17.5% of Znfingers and 22.5% corresponding to a more hetero-geneous group, the ‘beta-scaffold factor with minorgroove contact’ (according to the TRANSFAC classifi-cation) superclass, which includes Rel homologyregions, MADS factors and others.
Such structures can be understood as proteindomains, which can be found alone or combined togive rise to TFs. These domains are responsible forrelevant properties, such as TF–DNA or TF–TF bind-ing. In this context, self-interactions can be explainedby the presence of domains with the ability to bindbetween them as is the case of bHLH and bZip. Theyfollow a general mechanism to interact with DNAbased on protein dimerization [42]. Zn finger domainsare common in TFs, allowing them to bind DNA, butnot to interact with other protein regions [42]. Thisgroup of self-interacting Zn finger proteins is a subsetof the nuclear receptor superfamily (steroid, retinoidand thyroid, as well as some orphan receptors) [26,43].They obey a general mechanism in which Zn fingerTFs have to form dimers in order to recognize tandemsequences in DNA [42]. In fact, regulation at the levelof formation of transcriptional regulatory complexes islinked to a homo ⁄heterodimerization of TFs contain-ing these self-interacting domains. Attending to thissimple rule of domain self-interaction, relative levels ofthese proteins could determine the final composition of
a complex, by varying their function and affinity toDNA. This is the case of the bHLH–bZip proto-onco-gen c-myc [44], or the Zn finger retinoid X receptorRXR [45].
From a topological viewpoint, connections by self-interacting domains would imply high clustering andmodularity, because all these proteins share the samerules and they have the potential to give a highly inter-connected subgraph (i.e. a module). According to this,the high clustering of HTFN (see Fig. 1) could beexplained as a by-product of the overabundance ofself-interacting domains.
We wondered whether the HTFN modular architec-ture (Fig. 3C) might include both functionality andstructural similarity. In order to simplify the study ofmodularity, we traced an arbitrary line identifyingseven putative protein groups (dashed line in Fig. 3C).Nodes of each group were identified by different col-ours in the HTFN graph (Fig. 4A) where we visualizethe modules defined by the topological overlap algo-rithm. We note that a consequence of the hierarchicalcomponent of HTFN is that not all factors in eachgroup have the same level of relation. Unlike asimple modular network, the combination of hierarchyand modularity cannot give homogeneous groups.Figure 4B shows the HTFN core graph, highlightingits modularity, the under-representation of connectionsbetween hubs and the overabundance of highly con-nected nodes linked to poorly connected ones (bothobserved in the correlation profile). The central role ofthe hubs in topological groups defined in Fig. 3Ashould be stressed, such hubs are those described inTable 2, with the exception of E12 (with k ¼ 11),which is involved in lymphocyte development [46].
An analysis of the topological modules of the Fig. 3(labelled A–G) shows that they include structuraland ⁄or functional features. Table 3 summarizes themain structural and functional features of thesegroups. In agreement with the structural homogeneity
Table 2. Description and functionality of transcriptions factor hubs. Transcription factor (TF), degree (k), betweenness centrality (b).
TF Description Associate disease k b · 103
TBP Basal transcription machinery initiator Spinocerebellar ataxia [40] 27 17.3
p53 Tumor suppressor protein Proliferative disease [68] 23 18.5
P300 Coactivator. Histone acetyltransferase May play a role in epithelial cancer [69] 18 20.2
RXR-a Retinoid X-a receptor Hepatocellular carcinoma [70] 18 8
pRB retinoblastoma suppressor protein.
Tumour suppressor protein
Proliferative disease Bladder cancer.
Osteosarcoma [71]
15 27.1
RelA NF-jB pathway Hepatocyte apoptosis and foetal death [72] 14 6.6
c-jun AP-1 complex (activator). Proto-oncogen Proliferative disease [73] 14 4.1
c-myc Activator. Proto-oncogen Proliferative disease [74] 13 10.5
c-fos AP-1 complex (activator). Proto-oncogen Proliferative disease [75] 12 2
C. Rodriguez-Caso et al. Human transcription factor network topology
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6427
2 1
4 5
7 6 9
At least 9 transcription factors drive to cancer if their function is affected
http://www.scbi.uma.es
allow the formation of supramolecular activator orinhibitory complexes, depending on their componentsand possible combinations.Transcription factors (TFs) are an essential subset of
interacting proteins responsible for the control of geneexpression. They interact with DNA regions and tendto form transcriptional regulatory complexes. Thus,the final effect of one of these complexes is determinedby its TF composition.The number of TFs varies among organisms,
although it appears to be linked to the organism’scomplexity. Around 200–300 TFs are predicted forEscherichia coli [18] and Saccharomyces [19,20]. Bycontrast, comparative analysis in multicellular organ-isms shows that the predicted number of TFs reaches600–820 in C. elegans and D. melanogaster [20,21], and1500–1800 in Arabidopsis (1200 cloned sequences)[20–22]. For humans, around 1500 TFs have beendocumented [21] and it is estimated that there are2000–3000 [21,23]. Such an increase in the number ofTFs is associated with higher control of gene regula-tion [24]. Interestingly, such an increase is based onthe use of the same structural types of proteins.Human transcription factors are predominantly Zn fin-gers, followed by homeobox and basic helix–loop–helix[21]. Phylogenetic studies have shown that the amplifi-cation and shuffling of protein domains determine thegrowth of certain transcription factor families [25–28].Here, a domain can be defined as a protein sub-structure that can fold independently into a compactstructure. Different domains of a protein are oftenassociated with different functions [29,30].When dealing with TF networks, several relevant
questions arise. How are these factors distributed andrelated through the network structure? How importanthas the protein domain universe been in shaping thenetwork? Analysis of global patterns of networkorganization is required to answer these questions.To this end, we explored, for the first time, the
human transcription factor network (HTFN) obtainedfrom the protein–protein interaction information avail-able in the TRANSFAC database [31], using noveltools of network analysis. We show that this approxi-mation allows us to propose evolutionary considera-tions concerning the mechanisms shaping networkarchitecture.
Results and Discussion
Topological analysis
Data compilation from the TRANSFAC transcriptionfactor database provided 1370 human entries. After
filtering according to criteria given in ExperimentalProcedures, a graph of N ¼ 230 interacting humanTFs was obtained (Fig. 1). This can be understood asthe architecture of the regulatory backbone. It pro-vides a topological view of the interaction patternsamong the elements responsible for gene expression.This corresponds to the protein hardware that carriesout genomic instructions. The remaining TFs con-tained in the database did not form subgraphs andappeared isolated. The relatively small size of the con-nected graph compared with all the entries in the data-base might be due, at least in part, to the currentdegree of knowledge of this transcriptional regulatorynetwork, with only sparse data for many of its compo-nents. Although a number of possible sources of biasare present, it is worth noting that the topological pat-tern of organization reported from different sources ofprotein–protein interactions seems consistent [32].
Topological analysis of HTFN is summarized inTable 1 showing that HTFN is a sparse, small-worldgraph. The degree distribution (Fig. 2A) and clustering(Fig. 2B) show a heterogeneous, skewed shape remind-ing us of a power–law behaviour, indicating that mostTFs are linked to only a few others, whereas a handfulof them have many connections. The average between-ness centrality (b) shows well-defined power–law
Fig. 1. Human transcription factor network built from data extracted
from the TRANSFAC 8.2 database. Numbered black filled nodes
are the highest connected transcription factors. 1, TATA-binding
protein (TBP); 2, p53; 3, p300; 4, retinoid X receptor a (RXRa); 5,retinoblastoma protein (pRB); 6, nuclear factor NFjB p65 subunit
(RelA); 7, c-jun; 8, c-myc; 9, c-fos.
Human transcription factor network topology C. Rodriguez-Caso et al.
6424 FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS
Transcription factor network explains some cancers
26
Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3
1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain
2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain
3 Santa Fe Institute, Santa Fe, New Mexico, USA
Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.
Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they
Keywords
human; molecular evolution; protein
interaction; tinkering; transcription factor
network
Correspondence
Ricard V. Sole, ICREA - Complex System
Laboratory, Universitat Pompeu Fabra,
Dr Aiguader 80, 08003 Barcelona, Spain
Fax: +34 93 221 3237
Tel: +34 93 542 2821
E-mail: [email protected]
(Received 5 August 2005, revised 25
October 2005, accepted 31 October 2005)
doi:10.1111/j.1742-4658.2005.05041.x
Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.
Abbreviations
ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423
Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3
1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain
2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain
3 Santa Fe Institute, Santa Fe, New Mexico, USA
Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.
Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they
Keywords
human; molecular evolution; protein
interaction; tinkering; transcription factor
network
Correspondence
Ricard V. Sole, ICREA - Complex System
Laboratory, Universitat Pompeu Fabra,
Dr Aiguader 80, 08003 Barcelona, Spain
Fax: +34 93 221 3237
Tel: +34 93 542 2821
E-mail: [email protected]
(Received 5 August 2005, revised 25
October 2005, accepted 31 October 2005)
doi:10.1111/j.1742-4658.2005.05041.x
Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.
Abbreviations
ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423
Topology, tinkering and evolution of the humantranscription factor networkCarlos Rodriguez-Caso1,2, Miguel A. Medina2 and Ricard V. Sole1,3
1 ICREA-Complex Systems Laboratory, Universitat Pompeu Fabra, Barcelona, Spain
2 Department of Molecular Biology and Biochemistry, Faculty of Sciences, Universidad de Malaga, Spain
3 Santa Fe Institute, Santa Fe, New Mexico, USA
Living cells are composed of a large number of differ-ent molecules interacting with each other to yield com-plex spatial and temporal patterns. Unfortunately, thisreality is seldom captured by traditional and molecularbiology approaches. A shift from molecular to modularbiology seems unavoidable [1] as biological systems aredefined by complex networks of interacting compo-nents. Such networks show high heterogeneity and aretypically modular and hierarchical [2,3]. Genome-widegene expression and protein analyses provide new,powerful tools for the study of such complex biologicalphenomena [4–6] and new, more integrative views arerequired to properly interpret them [7]. Such an inte-grative approach is obtained by mapping molecularinteractions into a network, as is the case for metabolicand signalling pathways. In this context, biologicaldatabases provide a unique opportunity to characterizebiological networks under a systems perspective.
Early topological studies of cellular networksrevealed that genomic, proteomic and metabolic mapsshare characteristic features with other real-worldnetworks [8–12]. Protein networks, also called inter-actomes, were studied thanks to a massive two-hybridsystem screening in unicellular Saccharomyces cerevisiae[9] and, more recently, in Drosophila melanogaster [13]and Caenorhabditis elegans [10]. The networks have anontrivial organization that departs strongly from sim-ple, random homogeneous metaphors [2]. The networkstructure involves a nested hierarchy of levels, fromlarge-scale features to modules and motifs [1,14]. Thisis particularly true for protein interaction maps andgene regulatory nets, which different evolutionary for-ces from convergent evolution [15] to dynamical con-straints [16,17] have helped shape. In this context,protein–protein interactions play an essential role inregulation, signalling and gene expression because they
Keywords
human; molecular evolution; protein
interaction; tinkering; transcription factor
network
Correspondence
Ricard V. Sole, ICREA - Complex System
Laboratory, Universitat Pompeu Fabra,
Dr Aiguader 80, 08003 Barcelona, Spain
Fax: +34 93 221 3237
Tel: +34 93 542 2821
E-mail: [email protected]
(Received 5 August 2005, revised 25
October 2005, accepted 31 October 2005)
doi:10.1111/j.1742-4658.2005.05041.x
Patterns of protein interactions are organized around complex heterogene-ous networks. Their architecture has been suggested to be of relevance inunderstanding the interactome and its functional organization, which per-vades cellular robustness. Transcription factors are particularly relevant inthis context, given their central role in gene regulation. Here we present thefirst topological study of the human protein–protein interacting transcrip-tion factor network built using the TRANSFAC database. We show thatthe network exhibits scale-free and small-world properties with a hierarchi-cal and modular structure, which is built around a small number of keyproteins. Most of these proteins are associated with proliferative diseasesand are typically not linked to each other, thus reducing the propagationof failures through compartmentalization. Network modularity is consistentwith common structural and functional features and the features are gener-ated by two distinct evolutionary strategies: amplification and shuffling ofinteracting domains through tinkering and acquisition of specific interact-ing regions. The function of the regulatory complexes may have played anactive role in choosing one of them.
Abbreviations
ER, Erdos-Renyi; HTFN, human transcription factor network; SF, scale free; SW, small world; TF, transcription factor.
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6423
or via control of TF expression, less connected factorsmay also be relevant to cell survival.
Functional and structural patterns from topology
In order to reveal the mechanisms that shape the struc-ture of HTFN, we studied its topological modularityin relation to the function and structure of TFs fromavailable information. From a structural point of view,the overabundance of self-interactions is associatedwith a majority group of 55% of basic helix–loop–helix (bHLH) and leucine zippers (bZip), 17.5% of Znfingers and 22.5% corresponding to a more hetero-geneous group, the ‘beta-scaffold factor with minorgroove contact’ (according to the TRANSFAC classifi-cation) superclass, which includes Rel homologyregions, MADS factors and others.
Such structures can be understood as proteindomains, which can be found alone or combined togive rise to TFs. These domains are responsible forrelevant properties, such as TF–DNA or TF–TF bind-ing. In this context, self-interactions can be explainedby the presence of domains with the ability to bindbetween them as is the case of bHLH and bZip. Theyfollow a general mechanism to interact with DNAbased on protein dimerization [42]. Zn finger domainsare common in TFs, allowing them to bind DNA, butnot to interact with other protein regions [42]. Thisgroup of self-interacting Zn finger proteins is a subsetof the nuclear receptor superfamily (steroid, retinoidand thyroid, as well as some orphan receptors) [26,43].They obey a general mechanism in which Zn fingerTFs have to form dimers in order to recognize tandemsequences in DNA [42]. In fact, regulation at the levelof formation of transcriptional regulatory complexes islinked to a homo ⁄heterodimerization of TFs contain-ing these self-interacting domains. Attending to thissimple rule of domain self-interaction, relative levels ofthese proteins could determine the final composition of
a complex, by varying their function and affinity toDNA. This is the case of the bHLH–bZip proto-onco-gen c-myc [44], or the Zn finger retinoid X receptorRXR [45].
From a topological viewpoint, connections by self-interacting domains would imply high clustering andmodularity, because all these proteins share the samerules and they have the potential to give a highly inter-connected subgraph (i.e. a module). According to this,the high clustering of HTFN (see Fig. 1) could beexplained as a by-product of the overabundance ofself-interacting domains.
We wondered whether the HTFN modular architec-ture (Fig. 3C) might include both functionality andstructural similarity. In order to simplify the study ofmodularity, we traced an arbitrary line identifyingseven putative protein groups (dashed line in Fig. 3C).Nodes of each group were identified by different col-ours in the HTFN graph (Fig. 4A) where we visualizethe modules defined by the topological overlap algo-rithm. We note that a consequence of the hierarchicalcomponent of HTFN is that not all factors in eachgroup have the same level of relation. Unlike asimple modular network, the combination of hierarchyand modularity cannot give homogeneous groups.Figure 4B shows the HTFN core graph, highlightingits modularity, the under-representation of connectionsbetween hubs and the overabundance of highly con-nected nodes linked to poorly connected ones (bothobserved in the correlation profile). The central role ofthe hubs in topological groups defined in Fig. 3Ashould be stressed, such hubs are those described inTable 2, with the exception of E12 (with k ¼ 11),which is involved in lymphocyte development [46].
An analysis of the topological modules of the Fig. 3(labelled A–G) shows that they include structuraland ⁄or functional features. Table 3 summarizes themain structural and functional features of thesegroups. In agreement with the structural homogeneity
Table 2. Description and functionality of transcriptions factor hubs. Transcription factor (TF), degree (k), betweenness centrality (b).
TF Description Associate disease k b · 103
TBP Basal transcription machinery initiator Spinocerebellar ataxia [40] 27 17.3
p53 Tumor suppressor protein Proliferative disease [68] 23 18.5
P300 Coactivator. Histone acetyltransferase May play a role in epithelial cancer [69] 18 20.2
RXR-a Retinoid X-a receptor Hepatocellular carcinoma [70] 18 8
pRB retinoblastoma suppressor protein.
Tumour suppressor protein
Proliferative disease Bladder cancer.
Osteosarcoma [71]
15 27.1
RelA NF-jB pathway Hepatocyte apoptosis and foetal death [72] 14 6.6
c-jun AP-1 complex (activator). Proto-oncogen Proliferative disease [73] 14 4.1
c-myc Activator. Proto-oncogen Proliferative disease [74] 13 10.5
c-fos AP-1 complex (activator). Proto-oncogen Proliferative disease [75] 12 2
C. Rodriguez-Caso et al. Human transcription factor network topology
FEBS Journal 272 (2005) 6423–6434 ª 2005 The Authors Journal compilation ª 2005 FEBS 6427
2 1
4 5
7 6 9
At least 9 transcription factors drive to cancer if their function is affected
If I know the gene network of a process THEN
I can predict which genes are really essential
http://www.scbi.uma.es
Biomarkers can be obtained from the observation of bioinformatics networks
27
Breast cancer
http://www.scbi.uma.es
Gene signatures to cancer diagnosis
28
Robust gene signatures from microarray data using genetic algorithmsenriched with biological pathway keywords
R.M. Luque-Baena a,⇑, D. Urda a,b, M. Gonzalo Claros c, L. Franco a,b, J.M. Jerez a,b
a Departmento de Lenguajes y Ciencias de la Computación, University of Málaga, Bulevar Louis Pasteur, 35, 29071 Málaga, Spainb Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga, Spainc Supercomputing and Bioinformatics Centre, University of Málaga, C/ Severo Ochoa, 34, 29590 Málaga, Spain
a r t i c l e i n f o
Article history:Received 24 July 2013Accepted 16 January 2014Available online 27 January 2014
Keywords:DNA analysisEvolutionary algorithmsBiological enrichmentFeature selection
a b s t r a c t
Genetic algorithms are widely used in the estimation of expression profiles from microarrays data. How-ever, these techniques are unable to produce stable and robust solutions suitable to use in clinical and bio-medical studies. This paper presents a novel two-stage evolutionary strategy for gene feature selectioncombining the genetic algorithm with biological information extracted from the KEGG database. A com-parative study is carried out over public data from three different types of cancer (leukemia, lung cancerand prostate cancer). Even though the analyses only use features having KEGG information, the resultsdemonstrate that this two-stage evolutionary strategy increased the consistency, robustness and accuracyof a blind discrimination among relapsed and healthy individuals. Therefore, this approach could facilitatethe definition of gene signatures for the clinical prognosis and diagnostic of cancer diseases in a nearfuture. Additionally, it could also be used for biological knowledge discovery about the studied disease.
! 2014 Elsevier Inc. All rights reserved.
1. Introduction
The term cancer encompasses more than 100 potentially life-threatening diseases affecting nearly every part of the body. Canceris a complex, multifactorial, genetic disease involving structuraland expression abnormalities of both coding and non-codinggenes. In this sense, gene expression profiling plays an importantrole in a wide range of areas in biological science for handling can-cer diseases [1–4]. The analysis of DNA microarray data requires aselection of features (genes) due to the small number of samplesavailable (mostly less than a hundred) and the large number offeatures (in the order of thousands). This problem is well-knownin the literature as the ‘‘large-p-small-n’’ paradigm or the curseof dimensionality [5].
Evolutionary models have been proposed in several works[6–12] and constitute one of the most widely used techniques forfeature selection and prognosis analysis in microarray datasets.Despite all the variety of feature selection techniques proposedin the literature, it still remains a problematic intrinsic to the
domain of DNA microarrays. Genetic algorithms (GAs) [13–18],as a particular case of evolutionary models, use classification tech-niques within the algorithm to evaluate and evolve the population.Producing stable or robust solutions is a desired property of featureselection algorithms, in particular for clinical and biomedical stud-ies. Nevertheless, robustness is a property difficult to be analyzedand is often overlooked. In [19–21] different approaches are pro-posed, addressing the main drawbacks related to overfitting androbustness, through a modified GA that includes an early-stoppingcriteria and establishing a feature ranking method that leads tomore robust solutions. Although some proposals use biologicalinformation to analyze DNA microarray data [22], none of them in-cludes it into the mechanisms that guide the searching procedurein the GA. In our opinion, this strategy would, on one hand, pro-duce more robust feature subset selections and, on the other hand,permit to obtain signatures more relevant for clinicians and bio-medical researchers.
In this approach, a two-stage procedure is proposed in order toobtain robust feature subset selections with good performancerates in test future data. Bootstrap Cross-Validation (BCV) is usedsince its good behavior related to misclassification error with smallsamples has been previously demonstrated [23,24], including DNAmicroarray datasets. A novel feature scoring method within the GAis also proposed, taking into account biological information relatedto the studied disorders. One widely used source of biologicalinformation is the Gene Ontology (GO) database [25] since it
http://dx.doi.org/10.1016/j.jbi.2014.01.0061532-0464/! 2014 Elsevier Inc. All rights reserved.
⇑ Corresponding author. Address: Department of Computer Languages andComputer Science, University of Málaga, Bulevar Louis Pasteur, 35, 29071 Málaga,Spain. Fax: +34 952131397.
E-mail addresses: [email protected] (R.M. Luque-Baena), [email protected](D. Urda), [email protected] (M. Gonzalo Claros), [email protected] (L. Franco),[email protected] (J.M. Jerez).
Journal of Biomedical Informatics 49 (2014) 32–44
Contents lists available at ScienceDirect
Journal of Biomedical Informatics
journal homepage: www.elsevier .com/locate /y jb in
relevant information, in order to obtain a robust feature subsetselection with good performance rates. The approach incorporatesa novel feature scoring method within the GA, taking into accountbiological information about proteins (mostly enzymes) involvedin the pathways of the studied disorders. The most remarkablefinding is that our proposal improves the standard GA strategyregardless of the classification model used (LDA or SVM) in thethree analyzed data sets (Table 4, Accuracy column), leading to sta-tistically significant results in two of them (Leukemia and Lung).Even more important from the biological and clinic point of view,the robustness, in terms of the most selected genes that can beused to define gene signatures, is also improved in all three ana-lyzed databases (Table 4, Robustness column). The main conse-quence of both facts is that the results of a KEGG-improved GAcan provide more repetitive and consistent results that will facili-tate the definition of gene signatures for further clinical diagnosticand prognostic. Moreover, the comparative analysis done amongthe KEGG-improved GA (Table 5) and three alternative filter meth-ods (Cons, IG and ReliefF) demonstrated a similar or higher perfor-mance of the KEGG-improved GA, with the additional benefit of thebiological information about the disease dynamics provided by thisnew GA-based strategy.
Regarding the summarizing results of Fig. 3 it can be seen thatthe best placed pathways in Table 4 provide more accurate and ro-
bust results. This opens the possibility of a deeper study of whichKEGG-pathway(s) provide(s) the better results for any diseasedataset. It should be noted that those feature subsets that includemore genes of the analyzed pathways analyzed might indicate thatthis particular pathway has a greater biological impact on thedisease.
But the proposed KEGG-improved GA not only can be usedfor diagnostic and prognostic, but also for biological knowledgediscovery about the disease. Regarding the most remarkablegenes of Tables 6–8 that even not originally present in the se-lected pathways, form part of the final selection thus playingan important role for obtaining robust and accurate predictionresults. For example, in Table 6 (Leukemia set), the gene ZYX7
is repetitively selected in all but one pathways; it codes zyxin, aadhesion plaque protein that prompts the formation of actin-richstructures at which signal transduction assemble. In the case ofthe lung database (Table 7), several adhesion pathways are in-volved in this cancer (cf., 04530, 04514) while the ZYX gene doesnot seem to be significant. The gene SEMA3C8 corresponds to asemaphorin, a protein including an inmunoglobulin domain. It
Fig. 3. Accuracy and robustness obtained for the selected pathways for each considered database (Leukemia, Lung and Prostate). The graphs include the results obtainedwhen using a strategy based only on genetic algorithms (GA) and on genetic algorithms plus the filtering approach (Filter + GA) (see text for more details).
7 http://www.genecards.org/cgi-bin/carddisp.pl?gene=ZYX.8 http://www.genecards.org/cgi-bin/carddisp.pl?gene=SEMA3C.
42 R.M. Luque-Baena et al. / Journal of Biomedical Informatics 49 (2014) 32–44
0.8, 1, 2, 3, 5} and coef0, r = {0, 1, 2}. It should be noted that not allthe parameters are required for each kernel type. For further infor-mation please visit [49].
Table 4 shows a comparison of the results after applying differ-ent strategies. The first three columns show the classificationmethod, the dataset and the strategy used. The fourth column rep-resents the number of genes, on average, after executing the meth-od fifty resamplings and five repetitions for each resampling.Robustness column in Table 2 indicates the average frequency of
the most selected genes, which are those that appear more than5% of the time in any of the solutions. The last column shows theresult of prediction of the disease over a test set not used duringall the process.
The accuracy results for the LDA method are, in general, slightlybetter than those obtained by applying SVM, although LDA has low-er complexity. This is not surprising since it has been shown beforethat simpler classification techniques can lead to competitive oreven better results [50]. Therefore, the following analysis done to
Table 5Performance comparison among the ‘‘Filter + GA + Pathway’’ combined strategy and three well-known filtering methods (Cons, IG and ReliefF). ACC and number of genes(mean ± std) are reported for LDA and SVM classifiers on the three analyzed datasets.
Strategy Leukemia
LDA SVM
ACC #Genes ACC #Genes
Filter + GA + Pathway 05340 97.13 ± 1.16 31.83 ± 1.86 93.87 ± 2.02 30.82 ± 1.62Filter + GA + Pathway 04640 96.38 ± 1.26 4.47 ± 0.71 94.86 ± 1.13 4.05 ± 0.80Cons 85.85 ± 8.55 1.84 ± 0.51 88.24 ± 5.95 1.84 ± 0.51IG 93.13 ± 4.40 9 ± 0 93.36 ± 4.33 9 ± 0ReliefF 93.31 ± 4.37 9 ± 0 90.48 ± 5.15 9 ± 0
LungFilter + GA + Pathway 04144 98.09 ± 0.68 4.29 ± 0.53 96.25 ± 0.97 4.15 ± 0.57Filter + GA + Pathway 04530 98.26 ± 0.46 3.84 ± 0.46 97.05 ± 0.90 3.55 ± 0.64Cons 94.08 ± 3.36 1.84 ± 0.42 94.57 ± 2.55 1.84 ± 0.42IG 98.68 ± 1.51 22 ± 0 98.88 ± 1.39 22 ± 0ReliefF 97.89 ± 1.81 22 ± 0 98.47 ± 1.43 22 ± 0
ProstateFilter + GA + Pathway 00980 91.37 ± 1.15 8.27 ± 0.83 87.96 ± 2.39 11.15 ± 2.10Filter + GA + Pathway 00480 90.80 ± 1.36 14.30 ± 2.63 88.90 ± 2.29 26.24 ± 4.02Cons 81.51 ± 7.57 3.20 ± 0.67 82.49 ± 6.72 3.20 ± 0.67IG 91.66 ± 4.07 12 ± 0 85.86 ± 4.86 12 ± 0ReliefF 90.22 ± 4.53 12 ± 0 88.50 ± 5.17 12 ± 0
Fig. 2. Proportion of the final selected genes which belong to the analyzed pathway for the databases Leukemia, Lung and Prostate.
38 R.M. Luque-Baena et al. / Journal of Biomedical Informatics 49 (2014) 32–44
http://www.scbi.uma.es
Gene signatures to cancer diagnosis
28
Robust gene signatures from microarray data using genetic algorithmsenriched with biological pathway keywords
R.M. Luque-Baena a,⇑, D. Urda a,b, M. Gonzalo Claros c, L. Franco a,b, J.M. Jerez a,b
a Departmento de Lenguajes y Ciencias de la Computación, University of Málaga, Bulevar Louis Pasteur, 35, 29071 Málaga, Spainb Instituto de Investigación Biomédica de Málaga (IBIMA), Málaga, Spainc Supercomputing and Bioinformatics Centre, University of Málaga, C/ Severo Ochoa, 34, 29590 Málaga, Spain
a r t i c l e i n f o
Article history:Received 24 July 2013Accepted 16 January 2014Available online 27 January 2014
Keywords:DNA analysisEvolutionary algorithmsBiological enrichmentFeature selection
a b s t r a c t
Genetic algorithms are widely used in the estimation of expression profiles from microarrays data. How-ever, these techniques are unable to produce stable and robust solutions suitable to use in clinical and bio-medical studies. This paper presents a novel two-stage evolutionary strategy for gene feature selectioncombining the genetic algorithm with biological information extracted from the KEGG database. A com-parative study is carried out over public data from three different types of cancer (leukemia, lung cancerand prostate cancer). Even though the analyses only use features having KEGG information, the resultsdemonstrate that this two-stage evolutionary strategy increased the consistency, robustness and accuracyof a blind discrimination among relapsed and healthy individuals. Therefore, this approach could facilitatethe definition of gene signatures for the clinical prognosis and diagnostic of cancer diseases in a nearfuture. Additionally, it could also be used for biological knowledge discovery about the studied disease.
! 2014 Elsevier Inc. All rights reserved.
1. Introduction
The term cancer encompasses more than 100 potentially life-threatening diseases affecting nearly every part of the body. Canceris a complex, multifactorial, genetic disease involving structuraland expression abnormalities of both coding and non-codinggenes. In this sense, gene expression profiling plays an importantrole in a wide range of areas in biological science for handling can-cer diseases [1–4]. The analysis of DNA microarray data requires aselection of features (genes) due to the small number of samplesavailable (mostly less than a hundred) and the large number offeatures (in the order of thousands). This problem is well-knownin the literature as the ‘‘large-p-small-n’’ paradigm or the curseof dimensionality [5].
Evolutionary models have been proposed in several works[6–12] and constitute one of the most widely used techniques forfeature selection and prognosis analysis in microarray datasets.Despite all the variety of feature selection techniques proposedin the literature, it still remains a problematic intrinsic to the
domain of DNA microarrays. Genetic algorithms (GAs) [13–18],as a particular case of evolutionary models, use classification tech-niques within the algorithm to evaluate and evolve the population.Producing stable or robust solutions is a desired property of featureselection algorithms, in particular for clinical and biomedical stud-ies. Nevertheless, robustness is a property difficult to be analyzedand is often overlooked. In [19–21] different approaches are pro-posed, addressing the main drawbacks related to overfitting androbustness, through a modified GA that includes an early-stoppingcriteria and establishing a feature ranking method that leads tomore robust solutions. Although some proposals use biologicalinformation to analyze DNA microarray data [22], none of them in-cludes it into the mechanisms that guide the searching procedurein the GA. In our opinion, this strategy would, on one hand, pro-duce more robust feature subset selections and, on the other hand,permit to obtain signatures more relevant for clinicians and bio-medical researchers.
In this approach, a two-stage procedure is proposed in order toobtain robust feature subset selections with good performancerates in test future data. Bootstrap Cross-Validation (BCV) is usedsince its good behavior related to misclassification error with smallsamples has been previously demonstrated [23,24], including DNAmicroarray datasets. A novel feature scoring method within the GAis also proposed, taking into account biological information relatedto the studied disorders. One widely used source of biologicalinformation is the Gene Ontology (GO) database [25] since it
http://dx.doi.org/10.1016/j.jbi.2014.01.0061532-0464/! 2014 Elsevier Inc. All rights reserved.
⇑ Corresponding author. Address: Department of Computer Languages andComputer Science, University of Málaga, Bulevar Louis Pasteur, 35, 29071 Málaga,Spain. Fax: +34 952131397.
E-mail addresses: [email protected] (R.M. Luque-Baena), [email protected](D. Urda), [email protected] (M. Gonzalo Claros), [email protected] (L. Franco),[email protected] (J.M. Jerez).
Journal of Biomedical Informatics 49 (2014) 32–44
Contents lists available at ScienceDirect
Journal of Biomedical Informatics
journal homepage: www.elsevier .com/locate /y jb in
relevant information, in order to obtain a robust feature subsetselection with good performance rates. The approach incorporatesa novel feature scoring method within the GA, taking into accountbiological information about proteins (mostly enzymes) involvedin the pathways of the studied disorders. The most remarkablefinding is that our proposal improves the standard GA strategyregardless of the classification model used (LDA or SVM) in thethree analyzed data sets (Table 4, Accuracy column), leading to sta-tistically significant results in two of them (Leukemia and Lung).Even more important from the biological and clinic point of view,the robustness, in terms of the most selected genes that can beused to define gene signatures, is also improved in all three ana-lyzed databases (Table 4, Robustness column). The main conse-quence of both facts is that the results of a KEGG-improved GAcan provide more repetitive and consistent results that will facili-tate the definition of gene signatures for further clinical diagnosticand prognostic. Moreover, the comparative analysis done amongthe KEGG-improved GA (Table 5) and three alternative filter meth-ods (Cons, IG and ReliefF) demonstrated a similar or higher perfor-mance of the KEGG-improved GA, with the additional benefit of thebiological information about the disease dynamics provided by thisnew GA-based strategy.
Regarding the summarizing results of Fig. 3 it can be seen thatthe best placed pathways in Table 4 provide more accurate and ro-
bust results. This opens the possibility of a deeper study of whichKEGG-pathway(s) provide(s) the better results for any diseasedataset. It should be noted that those feature subsets that includemore genes of the analyzed pathways analyzed might indicate thatthis particular pathway has a greater biological impact on thedisease.
But the proposed KEGG-improved GA not only can be usedfor diagnostic and prognostic, but also for biological knowledgediscovery about the disease. Regarding the most remarkablegenes of Tables 6–8 that even not originally present in the se-lected pathways, form part of the final selection thus playingan important role for obtaining robust and accurate predictionresults. For example, in Table 6 (Leukemia set), the gene ZYX7
is repetitively selected in all but one pathways; it codes zyxin, aadhesion plaque protein that prompts the formation of actin-richstructures at which signal transduction assemble. In the case ofthe lung database (Table 7), several adhesion pathways are in-volved in this cancer (cf., 04530, 04514) while the ZYX gene doesnot seem to be significant. The gene SEMA3C8 corresponds to asemaphorin, a protein including an inmunoglobulin domain. It
Fig. 3. Accuracy and robustness obtained for the selected pathways for each considered database (Leukemia, Lung and Prostate). The graphs include the results obtainedwhen using a strategy based only on genetic algorithms (GA) and on genetic algorithms plus the filtering approach (Filter + GA) (see text for more details).
7 http://www.genecards.org/cgi-bin/carddisp.pl?gene=ZYX.8 http://www.genecards.org/cgi-bin/carddisp.pl?gene=SEMA3C.
42 R.M. Luque-Baena et al. / Journal of Biomedical Informatics 49 (2014) 32–44
0.8, 1, 2, 3, 5} and coef0, r = {0, 1, 2}. It should be noted that not allthe parameters are required for each kernel type. For further infor-mation please visit [49].
Table 4 shows a comparison of the results after applying differ-ent strategies. The first three columns show the classificationmethod, the dataset and the strategy used. The fourth column rep-resents the number of genes, on average, after executing the meth-od fifty resamplings and five repetitions for each resampling.Robustness column in Table 2 indicates the average frequency of
the most selected genes, which are those that appear more than5% of the time in any of the solutions. The last column shows theresult of prediction of the disease over a test set not used duringall the process.
The accuracy results for the LDA method are, in general, slightlybetter than those obtained by applying SVM, although LDA has low-er complexity. This is not surprising since it has been shown beforethat simpler classification techniques can lead to competitive oreven better results [50]. Therefore, the following analysis done to
Table 5Performance comparison among the ‘‘Filter + GA + Pathway’’ combined strategy and three well-known filtering methods (Cons, IG and ReliefF). ACC and number of genes(mean ± std) are reported for LDA and SVM classifiers on the three analyzed datasets.
Strategy Leukemia
LDA SVM
ACC #Genes ACC #Genes
Filter + GA + Pathway 05340 97.13 ± 1.16 31.83 ± 1.86 93.87 ± 2.02 30.82 ± 1.62Filter + GA + Pathway 04640 96.38 ± 1.26 4.47 ± 0.71 94.86 ± 1.13 4.05 ± 0.80Cons 85.85 ± 8.55 1.84 ± 0.51 88.24 ± 5.95 1.84 ± 0.51IG 93.13 ± 4.40 9 ± 0 93.36 ± 4.33 9 ± 0ReliefF 93.31 ± 4.37 9 ± 0 90.48 ± 5.15 9 ± 0
LungFilter + GA + Pathway 04144 98.09 ± 0.68 4.29 ± 0.53 96.25 ± 0.97 4.15 ± 0.57Filter + GA + Pathway 04530 98.26 ± 0.46 3.84 ± 0.46 97.05 ± 0.90 3.55 ± 0.64Cons 94.08 ± 3.36 1.84 ± 0.42 94.57 ± 2.55 1.84 ± 0.42IG 98.68 ± 1.51 22 ± 0 98.88 ± 1.39 22 ± 0ReliefF 97.89 ± 1.81 22 ± 0 98.47 ± 1.43 22 ± 0
ProstateFilter + GA + Pathway 00980 91.37 ± 1.15 8.27 ± 0.83 87.96 ± 2.39 11.15 ± 2.10Filter + GA + Pathway 00480 90.80 ± 1.36 14.30 ± 2.63 88.90 ± 2.29 26.24 ± 4.02Cons 81.51 ± 7.57 3.20 ± 0.67 82.49 ± 6.72 3.20 ± 0.67IG 91.66 ± 4.07 12 ± 0 85.86 ± 4.86 12 ± 0ReliefF 90.22 ± 4.53 12 ± 0 88.50 ± 5.17 12 ± 0
Fig. 2. Proportion of the final selected genes which belong to the analyzed pathway for the databases Leukemia, Lung and Prostate.
38 R.M. Luque-Baena et al. / Journal of Biomedical Informatics 49 (2014) 32–44
If I have determined a gene signature THEN
I can know which is the desease
http://www.scbi.uma.es
Cancer signatures to reveal prognosis
29
A microRNA Signature Associated with Early Recurrencein Breast CancerLuis G. Perez-Rivas1., Jose M. Jerez2., Rosario Carmona3, Vanessa de Luque1, Luis Vicioso4,
M. Gonzalo Claros3,5, Enrique Viguera6, Bella Pajares1, Alfonso Sanchez1, Nuria Ribelles1,
Emilio Alba1, Jose Lozano1,5*
1 Laboratorio de Oncologıa Molecular, Servicio de Oncologıa Medica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga,
Spain, 2 Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga, Malaga, Spain, 3 Plataforma Andaluza de Bioinformatica, Universidad de
Malaga, Malaga, Spain, 4 Servicio de Anatomıa Patologica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga, Spain,
5 Departmento de Biologıa Molecular y Bioquımica, Universidad de Malaga, Malaga, Spain, 6 Departmento of Biologıa Celular, Genetica y Fisiologıa Animal, Universidad de
Malaga, Malaga, Spain
Abstract
Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse patternafter surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years,respectively. Although several clinical and pathological features have been used to discriminate between low- and high-riskpatients, the identification of molecular biomarkers with prognostic value remains an unmet need in the currentmanagement of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developedearly (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregatedtumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarraydata analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentiallyexpressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs weredown-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-riskgroup of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by publicdatabases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result inan overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-relatedmicroRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breastsurgery.
Citation: Perez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoSONE 9(3): e91884. doi:10.1371/journal.pone.0091884
Editor: Sonia Rocha, University of Dundee, United Kingdom
Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014
Copyright: ! 2014 Perez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio deEconomıa, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucıa (TIN-4026, to JJ). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
. These authors contributed equally to this work.
Introduction
Breast cancer comprises a group of heterogeneous diseases thatcan be classified based on both clinical and molecular features [1–5]. Improvements in the early detection of primary tumors and thedevelopment of novel targeted therapies, together with thesystematic use of adjuvant chemotherapy, has drastically reducedmortality rates and increased disease-free survival (DFS) in breastcancer. Still, about one third of patients undergoing breast tumorexcision will develop metastases, the major life-threatening eventwhich is strongly associated with poor outcome [6,7].
The risk of relapse after tumor resection is not constant overtime. A detailed examination of large series of long-term follow-upstudies over the last two decades reveals a bimodal hazard functionwith two peaks of early and late recurrence occurring at 1.5 and 5
years, respectively, followed by a nearly flat plateau in which therisk of relapse tends to zero [8–10]. A causal link between tumorsurgery and the bimodal pattern of recurrence has been proposedby some investigators (i.e. an iatrogenic effect) [11]. According tothat model, surgical removal of the primary breast tumor wouldaccelerate the growth of dormant metastatic foci by altering thebalance between circulating pro- and anti-angiogenic factors[9,11–14]. Such hypothesis is supported by the fact that the twopeaks of relapse are observed regardless other factors than surgery,such as the axillary nodal status, the type of surgery or theadministration of adjuvant therapy. Although estrogen receptor(ER)-negative tumors are commonly associated with a higher riskof early relapse [15], the bimodal distribution pattern is observedwith independence of the hormone receptor status [16]. Otherstudies also suggest that the dynamics of tumor relapse may be a
PLOS ONE | www.plosone.org 1 March 2014 | Volume 9 | Issue 3 | e91884
A microRNA Signature Associated with Early Recurrencein Breast CancerLuis G. Perez-Rivas1., Jose M. Jerez2., Rosario Carmona3, Vanessa de Luque1, Luis Vicioso4,
M. Gonzalo Claros3,5, Enrique Viguera6, Bella Pajares1, Alfonso Sanchez1, Nuria Ribelles1,
Emilio Alba1, Jose Lozano1,5*
1 Laboratorio de Oncologıa Molecular, Servicio de Oncologıa Medica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga,
Spain, 2 Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga, Malaga, Spain, 3 Plataforma Andaluza de Bioinformatica, Universidad de
Malaga, Malaga, Spain, 4 Servicio de Anatomıa Patologica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga, Spain,
5 Departmento de Biologıa Molecular y Bioquımica, Universidad de Malaga, Malaga, Spain, 6 Departmento of Biologıa Celular, Genetica y Fisiologıa Animal, Universidad de
Malaga, Malaga, Spain
Abstract
Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse patternafter surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years,respectively. Although several clinical and pathological features have been used to discriminate between low- and high-riskpatients, the identification of molecular biomarkers with prognostic value remains an unmet need in the currentmanagement of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developedearly (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregatedtumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarraydata analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentiallyexpressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs weredown-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-riskgroup of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by publicdatabases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result inan overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-relatedmicroRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breastsurgery.
Citation: Perez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoSONE 9(3): e91884. doi:10.1371/journal.pone.0091884
Editor: Sonia Rocha, University of Dundee, United Kingdom
Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014
Copyright: ! 2014 Perez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio deEconomıa, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucıa (TIN-4026, to JJ). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
. These authors contributed equally to this work.
Introduction
Breast cancer comprises a group of heterogeneous diseases thatcan be classified based on both clinical and molecular features [1–5]. Improvements in the early detection of primary tumors and thedevelopment of novel targeted therapies, together with thesystematic use of adjuvant chemotherapy, has drastically reducedmortality rates and increased disease-free survival (DFS) in breastcancer. Still, about one third of patients undergoing breast tumorexcision will develop metastases, the major life-threatening eventwhich is strongly associated with poor outcome [6,7].
The risk of relapse after tumor resection is not constant overtime. A detailed examination of large series of long-term follow-upstudies over the last two decades reveals a bimodal hazard functionwith two peaks of early and late recurrence occurring at 1.5 and 5
years, respectively, followed by a nearly flat plateau in which therisk of relapse tends to zero [8–10]. A causal link between tumorsurgery and the bimodal pattern of recurrence has been proposedby some investigators (i.e. an iatrogenic effect) [11]. According tothat model, surgical removal of the primary breast tumor wouldaccelerate the growth of dormant metastatic foci by altering thebalance between circulating pro- and anti-angiogenic factors[9,11–14]. Such hypothesis is supported by the fact that the twopeaks of relapse are observed regardless other factors than surgery,such as the axillary nodal status, the type of surgery or theadministration of adjuvant therapy. Although estrogen receptor(ER)-negative tumors are commonly associated with a higher riskof early relapse [15], the bimodal distribution pattern is observedwith independence of the hormone receptor status [16]. Otherstudies also suggest that the dynamics of tumor relapse may be a
PLOS ONE | www.plosone.org 1 March 2014 | Volume 9 | Issue 3 | e91884
with tumors from relapse-free patients (group A, Table 2). MiR-625 was excluded from any further studies since RT-qPCR datashowed minimal variation between groups (FC,2). Next, we re-clustered the 71 tumors based on the 5-miRNA signature. Asshown in Figure 2, tumors from groups A and B were clearlysegregated in two distinct clusters, which included most of theexpected samples in each category: 78.8% group A in cluster 1b(low risk) and 70.4% group B in cluster 2b (high risk). Of note, thesupervised analysis included most tumors from group C (72.8%),in cluster 1b, indicating that the 5-miRNA signature specifically
discriminates tumors with an overall higher risk of earlyrecurrence.
The 5-miRNA signatureMiR-149 was the most significant miRNA downregulated in
group B, as determined by microarray hybridization and by RT-qPCR. This miRNA has been described as a TS-miR thatregulates the expression of genes associated with cell cycle,invasion or migration and its downregulation has been observed inseveral tumor diseases, including gastric cancer and breast cancer[70,77–81]. Down-regulation of miR-149 can occur epigenetical-
Figure 2. A 5-miRNA signature is associated with early recurrence in breast cancer. Hierarchical clustering of the 71 tumor samples basedon expression of the 5-miRNA signature. Note that lower expression levels of the 5-miRNA signature defines a distinct cluster 2b wich mainly includestumors from ‘‘high risk’’ patients (group B). On the contrary, most patients with good prognosis (group A) had tumors with normal or higher-thannormal levels of the 5-miRNA signature, defining a different cluster 1b (‘‘low risk’’).doi:10.1371/journal.pone.0091884.g002
Figure 3. The 5-miRNA signature discriminates patients with diferent RFS. A) Kaplan-Meier graph for the whole patient cohort included inthis study. B) Those patients whose tumors showed an overall down-regulation of the 5-miRNA signature (i.e. those from cluster 2b in Fig. 2) wereclassified as ‘‘high risk’’ (red line) and their cumulative RFS was calculated (red line). RFS was also calculated for the remaining patients in the cohort(‘‘low risk’’, black line). The Kaplan-Meier plot shows that the 5-miRNA signature specifically discriminates tumors with an overall higher risk of earlyrecurrence.doi:10.1371/journal.pone.0091884.g003
A miRNA Signature Predictive of Early Recurrence
PLOS ONE | www.plosone.org 7 March 2014 | Volume 9 | Issue 3 | e91884
ly, by hypermethylation of the neighbouring CpG island [80] or byimpaired processing of the pri-miR-149 precursor, in a polymor-phic variant [79]. In a recent work, downregulation of miR-149has been associated with elevated levels of the transcription factorSP1, increase invasiveness and lower 5-year survival in colorectalcancer [80]. The p53 repressor ZBTB2 is also a target of miR-149[81], which could explain, at least partially, its function as a TS-miR.
MiR-30a-3p is a member of the miR-30 family, which isassociated with mesenchymal and stemness features [82,83] and isdownregulated in several types of cancer [84–86]. Recently,Rodriguez-Gonzalez et al. have linked low levels of this miRNA totamoxifen resistance in ER+ breast tumors. They have alsoproposed several targets of miR-30a-3p involved in proliferationand apoptosis, such as BCL2, NFkB, MAP2K4, PDGFA,CDK5R1 and CHN1 [87].
Regarding miR-20b, this miRNA is part of the miR-106b-363cluster, which is frequently deregulated in cancer [88–91]. Thelevels of miR-20b associate with histological grade in breast cancer[92,93]. This miRNA has been involved in regulating several keyproteins such as ESR1, HIF-1a, VEGF or STAT3 [92,94,95]. Inparticular, because it targets both HIF-1a and VEGF and HIF-1anegatively controls miR-20b levels, it has been defined as an anti-angiogenic miRNA [95].
Both oncogenic and tumor suppressor features have beenreported for miR-10a [96]. Thus, reduced expression of miR-10ahas been associated with MAP3K7- and bTRC-mediatedactivation of the proinflammatory NFkB pathway [97]. Also,miR-10a downregulation represses differentiation in part byderegulation of the histone deacetylase HDAC4 [98] andpositively affects invasiveness by de-repressing several membersof the homeobox family of transcription factors [99].
Regarding miR-342-5p, it appears significantly deregulatedonly when we compare B vs AC (Table 2). Together with itscounterpart (miR-342-3p), it is deregulated in inflammatory breastcancer [74] and its low expression has been associated with lower
post-recurrence survival [100], likely because it targets AKT1mRNA [101].
In sum, the available bibliographic data suggests that down-regulation of miR-149, miR-30a-3p, miR-20b, miR-10a andmiR342-5p in primary breast tumors could confer them enhancedproliferative, angiogenic and invasive potentials.
Prognostic value of the 5-miRNA signature. The relation-ship between expression of the 5-miRNA signature and RFS wasexamined by a survival analysis. Figure 3A shows a Kaplan-Meiergraph for the whole series of patients included in the study. Due tothe intrinsic characteristics of the cohort, decreases in the RFS areonly observed in the intervals 0–24 and 50–60 months(corresponding to groups B and C, respectively). We next groupedthe tumors according to their 5-miRNA signature status in twodifferent groups. One group included those tumors with all fivemiRNAs simultaneously downregulated, (FC.2 and p,0.05) anda second group included those tumors not having all five miRNAsdownregulated. A survival analysis was performed using clinicaldata from the corresponding patients. As shown in Figure 3B, theKaplan-Meier graphs for the two groups demonstrate that the 5-miRNA signature defines a ‘‘high risk’’ group of patients with ashorter RFS (Peto-Peto test with p-value = 0.02, when comparingthe low vs high risk groups).
Using a Cox proportional hazard regression model, we alsotested all possible combinations of different covariates (tumorsubtype, patient age, tumor size, number of lymph nodes affectedand the 5-miRNA signature) with early relapse (#24 months) toidentify the best prognostic factors. The best model according tothe AIC criterion included the tumor size and expression of the 5-miRNA signature (data not shown). Only the 5-miRNA signature(all five miRNAs down-regulated) resulted statistically significant inthe Cox model for the high risk group (p-value = 0.02 withHR = 2.73, 95% CI: 1.17–6.36). The 5-miRNA expression datawere also used to develop a predictor model through boot-strapping over a Naive Bayes classifier (B = 200 with N = 71, seemethods). The prognostic accuracy of the models was assessed by areceiver operating charateristic (ROC) test (Figure 4). Consideredindividually, miR-30a-3p and miR-10a showed a strikingly highArea Under the Curve (AUC) score (0.890 and 0.875, respective-ly). This result suggests that mRNA targets regulated by miR-30a-3p and miR-10a could potentially add a greater contribution tothe final outcome of the disease. However, the 5-miRNA signaturehad the strongest predictive value to discriminate tumors frompatients that will develop early relapse (group B) from those thatwill remain free of disease (group A), with an AUC = 0.993(Figure 4). In summary, the 5-miRNA signature has a goodperformance as a risk predictor for early breast cancer recurrence.
Candidate targets for the 5-miRNA signature. To extendour set of five miRNAs with regulatory information, we next tookadvantage of the existing public databases curating predicted andvalidated miRNA-target interactions (MTIs). In particular, vali-dated targets were obtained from the miTarBase and miRecordsrepositories (see methods). First, we created a biological network inCytoscape [66] containing all the individual miRNAs included inthe 5-miRNA signature (miR-149, miR-20b, miR10a, miR-30a-3p and miR-342-5p). Next, we extended the network by adding H.sapiens MTI data retrieved from the indicated repositories and,finally, extended regulatory interaction networks (RIN) weregenerated and visualized in Cytoscape. Each regulatory interac-tion in the network consist of two nodes, a regulatory component(miRNA) and a target biomolecule (mRNA) connected throughone directed edge. Figure 5 shows the extended network when theRIN threshold was set to 1 (i.e. each predicted target appears in, atleast, one RIN). Thus, at RIN = 1 the network included 14
Figure 4. Receiver operating characteristic curve (ROC) forearly breast cancer recurrence by the 5-miRNA signaturestatus. ROC curves generated using the prognosis information andexpression levels of the 5-miRNA signature can discriminate betweenpatients who will develop early recurrence and those who will remainfree of disease. Note that, although miR-30-3p and miR10a, individuallyhave a high area under the curve (AUC) score, the 5-miRNA signaturehas the strongest predictive value (AUC = 0.993) to discriminate thosepatients likely to recur early (group B in our cohort).doi:10.1371/journal.pone.0091884.g004
A miRNA Signature Predictive of Early Recurrence
PLOS ONE | www.plosone.org 8 March 2014 | Volume 9 | Issue 3 | e91884
validated targets assigned to miR-20b (VEGFA, BAMBI, EFNB2,MYLIP, CRIM1, ARID4B, HIF1A, HIPK3, CDKN1A, PPARG,STAT3, MUC17, EPHB4, and ESR1), 7 validated targetsassigned to miR-10a (HOXA1, NCOR2, SRSF1, SRSF10/TRA2B, MAP3K7, USF2 and BTRC) and 9 validated targetsassigned to miR-3a-3p (THBS1, VEZT, TUBA1A, CDK6,WDR82, TMEM2, KRT7, CYR61 and SLC7A6) (Figure 5).Taking these results into account and considering that i) theextended network was constructed with the 5-miRNA signature asthe network nodes and ii) all MTIs depicted in Figure 5 have beenexperimentally verified, we suggest that at least some of the
30 mRNAs (Figure 5) could be regulated in vivo by the 5-miRNAsignature in early-relapsing tumors.
To gain further insight into the molecular basis of the 5-miRNAsignature prognostic value, we investigated the biological pathwaysassociated with the 30 experimentally verified targets fromFigure 5. To that end, we searched for Gene Ontology (GO)terms and Kyoto Encyclopedia of Genes and Genomes (KEGG)pathways associated with the 30 targets as a whole set. It should benoted, however, that our restrictive approach –including onlyexperimentally validated miRNA targets-, left miR-149 and miR-342-5p out of the GO analysis and therefore, additional biologicalpathways could be affected by downregulation of the 5-miRNA
Figure 5. Prediction of mRNA targets likely to be regulated by the 5-miRNA signature. Biological networks were created using theCytoscape software. Each network includes two types of nodes: the five individual miRNAs included in the 5-miRNA signature and their predictedmRNA targets (yellow circles), obtained from two different public databases (miRTarBase and miRecords). The number of databases included in theanalysis defines the regulatory interaction network (RIN) threshold. Thus, at RIN = 1 the network includes all mRNA targets that appear in, at least, onedatabase. The databases included in the RIN are identified by the color of the connecting arrows: miRTarBase (blue) and miRecords (red). Althoughmany mRNAs are potential targets for miR-149 and miR-342-5p, the miRTarBase and miRecords versions included in this study did not reveal anytargets experimentally validated for the two miRNAs.doi:10.1371/journal.pone.0091884.g005
A miRNA Signature Predictive of Early Recurrence
PLOS ONE | www.plosone.org 9 March 2014 | Volume 9 | Issue 3 | e91884
http://www.scbi.uma.es
Cancer signatures to reveal prognosis
29
A microRNA Signature Associated with Early Recurrencein Breast CancerLuis G. Perez-Rivas1., Jose M. Jerez2., Rosario Carmona3, Vanessa de Luque1, Luis Vicioso4,
M. Gonzalo Claros3,5, Enrique Viguera6, Bella Pajares1, Alfonso Sanchez1, Nuria Ribelles1,
Emilio Alba1, Jose Lozano1,5*
1 Laboratorio de Oncologıa Molecular, Servicio de Oncologıa Medica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga,
Spain, 2 Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga, Malaga, Spain, 3 Plataforma Andaluza de Bioinformatica, Universidad de
Malaga, Malaga, Spain, 4 Servicio de Anatomıa Patologica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga, Spain,
5 Departmento de Biologıa Molecular y Bioquımica, Universidad de Malaga, Malaga, Spain, 6 Departmento of Biologıa Celular, Genetica y Fisiologıa Animal, Universidad de
Malaga, Malaga, Spain
Abstract
Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse patternafter surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years,respectively. Although several clinical and pathological features have been used to discriminate between low- and high-riskpatients, the identification of molecular biomarkers with prognostic value remains an unmet need in the currentmanagement of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developedearly (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregatedtumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarraydata analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentiallyexpressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs weredown-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-riskgroup of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by publicdatabases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result inan overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-relatedmicroRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breastsurgery.
Citation: Perez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoSONE 9(3): e91884. doi:10.1371/journal.pone.0091884
Editor: Sonia Rocha, University of Dundee, United Kingdom
Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014
Copyright: ! 2014 Perez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio deEconomıa, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucıa (TIN-4026, to JJ). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
. These authors contributed equally to this work.
Introduction
Breast cancer comprises a group of heterogeneous diseases thatcan be classified based on both clinical and molecular features [1–5]. Improvements in the early detection of primary tumors and thedevelopment of novel targeted therapies, together with thesystematic use of adjuvant chemotherapy, has drastically reducedmortality rates and increased disease-free survival (DFS) in breastcancer. Still, about one third of patients undergoing breast tumorexcision will develop metastases, the major life-threatening eventwhich is strongly associated with poor outcome [6,7].
The risk of relapse after tumor resection is not constant overtime. A detailed examination of large series of long-term follow-upstudies over the last two decades reveals a bimodal hazard functionwith two peaks of early and late recurrence occurring at 1.5 and 5
years, respectively, followed by a nearly flat plateau in which therisk of relapse tends to zero [8–10]. A causal link between tumorsurgery and the bimodal pattern of recurrence has been proposedby some investigators (i.e. an iatrogenic effect) [11]. According tothat model, surgical removal of the primary breast tumor wouldaccelerate the growth of dormant metastatic foci by altering thebalance between circulating pro- and anti-angiogenic factors[9,11–14]. Such hypothesis is supported by the fact that the twopeaks of relapse are observed regardless other factors than surgery,such as the axillary nodal status, the type of surgery or theadministration of adjuvant therapy. Although estrogen receptor(ER)-negative tumors are commonly associated with a higher riskof early relapse [15], the bimodal distribution pattern is observedwith independence of the hormone receptor status [16]. Otherstudies also suggest that the dynamics of tumor relapse may be a
PLOS ONE | www.plosone.org 1 March 2014 | Volume 9 | Issue 3 | e91884
A microRNA Signature Associated with Early Recurrencein Breast CancerLuis G. Perez-Rivas1., Jose M. Jerez2., Rosario Carmona3, Vanessa de Luque1, Luis Vicioso4,
M. Gonzalo Claros3,5, Enrique Viguera6, Bella Pajares1, Alfonso Sanchez1, Nuria Ribelles1,
Emilio Alba1, Jose Lozano1,5*
1 Laboratorio de Oncologıa Molecular, Servicio de Oncologıa Medica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga,
Spain, 2 Departamento de Lenguajes y Ciencias de la Computacion, Universidad de Malaga, Malaga, Spain, 3 Plataforma Andaluza de Bioinformatica, Universidad de
Malaga, Malaga, Spain, 4 Servicio de Anatomıa Patologica, Instituto de Biomedicina de Malaga (IBIMA), Hospital Universitario Virgen de la Victoria, Malaga, Spain,
5 Departmento de Biologıa Molecular y Bioquımica, Universidad de Malaga, Malaga, Spain, 6 Departmento of Biologıa Celular, Genetica y Fisiologıa Animal, Universidad de
Malaga, Malaga, Spain
Abstract
Recurrent breast cancer occurring after the initial treatment is associated with poor outcome. A bimodal relapse patternafter surgery for primary tumor has been described with peaks of early and late recurrence occurring at about 2 and 5 years,respectively. Although several clinical and pathological features have been used to discriminate between low- and high-riskpatients, the identification of molecular biomarkers with prognostic value remains an unmet need in the currentmanagement of breast cancer. Using microarray-based technology, we have performed a microRNA expression analysis in71 primary breast tumors from patients that either remained disease-free at 5 years post-surgery (group A) or developedearly (group B) or late (group C) recurrence. Unsupervised hierarchical clustering of microRNA expression data segregatedtumors in two groups, mainly corresponding to patients with early recurrence and those with no recurrence. Microarraydata analysis and RT-qPCR validation led to the identification of a set of 5 microRNAs (the 5-miRNA signature) differentiallyexpressed between these two groups: miR-149, miR-10a, miR-20b, miR-30a-3p and miR-342-5p. All five microRNAs weredown-regulated in tumors from patients with early recurrence. We show here that the 5-miRNA signature defines a high-riskgroup of patients with shorter relapse-free survival and has predictive value to discriminate non-relapsing versus early-relapsing patients (AUC = 0.993, p-value,0.05). Network analysis based on miRNA-target interactions curated by publicdatabases suggests that down-regulation of the 5-miRNA signature in the subset of early-relapsing tumors would result inan overall increased proliferative and angiogenic capacity. In summary, we have identified a set of recurrence-relatedmicroRNAs with potential prognostic value to identify patients who will likely develop metastasis early after primary breastsurgery.
Citation: Perez-Rivas LG, Jerez JM, Carmona R, de Luque V, Vicioso L, et al. (2014) A microRNA Signature Associated with Early Recurrence in Breast Cancer. PLoSONE 9(3): e91884. doi:10.1371/journal.pone.0091884
Editor: Sonia Rocha, University of Dundee, United Kingdom
Received November 11, 2013; Accepted February 14, 2014; Published March 14, 2014
Copyright: ! 2014 Perez-Rivas et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by a grant from the Spanish Society of Medical Oncology (SEOM, to NR) and by grants from the Spanish Ministerio deEconomıa, (SAF2010-20203 to J.L and TIN2010-16556 to J.J) and from the Junta de Andalucıa (TIN-4026, to JJ). The funders had no role in study design, datacollection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
. These authors contributed equally to this work.
Introduction
Breast cancer comprises a group of heterogeneous diseases thatcan be classified based on both clinical and molecular features [1–5]. Improvements in the early detection of primary tumors and thedevelopment of novel targeted therapies, together with thesystematic use of adjuvant chemotherapy, has drastically reducedmortality rates and increased disease-free survival (DFS) in breastcancer. Still, about one third of patients undergoing breast tumorexcision will develop metastases, the major life-threatening eventwhich is strongly associated with poor outcome [6,7].
The risk of relapse after tumor resection is not constant overtime. A detailed examination of large series of long-term follow-upstudies over the last two decades reveals a bimodal hazard functionwith two peaks of early and late recurrence occurring at 1.5 and 5
years, respectively, followed by a nearly flat plateau in which therisk of relapse tends to zero [8–10]. A causal link between tumorsurgery and the bimodal pattern of recurrence has been proposedby some investigators (i.e. an iatrogenic effect) [11]. According tothat model, surgical removal of the primary breast tumor wouldaccelerate the growth of dormant metastatic foci by altering thebalance between circulating pro- and anti-angiogenic factors[9,11–14]. Such hypothesis is supported by the fact that the twopeaks of relapse are observed regardless other factors than surgery,such as the axillary nodal status, the type of surgery or theadministration of adjuvant therapy. Although estrogen receptor(ER)-negative tumors are commonly associated with a higher riskof early relapse [15], the bimodal distribution pattern is observedwith independence of the hormone receptor status [16]. Otherstudies also suggest that the dynamics of tumor relapse may be a
PLOS ONE | www.plosone.org 1 March 2014 | Volume 9 | Issue 3 | e91884
with tumors from relapse-free patients (group A, Table 2). MiR-625 was excluded from any further studies since RT-qPCR datashowed minimal variation between groups (FC,2). Next, we re-clustered the 71 tumors based on the 5-miRNA signature. Asshown in Figure 2, tumors from groups A and B were clearlysegregated in two distinct clusters, which included most of theexpected samples in each category: 78.8% group A in cluster 1b(low risk) and 70.4% group B in cluster 2b (high risk). Of note, thesupervised analysis included most tumors from group C (72.8%),in cluster 1b, indicating that the 5-miRNA signature specifically
discriminates tumors with an overall higher risk of earlyrecurrence.
The 5-miRNA signatureMiR-149 was the most significant miRNA downregulated in
group B, as determined by microarray hybridization and by RT-qPCR. This miRNA has been described as a TS-miR thatregulates the expression of genes associated with cell cycle,invasion or migration and its downregulation has been observed inseveral tumor diseases, including gastric cancer and breast cancer[70,77–81]. Down-regulation of miR-149 can occur epigenetical-
Figure 2. A 5-miRNA signature is associated with early recurrence in breast cancer. Hierarchical clustering of the 71 tumor samples basedon expression of the 5-miRNA signature. Note that lower expression levels of the 5-miRNA signature defines a distinct cluster 2b wich mainly includestumors from ‘‘high risk’’ patients (group B). On the contrary, most patients with good prognosis (group A) had tumors with normal or higher-thannormal levels of the 5-miRNA signature, defining a different cluster 1b (‘‘low risk’’).doi:10.1371/journal.pone.0091884.g002
Figure 3. The 5-miRNA signature discriminates patients with diferent RFS. A) Kaplan-Meier graph for the whole patient cohort included inthis study. B) Those patients whose tumors showed an overall down-regulation of the 5-miRNA signature (i.e. those from cluster 2b in Fig. 2) wereclassified as ‘‘high risk’’ (red line) and their cumulative RFS was calculated (red line). RFS was also calculated for the remaining patients in the cohort(‘‘low risk’’, black line). The Kaplan-Meier plot shows that the 5-miRNA signature specifically discriminates tumors with an overall higher risk of earlyrecurrence.doi:10.1371/journal.pone.0091884.g003
A miRNA Signature Predictive of Early Recurrence
PLOS ONE | www.plosone.org 7 March 2014 | Volume 9 | Issue 3 | e91884
ly, by hypermethylation of the neighbouring CpG island [80] or byimpaired processing of the pri-miR-149 precursor, in a polymor-phic variant [79]. In a recent work, downregulation of miR-149has been associated with elevated levels of the transcription factorSP1, increase invasiveness and lower 5-year survival in colorectalcancer [80]. The p53 repressor ZBTB2 is also a target of miR-149[81], which could explain, at least partially, its function as a TS-miR.
MiR-30a-3p is a member of the miR-30 family, which isassociated with mesenchymal and stemness features [82,83] and isdownregulated in several types of cancer [84–86]. Recently,Rodriguez-Gonzalez et al. have linked low levels of this miRNA totamoxifen resistance in ER+ breast tumors. They have alsoproposed several targets of miR-30a-3p involved in proliferationand apoptosis, such as BCL2, NFkB, MAP2K4, PDGFA,CDK5R1 and CHN1 [87].
Regarding miR-20b, this miRNA is part of the miR-106b-363cluster, which is frequently deregulated in cancer [88–91]. Thelevels of miR-20b associate with histological grade in breast cancer[92,93]. This miRNA has been involved in regulating several keyproteins such as ESR1, HIF-1a, VEGF or STAT3 [92,94,95]. Inparticular, because it targets both HIF-1a and VEGF and HIF-1anegatively controls miR-20b levels, it has been defined as an anti-angiogenic miRNA [95].
Both oncogenic and tumor suppressor features have beenreported for miR-10a [96]. Thus, reduced expression of miR-10ahas been associated with MAP3K7- and bTRC-mediatedactivation of the proinflammatory NFkB pathway [97]. Also,miR-10a downregulation represses differentiation in part byderegulation of the histone deacetylase HDAC4 [98] andpositively affects invasiveness by de-repressing several membersof the homeobox family of transcription factors [99].
Regarding miR-342-5p, it appears significantly deregulatedonly when we compare B vs AC (Table 2). Together with itscounterpart (miR-342-3p), it is deregulated in inflammatory breastcancer [74] and its low expression has been associated with lower
post-recurrence survival [100], likely because it targets AKT1mRNA [101].
In sum, the available bibliographic data suggests that down-regulation of miR-149, miR-30a-3p, miR-20b, miR-10a andmiR342-5p in primary breast tumors could confer them enhancedproliferative, angiogenic and invasive potentials.
Prognostic value of the 5-miRNA signature. The relation-ship between expression of the 5-miRNA signature and RFS wasexamined by a survival analysis. Figure 3A shows a Kaplan-Meiergraph for the whole series of patients included in the study. Due tothe intrinsic characteristics of the cohort, decreases in the RFS areonly observed in the intervals 0–24 and 50–60 months(corresponding to groups B and C, respectively). We next groupedthe tumors according to their 5-miRNA signature status in twodifferent groups. One group included those tumors with all fivemiRNAs simultaneously downregulated, (FC.2 and p,0.05) anda second group included those tumors not having all five miRNAsdownregulated. A survival analysis was performed using clinicaldata from the corresponding patients. As shown in Figure 3B, theKaplan-Meier graphs for the two groups demonstrate that the 5-miRNA signature defines a ‘‘high risk’’ group of patients with ashorter RFS (Peto-Peto test with p-value = 0.02, when comparingthe low vs high risk groups).
Using a Cox proportional hazard regression model, we alsotested all possible combinations of different covariates (tumorsubtype, patient age, tumor size, number of lymph nodes affectedand the 5-miRNA signature) with early relapse (#24 months) toidentify the best prognostic factors. The best model according tothe AIC criterion included the tumor size and expression of the 5-miRNA signature (data not shown). Only the 5-miRNA signature(all five miRNAs down-regulated) resulted statistically significant inthe Cox model for the high risk group (p-value = 0.02 withHR = 2.73, 95% CI: 1.17–6.36). The 5-miRNA expression datawere also used to develop a predictor model through boot-strapping over a Naive Bayes classifier (B = 200 with N = 71, seemethods). The prognostic accuracy of the models was assessed by areceiver operating charateristic (ROC) test (Figure 4). Consideredindividually, miR-30a-3p and miR-10a showed a strikingly highArea Under the Curve (AUC) score (0.890 and 0.875, respective-ly). This result suggests that mRNA targets regulated by miR-30a-3p and miR-10a could potentially add a greater contribution tothe final outcome of the disease. However, the 5-miRNA signaturehad the strongest predictive value to discriminate tumors frompatients that will develop early relapse (group B) from those thatwill remain free of disease (group A), with an AUC = 0.993(Figure 4). In summary, the 5-miRNA signature has a goodperformance as a risk predictor for early breast cancer recurrence.
Candidate targets for the 5-miRNA signature. To extendour set of five miRNAs with regulatory information, we next tookadvantage of the existing public databases curating predicted andvalidated miRNA-target interactions (MTIs). In particular, vali-dated targets were obtained from the miTarBase and miRecordsrepositories (see methods). First, we created a biological network inCytoscape [66] containing all the individual miRNAs included inthe 5-miRNA signature (miR-149, miR-20b, miR10a, miR-30a-3p and miR-342-5p). Next, we extended the network by adding H.sapiens MTI data retrieved from the indicated repositories and,finally, extended regulatory interaction networks (RIN) weregenerated and visualized in Cytoscape. Each regulatory interac-tion in the network consist of two nodes, a regulatory component(miRNA) and a target biomolecule (mRNA) connected throughone directed edge. Figure 5 shows the extended network when theRIN threshold was set to 1 (i.e. each predicted target appears in, atleast, one RIN). Thus, at RIN = 1 the network included 14
Figure 4. Receiver operating characteristic curve (ROC) forearly breast cancer recurrence by the 5-miRNA signaturestatus. ROC curves generated using the prognosis information andexpression levels of the 5-miRNA signature can discriminate betweenpatients who will develop early recurrence and those who will remainfree of disease. Note that, although miR-30-3p and miR10a, individuallyhave a high area under the curve (AUC) score, the 5-miRNA signaturehas the strongest predictive value (AUC = 0.993) to discriminate thosepatients likely to recur early (group B in our cohort).doi:10.1371/journal.pone.0091884.g004
A miRNA Signature Predictive of Early Recurrence
PLOS ONE | www.plosone.org 8 March 2014 | Volume 9 | Issue 3 | e91884
validated targets assigned to miR-20b (VEGFA, BAMBI, EFNB2,MYLIP, CRIM1, ARID4B, HIF1A, HIPK3, CDKN1A, PPARG,STAT3, MUC17, EPHB4, and ESR1), 7 validated targetsassigned to miR-10a (HOXA1, NCOR2, SRSF1, SRSF10/TRA2B, MAP3K7, USF2 and BTRC) and 9 validated targetsassigned to miR-3a-3p (THBS1, VEZT, TUBA1A, CDK6,WDR82, TMEM2, KRT7, CYR61 and SLC7A6) (Figure 5).Taking these results into account and considering that i) theextended network was constructed with the 5-miRNA signature asthe network nodes and ii) all MTIs depicted in Figure 5 have beenexperimentally verified, we suggest that at least some of the
30 mRNAs (Figure 5) could be regulated in vivo by the 5-miRNAsignature in early-relapsing tumors.
To gain further insight into the molecular basis of the 5-miRNAsignature prognostic value, we investigated the biological pathwaysassociated with the 30 experimentally verified targets fromFigure 5. To that end, we searched for Gene Ontology (GO)terms and Kyoto Encyclopedia of Genes and Genomes (KEGG)pathways associated with the 30 targets as a whole set. It should benoted, however, that our restrictive approach –including onlyexperimentally validated miRNA targets-, left miR-149 and miR-342-5p out of the GO analysis and therefore, additional biologicalpathways could be affected by downregulation of the 5-miRNA
Figure 5. Prediction of mRNA targets likely to be regulated by the 5-miRNA signature. Biological networks were created using theCytoscape software. Each network includes two types of nodes: the five individual miRNAs included in the 5-miRNA signature and their predictedmRNA targets (yellow circles), obtained from two different public databases (miRTarBase and miRecords). The number of databases included in theanalysis defines the regulatory interaction network (RIN) threshold. Thus, at RIN = 1 the network includes all mRNA targets that appear in, at least, onedatabase. The databases included in the RIN are identified by the color of the connecting arrows: miRTarBase (blue) and miRecords (red). Althoughmany mRNAs are potential targets for miR-149 and miR-342-5p, the miRTarBase and miRecords versions included in this study did not reveal anytargets experimentally validated for the two miRNAs.doi:10.1371/journal.pone.0091884.g005
A miRNA Signature Predictive of Early Recurrence
PLOS ONE | www.plosone.org 9 March 2014 | Volume 9 | Issue 3 | e91884
If I know the which genes ARE expressed THEN
I can know which output WILL be obtained
http://www.scbi.uma.es
Characterization of complex variations in cancer
30
©20
14 N
atur
e A
mer
ica,
Inc.
All
righ
ts r
eser
ved.
NATURE BIOTECHNOLOGY ADVANCE ONLINE PUBLICATION 3
A N A LY S I S
for structural variants of different sizes (Supplementary Table 2). For the present comparison, we ran them as described in their companies’ corresponding publication or website.
We first observed that the calling of somatic SNVs was nearly opti-mal and within the same range in Mutect and SMUFIN, with sensitivi-ties of 97% and 92%, and specificities of 93% and 99%, respectively (Table 1 and Supplementary Table 3). On the other hand, the calling
efficiency of somatic structural variants varied greatly between differ-ent methods, revealing clear differences when compared to SMUFIN. Some methods reached reasonable levels of sensitivity when the eval-uation was restricted to the range of structural variants they were designed to detect (Pindel and Delly), but these dropped drastically when compared against the complete catalog of structural variations in the tumor (Supplementary Table 4). By contrast, SMUFIN was
Con
stru
ctio
n of
brea
kpoi
nt b
lock
sD
efin
ition
and
cla
ssifi
catio
nof
var
iant
sA
ssig
ning
ref
eren
ceco
ordi
nate
s
Quaternary sequence tree
1 2 3 4 5 67 8 9
3 6110 11 12
Read
nt
1
2
3
4
5
6
7
8
9
10
11
n
Single orientation breakpoint
Double orientation breakpoint
Quaternary sequence tree
Overlappingand complementary
reads from normalgenome
Construction of breakpoint blocks
Undefined breakpoint blocks
Reads in tumor-specific branches
Com
paris
on o
f nor
mal
and
tum
or r
eads
and
iden
tific
atio
n of
pot
entia
l bre
akpo
ints
Normal
Reads
Tumor
FASTQ file
Qualityfilters
Tumor and normal genome sequencing
Read1
23
456
789
3
6
1
nt 1 2 3 4 5 6 7 8 9 1011.................................n = Readlength
Short insertion
SNV
Large SV
101112
Tumor and normal reads
Unambiguous extension of normal and mutated
tumor allele
Mutated tumor allele
Nonmutated tumor allele
Normal alleles
Definition of small variants (n < read size)
Definition of breakpoint and variant sequence for large SVs ( > read size)
Extension of the variant and normal sequences around the breakpoint100 nt 100 ntBreakpoint
SNVs
TumorNormal
Deletions
Inversions
Insertions
Small SVs Breakpoint of large SV
Reference genome
Mapping of normal sequences (BWA)Independent mapping of normal sequences
flanking the breakpoint (BWA)
a
b
c
d
Tumor-specific reads with potential breakpoints
http://www.scbi.uma.es
Characterization of complex variations in cancer
30
©20
14 N
atur
e A
mer
ica,
Inc.
All
righ
ts r
eser
ved.
NATURE BIOTECHNOLOGY ADVANCE ONLINE PUBLICATION 3
A N A LY S I S
for structural variants of different sizes (Supplementary Table 2). For the present comparison, we ran them as described in their companies’ corresponding publication or website.
We first observed that the calling of somatic SNVs was nearly opti-mal and within the same range in Mutect and SMUFIN, with sensitivi-ties of 97% and 92%, and specificities of 93% and 99%, respectively (Table 1 and Supplementary Table 3). On the other hand, the calling
efficiency of somatic structural variants varied greatly between differ-ent methods, revealing clear differences when compared to SMUFIN. Some methods reached reasonable levels of sensitivity when the eval-uation was restricted to the range of structural variants they were designed to detect (Pindel and Delly), but these dropped drastically when compared against the complete catalog of structural variations in the tumor (Supplementary Table 4). By contrast, SMUFIN was
Con
stru
ctio
n of
brea
kpoi
nt b
lock
sD
efin
ition
and
cla
ssifi
catio
nof
var
iant
sA
ssig
ning
ref
eren
ceco
ordi
nate
s
Quaternary sequence tree
1 2 3 4 5 67 8 9
3 6110 11 12
Read
nt
1
2
3
4
5
6
7
8
9
10
11
n
Single orientation breakpoint
Double orientation breakpoint
Quaternary sequence tree
Overlappingand complementary
reads from normalgenome
Construction of breakpoint blocks
Undefined breakpoint blocks
Reads in tumor-specific branches
Com
paris
on o
f nor
mal
and
tum
or r
eads
and
iden
tific
atio
n of
pot
entia
l bre
akpo
ints
Normal
Reads
Tumor
FASTQ file
Qualityfilters
Tumor and normal genome sequencing
Read1
23
456
789
3
6
1
nt 1 2 3 4 5 6 7 8 9 1011.................................n = Readlength
Short insertion
SNV
Large SV
101112
Tumor and normal reads
Unambiguous extension of normal and mutated
tumor allele
Mutated tumor allele
Nonmutated tumor allele
Normal alleles
Definition of small variants (n < read size)
Definition of breakpoint and variant sequence for large SVs ( > read size)
Extension of the variant and normal sequences around the breakpoint100 nt 100 ntBreakpoint
SNVs
TumorNormal
Deletions
Inversions
Insertions
Small SVs Breakpoint of large SV
Reference genome
Mapping of normal sequences (BWA)Independent mapping of normal sequences
flanking the breakpoint (BWA)
a
b
c
d
Tumor-specific reads with potential breakpoints
If I know the polymorphisms of a person THEN
I can predict which disease WILL he suffer
http://www.scbi.uma.es
Personalised medicine
31
A needle in a haystack WAS FOUND
http://www.scbi.uma.es
Linking unrelated diseases
32
Alzheimer patients use to be free of cancer, and cancer patients use to be free of mental diseases
http://www.scbi.uma.es
Linking unrelated diseases
32
Alzheimer patients use to be free of cancer, and cancer patients use to be free of mental diseases
Molecular Evidence for the Inverse Comorbidity betweenCentral Nervous System Disorders and Cancers Detectedby Transcriptomic Meta-analysesKristina Ibanez1., Cesar Boullosa1., Rafael Tabares-Seisdedos2, Anaıs Baudot3*, Alfonso Valencia1*
1 Structural Biology and Biocomputing Programme, Spanish National Cancer, Research Centre (CNIO), Madrid, Spain, 2 Department of Medicine, University of Valencia,
CIBERSAM, INCLIVA, Valencia, Spain, 3 Aix-Marseille Universite, CNRS, I2M, UMR 7373, Marseille, France
Abstract
There is epidemiological evidence that patients with certain Central Nervous System (CNS) disorders have a lower thanexpected probability of developing some types of Cancer. We tested here the hypothesis that this inverse comorbidity isdriven by molecular processes common to CNS disorders and Cancers, and that are deregulated in opposite directions. Weconducted transcriptomic meta-analyses of three CNS disorders (Alzheimer’s disease, Parkinson’s disease and Schizophrenia)and three Cancer types (Lung, Prostate, Colorectal) previously described with inverse comorbidities. A significant overlap wasobserved between the genes upregulated in CNS disorders and downregulated in Cancers, as well as between the genesdownregulated in CNS disorders and upregulated in Cancers. We also observed expression deregulations in oppositedirections at the level of pathways. Our analysis points to specific genes and pathways, the upregulation of which couldincrease the incidence of CNS disorders and simultaneously lower the risk of developing Cancer, while the downregulationof another set of genes and pathways could contribute to a decrease in the incidence of CNS disorders while increasing theCancer risk. These results reinforce the previously proposed involvement of the PIN1 gene, Wnt and P53 pathways, andreveal potential new candidates, in particular related with protein degradation processes.
Citation: Ibanez K, Boullosa C, Tabares-Seisdedos R, Baudot A, Valencia A (2014) Molecular Evidence for the Inverse Comorbidity between Central NervousSystem Disorders and Cancers Detected by Transcriptomic Meta-analyses. PLoS Genet 10(2): e1004173. doi:10.1371/journal.pgen.1004173
Editor: Marshall S. Horwitz, University of Washington, United States of America
Received September 16, 2013; Accepted December 30, 2013; Published February 20, 2014
Copyright: ! 2014 Ibanez et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by a Fellowship from Obra Social la Caixa grant to KI (http://obrasocial.lacaixa.es/laCaixaFoundation/home_en.html), FPI grantBES-2008-006332 to CB and grant BIO2012 to AV Group. The funders had no role in study design, data collection and analysis, decision to publish, or preparationof the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected] (AB); [email protected] (AV)
. These authors contributed equally to this work.
Introduction
Epidemiological evidences point to a lower-than-expectedprobability of developing some types of Cancer in certain CNSdisorders, including Alzheimer’s disease (AD), Parkinson’s disease(PD) and Schizophrenia (SCZ) [1–6]. Our current understandingof such inverse comorbidities suggests that this phenomenon isinfluenced by environmental factors, drug treatments and otheraspects related with disease diagnosis. Genetics can additionallycontribute to the inverse comorbidity between complex diseases,together with these external factors (for review, see [3–7]). Inparticular, we propose the deregulation in opposite directions of acommon set of genes and pathways as an underlying cause ofinverse comorbidities.
To investigate the biological plausibility of this hypothesis, abasic initial step is to establish the existence of inverse geneexpression deregulations (i.e., down- versus up-regulations) in CNSdisorders and Cancers. Towards this objective, we have performedintegrative meta-analyses of collections of gene expression data,publically available for AD, PD and SCZ, and Lung (LC),Colorectal (CRC) and Prostate (PC) Cancers. Clinical andepidemiological data previously reported inverse comorbidities forthese complex disorders, according to population studies assessingthe Cancer risks among patients with CNS disorders [8–17].
Results and Discussion
For each CNS disorder and Cancer type independently, weundertook meta-analyses from a large collection of microarraygene expression datasets to identify the genes that are significantlyup- and down-regulated in disease when compared with theircorresponding healthy control samples (Differentially ExpressedGenes – DEGs –, FDR corrected p-value (q-value),0.05, seeMethods and Table S1). Then, the DEGs of the CNS disordersand Cancer types were compared to each others. There weresignificant overlaps (Fisher’s exact test, corrected p-value (q-value),0.05, see Methods) between the DEGs upregulated inCNS disorders and those downregulated in Cancers. Similarly,DEGs downregulated in CNS disorders overlapped significantlywith DEGs upregulated in Cancers (Figure 1A). Significantoverlaps between DEGs deregulated in opposite directions in CNSdisorders and Cancers are still observed while setting morestringent cutoffs for the detection of DEGs (qvalues lower than0.005, 0.0005, 0.00005 and 0.000005, Figure S1). A significantoverlap between DEGs deregulated in the same direction was onlyidentified in the case of CRC and PD upregulated genes(Figure 1A).
A molecular interpretation of the inverse comorbidity between CNSdisorders and Cancers could be that the downregulation of certain
PLOS Genetics | www.plosgenetics.org 1 February 2014 | Volume 10 | Issue 2 | e1004173
Molecular Evidence for the Inverse Comorbidity betweenCentral Nervous System Disorders and Cancers Detectedby Transcriptomic Meta-analysesKristina Ibanez1., Cesar Boullosa1., Rafael Tabares-Seisdedos2, Anaıs Baudot3*, Alfonso Valencia1*
1 Structural Biology and Biocomputing Programme, Spanish National Cancer, Research Centre (CNIO), Madrid, Spain, 2 Department of Medicine, University of Valencia,
CIBERSAM, INCLIVA, Valencia, Spain, 3 Aix-Marseille Universite, CNRS, I2M, UMR 7373, Marseille, France
Abstract
There is epidemiological evidence that patients with certain Central Nervous System (CNS) disorders have a lower thanexpected probability of developing some types of Cancer. We tested here the hypothesis that this inverse comorbidity isdriven by molecular processes common to CNS disorders and Cancers, and that are deregulated in opposite directions. Weconducted transcriptomic meta-analyses of three CNS disorders (Alzheimer’s disease, Parkinson’s disease and Schizophrenia)and three Cancer types (Lung, Prostate, Colorectal) previously described with inverse comorbidities. A significant overlap wasobserved between the genes upregulated in CNS disorders and downregulated in Cancers, as well as between the genesdownregulated in CNS disorders and upregulated in Cancers. We also observed expression deregulations in oppositedirections at the level of pathways. Our analysis points to specific genes and pathways, the upregulation of which couldincrease the incidence of CNS disorders and simultaneously lower the risk of developing Cancer, while the downregulationof another set of genes and pathways could contribute to a decrease in the incidence of CNS disorders while increasing theCancer risk. These results reinforce the previously proposed involvement of the PIN1 gene, Wnt and P53 pathways, andreveal potential new candidates, in particular related with protein degradation processes.
Citation: Ibanez K, Boullosa C, Tabares-Seisdedos R, Baudot A, Valencia A (2014) Molecular Evidence for the Inverse Comorbidity between Central NervousSystem Disorders and Cancers Detected by Transcriptomic Meta-analyses. PLoS Genet 10(2): e1004173. doi:10.1371/journal.pgen.1004173
Editor: Marshall S. Horwitz, University of Washington, United States of America
Received September 16, 2013; Accepted December 30, 2013; Published February 20, 2014
Copyright: ! 2014 Ibanez et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by a Fellowship from Obra Social la Caixa grant to KI (http://obrasocial.lacaixa.es/laCaixaFoundation/home_en.html), FPI grantBES-2008-006332 to CB and grant BIO2012 to AV Group. The funders had no role in study design, data collection and analysis, decision to publish, or preparationof the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected] (AB); [email protected] (AV)
. These authors contributed equally to this work.
Introduction
Epidemiological evidences point to a lower-than-expectedprobability of developing some types of Cancer in certain CNSdisorders, including Alzheimer’s disease (AD), Parkinson’s disease(PD) and Schizophrenia (SCZ) [1–6]. Our current understandingof such inverse comorbidities suggests that this phenomenon isinfluenced by environmental factors, drug treatments and otheraspects related with disease diagnosis. Genetics can additionallycontribute to the inverse comorbidity between complex diseases,together with these external factors (for review, see [3–7]). Inparticular, we propose the deregulation in opposite directions of acommon set of genes and pathways as an underlying cause ofinverse comorbidities.
To investigate the biological plausibility of this hypothesis, abasic initial step is to establish the existence of inverse geneexpression deregulations (i.e., down- versus up-regulations) in CNSdisorders and Cancers. Towards this objective, we have performedintegrative meta-analyses of collections of gene expression data,publically available for AD, PD and SCZ, and Lung (LC),Colorectal (CRC) and Prostate (PC) Cancers. Clinical andepidemiological data previously reported inverse comorbidities forthese complex disorders, according to population studies assessingthe Cancer risks among patients with CNS disorders [8–17].
Results and Discussion
For each CNS disorder and Cancer type independently, weundertook meta-analyses from a large collection of microarraygene expression datasets to identify the genes that are significantlyup- and down-regulated in disease when compared with theircorresponding healthy control samples (Differentially ExpressedGenes – DEGs –, FDR corrected p-value (q-value),0.05, seeMethods and Table S1). Then, the DEGs of the CNS disordersand Cancer types were compared to each others. There weresignificant overlaps (Fisher’s exact test, corrected p-value (q-value),0.05, see Methods) between the DEGs upregulated inCNS disorders and those downregulated in Cancers. Similarly,DEGs downregulated in CNS disorders overlapped significantlywith DEGs upregulated in Cancers (Figure 1A). Significantoverlaps between DEGs deregulated in opposite directions in CNSdisorders and Cancers are still observed while setting morestringent cutoffs for the detection of DEGs (qvalues lower than0.005, 0.0005, 0.00005 and 0.000005, Figure S1). A significantoverlap between DEGs deregulated in the same direction was onlyidentified in the case of CRC and PD upregulated genes(Figure 1A).
A molecular interpretation of the inverse comorbidity between CNSdisorders and Cancers could be that the downregulation of certain
PLOS Genetics | www.plosgenetics.org 1 February 2014 | Volume 10 | Issue 2 | e1004173
Figure 1. Comparisons of Differentially Expressed Genes (DEGs). (A) Comparisons of DEGs associated with Central Nervous System (CNS)disorders and Cancers. The DEGs identified as significantly up- and down-regulated (q-value,0.05) after gene expression meta-analysis in each CNSdisorder (Alzheimer’s Disease, AD; Parkinson’s Disease, PD; and Schizophrenia, SCZ) and Cancer type (Lung Cancer, LC; Colorectal Cancer, CRC; andProstate Cancer, PC) are compared to each others. (B) Comparisons of DEGs between CNS disorders, Cancers and Asthma, HIV, Malaria, Dystrophy,Sarcoidosis. The DEGs identified as significantly up- and down-regulated (q-value,0.05) after gene expression meta-analysis in each CNS disorder(Alzheimer’s Disease, AD; Parkinson’s Disease, PD; and Schizophrenia, SCZ), Cancer type (Lung Cancer, LC; Colorectal Cancer, CRC; and ProstateCancer, PC), and in Asthma, HIV, Malaria, Dystrophia and Sarcoidosis, are compared to each others. Cells are coloured according to the significance ofthe overlaps (Fisher’s exact test, Bonferroni correction for multiple testing, see Methods). Grey cells correspond to non-significant overlaps(q-value.0.05).doi:10.1371/journal.pgen.1004173.g001
Table 1. DEGs significantly downregulated in the three CNS disorders and upregulated in the three Cancer types (q-value,0.05).
PPIAP11, IARS, GGCT, NME2, GAPDHP1, CDC123, PSMD8, MRPS33, FIBP, OAZ2, IARS2, SLC35B1, APOO, TMEM189-UBE2V1, VDAC1, TMED3, SMS, DNM1L, PRPS1, SRSF2,TMEM14D, TOMM70A, ATP6V1C1, NUP93, MRPL15, UBA5, PPIH, SMYD3, NIT2, SRD5A1, NUDT21, MRPL12, EEF1E1, MRPS7, TTPAL, BZW1P2, RP11-552M11.4, TSN, MECR,ZWINT, RPRD1A, UCHL5, NHP2P2, TFB2M, FEN1, CGREF1, IMPAD1, ARL1, ACLY, MRPL42, LSM4, KPNA1, TIMM23B, RP11-164O23.5, RP11-762H8.2, FARSA, MRPL4, API5,RP3-425P12.4, RFC3, RANBP9, TFCP2, GMDS, CCNB1, TMEM177, GUF1, HSPA13, NMD3, GCFC2, TUBGCP5, TBCE, YKT6, PHF14, BRCC3
doi:10.1371/journal.pgen.1004173.t001
Inverse Comorbidity among Cancer and CNS Disorders
PLOS Genetics | www.plosgenetics.org 3 February 2014 | Volume 10 | Issue 2 | e1004173
Comparing differentially
expressed genes
SCZ: schizophrenia AD: Alzheimer disease PD: Parkinson disease CRC: colorectal cancer PC: prostate cancer LC: lung cancer
http://www.scbi.uma.es
Mental and cancer diseases are really connected
33
AD and PD, and upregulated in CRC (Reactome database;Figure S2).
Aside the Wnt and p53 pathways, our analysis reveals otherpathways related to protein folding and protein degradationdisplaying patterns of downregulation in CNS disorders andupregulation in Cancers, and that may be relevant for inversecomorbidity. For instance, the Ubiquitin/Proteasome system isconsistently downregulated in CNS disorders and upregulated inCancers according to the three pathway databases analyzed(Figure 2, Figure S2, Table S3). The inverse relationshipbetween the levels of expression deregulations of these pathwayspossibly suggests opposite roles in CNS disorders and Cancers.
A detailed examination of the KEGG pathways deregulated inopposite directions in CNS disorders and Cancers finallyrevealed that 89% of the KEGG pathways that wereupregulated in Cancers and downregulated in CNS disordersare related to Metabolism and Genetic Information Processing(Figure 2, Figure 3). By contrast, the pathways downregulatedin Cancers and upregulated in CNS disorders are related to thecell’s communication with its environment (EnvironmentalInformation Processing and Organismal System; Figure 2,Figure 3). Hence, global regulations of cellular activity mayaccount for a protective effect between inversely comorbiddiseases.
Table 2. DEGs significantly upregulated in the three CNS disorders and downregulated in the three Cancer types (q-value,0.05).
MT2A, MT1X, NFKBIA, AC009469.1, DHRS3, CDKN1A, TNFRSF1A, CRYBG3, IL4R, MT1M, FAM107A, ITPKC, MID1, IL11RA, AHNAK, KAT2B, BCL2, PTH1R, NFASC
doi:10.1371/journal.pgen.1004173.t002
Figure 2. KEGG pathways significantly deregulated in Central Nervous System (CNS) disorders and Cancer types. KEGG pathways [24]significantly up- and downregulated in each disease were identified using the GSEA method [34] (q-value,0.05). The significant pathways werecompared between the 6 diseases and combined in a network representation. Node pie charts are coloured according to the pathway status asCancer upregulated (yellow), Cancer downregulated (blue), CNS disorder upregulated (green) and CNS disorder downregulated (red). The green/blueand yellow/red associations thus correspond to pathways deregulated in opposite directions in CNS disorders and Cancers. Pathway labels arecoloured according to their classifications provided by KEGG [24], as: Metabolism (green), Genetic Information Processing (yellow), Cellular Process(pink), Environmental Information Processing (red) and Organismal Systems (dark red). All networks are available at bioinfo.cnio.es/people/cboullosa/validation/cytoscape/Ibanezetal.zip, in cytoscape format (http://www.cytoscape.org/).doi:10.1371/journal.pgen.1004173.g002
Inverse Comorbidity among Cancer and CNS Disorders
PLOS Genetics | www.plosgenetics.org 4 February 2014 | Volume 10 | Issue 2 | e1004173
Typical cancer functions
Typical mental disease functions
http://www.scbi.uma.es
Mental and cancer diseases are really connected
33
AD and PD, and upregulated in CRC (Reactome database;Figure S2).
Aside the Wnt and p53 pathways, our analysis reveals otherpathways related to protein folding and protein degradationdisplaying patterns of downregulation in CNS disorders andupregulation in Cancers, and that may be relevant for inversecomorbidity. For instance, the Ubiquitin/Proteasome system isconsistently downregulated in CNS disorders and upregulated inCancers according to the three pathway databases analyzed(Figure 2, Figure S2, Table S3). The inverse relationshipbetween the levels of expression deregulations of these pathwayspossibly suggests opposite roles in CNS disorders and Cancers.
A detailed examination of the KEGG pathways deregulated inopposite directions in CNS disorders and Cancers finallyrevealed that 89% of the KEGG pathways that wereupregulated in Cancers and downregulated in CNS disordersare related to Metabolism and Genetic Information Processing(Figure 2, Figure 3). By contrast, the pathways downregulatedin Cancers and upregulated in CNS disorders are related to thecell’s communication with its environment (EnvironmentalInformation Processing and Organismal System; Figure 2,Figure 3). Hence, global regulations of cellular activity mayaccount for a protective effect between inversely comorbiddiseases.
Table 2. DEGs significantly upregulated in the three CNS disorders and downregulated in the three Cancer types (q-value,0.05).
MT2A, MT1X, NFKBIA, AC009469.1, DHRS3, CDKN1A, TNFRSF1A, CRYBG3, IL4R, MT1M, FAM107A, ITPKC, MID1, IL11RA, AHNAK, KAT2B, BCL2, PTH1R, NFASC
doi:10.1371/journal.pgen.1004173.t002
Figure 2. KEGG pathways significantly deregulated in Central Nervous System (CNS) disorders and Cancer types. KEGG pathways [24]significantly up- and downregulated in each disease were identified using the GSEA method [34] (q-value,0.05). The significant pathways werecompared between the 6 diseases and combined in a network representation. Node pie charts are coloured according to the pathway status asCancer upregulated (yellow), Cancer downregulated (blue), CNS disorder upregulated (green) and CNS disorder downregulated (red). The green/blueand yellow/red associations thus correspond to pathways deregulated in opposite directions in CNS disorders and Cancers. Pathway labels arecoloured according to their classifications provided by KEGG [24], as: Metabolism (green), Genetic Information Processing (yellow), Cellular Process(pink), Environmental Information Processing (red) and Organismal Systems (dark red). All networks are available at bioinfo.cnio.es/people/cboullosa/validation/cytoscape/Ibanezetal.zip, in cytoscape format (http://www.cytoscape.org/).doi:10.1371/journal.pgen.1004173.g002
Inverse Comorbidity among Cancer and CNS Disorders
PLOS Genetics | www.plosgenetics.org 4 February 2014 | Volume 10 | Issue 2 | e1004173
Typical cancer functions
Typical mental disease functions
↑↑ cancer ↓↓ mental disease
74 genes19 genescancer ↓↓
mental disease↑↑
Since 93 genes are inversely expressed in cancer and CNS disorders
THEN I can explain the inverse correlation between both diseases
http://www.scbi.uma.es
After basic research, translational research is easy
34
http://www.scbi.uma.es
Higher vertebrates have conserved genomes
35
The bonobo genome compared with the chimpanzee and human genomes Kay Prüfer et al. Nature 486, 527–531 (28 June 2012)
The zebrafish reference genome sequence and its relationship to the human genome Kerstin Howe et al. Nature 496, 498–503 (25 April 2013)
70% of protein-coding human genes are related to genes found in the zebrafish
84% of genes known to be associated with human disease have a zebrafish counterpart
Chimpanzee
http://www.scbi.uma.es
Genome plasticity in bacteria
36
Estimating the size of the bacterial pan-genomeLapierre & Gogarten
Trends in Genetics 23(3), 2009, Pages 107–110
Pangenomics – an avenue to improved industrial starter cultures and probioticsGarrigues et al.Current Opinion in Biotechnology 2013, 24:187–191
http://www.scbi.uma.es
Minimum number of genes for a living organism
37
1354 genes
Giovannoni et al., (2005) Science 309: 1242-1245
500 genes
http://www.scbi.uma.es
Minimum number of genes for a living organism
37
1354 genes
Giovannoni et al., (2005) Science 309: 1242-1245
500 genesIf I know the minimal gene number of an organism THEN
I can design artificial organisms for biotechnological purposes
http://www.scbi.uma.es
There aren’t new genes but duplicated genes
38
The number of gene families plateaus with genome size
Figure 3.15 Because many genes areduplicated, the number of different genefamilies is much less than the totalnumber of genes. The histogram comparesthe total number of genes with the numberof distinct gene families.
Figure 3.16 The proportion ofgenes that are present in multiplecopies increases with genome sizein higher eukaryotes.
their exons. (A family of related genes arises by duplication of an an-cestral gene followed by accumulation of changes in sequence betweenthe copies. Most often the members of a family are related but not iden-tical.) The number of types of genes is calculated by adding the numberof unique genes (where there is no other related gene at all) to the num-bers of families that have 2 or more members.
Figure 3.15 compares the total number of genes with the number ofdistinct families in each of six genomes. In bacteria, most genes areunique, so the number of distinct families is close to the total gene num-ber. The situation is different even in the lower eukaryote S. cerevisiae,where there is a significant proportion of repeated genes. The most strik-ing effect is that the number of genes increases quite sharply in the highereukaryotes, but the number of gene families does not change much.
Figure 3.16 shows that the proportion of unique genes dropssharply with genome size. When genes are present in families, the num-ber of members in a family is small in bacteria and lower eukaryotes,but is large in higher eukaryotes. Much of the extra genome size ofArabidopsis is accounted for by families with >4 members.
If every gene is expressed, the total number of genes will account forthe total number of proteins required to make the organism (the pro-teome). However, two effects mean that the proteome is different fromthe total gene number. Because genes are duplicated, some of themcode for the same protein (although it may be expressed in a differenttime or place) and others may code for related proteins that again playthe same role in different times or places. And because some genes canproduce more than one protein by means of alternative splicing, theproteome can be larger than the number of genes.
What is the core proteome—the basic number of the different typesof proteins in the organism? A minimum estimate is given by the num-ber of gene families, ranging from 1400 in the bacterium, >4000 in theyeast, and a range of 11,000-14,000 for the fly and worm.
What is the distribution of the proteome among types of proteins?The 6000 proteins of the yeast proteome include 5000 soluble proteinsand 1000 transmembrane proteins. About half of the proteins are cyto-plasmic, a quarter are in the nucleolus, and the remainder are split be-tween the mitochondrion and the ER/Golgi system.
How many genes are common to all organisms (or to groups such asbacteria or higher eukaryotes) and how many are specific for the individ-ual type of organism? Figure 3.17 summarizes the comparison betweenyeast, worm, and fly. Genes that code for corresponding proteins in differ-ent organisms are called orthologs. Operationally, we usually reckon thattwo genes in different organisms can be considered to provide correspond-ing functions if their sequences are similar over >80% of the length. B>this criterion, -20% of the fly genes have orthologs in both yeast and theworm. These genes are probably required by all eukaryotes. The propor-tion increases to 30% when fly and worm are compared, probably repre-senting the addition of gene functions that are common to multicellulaieukaryotes. This still leaves a major proportion of genes as coding for pro-teins that are required specifically by either flies or worms, respectively
The proteome can be deduced from the number and structures olgenes, and can also be directly measured by analyzing the total proteircontent of a cell or organism. By such approaches, some proteins havebeen identified that were not suspected on the basis of genome analysisand that have therefore led to the identification of new genes. Severamethods are used for large scale analysis of proteins. Mass spectrometrycan be used for separating and identifying proteins in a mixture obtainecdirectly from cells or tissues. Hybrid proteins bearing tags can be obtained by expression of cDNAs made by linking the sequences of opeireading frames to appropriate expression vectors that incorporate the sequences for affinity tags. This allows array analysis to be used to analyz*
62 CHAPTER 3 The content of the genomeBy Book_Crazy [IND]
WHOLE-GENOME SEQUENCING HAS GIVEN UNIQUE INSIGHTS INTO THE GENOMES STRUCTURE
GENOME"SIZE"HAS"NOTHING"TO"DO"WITH"GENE"NUMBER"
VARIABILITY"AMONG"GENOMES"ARISES"FROM"A"NUMBER"OF"DIFFERENT"SOURCES"
HIGH&THROUGHPUT"TECHNOLOGIES"OVERVIEW"7"
http://www.scbi.uma.es
We are not able to predict which kind of organism is produced when having the genome sequence
39
?
http://www.scbi.uma.es
We are not able to predict which kind of organism is produced when having the genome sequence
39
?
A living being si more than the sum of its components
http://www.scbi.uma.es
We can now relate facial shapes with genes
40
Modeling 3D Facial Shape from DNAPeter Claes1, Denise K. Liberton2, Katleen Daniels1, Kerri Matthes Rosana2, Ellen E. Quillen2,
Laurel N. Pearson2, Brian McEvoy3, Marc Bauchet2, Arslan A. Zaidi2, Wei Yao2, Hua Tang4,
Gregory S. Barsh4,5, Devin M. Absher5, David A. Puts2, Jorge Rocha6,7, Sandra Beleza4,8,
Rinaldo W. Pereira9, Gareth Baynam10,11,12, Paul Suetens1, Dirk Vandermeulen1, Jennifer K. Wagner13,
James S. Boster14, Mark D. Shriver2*
1 Medical Image Computing, ESAT/PSI, Department of Electrical Engineering, KU Leuven, Medical Imaging Research Center, KU Leuven & UZ Leuven, iMinds-KU Leuven
Future Health Department, Leuven, Belgium, 2 Department of Anthropology, Penn State University, University Park, Pennsylvania, United States of America, 3 Smurfit
Institute of Genetics, Dublin, Ireland, 4 Department of Genetics, Stanford University, Palo Alto, California, United States of America, 5 HudsonAlpha Institute for
Biotechnology, Huntsville, Alabama, United States of America, 6 CIBIO: Centro de Investigacao em Biodiversidade e Recursos Geneticos, Universidade do Porto, Porto,
Portugal, 7 Departamento de Biologia, Faculdade de Ciencias, Universidade do Porto, Porto, Portugal, 8 IPATIMUP: Instituto de Patologia e Imunologia Molecular da
Universidade do Porto, Porto, Portugal, 9 Programa de Pos-Graduacao em Ciencias Genomicas e Biotecnologia, Universidade Catolica de Brasılia, Brasilia, Brasil, 10 School
of Paediatrics and Child Health, University of Western Australia, Perth, Australia, 11 Institute for Immunology and Infectious Diseases, Murdoch University, Perth, Australia,
12 Genetic Services of Western Australia, King Edward Memorial Hospital, Perth, Australia, 13 Center for the Integration of Genetic Healthcare Technologies, University of
Pennsylvania, Philadelphia, Pennsylvania, United States of America, 14 Department of Anthropology, University of Connecticut, Storrs, Connecticut, United States of
America
Abstract
Human facial diversity is substantial, complex, and largely scientifically unexplained. We used spatially dense quasi-landmarks to measure face shape in population samples with mixed West African and European ancestry from threelocations (United States, Brazil, and Cape Verde). Using bootstrapped response-based imputation modeling (BRIM), weuncover the relationships between facial variation and the effects of sex, genomic ancestry, and a subset of craniofacialcandidate genes. The facial effects of these variables are summarized as response-based imputed predictor (RIP) variables,which are validated using self-reported sex, genomic ancestry, and observer-based facial ratings (femininity andproportional ancestry) and judgments (sex and population group). By jointly modeling sex, genomic ancestry, andgenotype, the independent effects of particular alleles on facial features can be uncovered. Results on a set of 20 genesshowing significant effects on facial features provide support for this approach as a novel means to identify genes affectingnormal-range facial features and for approximating the appearance of a face from genetic markers.
Citation: Claes P, Liberton DK, Daniels K, Rosana KM, Quillen EE, et al. (2014) Modeling 3D Facial Shape from DNA. PLoS Genet 10(3): e1004224. doi:10.1371/journal.pgen.1004224
Editor: Daniela Luquetti, Seattle Children’s Research Institute, United States of America
Received September 12, 2013; Accepted January 22, 2014; Published March 20, 2014
Copyright: ! 2014 Claes et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This investigation was supported by grants to MDS from Science Foundation of Ireland Walton Fellowship (04.W4/B643); to MDS and DAP from theNational Institute Justice (2008-DN-BX-K125); to JKW from the NIH/National Human Genome Research Institute (K99HG006446); to DKL from the National ScienceFoundation (BCS-0851815) and from the Wenner Gren Foundation (Fieldwork Grant 7967). PC is partly supported by the Flemish Institute for the Promotion ofInnovation by Science and Technology in Flanders (IWT Vlaanderen), the Research Program of the Fund for Scientific Research - Flanders (Belgium) (FWO), theResearch Fund KU Leuven and SB was supported by the Portuguese Institution ‘‘Fundacao para a Ciencia e a Tecnologia’’ [FCT; PTDC/BIABDE/64044/2006(project) and SFRH/BPD/21887/2005 (post-doc grant)] and by a Dean’s Postdoctoral Fellowship at Stanford University. The funders had no role in study design,data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
Introduction
The craniofacial complex is initially modulated by precisely-timed embryonic gene expression and molecular interactionsmediated through complex pathways [1]. As humans grow,hormones and biomechanical factors also affect many parts ofthe face [2,3]. The inability to systematically summarize facialvariation has impeded the discovery of the determinants andcorrelates of face shape. In contrast to genomic technologies,systematic and comprehensive phenotyping has lagged. This isespecially so in the context of multipartite traits such as the humanface. In typical genome-wide association studies (GWAS) todayphenotypes are summarized as univariate variables, which isinherently limiting for multivariate traits, which, by definitioncannot be expressed with single variables. Current state-of-the-art
genetic association studies for facial traits are limited in theirdescription of facial morphology [4–7]. These analyses start from asparse set of anatomical landmarks (these being defined as ‘‘a pointof correspondence on an object that matches between and withinpopulations’’), which overlooks salient features of facial shape.Subsequently, either a set of conventional morphometric mea-surements such as distances and angles are extracted, whichdrastically oversimplify facial shape, or a set of principalcomponents (PCs) are extracted using principal componentsanalysis (PCA) on the shape-space obtained with superimpositiontechniques, where each PC is assumed to represent a distinctmorphological trait. Here we describe a novel method thatfacilitates the compounding of all PCs into a single scalar variablecustomized to relevant independent variables including, sex,genomic ancestry, and genes. Our approach combines placing
PLOS Genetics | www.plosgenetics.org 1 March 2014 | Volume 10 | Issue 3 | e1004224
Modeling 3D Facial Shape from DNAPeter Claes1, Denise K. Liberton2, Katleen Daniels1, Kerri Matthes Rosana2, Ellen E. Quillen2,
Laurel N. Pearson2, Brian McEvoy3, Marc Bauchet2, Arslan A. Zaidi2, Wei Yao2, Hua Tang4,
Gregory S. Barsh4,5, Devin M. Absher5, David A. Puts2, Jorge Rocha6,7, Sandra Beleza4,8,
Rinaldo W. Pereira9, Gareth Baynam10,11,12, Paul Suetens1, Dirk Vandermeulen1, Jennifer K. Wagner13,
James S. Boster14, Mark D. Shriver2*
1 Medical Image Computing, ESAT/PSI, Department of Electrical Engineering, KU Leuven, Medical Imaging Research Center, KU Leuven & UZ Leuven, iMinds-KU Leuven
Future Health Department, Leuven, Belgium, 2 Department of Anthropology, Penn State University, University Park, Pennsylvania, United States of America, 3 Smurfit
Institute of Genetics, Dublin, Ireland, 4 Department of Genetics, Stanford University, Palo Alto, California, United States of America, 5 HudsonAlpha Institute for
Biotechnology, Huntsville, Alabama, United States of America, 6 CIBIO: Centro de Investigacao em Biodiversidade e Recursos Geneticos, Universidade do Porto, Porto,
Portugal, 7 Departamento de Biologia, Faculdade de Ciencias, Universidade do Porto, Porto, Portugal, 8 IPATIMUP: Instituto de Patologia e Imunologia Molecular da
Universidade do Porto, Porto, Portugal, 9 Programa de Pos-Graduacao em Ciencias Genomicas e Biotecnologia, Universidade Catolica de Brasılia, Brasilia, Brasil, 10 School
of Paediatrics and Child Health, University of Western Australia, Perth, Australia, 11 Institute for Immunology and Infectious Diseases, Murdoch University, Perth, Australia,
12 Genetic Services of Western Australia, King Edward Memorial Hospital, Perth, Australia, 13 Center for the Integration of Genetic Healthcare Technologies, University of
Pennsylvania, Philadelphia, Pennsylvania, United States of America, 14 Department of Anthropology, University of Connecticut, Storrs, Connecticut, United States of
America
Abstract
Human facial diversity is substantial, complex, and largely scientifically unexplained. We used spatially dense quasi-landmarks to measure face shape in population samples with mixed West African and European ancestry from threelocations (United States, Brazil, and Cape Verde). Using bootstrapped response-based imputation modeling (BRIM), weuncover the relationships between facial variation and the effects of sex, genomic ancestry, and a subset of craniofacialcandidate genes. The facial effects of these variables are summarized as response-based imputed predictor (RIP) variables,which are validated using self-reported sex, genomic ancestry, and observer-based facial ratings (femininity andproportional ancestry) and judgments (sex and population group). By jointly modeling sex, genomic ancestry, andgenotype, the independent effects of particular alleles on facial features can be uncovered. Results on a set of 20 genesshowing significant effects on facial features provide support for this approach as a novel means to identify genes affectingnormal-range facial features and for approximating the appearance of a face from genetic markers.
Citation: Claes P, Liberton DK, Daniels K, Rosana KM, Quillen EE, et al. (2014) Modeling 3D Facial Shape from DNA. PLoS Genet 10(3): e1004224. doi:10.1371/journal.pgen.1004224
Editor: Daniela Luquetti, Seattle Children’s Research Institute, United States of America
Received September 12, 2013; Accepted January 22, 2014; Published March 20, 2014
Copyright: ! 2014 Claes et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This investigation was supported by grants to MDS from Science Foundation of Ireland Walton Fellowship (04.W4/B643); to MDS and DAP from theNational Institute Justice (2008-DN-BX-K125); to JKW from the NIH/National Human Genome Research Institute (K99HG006446); to DKL from the National ScienceFoundation (BCS-0851815) and from the Wenner Gren Foundation (Fieldwork Grant 7967). PC is partly supported by the Flemish Institute for the Promotion ofInnovation by Science and Technology in Flanders (IWT Vlaanderen), the Research Program of the Fund for Scientific Research - Flanders (Belgium) (FWO), theResearch Fund KU Leuven and SB was supported by the Portuguese Institution ‘‘Fundacao para a Ciencia e a Tecnologia’’ [FCT; PTDC/BIABDE/64044/2006(project) and SFRH/BPD/21887/2005 (post-doc grant)] and by a Dean’s Postdoctoral Fellowship at Stanford University. The funders had no role in study design,data collection and analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: [email protected]
Introduction
The craniofacial complex is initially modulated by precisely-timed embryonic gene expression and molecular interactionsmediated through complex pathways [1]. As humans grow,hormones and biomechanical factors also affect many parts ofthe face [2,3]. The inability to systematically summarize facialvariation has impeded the discovery of the determinants andcorrelates of face shape. In contrast to genomic technologies,systematic and comprehensive phenotyping has lagged. This isespecially so in the context of multipartite traits such as the humanface. In typical genome-wide association studies (GWAS) todayphenotypes are summarized as univariate variables, which isinherently limiting for multivariate traits, which, by definitioncannot be expressed with single variables. Current state-of-the-art
genetic association studies for facial traits are limited in theirdescription of facial morphology [4–7]. These analyses start from asparse set of anatomical landmarks (these being defined as ‘‘a pointof correspondence on an object that matches between and withinpopulations’’), which overlooks salient features of facial shape.Subsequently, either a set of conventional morphometric mea-surements such as distances and angles are extracted, whichdrastically oversimplify facial shape, or a set of principalcomponents (PCs) are extracted using principal componentsanalysis (PCA) on the shape-space obtained with superimpositiontechniques, where each PC is assumed to represent a distinctmorphological trait. Here we describe a novel method thatfacilitates the compounding of all PCs into a single scalar variablecustomized to relevant independent variables including, sex,genomic ancestry, and genes. Our approach combines placing
PLOS Genetics | www.plosgenetics.org 1 March 2014 | Volume 10 | Issue 3 | e1004224
Modeling 3D Facial Shape from DNA
PLOS Genetics | www.plosgenetics.org 5 March 2014 | Volume 10 | Issue 3 | e1004224
Figure 4. Relationships between the ancestry and sex RIP variables and their initial predictor variables. (A) RIP-A with genomicancestry; genomic ancestry is calculated using the core panel of 68 AIMs and RIP-A is calculated using this ancestry estimate on the set of threepopulations combined (N = 592). Populations are indicated as shown in the legend with United States participants shown with black circles, Brazilianswith red circles, and Cape Verdeans with blue circles. (B) Histograms of RIP-S by self-reported sex.doi:10.1371/journal.pgen.1004224.g004
Modeling 3D Facial Shape from DNA
PLOS Genetics | www.plosgenetics.org 7 March 2014 | Volume 10 | Issue 3 | e1004224
http://www.scbi.uma.es
We have found the treasure coffer, but…
41http://www.slideshare.net/MGonzaloClaros
http://www.scbi.uma.es
We have found the treasure coffer, but…
41http://www.slideshare.net/MGonzaloClaros