Learning gene regulations with only positive examples

On Learning Gene Regulatory Networks with Only Positive Examples

Luigi Cerulo, University of Sannio, ItalyMichele Ceccarelli, University of Sannio, Italy

Tuesday, October 12, 2010

Outline

• Supervised inference of gene regulatory networks

• The positive only problem

• Negative selection approaches

• Effect on prediction accuracy

• Conclusions and future directions


Gene Regulatory Network (GRN)

The network of transcription dependences among genes of an organism, known as transcription factors, and their binding sites.

Gene BGene A

TF

TF

protein

gene A gene B


Gene Regulatory Network (GRN)

• A gene regulatory network can be represented as a graph G = (Vertices, Edges)

• Vertices = Genes

• Edges = Interactions

G1

G2

G3

G4

G5

G6

G7

G8


Inference of Gene regulatory networks

G1

G2

G3

G4

G5

G6

G7

G8Gi = {e1, e2, e3, . . . , en}


GRNunsupervised inference• Correlation models (eg. Mutual

information)

• Bayesian Network

• Boolean networks

• ODE

• ...


• Part of the networkis known in advancefrom public databases(Eg. RegulonDB)

GRNsupervised Inference

G1

G2

G3

G4

G5

G6

G7

G8


GRNsupervised Inference

G1

G2

G3

G4

G5

G6

G7

G8

G1

G2

G3

G4

G5

G6

G7

G8

+

Binary classifier (SVM, Decision Tree, Neural Networks,...)

T = {(G1, G2), (G2, G3), (G6, G7), (G7, G8)}Gi = {e1, e2, e3, . . . , en}


Related work


• SIRENE approach

• trains an SVM classifier for each gene and predicts which genes are regulated by that gene

• combines all predicted regulations to obtain the full regulatory network

G1

G2

G3

G4

G5

G6

G7

G8

G1

G2

G3

G4

G5

G6

G7

G8

G1

G2

G3

G4

G5

G6

G7

G8

...


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Ratio of false positives

Rat

io o

f tru

e po

sitiv

es

CLRSIRENESIRENE−Bias

Precision

Compared with unsupervised methods (Mordelet and Vert, 2008)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Prec

isio

n

CLRSIRENESIRENE−Bias

Method Recall at 60% of Precision Recall at 80% of Precision

SIRENE 44.5% 17.6%

CLR 7.5% 5.5%

Relevance networks 4.7% 3.3%

ARACNe 1% 0%

Bayesian network 1% 0%


0

15

30

45

60

0 100 200 300 400

True

Pos

itive

s

Top N

prediction of new c-Myc regulationsTrue positives are validated with IPA

(www.ingenuity.com)

supervised (SIRENE)

unsupervised (ARACNE)


http://www.ingenuity.com

http://www.ingenuity.com

Supervised learning

+

-

++ +

+++

+

+

+

--

----

-

-

++

-


Supervised learning

f(x)

+

-

++ +

+++

+

+

+

--

----

-

-

++

-


Supervised learning

f(x)

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?


Supervised learning

f(x)

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

-


Supervised learning

f(x)

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

+

-


Supervised learning with unlabeled data

+

-

++ +

+++

+

+

+

--

----

-

-

++

-



+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?



+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?

?

?

?

?

?

?

?

? ?

?



+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?

?

?

?

?

?

?

?

? ?

?

f(x)PU-learning


Supervised learning of gene regulatory networks

G1

G2

G3

G4

G5

G6

G7

G8

+1

Is this a negativeexample?

Is this a negative example?

+1

+1

+1


Training set

P Q N

Labeled Unlabeled

Positive Negative

% of Known Positives|P |

|P ∪Q|


Effect of PU-learningE.coli dataset [J.J. Faith et al., 2007]

AU

ROC

% of known positives

0.5

0.6

0.7

0.8

0.9

1.0

10 20 30 40 50 60 70 80 90 100


Reliable negative selection

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?

?

?

?

?

?

?

?

? ?

?

f(x)PU-learning



+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?

?

?

?

?

f(x)PU-learning



+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?

?

?

?

?

f(x)PU-learning

f’(x)Tuesday, October 12, 2010

Reliable negative selection in text mining

• B. Liu et al. Building Text Classifiers Using Positive and Unlabeled Examples, in ICDM 2003

• Yu et al. PEBL: Positive Example Based Learning for Web Page Classification Using SVM, in KDD 2002

• Denis et al. Text classification from positive and unlabeled Examples, in IPMU 2002


Methods based on reliable negative selection

P Q N

Labeled Unlabeled

P RN

Negative selection heuristic

Original training set

New training set


Quality of RN

• RN could be contaminated with positives embedded in unlabeled data

• The fraction of positive contamination is the ratio between the number of positives in RN and the total number of unknown positives |Q|

RN


Effect of positive contaminationE.coli dataset [J.J. Faith et al., 2007]

AU

ROC


0.5

0.6

0.7

0.8

0.9

1.0

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100

positive contamination = 1(PU-learning)

positive contamination = 0



F-M

easu

re




0.0

0.1

0.2

0.3

0.4

0.5

0.6

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100


Network topology based heuristics


Network motifs

Network motifs are small connected subnetworks a network exhibits in a significant higher or lower occurrences than would be expected just by chance

A B C

A

B C D E


B. Goemann, E. Wingender, and A. P. Potapov, “An approach to evaluate the topological significance of motifs and other patterns in regulatory networks.” BMC System Biology, vol. 3, no. 53, May 2009.

S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in the transcriptional regulation network of escherichia coli,” Nature Genetics, vol. 31, no. 1, pp. 64–68, May 2002.


Network Motifs Heuristic

• For each three genes sub networks T:

• If matches a network motifs M then considers all connections not present in M as negatives

A

B

C


MOTIF selection performanceE.coli dataset [J.J. Faith et al., 2007 and RegulonDB]

AU

ROC


0.5

0.6

0.7

0.8

0.9

1.0

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100




MOTIF selection performanceE.coli dataset [J.J. Faith et al., 2007 and RegulonDB]

AU

ROC


0.5

0.6

0.7

0.8

0.9

1.0

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100



MOTIF



F-M

easu

re




0.0

0.1

0.2

0.3

0.4

0.5

0.6

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100



F-M

easu

re




0.0

0.1

0.2

0.3

0.4

0.5

0.6

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100

MOTIF


Scale free networks

Albert-László Barabási and Zoltán N. OltvaiNetwork biology: Understanding the cell’s functional organizationNature Reviews Genetics 5, 101-113 (2004)


Hierarchical networks

Hong-Wu Ma, Jan Buer, and An-Ping ZengHierarchical structure and modules in the Escherichia coli transcriptional regulatory network revealed by a new top-down approachBMC Bioinformatics 2004 5:199


Experimental data

• 445 Affymetrix Antisense2 microarray expression profiles for 4345 genes of E.coli [J.J. Faith et al., 2007]

• Data were standardized (i.e. zero mean unit standard deviation)

• Regulations extracted from RegulonDB (v. 5) between 154 Transcription Factors and 1211 genes


Summary and conclusions

• Learning gene regulations is affected by the problem of learning from positive only data

• At least for E.coli

• The study of positive contamination shows that there is room for new heuristics

• Topology based heuristics (eg. motifs) have shown promising results.

• Open issues arise on higher level organisms where gene interactions are more complex


Re weighting strategy (PosOnly)

Cerulo et al. BMC Bioinformatics 2010, 11:228http://www.biomedcentral.com/1471-2105/11/228

Open AccessR E S E A R C H A R T I C L E

© 2010 Cerulo et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

Research articleLearning gene regulatory networks from only positive and unlabeled dataLuigi Cerulo*1,2, Charles Elkan3 and Michele Ceccarelli1,2

AbstractBackground: Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs. This approach has been proven to outperform previous unsupervised methods. However, the supervised approach raises open questions. In particular, although known regulatory connections can safely be assumed to be positive training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not available that a given pair of genes do not interact.

Results: A recent advance in research on data mining is a method capable of learning a classifier from only positive and unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory networks, we show that this method significantly outperforms the current state of the art of machine learning methods. We assess the new method using both simulated and experimental data, and obtain major performance improvement.

Conclusions: Compared to unsupervised methods for gene network inference, supervised methods are potentially more accurate, but for training they need a complete set of known regulatory connections. A supervised method that can be trained using only positive and unlabeled data, as presented in this paper, is especially beneficial for the task of inferring gene regulatory networks, because only an incomplete set of known regulatory connections is available in public databases such as RegulonDB, TRRD, KEGG, Transfac, and IPA.

BackgroundInferring the topology of gene regulatory networks is fun-damental to understand the complexity of interdepen-dencies among gene up and down regulation.Characterizing experimentally the transcriptional cis-regulation at a genome scale is still an expensive chal-lenge, even for well-studied model organisms. In silicomethods represent a promising direction that, through areverse engineering approach, aim to extract gene regula-tory networks from prior biological knowledge and avail-able genomic and post-genomic data. Different modelarchitectures to reverse engineer gene regulatory net-works from gene expression data have been proposed inliterature [1]. Such models represent biological regula-tions as a network where nodes represent elements of

interactions, eg. genes, proteins, metabolites, while edgesrepresent the presence of interaction activities betweensuch network components. Four main network modelarchitectures can be distinguished: i) information theorymodels, ii) boolean network models, iii) differential anddifference equation models, iv) Bayesian models.

Information theory models correlate two genes bymeans of a correlation coefficient and a threshold. Twogenes are predicted to interact if the correlation coeffi-cient of their expression levels is above a threshold. Forexample, TD-ARACNE [2], ARACNE [3], and CLR [4]infer the network structure with a statistical score derivedfrom the mutual information and a set of pruning heuris-tics.

Boolean networks use a binary variable to represent thestate of a gene activity and a directed graph, where edgesare represented by boolean functions, to represent theinteractions between genes. REVEAL [5] is an algorithmthat infers a boolean network model from gene expres-

* Correspondence: [email protected] Department of Biological and Environmental Studies, University of Sannio, Benevento, ItalyFull list of author information is available at the end of the article


PosOnly: How it works

Let x be a random example

s=1 iff x is labeled, s=0 iff x is unlabeled

y=1 iff x is positive, y=0 iff x is negative

If s=1 then y=1 (the contrary is not always true!)

p(s=1|x,y=0) = 0

P Q N

Labeleds=1

Unlabeleds=0

Positivey=1

Negativey=0


• The goal is to learna classifier such that:f(x) = p(y=1|x)

• It is easy to see that (Elkan and Noto, 2008):f(x) = p(s=1|x)/p(s=1|y=1) = p(s=1 and y=1|x)/p(s=1|y=1)= p(s=1|y=1,x)p(y=1|x)/p(s=1|y=1)= p(y=1|x)


P Q N

Labeleds=1

Unlabeleds=0

Positivey=1

Negativey=0



binary classifier trained with labeled and unlabeled examples

unknown constant estimated empirically in a number of ways

f(x) =p(s = 1|x)

p(s = 1|y = 1)


Results: experimental data

% of Known Positives

Mean of F-M

easure


Learning gene regulations with only positive examples

Documents

Transcript of Learning gene regulations with only positive examples