Learning gene regulations with only positive examples

52
On Learning Gene Regulatory Networks with Only Positive Examples Luigi Cerulo , University of Sannio, Italy Michele Ceccarelli, University of Sannio, Italy Tuesday, October 12, 2010

Transcript of Learning gene regulations with only positive examples

Page 1: Learning gene regulations with only positive examples

On Learning Gene Regulatory Networks with Only Positive Examples

Luigi Cerulo, University of Sannio, ItalyMichele Ceccarelli, University of Sannio, Italy

Tuesday, October 12, 2010

Page 2: Learning gene regulations with only positive examples

Outline

• Supervised inference of gene regulatory networks

• The positive only problem

• Negative selection approaches

• Effect on prediction accuracy

• Conclusions and future directions

Tuesday, October 12, 2010

Page 3: Learning gene regulations with only positive examples

Gene Regulatory Network (GRN)

The network of transcription dependences among genes of an organism, known as transcription factors, and their binding sites.

Gene BGene A

TF

TF

protein

gene A gene B

Tuesday, October 12, 2010

Page 4: Learning gene regulations with only positive examples

Gene Regulatory Network (GRN)

• A gene regulatory network can be represented as a graph G = (Vertices, Edges)

• Vertices = Genes

• Edges = Interactions

G1

G2

G3

G4

G5

G6

G7

G8

Tuesday, October 12, 2010

Page 5: Learning gene regulations with only positive examples

Inference of Gene regulatory networks

G1

G2

G3

G4

G5

G6

G7

G8Gi = {e1, e2, e3, . . . , en}

Tuesday, October 12, 2010

Page 6: Learning gene regulations with only positive examples

GRNunsupervised inference• Correlation models (eg. Mutual

information)

• Bayesian Network

• Boolean networks

• ODE

• ...

Tuesday, October 12, 2010

Page 7: Learning gene regulations with only positive examples

• Part of the networkis known in advancefrom public databases(Eg. RegulonDB)

GRNsupervised Inference

G1

G2

G3

G4

G5

G6

G7

G8

Tuesday, October 12, 2010

Page 8: Learning gene regulations with only positive examples

GRNsupervised Inference

G1

G2

G3

G4

G5

G6

G7

G8

G1

G2

G3

G4

G5

G6

G7

G8

+

Binary classifier (SVM, Decision Tree, Neural Networks,...)

T = {(G1, G2), (G2, G3), (G6, G7), (G7, G8)}Gi = {e1, e2, e3, . . . , en}

Tuesday, October 12, 2010

Page 9: Learning gene regulations with only positive examples

Related work

Tuesday, October 12, 2010

Page 10: Learning gene regulations with only positive examples

• SIRENE approach

• trains an SVM classifier for each gene and predicts which genes are regulated by that gene

• combines all predicted regulations to obtain the full regulatory network

G1

G2

G3

G4

G5

G6

G7

G8

G1

G2

G3

G4

G5

G6

G7

G8

G1

G2

G3

G4

G5

G6

G7

G8

...

Tuesday, October 12, 2010

Page 11: Learning gene regulations with only positive examples

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Ratio of false positives

Rat

io o

f tru

e po

sitiv

es

CLRSIRENESIRENE−Bias

Precision

Compared with unsupervised methods (Mordelet and Vert, 2008)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Prec

isio

n

CLRSIRENESIRENE−Bias

Method Recall at 60% of Precision Recall at 80% of Precision

SIRENE 44.5% 17.6%

CLR 7.5% 5.5%

Relevance networks 4.7% 3.3%

ARACNe 1% 0%

Bayesian network 1% 0%

Tuesday, October 12, 2010

Page 12: Learning gene regulations with only positive examples

0

15

30

45

60

0 100 200 300 400

True

Pos

itive

s

Top N

prediction of new c-Myc regulationsTrue positives are validated with IPA

(www.ingenuity.com)

supervised (SIRENE)

unsupervised (ARACNE)

Tuesday, October 12, 2010

Page 13: Learning gene regulations with only positive examples

Supervised learning

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

Tuesday, October 12, 2010

Page 14: Learning gene regulations with only positive examples

Supervised learning

f(x)

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

Tuesday, October 12, 2010

Page 15: Learning gene regulations with only positive examples

Supervised learning

f(x)

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

Tuesday, October 12, 2010

Page 16: Learning gene regulations with only positive examples

Supervised learning

f(x)

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

-

Tuesday, October 12, 2010

Page 17: Learning gene regulations with only positive examples

Supervised learning

f(x)

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

+

-

Tuesday, October 12, 2010

Page 18: Learning gene regulations with only positive examples

Supervised learning with unlabeled data

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

Tuesday, October 12, 2010

Page 19: Learning gene regulations with only positive examples

Supervised learning with unlabeled data

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?

Tuesday, October 12, 2010

Page 20: Learning gene regulations with only positive examples

Supervised learning with unlabeled data

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?

?

?

?

?

?

?

?

? ?

?

Tuesday, October 12, 2010

Page 21: Learning gene regulations with only positive examples

Supervised learning with unlabeled data

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?

?

?

?

?

?

?

?

? ?

?

f(x)PU-learning

Tuesday, October 12, 2010

Page 22: Learning gene regulations with only positive examples

Supervised learning of gene regulatory networks

G1

G2

G3

G4

G5

G6

G7

G8

+1

Is this a negativeexample?

Is this a negative example?

+1

+1

+1

Tuesday, October 12, 2010

Page 23: Learning gene regulations with only positive examples

Training set

P Q N

Labeled Unlabeled

Positive Negative

% of Known Positives|P |

|P ∪Q|

Tuesday, October 12, 2010

Page 24: Learning gene regulations with only positive examples

Effect of PU-learningE.coli dataset [J.J. Faith et al., 2007]

AU

ROC

% of known positives

0.5

0.6

0.7

0.8

0.9

1.0

10 20 30 40 50 60 70 80 90 100

Tuesday, October 12, 2010

Page 25: Learning gene regulations with only positive examples

Reliable negative selection

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?

?

?

?

?

?

?

?

? ?

?

f(x)PU-learning

Tuesday, October 12, 2010

Page 26: Learning gene regulations with only positive examples

Reliable negative selection

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?

?

?

?

?

f(x)PU-learning

Tuesday, October 12, 2010

Page 27: Learning gene regulations with only positive examples

Reliable negative selection

+

-

++ +

+++

+

+

+

--

----

-

-

++

-

?

?

?

?

?

?

?

?

f(x)PU-learning

f’(x)Tuesday, October 12, 2010

Page 28: Learning gene regulations with only positive examples

Reliable negative selection in text mining

• B. Liu et al. Building Text Classifiers Using Positive and Unlabeled Examples, in ICDM 2003

• Yu et al. PEBL: Positive Example Based Learning for Web Page Classification Using SVM, in KDD 2002

• Denis et al. Text classification from positive and unlabeled Examples, in IPMU 2002

Tuesday, October 12, 2010

Page 29: Learning gene regulations with only positive examples

Methods based on reliable negative selection

P Q N

Labeled Unlabeled

P RN

Negative selection heuristic

Original training set

New training set

Tuesday, October 12, 2010

Page 30: Learning gene regulations with only positive examples

Quality of RN

• RN could be contaminated with positives embedded in unlabeled data

• The fraction of positive contamination is the ratio between the number of positives in RN and the total number of unknown positives |Q|

RN

Tuesday, October 12, 2010

Page 31: Learning gene regulations with only positive examples

Effect of positive contaminationE.coli dataset [J.J. Faith et al., 2007]

AU

ROC

% of known positives

0.5

0.6

0.7

0.8

0.9

1.0

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100

positive contamination = 1(PU-learning)

positive contamination = 0

Tuesday, October 12, 2010

Page 32: Learning gene regulations with only positive examples

Effect of positive contaminationE.coli dataset [J.J. Faith et al., 2007]

F-M

easu

re

% of known positives

positive contamination = 1(PU-learning)

positive contamination = 0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100

Tuesday, October 12, 2010

Page 33: Learning gene regulations with only positive examples

Network topology based heuristics

Tuesday, October 12, 2010

Page 34: Learning gene regulations with only positive examples

Network motifs

Network motifs are small connected subnetworks a network exhibits in a significant higher or lower occurrences than would be expected just by chance

A B C

A

B C D E

Tuesday, October 12, 2010

Page 35: Learning gene regulations with only positive examples

B. Goemann, E. Wingender, and A. P. Potapov, “An approach to evaluate the topological significance of motifs and other patterns in regulatory networks.” BMC System Biology, vol. 3, no. 53, May 2009.

S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in the transcriptional regulation network of escherichia coli,” Nature Genetics, vol. 31, no. 1, pp. 64–68, May 2002.

Tuesday, October 12, 2010

Page 36: Learning gene regulations with only positive examples

Network Motifs Heuristic

• For each three genes sub networks T:

• If matches a network motifs M then considers all connections not present in M as negatives

A

B

C

Tuesday, October 12, 2010

Page 37: Learning gene regulations with only positive examples

Network Motifs Heuristic

• For each three genes sub networks T:

• If matches a network motifs M then considers all connections not present in M as negatives

A

B

C

Tuesday, October 12, 2010

Page 38: Learning gene regulations with only positive examples

Network Motifs Heuristic

• For each three genes sub networks T:

• If matches a network motifs M then considers all connections not present in M as negatives

A

B

C

Tuesday, October 12, 2010

Page 39: Learning gene regulations with only positive examples

Network Motifs Heuristic

• For each three genes sub networks T:

• If matches a network motifs M then considers all connections not present in M as negatives

A

B

C

Tuesday, October 12, 2010

Page 40: Learning gene regulations with only positive examples

MOTIF selection performanceE.coli dataset [J.J. Faith et al., 2007 and RegulonDB]

AU

ROC

% of known positives

0.5

0.6

0.7

0.8

0.9

1.0

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100

positive contamination = 1(PU-learning)

positive contamination = 0

Tuesday, October 12, 2010

Page 41: Learning gene regulations with only positive examples

MOTIF selection performanceE.coli dataset [J.J. Faith et al., 2007 and RegulonDB]

AU

ROC

% of known positives

0.5

0.6

0.7

0.8

0.9

1.0

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100

positive contamination = 1(PU-learning)

positive contamination = 0

MOTIF

Tuesday, October 12, 2010

Page 42: Learning gene regulations with only positive examples

Effect of positive contaminationE.coli dataset [J.J. Faith et al., 2007]

F-M

easu

re

% of known positives

positive contamination = 1(PU-learning)

positive contamination = 0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100

Tuesday, October 12, 2010

Page 43: Learning gene regulations with only positive examples

Effect of positive contaminationE.coli dataset [J.J. Faith et al., 2007]

F-M

easu

re

% of known positives

positive contamination = 1(PU-learning)

positive contamination = 0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

10 20 30 40 50 60 70 80 90 10010 20 30 40 50 60 70 80 90 100

MOTIF

Tuesday, October 12, 2010

Page 44: Learning gene regulations with only positive examples

Scale free networks

Albert-László Barabási and Zoltán N. OltvaiNetwork biology: Understanding the cell’s functional organizationNature Reviews Genetics 5, 101-113 (2004)

Tuesday, October 12, 2010

Page 45: Learning gene regulations with only positive examples

Hierarchical networks

Hong-Wu Ma, Jan Buer, and An-Ping ZengHierarchical structure and modules in the Escherichia coli transcriptional regulatory network revealed by a new top-down approachBMC Bioinformatics 2004 5:199

Tuesday, October 12, 2010

Page 46: Learning gene regulations with only positive examples

Experimental data

• 445 Affymetrix Antisense2 microarray expression profiles for 4345 genes of E.coli [J.J. Faith et al., 2007]

• Data were standardized (i.e. zero mean unit standard deviation)

• Regulations extracted from RegulonDB (v. 5) between 154 Transcription Factors and 1211 genes

Tuesday, October 12, 2010

Page 47: Learning gene regulations with only positive examples

Summary and conclusions

• Learning gene regulations is affected by the problem of learning from positive only data

• At least for E.coli

• The study of positive contamination shows that there is room for new heuristics

• Topology based heuristics (eg. motifs) have shown promising results.

• Open issues arise on higher level organisms where gene interactions are more complex

Tuesday, October 12, 2010

Page 48: Learning gene regulations with only positive examples

Re weighting strategy (PosOnly)

Cerulo et al. BMC Bioinformatics 2010, 11:228http://www.biomedcentral.com/1471-2105/11/228

Open AccessR E S E A R C H A R T I C L E

© 2010 Cerulo et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction inany medium, provided the original work is properly cited.

Research articleLearning gene regulatory networks from only positive and unlabeled dataLuigi Cerulo*1,2, Charles Elkan3 and Michele Ceccarelli1,2

AbstractBackground: Recently, supervised learning methods have been exploited to reconstruct gene regulatory networks from gene expression data. The reconstruction of a network is modeled as a binary classification problem for each pair of genes. A statistical classifier is trained to recognize the relationships between the activation profiles of gene pairs. This approach has been proven to outperform previous unsupervised methods. However, the supervised approach raises open questions. In particular, although known regulatory connections can safely be assumed to be positive training examples, obtaining negative examples is not straightforward, because definite knowledge is typically not available that a given pair of genes do not interact.

Results: A recent advance in research on data mining is a method capable of learning a classifier from only positive and unlabeled examples, that does not need labeled negative examples. Applied to the reconstruction of gene regulatory networks, we show that this method significantly outperforms the current state of the art of machine learning methods. We assess the new method using both simulated and experimental data, and obtain major performance improvement.

Conclusions: Compared to unsupervised methods for gene network inference, supervised methods are potentially more accurate, but for training they need a complete set of known regulatory connections. A supervised method that can be trained using only positive and unlabeled data, as presented in this paper, is especially beneficial for the task of inferring gene regulatory networks, because only an incomplete set of known regulatory connections is available in public databases such as RegulonDB, TRRD, KEGG, Transfac, and IPA.

BackgroundInferring the topology of gene regulatory networks is fun-damental to understand the complexity of interdepen-dencies among gene up and down regulation.Characterizing experimentally the transcriptional cis-regulation at a genome scale is still an expensive chal-lenge, even for well-studied model organisms. In silicomethods represent a promising direction that, through areverse engineering approach, aim to extract gene regula-tory networks from prior biological knowledge and avail-able genomic and post-genomic data. Different modelarchitectures to reverse engineer gene regulatory net-works from gene expression data have been proposed inliterature [1]. Such models represent biological regula-tions as a network where nodes represent elements of

interactions, eg. genes, proteins, metabolites, while edgesrepresent the presence of interaction activities betweensuch network components. Four main network modelarchitectures can be distinguished: i) information theorymodels, ii) boolean network models, iii) differential anddifference equation models, iv) Bayesian models.

Information theory models correlate two genes bymeans of a correlation coefficient and a threshold. Twogenes are predicted to interact if the correlation coeffi-cient of their expression levels is above a threshold. Forexample, TD-ARACNE [2], ARACNE [3], and CLR [4]infer the network structure with a statistical score derivedfrom the mutual information and a set of pruning heuris-tics.

Boolean networks use a binary variable to represent thestate of a gene activity and a directed graph, where edgesare represented by boolean functions, to represent theinteractions between genes. REVEAL [5] is an algorithmthat infers a boolean network model from gene expres-

* Correspondence: [email protected] Department of Biological and Environmental Studies, University of Sannio, Benevento, ItalyFull list of author information is available at the end of the article

Tuesday, October 12, 2010

Page 49: Learning gene regulations with only positive examples

PosOnly: How it works

Let x be a random example

s=1 iff x is labeled, s=0 iff x is unlabeled

y=1 iff x is positive, y=0 iff x is negative

If s=1 then y=1 (the contrary is not always true!)

p(s=1|x,y=0) = 0

P Q N

Labeleds=1

Unlabeleds=0

Positivey=1

Negativey=0

Tuesday, October 12, 2010

Page 50: Learning gene regulations with only positive examples

• The goal is to learna classifier such that:f(x) = p(y=1|x)

• It is easy to see that (Elkan and Noto, 2008):f(x) = p(s=1|x)/p(s=1|y=1) = p(s=1 and y=1|x)/p(s=1|y=1)= p(s=1|y=1,x)p(y=1|x)/p(s=1|y=1)= p(y=1|x)

PosOnly: How it works

P Q N

Labeleds=1

Unlabeleds=0

Positivey=1

Negativey=0

Tuesday, October 12, 2010

Page 51: Learning gene regulations with only positive examples

PosOnly: How it works

binary classifier trained with labeled and unlabeled examples

unknown constant estimated empirically in a number of ways

f(x) =p(s = 1|x)

p(s = 1|y = 1)

Tuesday, October 12, 2010

Page 52: Learning gene regulations with only positive examples

Results: experimental data

% of Known Positives

Mean of F-M

easure

Tuesday, October 12, 2010