Rejoinder on “Imprecise probability models for learning multinomial distributions from data....

5
JID:IJA AID:7726 /DIS [m3G; v 1.134; Prn:20/05/2014; 14:17] P.1(1-5) International Journal of Approximate Reasoning ••• (••••) •••••• Contents lists available at ScienceDirect International Journal of Approximate Reasoning www.elsevier.com/locate/ijar Discussion Rejoinder on “Imprecise probability models for learning multinomial distributions from data. Applications to learning credal networks” Andrés R. Masegosa, Serafín Moral Dpto. Ciencias de la Computación e Inteligencia Artificial, Universidad de Granada, 18071 Granada, Spain article info abstract Article history: Received 26 February 2014 Received in revised form 25 April 2014 Accepted 27 April 2014 Available online xxxx Keywords: Imprecise probability Learning Near-ignorance Imprecise prior models Credal networks Credal classification In this paper we answer to the comments provided by Fabio Cozman, Marco Zaffalon, Giorgio Corani, and Didier Dubois on our paper ‘Imprecise Probability Models for Learning Multinomial Distributions from Data. Applications to Learning Credal Networks’. The main topics we have considered are: regularity, the learning principle, the trade-off between prior imprecision and learning, strong symmetry, and the properties of ISSDM for learning graphical conditional independence models. © 2014 Elsevier Inc. All rights reserved. 1. Fabio Cozman In this section we consider the discussion by Fabio Cozman [1] on our paper [2]. We welcome the comments about regularity because this concept underlays some of our results about learning and it was not explicit in the paper. This is an important issue. Under the betting interpretation of upper and lower probabilities, we believe that probabilities should be positive, at least for feasible events in the finite case. Intuitively, there is always an > 0 that we are ready to pay for any possible event A to gain 1 if A happens to be true. As it was pointed out by Cozman, this poses important technical difficulties in the infinite case. This does not mean that some weaker assumptions of regularity should be discarded. Here we concentrate on assuming it for measurable sets with Lebesgue measure greater than 0. But imprecise probabilities offer some other alternatives as for example to assume that if P ( A)> 0, then we must have P ( A)> 0, but allowing P ( A) = P ( A) = 0. In this form, it is avoided to assume regularity for any logically possible event. In our case, we have that P ( A)> 0, P ( A)> 0 for any measurable event with | A| > 0. If | A|= 0, we have that the upper and lower probabilities of A are equal to 0. With the IDM we have many measurable events for which P ( A)> 0 and P ( A) = 0. For example, in the case of intervals included in [0, 1], P ([a, b]) = 0, for all the intervals except for the full [0, 1] interval. However, there are some other measurable sets with positive lower probability (in other case we could not obtain meaningful inferences). For example, for any > 0 the lower probability of [0, ]∪[1 , 1] is always positive. This is somewhat shocking. We cannot forget that upper and lower DOI of original article: http://dx.doi.org/10.1016/j.ijar.2013.09.019. DOIs of comments: http://dx.doi.org/10.1016/j.ijar.2014.04.016, http://dx.doi.org/10.1016/j.ijar.2014.05.001, http://dx.doi.org/10.1016/j.ijar.2014.04.002. * Corresponding author. E-mail addresses: [email protected] (A.R. Masegosa), [email protected] (S. Moral). http://dx.doi.org/10.1016/j.ijar.2014.04.017 0888-613X/© 2014 Elsevier Inc. All rights reserved.

Transcript of Rejoinder on “Imprecise probability models for learning multinomial distributions from data....

Page 1: Rejoinder on “Imprecise probability models for learning multinomial distributions from data. Applications to learning credal networks”

JID:IJA AID:7726 /DIS [m3G; v 1.134; Prn:20/05/2014; 14:17] P.1 (1-5)

International Journal of Approximate Reasoning ••• (••••) •••–•••

Contents lists available at ScienceDirect

International Journal of Approximate Reasoning

www.elsevier.com/locate/ijar

Discussion

Rejoinder on “Imprecise probability models for learningmultinomial distributions from data. Applications to learningcredal networks”

Andrés R. Masegosa, Serafín Moral ∗

Dpto. Ciencias de la Computación e Inteligencia Artificial, Universidad de Granada, 18071 Granada, Spain

a r t i c l e i n f o a b s t r a c t

Article history:Received 26 February 2014Received in revised form 25 April 2014Accepted 27 April 2014Available online xxxx

Keywords:Imprecise probabilityLearningNear-ignoranceImprecise prior modelsCredal networksCredal classification

In this paper we answer to the comments provided by Fabio Cozman, Marco Zaffalon,Giorgio Corani, and Didier Dubois on our paper ‘Imprecise Probability Models for LearningMultinomial Distributions from Data. Applications to Learning Credal Networks’. The maintopics we have considered are: regularity, the learning principle, the trade-off betweenprior imprecision and learning, strong symmetry, and the properties of ISSDM for learninggraphical conditional independence models.

© 2014 Elsevier Inc. All rights reserved.

1. Fabio Cozman

In this section we consider the discussion by Fabio Cozman [1] on our paper [2]. We welcome the comments aboutregularity because this concept underlays some of our results about learning and it was not explicit in the paper. This isan important issue. Under the betting interpretation of upper and lower probabilities, we believe that probabilities shouldbe positive, at least for feasible events in the finite case. Intuitively, there is always an ε > 0 that we are ready to pay forany possible event A to gain 1 if A happens to be true. As it was pointed out by Cozman, this poses important technicaldifficulties in the infinite case. This does not mean that some weaker assumptions of regularity should be discarded. Here weconcentrate on assuming it for measurable sets with Lebesgue measure greater than 0. But imprecise probabilities offer someother alternatives as for example to assume that if P (A) > 0, then we must have P (A) > 0, but allowing P (A) = P (A) = 0.In this form, it is avoided to assume regularity for any logically possible event. In our case, we have that P (A) > 0, P (A) > 0for any measurable event with |A| > 0. If |A| = 0, we have that the upper and lower probabilities of A are equal to 0. Withthe IDM we have many measurable events for which P (A) > 0 and P (A) = 0. For example, in the case of intervals includedin [0,1], P ([a,b]) = 0, for all the intervals except for the full [0,1] interval. However, there are some other measurable setswith positive lower probability (in other case we could not obtain meaningful inferences). For example, for any ε > 0 thelower probability of [0, ε] ∪ [1 − ε,1] is always positive. This is somewhat shocking. We cannot forget that upper and lower

DOI of original article: http://dx.doi.org/10.1016/j.ijar.2013.09.019.DOIs of comments: http://dx.doi.org/10.1016/j.ijar.2014.04.016, http://dx.doi.org/10.1016/j.ijar.2014.05.001, http://dx.doi.org/10.1016/j.ijar.2014.04.002.

* Corresponding author.E-mail addresses: [email protected] (A.R. Masegosa), [email protected] (S. Moral).

http://dx.doi.org/10.1016/j.ijar.2014.04.0170888-613X/© 2014 Elsevier Inc. All rights reserved.

Page 2: Rejoinder on “Imprecise probability models for learning multinomial distributions from data. Applications to learning credal networks”

JID:IJA AID:7726 /DIS [m3G; v 1.134; Prn:20/05/2014; 14:17] P.2 (1-5)

2 A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning ••• (••••) •••–•••

probabilities have a behavioral interpretation, and that we are ready to buy event [0, ε] ∪ [1 − ε,1] for a positive amountfor any ε , but we are not ready to pay anything for the interval [ε,1 − ε]. The IDM is assuming that we are ready to betsome amount for the chances being in [0,0.0001] ∪ [0.9999,1], but nothing for the chances being in [0.0001,0.9999]. Inthe ISSDM the last event has a greater lower probability than the former one (except in some extreme cases in which thelower value of the equivalent sample size s1 is too small).

With respect to the learning principle, we believe that it is important by itself. It is a basic requirement because situationssuch as the one in Example 4 in our paper [2] look counterintuitive and should be avoided. This principle is incompatiblewith representation invariance (RIP) but we find it more basic than RIP. Representation invariance looks appealing butrefining and coarsening the categories is a non-trivial fact. In imprecise probability we very often neglect this and even wesay that in an experiment the number of categories is unknown. But if we go to the bag of marbles example by Walley[3], the color of the balls is a continuum and to define the problem we have to determine the set of categories and themeasurement procedure, i.e. how the observed frequencies of colors are going to be mapped in our set of categories. In anyexperimental setting there are prior assumptions and, in our case, the selection of categories is part of these assumptions.If according to a gamble, we are going to be paid an amount if a ball is red, we have to fix in advance the procedure whichdefines when a ball is classified as red but not brown, for example. It is possible that we have some set of categories thatare not observed yet, but we should know which the categories are.

We appreciate the comment about connecting our proposal with existing literature on strict coherence. We believe thatit is a good suggestion and, as it was said earlier, imprecise probability could throw new light on this issue, as it allowsweaker formalizations of this idea.

With respect to the comment on dilation, we want to point out that being true that under the ISSDM the degree ofimprecision can increase after receiving some information, it never happens that the interval of conditional probabilitiesstrictly contains the interval of prior probabilities. For a value w1 we can have a prior precise probability. After conditioningto observations, the conditional probabilities can become an imprecise interval [a,b] but, this interval never contains theprecise value. So, we do not have dilation in a strict sense [4]. When observations are uniform in the different categorieswe go back to precision, and imprecision only increases when data are unbalanced and relative frequencies are extreme(close to 0 or 1). It is in this case when imprecision appears in the ISSDM. We find this behavior reasonable as the extremeprobabilities are the most risky ones: if our upper probability of an event is very close to 0, then we can take risky betsagainst this event.

In relation with the generalized version of the ISSDM and the necessity of specifying more parameters, we have shownthat there are important problems in selecting the equivalent sample size in a Bayesian approach and in determining howthis sample size should change in relation with the number of elements on W and the number of conditional distributions.There is not a clear way of doing it and we have shown examples, in which several options make sense. The generalizedISSDM tries to minimize the number of prior assumptions, by allowing different assignment functions to determine theequivalent sample sizes of prior probabilities for the different conditional distributions. In fact, the generalized ISSDM allowsmore possibilities than the restricted case. It usually happens that for being more imprecise we need more parameters. Forexample, we need two values to specify an interval probability against one value for a precise probability. But this is notnecessarily a problem.

We consider the learning part as particularly important for the ISSDM. In some broad sense, we believe that this is amodel for near-ignorance in relation with the dependence structure of a credal network. As it was pointed out in the paper,with low s values we are favoring independence and with high values of s dependence, so by allowing an interval for thes value, we are trying to be ignorant about this fact. It is similar to what IDM does for the parameters: αi can be very lowfavoring low chances or high favoring high chances. Observed data overcome this ignorance and we can make more precisedecisions when the number of observations increases.

There is a part in which we have to make many compromises in order to obtain a full procedure for learning generalizedcredal networks. This has been motivated by the idea of obtaining feasible computer implementations of the proposedmethods and being able to show their behavior in some experiments. This part is not essential in itself. We have includedit to show that this is not only a theory but something which can provide useful practical results. However, we agree withFabio Cozman on the fact that moving from optimization to sampling for approximate computations is an idea that hassome potential and deserves some attention for future work.

We thank Fabio Cozman for his comments in the generalized credal networks with ambiguous edges. We also believethat they are more intuitive when working with experts, and can also be a basis for interactive procedures for learningcredal networks in which we can ask to experts about their beliefs about unsure edges in the line of [5].

Finally, we agree that evaluation of credal networks is not a simple issue, but here we have followed existing proceduresin the literature as this is only an auxiliary topic in our paper.

2. Marco Zaffalon and Giorgio Corani

In this section we consider the comments by Marco Zaffalon and Giorgio Corani [6] on our paper [2]. We agree withMarco Zaffalon and with Giorgio Corani in that near-ignorance models have an important role to play in learning with noprior information. The only point is that there can be alternative models. As they say in their comments, there is a trade-offbetween learning and prior assumptions. If prior assumptions are weaker then the learning capabilities are lower. Models

Page 3: Rejoinder on “Imprecise probability models for learning multinomial distributions from data. Applications to learning credal networks”

JID:IJA AID:7726 /DIS [m3G; v 1.134; Prn:20/05/2014; 14:17] P.3 (1-5)

A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning ••• (••••) •••–••• 3

for learning only from data should be selected by looking for an equilibrium between these two desirable assumptions:learning capabilities and weak prior assumptions. In general, this equilibrium is not found in an isolated point and it will beimpossible that a model will satisfy all the requirements. Our view is that there can be different alternatives and that theselection for a particular case could depend of the type of problem we are facing. For example, IDM is nice when learningfrequencies under perfect observations, but it has strong limitations when we try to discover dependence relationships.In general, we aim to have general principles that determine our models in a unique way. But, the existence of severalalternatives is usually the result of opposed desiderata. We have to give up in some direction in order to gain in the other,and there are several ways in which this can be done. The procedure in Piatti et al. [7] is one alternative for learningfrequencies with imperfect observations. It works in practice. The main drawback is that it is not coherent, but our feelingis that it avoids sure loss, and therefore could be refined in order to make it coherent. Also, near-ignorance and RIP aredesirable properties and IDM and alternative models will be important in the future.

We have to admit that at the beginning we were not very comfortable with ISSDM and its property of being completelyprecise without observations. But then, we started to realize that this precision also had some remarkable properties whenapplied to learning graphical models for classification. For example, imagine a Naive Bayes problem, and assume that weobserve for an attribute X the same vector of frequencies for all the possible values of the class C , then the ISSDM-NCCwould automatically discard this variable (the predictions are the same than when this variable is not present). This isnatural, as the data do not seem to be relevant for the class. However, if this variable is used in an IDM-NCC it will increasethe imprecision of the result.

The ISSDM avoids some of the problems of the IDM when Bayesian scores are generalized to decide about structuralindependence relationships. We will try to explain it with a simple example. Imagine that X and Y are two bivaluatedvariables with values {x1, x2} and {y1, y2}, and that we have observed a long sequence of joint observations of the twovariables in such a way that half of the cases are (x1, y1) and the other half are (x2, y2). Intuitively, the data support a highdegree of dependence between the variables (in all the observed pairs (xi, y j) we have i = j). However, when applying IDMto the case of dependent variables, we include in our model a Dirichlet density in which alpha weights are concentratedon (x1, y2) and (x2, y1) and the data will be highly incompatible with this option (all our observations are for the valueswith very low prior weights). On the other hand, this incompatibility can be not achieved if we consider independence. Inthat case, we have to look to the marginal densities of the two variables. As we have observed the same frequencies forall the possible values of the marginal variables, we cannot obtain the same degree of incompatibility of the model withthe data. The consequence will be that dependence would never dominate independence, only because we can play withthe parameters of the IDM. The same thing cannot be done with the ISSDM due to the symmetry of the densities. So, evenrecognizing that in the ISSDM imprecision comes from prior-data conflict, this model seem to have a role to play whenlearning about structural dependence relationships without prior information.

With respect to ‘strong symmetry’, we agree with the comments provided by Marco and Giorgio in an earlier versionof the paper in that our argument looks convincing at a first sight, but after deepening on it, one notices that it is not sosimple and that it is not clear that it should be a basic requirement for prior models. Here, we would like to add somenew relevant references. The first one is a paper by de Cooman and Miranda [8] about symmetry. In that paper ‘strongsymmetry’ is seen as a model of symmetry, applicable for example when we have a dice and we believe that the dice andthe rolling mechanism are symmetrical. In that case, it makes sense to assume strong symmetry and to consider a uniformprior distribution. In another paper by de Cooman et al. [9] the property of category permutation invariance is introduced.Roughly speaking, this property says that our beliefs should not change after any permutation σ of categories. According tothese authors this is not a trivial property and “can only be justified when our subject has no reason to distinguish betweenthe categories, which may for instance happen when he is in a state of prior ignorance”. The question is, if our beliefs donot change for any permutation σ , should they change under a random permutation of categories? We think that if thefirst assumption is given for granted under ignorance, then the invariance under a random permutation is not a big stepahead. Conceptually, we do not see a big difference between category permutation invariance and strong symmetry, thoughfrom a formal point of view they are not equivalent concepts. Perhaps, the differences can appear if we have the full pictureof a problem at hand. We have to take into account that the final aim of uncertainty management is to make decisions.If we are able to express the consequences of our decisions in terms of the new random permuted categories, then strongsymmetry can make sense because we can move the full problem to the new scenario with random permuted categories.However, if the consequences of our decisions are only known for the initial categories, then the strong symmetry canbe more questionable. Assume that W has two values Red and Blue and we want to predict the next color, after havingobserved the colors in a sample of size n. Then if we have to determine the true color (and not the randomized name)making a random permutation of the names in the sample does not make sense.

We recognize that we have not shown a principled justification of ISSDM as a model for learning without prior informa-tion. Even if strong symmetry is accepted, there are more possible models with symmetrical densities that could be usedin practice. But, we want to stress that, at least, this is a reasonable model for this task. Nowadays, most of learning isdone with Bayesian priors which use a single density. As it has been shown in the paper, the selection of the sample sizeparameter poses important problems. The ISSDM can solve these problems by being imprecise and allowing an interval ofparameters.

With respect to the naive credal classifiers, we will start with the global vs the local approach issue. With the IDMone of the main reasons to use the global approach is that the local one is usually too imprecise to be useful and this

Page 4: Rejoinder on “Imprecise probability models for learning multinomial distributions from data. Applications to learning credal networks”

JID:IJA AID:7726 /DIS [m3G; v 1.134; Prn:20/05/2014; 14:17] P.4 (1-5)

4 A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning ••• (••••) •••–•••

imprecision increases with the number of attributes. Here, we show that with the ISSDM the imprecision is similar in bothcases (slightly greater in the local approach). This is another nice property of ISSDM. We prefer the local approach. We willtry to give a simple example to support this. Imagine that we have a class C and two attributes X1 and X2, the parameters determines our beliefs about whether the prior probabilities about C and the conditional probabilities about Xi (i = 1,2)are extreme or uniform. As s becomes lower we believe in more extreme probabilities. In the global approach, we select thesame s for all the prior densities. So, all of them should have the same shape: we are assuming that if the prior probabilitiesfor C are extreme (close to 0 and 1), so are the conditional probability of X1 given C and the conditional probability ofX2 given C . This seems to be an unnatural restriction. If the modeler is asked whether it is possible that the probabilitiesfor the categories of C are balanced and the conditional probabilities are extreme and the answer is ‘yes’, then the globalapproach does not make sense. The local approach gives up this restriction and, then, it is more natural from our point ofview. One drawback of the local approach is its higher computational complexity.

Zaffalon and Corani are right in asking a procedure to remove irrelevant features in the global approach. It would havebeen possible and it is desirable.

With respect to the comments about the results of the experiments with the different credal classifiers, Zaffalon andCorani point out to some interesting issues. One point is the average behavior against the worst-case behavior. This is partof the general discussion of comparing imprecise classifiers, an issue that is far from being closed, but that is out of thescope of this paper. We agree with them in their appreciation that the aggressive behavior of ISSDM is the reason forbetter average utility. The main question is whether this aggressive behavior is justified. In our opinion, if it obtains a betterperformance, then there is a reason for it. The problem with Bayesian classifiers is that they are always precise. Sometimes,the decisions are only justified by this precision and not by the information provided by the data. Another point raised byZaffalon and Corani is that ISSDM-NCC is highly precise with small samples. This is mainly true for the global approach,but not really for the local one. The local ISSDM-NCC shows an increasing in the determinacy index when the training sizeincreases. It is not so important, as in the other credal classifiers, but it is present. We need a sample size of 30 to classifyaround 80% of the cases with precision. In the global application, the increment in the determinacy is less evident in theexperiments and we can have a high degree of precise classifications with low samples (around 97% with a training sizeof 5). In any case, in the long run when the sample size continues increasing the determinacy will approach 100%.

We are not quite sure about the mentioned overconfident behavior in the case of sensible decisions with poorly infor-mative learning sets. In that situation, if the scenario contains sensible decisions this would translate into high differencesin utilities, and small imprecision could imply indetermination on the decisions.

Finally, Zaffalon and Corani say that ISSDM-NCC might not be easily applicable for case studies in which experts canprovide domain knowledge, for example providing intervals of probabilities. We see several ways in which this can be done.The most obvious one is to take the prior densities of the ISSDM and compute the conditional densities to the intervalprovided by an expert.

3. Didier Dubois

In this section we consider some comments provided by Didier Dubois [10] on our paper [2]. Didier Dubois points outthat we assume a uniformly distributed prior. This is true on the space of values W , but it is not true for the space ofchances Θ in which the density is not constant except if all the weights αi are equal to 1. It is also said that this is due tothe use of Bayesian updating. We want to point out that we find coherence as a basic requirement for our view of impreciseprobability and that the generalized Bayesian updating is a consequence of coherence. We recognize that the fact that theprior probabilities on W are uniform is shocking and one can feel uncomfortable about it, but in our opinion it is the setof prior densities which should be blamed for this uniformity and not the updating rule. In this paper we have presentedthe ISSDM which perhaps is an extreme possibility. There are other models that allow learning. For example, assuming aminimum positive value for the αi values in the IDM as it was proposed in [11] (the bounded IDM) does not provide preciseuniform probabilities in W and satisfies the learning principle. The combination of the bounded IDM and the ISSDM couldrepresent a good alternative for future research.

Acknowledgements

We want to finish this rejoinder by thanking Fabio Cozman, Marco Zaffalon, Giorgio Corani, and Didier Dubois by theirinteresting and useful comments. They have contributed to improve the quality of our original work and enriched it withthese additional and very pertinent discussion.

References

[1] F.G. Cozman, Learning imprecise probability models: Conceptual and practical challenges, Int. J. Approx. Reason. (2014), http://dx.doi.org/10.1016/j.ijar.2014.04.016, in press.

[2] A.R. Masegosa, S. Moral, Imprecise probability models for learning multinomial distributions from data. Applications to learning credal networks, Int.J. Approx. Reason. (2014), http://dx.doi.org/10.1016/j.ijar.2013.09.019, in press.

[3] P. Walley, Inferences from multinomial data: learning about a bag of marbles (with discussion), J. R. Stat. Soc. B 58 (1996) 3–57.[4] A. Pedersen, G. Wheeler, Demystifying dilation, Erkenntnis (2013) 1–38.

Page 5: Rejoinder on “Imprecise probability models for learning multinomial distributions from data. Applications to learning credal networks”

JID:IJA AID:7726 /DIS [m3G; v 1.134; Prn:20/05/2014; 14:17] P.5 (1-5)

A.R. Masegosa, S. Moral / International Journal of Approximate Reasoning ••• (••••) •••–••• 5

[5] A. Cano, A. Masegosa, S. Moral, A method for integrating expert knowledge when learning Bayesian networks from data, IEEE Trans. Syst. Man Cybern.,Part B, Cybern. 41 (2011) 1382–1394, http://dx.doi.org/10.1109/TSMCB.2011.2148197.

[6] M. Zaffalon, G. Corani, Comments on “Imprecise probability models for learning multinomial distributions from data. Applications to learning credalnetworks” by Andrés Masegosa and Serafín Moral, Int. J. Approx. Reason. (2014), http://dx.doi.org/10.1016/j.ijar.2014.05.001, in press.

[7] A. Piatti, M. Zaffalon, F. Trojani, M. Hutter, Limits of learning about a categorical latent variable under prior near-ignorance, Int. J. Approx. Reason. 50(2009) 597–611.

[8] G. de Cooman, E. Miranda, Symmetry of models versus models of symmetry, in: W. Harper, G. Wheeler (Eds.), Probability and Inference: Essays inHonour of Henry E. Kyburg, King’s College Publications, 2007, pp. 67–149.

[9] G. de Cooman, E. Miranda, E. Quaeghebeur, Representation insensitivity in immediate prediction under exchangeability, Int. J. Approx. Reason. 50(2009) 204–216, http://dx.doi.org/10.1016/j.ijar.2008.03.010.

[10] D. Dubois, On various ways of tackling incomplete information in statistics, Int. J. Approx. Reason. (2014), http://dx.doi.org/10.1016/j.ijar.2014.04.002,in press.

[11] S. Moral, Imprecise probabilities for representing ignorance about a parameter, Int. J. Approx. Reason. 53 (2012) 347–362, http://dx.doi.org/10.1016/j.ijar.2010.12.001.