Multiple-Instance Learning
Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998]
Paper 2: EM-DD: An Improved Multiple-Instance
Learning Technique [Zhang and Goldman, 2001]
Multiple-Instance Learning (MIL)
A variation on supervised learning Supervised learning: training data are well
labeled. MIL: each training example is a set (or bag) of
instances along with a single label equal to the maximum label among all instances in the bag.
Goal: to learn to accurately predict the label of previously unseen bags.
MIL Setup
Training Data:D = {<B1, l1>, …., <Bm, lm>}m bags where bag Bi has label li.
Boolean labelsPositive Bags: Bi
+
Negative Bags: Bi-
If bag Bi+ = { B+
i1,…, B+ij, … B+
in}, then B+ij is the jth instance in B+
i.
B+ijk is the value of the kth feature of the instance B+
ij
Real-value labels li = max(li1, li2, … , lin)
Diverse Density Algorithm[Maron and Lozano-Perez, 1998]
Main idea: Find a point in feature space that have a high
Diverse Density – High density of positive instances (“close ” to at least
one instance from each positive bag) Low density of negative instances (“far” from every
instance in every negative bag) Higher diverse density = higher probability of
being the target concept.
A Motivating Example for DD
To find an area where there is both high density of positive points and low density of negative points.
The difficulty with using regular density, which adds up the contribution of every positive bag and subtracts negative bags, is illustrated in (b), Section B.
Diverse Density
Assuming that the target concept is a single point t and x is some point in feature space,Pr(x = t | B1
+,…, Bn+
, B1-,…, Bn
-) ……………..(1)
represents the probability that x is the target concept given the training examples.
We can find t if we maximize the above probability over all points x.
Probabilistic Measure of Diverse Density
Using Bayes’ rule, maximizing (1) is equivalent to maximizing
Pr(B1+,…, Bn
+, B1
-,…, Bn- | x = t) ……………..(2)
Further assuming that the bags are conditionally independent given t, the best hypothesis is
argmaxx∏i Pr(Bi+| x = t) ∏i Pr(Bi
-| x = t) ……(3)
General Definition of DD
Again using Bayes’ rule, (3) is equivalent to
argmaxx∏i Pr(x = t |Bi+) ∏i Pr(x = t |Bi
-) ……(4)
(assume a uniform prior over concept location)
x will have high Diverse Density if every positive bag has an instance close to x and no negative bags are close to x.
Noise-or Model
The causal probability of instance j in bag Bi
Pr(x = t |Bij) = exp( -|| Bij – x ||2 )
A positive bag’s contribution:Pr(x = t |Bi
+) = Pr(x = t |Bi1+, Bi2
+,…) =1- ∏j(1- Pr(x = t |Bij
+) )
A negative bag’s contribution:Pr(x = t |Bi
-) = Pr(x = t |Bi1-, Bi2
-,…) =∏j(1- Pr(x = t |Bij
-) )
Feature Relevance
“closeness” depends on the features. Problem: some features might be irrelevant, and
some others might be more important than the others.
|| Bij – x ||2 = ∑k wk ( Bijk – xk )2
Solution: “weight” the features depending on their relevance. Find the best weighting of the features by finding the weights that maximize Diverse Density.
Label Prediction
Predict the label of unknown bag Bi for hypothesis t :
Label(Bi | t) = maxj{exp[-∑k (wk(Bijk – tk))2]}
where wk is a scale factor indicating the importance of feature value for dimension k.
Finding the Maximum DD
Use gradient ascent with multiple starting points The maximum DD peak is made of contributions from
some set of positive points. Start an ascent from every positive point, one of them is
likely to be closest to the maximum. We can contribute most to it and have a climb directly on it.
While this heuristic is sensible for maximizing w.r.t. location, maximizing w.r.t. scaling of feature weights may still lead to local maxima.
Experiments
Experiments
Figure 3(a) shows the regular density surface for the data set in Figure 2, and it is clear that finding the peak is difficult. Figure 3(b) plots the DD surface, and it is easy to pick out the global maxima which is the desired concept.
Performance Evaluation
The table below lists the average accuracy of twenty runs, compared with the performance of the two principal algorithms reproted in [Dietterich et al., 1997] (iterated-discrim APR and GFS elim-ked APR), as well as the MULTINST algorithm from [Auer, 1997].
EM-DD[Zhang and Goldman, 2001]
In the MIL setting, the label of a bag is determined by the "most positive" instance in the bag, i.e., the one with the highest probability of being positive among all the instances in that bag. The difficulty of MIL comes from the ambiguity of not knowing which instance is the most likely one.
In [Zhang and Goldman, 2001], the knowledge of which instance determines the label of the bag is modeled using a set of hidden variables, which are estimated using the Expectation Maximization style approach. This results in an algorithm called EM-DD, which combines this EM-style approach with the DD algorithm.
EM-DD Algorithm
Expectation Maximization algorithm [Dempster,Laird and Rubin, 1977] Start with an initial guess h (which can be obtained
using original DD algorithm), set to some appropriate instance from a positive bag.
E-Step: h is used to pick one instance from each bag that is most likely (given generative model) to be responsible for its label.
M-Step: two-step gradient ascent search (quasi-newton search) of the standard DD algorithm to find a new h’ that maximizes DD(h).
Comparison of Performance
Thank you!
Top Related