Partialy Supervised Learning - pub.rocursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW8.pdfThis...

57
Service Engineering and Management Data Mining and Data Warehousing Partialy Supervised Learning Prof.dr.ing. Florin Radulescu Universitatea Politehnica din Bucureşti

Transcript of Partialy Supervised Learning - pub.rocursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW8.pdfThis...

  • Service Engineering and Management

    Data Mining and Data Warehousing

    Partialy Supervised Learning Prof.dr.ing. Florin Radulescu

    Universitatea Politehnica din Bucureşti

  • 2

    What is partially supervised learning

    Learning from labeled and unlabeled data

    Learning with positive and unlabeled data

    Summary

    Road Map

  • 3

    In supervised learning the goal is to build a classifier starting from a set of labeled examples.

    Unsupervised learning starts with a set of unlabeled examples trying to discover the inner structure of this set, like in clustering.

    Partially supervised learning (or semi-supervised learning) learning includes a series of algorithms and techniques using a (small) set of labeled examples and a (possible large) set of unlabeled examples for performing classification or regression.

    Partially supervised learning

  • 4

    The need for such algorithms and techniques comes from the cost of obtaining labeled examples.

    This is made in many cases manually by experts and the volume of these examples is sometimes small.

    When learning is made starting from a finite number of training examples in a high-dimensional space and for each dimension the number of possible values is large, the amount of needed training data required to ensure that there are several samples with each combination of values is huge.

    Partially supervised learning

  • 5

    For a given number of training samples the predictive power decreases as the dimensionality increases.

    This phenomenon is called Hughes effect or Hughes phenomenon, after Gordon F. Hughes

    He published in 1968 the paper "On the mean accuracy of statistical pattern recognizers".

    Adding extra information to a small number of labeled training examples will increase the accuracy (by delaying the occurrence of the effect described).

    Hughes effect

  • 6

    In a 2D space containing only two labeled

    examples: a positive example and a

    negative example.

    Effect of unlabeled examples

  • 7

    Based on these labeled examples a classifier may be built, represented by the dotted line.

    The points in the left of the line will be classified as positive and the point in the right of the line as negative.

    This classifier is not very accurate because the number of the labeled examples is too small.

    Effect of unlabeled examples

  • 8

    Suppose now that several unlabeled

    examples are added, as in the next figure.

    Effect of unlabeled examples

  • 9

    In that case, the unlabeled examples that

    are near or linked in some way to the two

    labeled examples may be considered

    having the same label and the classifier

    changes.

    Unlabeled examples form two clusters,

    one containing the positive example and

    the other the negative one.

    Effect of unlabeled examples

  • 10

    Consequently, the border between the

    positive area and negative area has an

    irregular shape (the dotted line).

    These two classifiers, in terms of

    accuracy, are very different, the second

    being much more accurate than the first.

    Effect of unlabeled examples

  • 11

    Another illustration is for the case of a

    training set containing only positive examples

    (or examples belonging to the same class).

    The next figure shows six points labeled as

    positive placed in a 2D space.

    Because there are no negative examples,

    there is no way to determine a separation

    between the positive area and negative area.

    Positive and unlabeled examples

  • 12

    All the lines figured may be at this point

    separation lines.

    Positive and unlabeled examples

  • 13

    In the following figureseveral unlabeled

    examples are added.

    Positive and unlabeled examples

  • 14

    Some unlabeled examples are placed

    near the positive examples and some

    other unlabeled examples are placed

    separately.

    The natural assumption is that examples

    in the first category are also positive

    examples and the other examples are

    negative ones.

    Positive and unlabeled examples

  • 15

    The effect of using unlabeled examples is

    related to the fact that labeled and

    unlabeled examples comes from the same

    distribution or from different distributions.

    There are three cases, derived from the

    treatment of missing data (here missing

    data = label is missing)

    Distribution of unlabeled examples

  • 16

    Suppose you are modelling weight (Y) as a function of gender (X).

    Some respondents wouldn't disclose their weight, so you are missing

    some values for Y. There are three possible mechanisms for the

    nondisclosure:

    1. There may be no particular reason why some respondents told you their

    weights and others didn't. That is, the probability that Y is missing may

    has no relationship to X or Y. In this case our data is missing completely

    at random (MCAR).

    2. One gender may be less likely to disclose its weight. That is, the

    probability that Y is missing depends only on the value of X. Such data

    are missing at random (MAR).

    3. Heavy (or light) people may be less likely to disclose their weight. That

    is, the probability that Y is missing depends on the unobserved value of

    Y itself. Such data are not missing at random (MNAR).

    The formal definitions are:

    Example (from http://www.math.chalmers.se/Stat/Grundutb/GU/MSA650/S09/Lecture5.pdf)

  • 17

    MCAR = Missing Completely At Random.

    If x is the vector of attribute values and y the class, we have:

    P (labeled=1|x, y) = P (labeled=1),

    So labeled and unlabeled examples comes from the same distribution (or the fact that an example is labeled or not is not related to the attribute values or the class of the example).

    MCAR

  • 18

    MAR = Missing At Random.

    In this case:

    P (labeled=1| x, y) = P (labeled=1| x),

    The probability for an example to be labeled is not related to the class.

    We have also: P(y=1| x, labeled=1) = P(y=1| x, labeled=0) = P(y=1| x),

    So, for a fixed x the probability to be labeled is the same with the probability not to be labeled.

    But in this case the conditional distribution of x given y is not the same in labeled and unlabeled data.

    MAR

  • 19

    MNAR = Missing Not At Random.

    In this case:

    P (labeled=1|x, y) P (labeled=0|x, y),

    Labeled and unlabeled examples are not

    from the same distribution.

    MNAR

  • 20

    What is partially supervised learning

    Learning from labeled and unlabeled data

    Learning with positive and unlabeled data

    Summary

    Road Map

  • 21

    Some techniques are presented in the next slides for using unlabeled data along with a training set containing labeled examples belonging to all classes, based on [Chawla, Karakoulas, 2005] study.

    The study evaluates four learning techniques:

    Co-training

    ASSEMBLE

    Re-weighting

    Expectation-Maximization

    Learning from labeled and unlabeled data

  • 22

    Co-training and ASSEMBLE assume a

    MCAR distribution and the other two

    techniques a MAR one.

    The study used Naïve Bayes as

    underlying supervised learner and, for co-

    training (that requires two classifiers), the

    second classifier was C4.5.

    Learning from labeled and unlabeled data

  • 23

    Co-training was proposed by Blum and Mitchell in the paper “Combining labeled and unlabeled data with co-training” presented in 1998 at the Workshop on Computational Learning Theory:

    The attributes x describing examples can be split in two disjoint subsets that are independent, or, in other words, the instance space X can be written as a Cartesian product:

    X = X1 X2

    where X1 and X2 correspond to two different views of an example

    Alternate definition:

    each example x is given like a pair: x = (x1, x2)

    Co-training (Blum and Mitchel)

  • 24

    The main assumption is that X1 and X2 are

    sufficient for learning a classifier each.

    The example presented in the original

    article is a set of web pages.

    Each page is described by

    x1 = {words on the web page} and also by

    x2 = {words on the links pointing to the web

    page}.

    Co-training (Blum and Mitchel)

  • 25

    1. Initially LA = LB = L, UA = UB = U 2. Build two classifiers, A from LA and X1 and B from

    LB and X2 3. Allow A to label the set UA, obtaining L1 4. Allow B to label the set UB, obtaining L2 5. Based on confidence, select C1 from L1 and C2 from

    L2 (subsets containing a number of most confident examples for each class)

    6. Add C1 to LB and subtract it from UB 7. Add C2 to LA and subtract it from UA 8. Go to step 2 until stopping criteria are met

    Co-training algorithm (v1)

  • 26

    The process ends when there are no more unlabeled examples or C1 and C2 are empty

    In the last case there are some unlabeled examples but the confidence of their classifications – probability of the assigned class for example - is below a given threshold.

    In the end, the final classifier is obtained by combining A and B (the final two classifiers obtained at step 2).

    The experiments described in the original article are made using a slightly different form of the algorithm, presented on the next slide.

    Co-training (Blum and Mitchel)

  • 27

    1. Given:

    • A set L of labeled examples

    • A set U of unlabeled examples

    2. Create a pool U’ of examples by choosing u examples at random from U.

    3. Loop for k iterations: 3.1. Use L to train a classifier h1 that considers only the x1 portion of

    x

    3.2. Use L to train a classifier h2 that considers only the x2 portion of x

    3.3. Allow h1 to label p positive and n negative examples from U’

    3.4. Allow h2 to label p positive and n negative examples from U’

    3.5. Add these self-labeled examples to L

    3.6. Randomly choose 2p + 2n examples from U to replenish U’

    Co-training algorithm (v0)

  • 28

    The experiments were made using Naive Bayes as h1 and h2, p = 1, n = 3, k = 30 and u = 75.

    Begining with 12 labeled web pages and using 1000 additional unlabeled web pages, the results were:

    average error for learning only from labeled data 11.1%;

    average error using co-training 5.0%

    Co-training algorithm (v0)

  • 29

    Results described in [Blum, Mitchell, 98]

    Co-training results

    Page-based classifier

    Link-based classifier

    Combined classifier

    Supervised training 12.9 12.4 11.1

    Co-training 6.2 11.6 5.0

  • 30

    In [Goldman, Zhou, 2000] there is another approach in co-training: o Because not always is possible to split the feature

    space X in two disjoint and independent subspaces X1 and X2, use two different algorithms for building the two classifiers h1 and h2, on the same feature space.

    o Each algorithm labels some unlabeled examples and these will be included by the other algorithm in its training data:

    o If the two classifiers are denoted with A and B, A will produce LB and B will produce LA.

    Co-training (Goldman and Zhou)

  • 31

    1. Repeat until LA and LB do not change during iteration. For each algorithm do

    2. Train algorithm A on L LA to obtain the hypothesis HA (a hypothesis defines a partition of the instance space). Similar for B

    3. Each algorithm considers each of its equivalence classes and decides which one to use to label data from U for the other algorithm, using two tests. For A the tests are (similar for B):

    o The class k used by A to label data for B has accuracy at least as good as the accuracy of B.

    o The conservative estimate of the class k is bigger than the conservative estimate of B.

    (The conservative estimate is an estimation for 1/2 where is the hypothesis error. This prevents the degradation of B performances due to the noise.)

    4. All examples in U passing these tests are placed in LB (similar for B).

    5. End Repeat

    6. At the end, combine HA and HB

    Algorithm

  • 32

    ASSEMBLE is an ensemble algorithm presented in [Bennet at all, 2002]

    It won the NIPS 2001 Unlabeled data competition.

    It alternates between assigning “pseudo-classes” to the instances from the unlabeled data set and constructing the next base classifier using the labeled examples but also the unlabeled examples

    For these examples the previous assigned pseudo-class is considered.

    ASSEMBLE

  • 33

    Any weight-sensitive classification algorithm can be boosted using labeled and unlabeled data.

    ASSEMBLE can exploit unlabeled data to reduce the number of classifiers needed in the ensemble therefore speeding up learning.

    ASSEMBLE works well in practice.

    Computational results show the approach is effective on a number of test problems, producing more accurate ensembles than AdaBoost using the same number of base learners.

    ASSEMBLE - advantages

  • 34

    Re-weighting is a technique for reject-inferencing in credit scoring presented in [Crook, Banasik, 2002].

    The main idea is to extrapolate information on the examples from approved credit applications to the unlabeled data.

    The re-weighting may be used if data is of the MAR type, so the provided population model for all applicants is the same as that for accepts only:

    P(y=1| x, labeled=1) = P(y=1| x, labeled=0) = P(y=1| x),

    So: for a given x the distribution of the examples having a certain label is the same in the labeled and unlabeled set.

    Re-weighting

  • 35

    All credit institutions have an archive of approved applications and for each of these applications there is also a Good/Bad performance label.

    Based on the classification variables used to accept/reject an application, applications (past-labeled but also those unlabeled) can be scored and partitioned in score groups.

    For every score group the distribution of classes in the labeled examples is then extrapolate to the unlabeled examples of the same score group, picking at random examples from here.

    Re-weighting

  • 36

    Suppose we have a set of previously accepted

    applications (Labeled column in Table 1) and a

    set of unlabeled applications. Each application

    have a score and can be included in a score

    group (there are 5 score groups below)

    Re-weighting example

    Score group Unlabeled Labeled Class0

    / Bad

    Class1

    / Good

    Group

    weight

    Re-w.

    Class0

    Re-w.

    Class1

    0.0-0.2 10 10 6 4 2 12 8

    0.2-0.4 10 20 10 10 1.5 15 15

    0.4-0.6 20 60 20 40 1.33 27 53

    0.6-0.8 20 100 10 90 1.2 12 108

    0.8-1.0 20 200 10 190 1.1 11 209

  • 37

    The group weight is computed as (XL + XU) / XL.

    For every score group the weight is used to compute the number of examples of class0 and class1 from the whole score group examples (labeled and unlabeled).

    Example: for score group 0.8-1.0, the weight is 1.1 so re-weighting class0 and class1 we obtain 10*1.1 = 11 for class0 and 190*1.1 = 209 for class1.

    It means that we pick at random 11-10=1 example from the unlabeled set and label it as class0 and 209-190 = 19 examples (the rest of them, 20-1) and label them as class1.

    Re-weighting example

  • 38

    This procedure is run for every score group.

    At the end, all unlabeled examples have a class0/class1 label.

    Note that class0/class1 is not the same as rejected/accepted; the initial set of labeled examples contains only accepted applications!

    Using the whole set of examples (L+U) now having all a class0/class1 label, we can learn a new classifier that incorporate not only the data from labeled examples but also information from unlabeled ones.

    Re-weighting example

  • 39

    In [Liu 11] and [Nigam et al, 98] the process is described as follows:

    Initial: Train a classifier using only the set of labeled documents.

    Loop:

    Use this classifier to label (probabilistically) the unlabeled documents (E step)

    Use all the documents to train a new classifier (M step)

    Until convergence.

    Expectation-Maximization

  • 40

    For Naïve Bayes, the expectation step means computing for every class cj and every unlabeled document di the probability Pr(cj | di; ).

    Notations are:

    ci – class ci

    D – the set of documents

    di – a document di in D

    V – words vocabulary (set of significant words)

    wdi, k – the word in position k in document di

    Nti - the number of times that word wt occurs in document di

    - the set of parameters of all components, = {1, 2, …, K, 1, 2, …, K}. j is the mixture weight (or mixture probability) of the mixture component j; j is the parameters of component j. K is the number of mixture components.

    Expectation-Maximization

  • 41

    Expectation step: compute class labels

    (probabilities)

    Expectation step

    Pr (cj | di ; ) = 𝐏𝐫 𝒄𝒋 )𝐏𝐫 𝒅𝒊 𝒄𝒋 ; )

    𝐏𝐫 𝒅𝒊 ) =

    𝐏𝐫 𝒄𝒋 ) 𝐏𝐫 𝒘𝒅𝒊,𝒌 𝒄𝒋 ; )|𝐝𝐢|𝐤=𝟏

    𝐏𝐫 𝒄𝒓 )|𝑪|𝒓=𝟏

    𝐏𝐫 𝒘𝒅𝒊,𝒌 𝒄𝒓 ; )|𝐝𝐢|𝐤=𝟏

  • 42

    Maximization step: re-compute the

    parameters

    Maximization step

    Pr (wt | cj ; ) = + 𝑵𝒕𝒊 𝐏𝐫 𝒄𝒋 𝒅𝒊)

    |𝑫|𝒊=𝟏

    𝑽 + 𝑵𝒔𝒊 𝐏𝐫 𝒄𝒋 𝒅𝒊)|𝑫|𝒊=𝟏

    |𝑽|𝒔=𝟏

    Pr(cj | ) = 𝐏𝐫 𝒄𝒋 𝒅𝒊 )

    | 𝑫 |𝒊=𝟏

    | 𝑫 |

  • 43

    EM algorithm works well if the two mixture model assumptions for a particular data set are true:

    o The data (or the text documents) are generated by a mixture model,

    o There is one-to-one correspondence between mixture components and document classes.

    In many real-life situations these two assumptions are not met.

    For example, the class Sports may contain documents about different sub-classes such as Football, Tennis, and Handball.

    Expectation-Maximization

  • 44

    What is partially supervised learning

    Learning from labeled and unlabeled data

    Learning with positive and unlabeled data

    Summary

    Road Map

  • 45

    Sometimes all labeled examples are only from the positive class. Examples (see [Liu 11]): Given a collection of papers on semi-supervised learning,

    find all semi-supervised learning papers in proceeding or another collection of documents.

    Given the browser bookmarks of a person, find other documents that may be interesting for that person.

    Given the list of customers of a direct marketing company, identify other persons (from a person database) that may be also interested in those products.

    Given the approved and good (as performance) applications from a credit company, identify other persons that may be interested in getting a credit.

    Positive and unlabeled data

  • 46

    Suppose we have a classification function f and an input vector X labeled with class Y, where Y {1, -1}. We rewrite the probability of error:

    Pr[f(X) Y] = Pr[f(X) = 1 and Y = -1] + Pr[f(X) = -1 and Y = 1] (1)

    Because:

    Pr[f(X) = 1 and Y = -1] = Pr[f(X) = 1] – Pr[f(X) = 1 and Y = 1] = Pr[f(X) = 1] – (Pr[Y = 1] – Pr[f(X) = -1 and Y = 1]).

    Replacing in (1) we obtain:

    Pr[f(X) Y] = Pr[f(X) = 1] – Pr[Y = 1] + 2Pr[f(X) = -1|Y = 1]Pr[Y = 1] (2)

    Pr[Y = 1] is constant.

    If Pr[f(X) = -1|Y = 1] is small minimizing error is approximately the same as minimizing Pr[f(X) = 1].

    Theoretical foundation

  • 47

    If the sets of positive examples P and unlabeled examples U are large, holding Pr[f(X) = -1|Y = 1] small while minimizing Pr[f(X) = 1] is approximately the same as: o minimizing PrU[f(X) = 1]

    o while holding PrP[f(X) = 1] ≥ r

    (where r is the recall: Pr[f(X)=1| Y=1])

    which is the same as (PrP[f(X) = -1] ≤ 1 – r)

    In other words: o The algorithm tries to minimize the number of unlabeled

    examples labeled as positive

    o Subject to the constraint that the fraction of errors on the positive examples is no more than 1-r.

    Theoretical foundation

  • 48

    For implementing the theory above there is a 2-step strategy (presented in [Liu 11]):

    Step 1: Identify in the unlabeled examples a subset called “reliable negatives” (RN).

    These examples will be used as negative labeled examples in the next step.

    We start with only positive examples but must build a negative labeled set in order to use a supervised learning algorithm for building the model (classifier)

    Step 2: Build a sequence of classifiers by iteratively applying a classification algorithm and then selecting a good classifier.

    In this step we can use Expectation Maximization or SVM for example.

    2-step strategy

  • 49

    Building the reliable negative set is really

    the key in this case.

    There are several methods ([Zhang, Zuo

    2009] ):

    Spy technique

    1-DNF algorithm

    Naïve Bayes

    Rocchio

    Obtaining reliable negatives (RN)

  • 50

    In this technique, first randomly select a set S of positive documents from P and puts them in U.

    These examples are the spies.

    They behave identically to the unknown positive documents in P.

    Then using I-EM algorithm with (P-S) as positive and U S as negative, a classifier is obtained

    It uses the probabilities assigned to the documents in S to decide a probability threshold th to identify possible negative documents in U:

    all documents with a probability less than any spy will be assigned to RN

    Spy technique

  • 51

    Spy algorithm

    1. RN = {};

    2. S = Sample(P, s%);

    3. Us = U S;

    4. Ps = P-S;

    5. Assign each document in Ps the class label 1;

    6. Assign each document in Us the class label -1;

    7. I-EM(Us, Ps); // This produces a Naïve Bayes classifier.

    8. Classify each document in Us using the NB classifier;

    9. Determine a probability threshold th using S;

    10. For each document dUs

    11. If its probability Pr(1|d) < th

    12. Then RN = RN {d};

    13. End If

    14.End For

  • 52

    The algorithm builds a so-called positive feature set (PF) containing words that occur in the positive examples set of documents P more frequently than in the unlabeled examples set U.

    Then using PF it tries to identify (for filtering out) possible positive documents from U.

    A document in U that does not have any positive feature in PF is regarded as a strong negative document.

    1-DNF algorithm

  • 53

    Algorithm

    1. PF = {}

    2. For i = 1 to n

    3. If (freq(wi, P)/|P|> freq(wi, U)/|U|)

    4. Then PF = PF {wi}

    5. End if

    6. End for

    7. RN = U;

    8. For each document d U

    9. If (wi, freq (wi , d ) > 0) and (wi PF)

    10. Then RN = RN - {d}

    11. End if

    12.End for

  • 54

    In this case, a classifier is built considering all unlabeled examples as negative. Then the classifier is used to classify U and the negative labeled examples form the reliable negative set.

    The algorithm is :

    1. Assign label 1 to each document in P;

    2. Assign label –1 to each document in U;

    3. Build a NB classifier using P and U;

    4. Use the classifier to classify U. Those documents in U that are classified as negative form the reliable negative set RN.

    Naïve Bayes

  • 55

    The algorithm of building RN is the same as for Naïve Bayes with the difference that at step 3 a Rocchio classifier is built instead of a Naïve Bayes one.

    Rocchio builds a prototype vector for each class (a vector describing all documents in the class) and then using the cosine similarity finds the class for test examples: the class of the prototype most similar with the given example.

    Rocchio

  • 56

    This course presented:

    What is partially supervised learning, with illustrations of the impact of unlabeled data

    Learning from labeled and unlabeled data, where were presented co-training, ASSEMBLE, re-weighting and EM

    Learning with positive and unlabeled examples

    Next week: Information integration

    Summary

  • 57

    [Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks, Contents, and Usage Data, Second Edition, Springer.

    [Chawla, Karakoulas 2005] Nitesh V. Chawla, Grigoris Karakoulas, Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains, Journal of Artificial Intelligence Research, volume 23, 2005, pages 331-366.

    [Nigam et al, 98] Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom Mitchell, Using EM to Classify Text from Labeled and Unlabeled Documents, Technical Report CMU-CS-98-120. Carnegie Mellon University. 1998

    [Blum, Mitchell, 98] Blum, A., Mitcell, T. Combining labeled and unlabeled data with co-training, Procs. Of Workshop on Computational Learning Theory, 1998.

    [Goldman, Zhou, 2000] Sally Goldman, Yan Zhou, Enhancing Supervised Learning with Unlabeled Data, Proceedings of the Seventeenth International Conference on Machine Learning (ICML), 2000, pages 327 – 334

    [Bennet at al, 2002] Bennet, K., Demiriz, A., Maclin, R., Exploiting unlabeled data in ensemble methods, Procs. Of the 6th Intl. Conf. on Knowledge Discovery and Databases, 2002, pages 289-296.

    [Crook, Banasik, 2002] Sample selection bias in credit scoring models, Intl. Conf.on Credit Risk Modeling and Decisioning, 2002.

    [Zhang, Zuo 2009] Bangzuo Zhang, Wanli Zuo, Reliable Negative Extracting Based on kNN for Learning from Positive and Unlabeled Examples, journal of Computers, vol. 4, no. 1, 2009

    [Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org

    References