Partialy Supervised Learning - pub.rocursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW8.pdfThis...

Service Engineering and Management

Data Mining and Data Warehousing

Partialy Supervised Learning Prof.dr.ing. Florin Radulescu

Universitatea Politehnica din Bucureşti

2

What is partially supervised learning

Learning from labeled and unlabeled data

Learning with positive and unlabeled data

Summary

Road Map

3

In supervised learning the goal is to build a classifier starting from a set of labeled examples.

Unsupervised learning starts with a set of unlabeled examples trying to discover the inner structure of this set, like in clustering.

Partially supervised learning (or semi-supervised learning) learning includes a series of algorithms and techniques using a (small) set of labeled examples and a (possible large) set of unlabeled examples for performing classification or regression.

Partially supervised learning

4

The need for such algorithms and techniques comes from the cost of obtaining labeled examples.

This is made in many cases manually by experts and the volume of these examples is sometimes small.

When learning is made starting from a finite number of training examples in a high-dimensional space and for each dimension the number of possible values is large, the amount of needed training data required to ensure that there are several samples with each combination of values is huge.

Partially supervised learning

5

For a given number of training samples the predictive power decreases as the dimensionality increases.

This phenomenon is called Hughes effect or Hughes phenomenon, after Gordon F. Hughes

He published in 1968 the paper "On the mean accuracy of statistical pattern recognizers".

Adding extra information to a small number of labeled training examples will increase the accuracy (by delaying the occurrence of the effect described).

Hughes effect

6

In a 2D space containing only two labeled

examples: a positive example and a

negative example.

Effect of unlabeled examples

7

Based on these labeled examples a classifier may be built, represented by the dotted line.

The points in the left of the line will be classified as positive and the point in the right of the line as negative.

This classifier is not very accurate because the number of the labeled examples is too small.


8

Suppose now that several unlabeled

examples are added, as in the next figure.


9

In that case, the unlabeled examples that

are near or linked in some way to the two

labeled examples may be considered

having the same label and the classifier

changes.

Unlabeled examples form two clusters,

one containing the positive example and

the other the negative one.


10

Consequently, the border between the

positive area and negative area has an

irregular shape (the dotted line).

These two classifiers, in terms of

accuracy, are very different, the second

being much more accurate than the first.


11

Another illustration is for the case of a

training set containing only positive examples

(or examples belonging to the same class).

The next figure shows six points labeled as

positive placed in a 2D space.

Because there are no negative examples,

there is no way to determine a separation

between the positive area and negative area.

Positive and unlabeled examples

12

All the lines figured may be at this point

separation lines.


13

In the following figureseveral unlabeled

examples are added.


14

Some unlabeled examples are placed

near the positive examples and some

other unlabeled examples are placed

separately.

The natural assumption is that examples

in the first category are also positive

examples and the other examples are

negative ones.


15

The effect of using unlabeled examples is

related to the fact that labeled and

unlabeled examples comes from the same

distribution or from different distributions.

There are three cases, derived from the

treatment of missing data (here missing

data = label is missing)

Distribution of unlabeled examples

16

Suppose you are modelling weight (Y) as a function of gender (X).

Some respondents wouldn't disclose their weight, so you are missing

some values for Y. There are three possible mechanisms for the

nondisclosure:

1. There may be no particular reason why some respondents told you their

weights and others didn't. That is, the probability that Y is missing may

has no relationship to X or Y. In this case our data is missing completely

at random (MCAR).

2. One gender may be less likely to disclose its weight. That is, the

probability that Y is missing depends only on the value of X. Such data

are missing at random (MAR).

3. Heavy (or light) people may be less likely to disclose their weight. That

is, the probability that Y is missing depends on the unobserved value of

Y itself. Such data are not missing at random (MNAR).

The formal definitions are:

Example (from http://www.math.chalmers.se/Stat/Grundutb/GU/MSA650/S09/Lecture5.pdf)

17

MCAR = Missing Completely At Random.

If x is the vector of attribute values and y the class, we have:

P (labeled=1|x, y) = P (labeled=1),

So labeled and unlabeled examples comes from the same distribution (or the fact that an example is labeled or not is not related to the attribute values or the class of the example).

MCAR

18

MAR = Missing At Random.

In this case:

P (labeled=1| x, y) = P (labeled=1| x),

The probability for an example to be labeled is not related to the class.

We have also: P(y=1| x, labeled=1) = P(y=1| x, labeled=0) = P(y=1| x),

So, for a fixed x the probability to be labeled is the same with the probability not to be labeled.

But in this case the conditional distribution of x given y is not the same in labeled and unlabeled data.

MAR

19

MNAR = Missing Not At Random.

In this case:

P (labeled=1|x, y) P (labeled=0|x, y),

Labeled and unlabeled examples are not

from the same distribution.

MNAR

20




Summary

Road Map

21

Some techniques are presented in the next slides for using unlabeled data along with a training set containing labeled examples belonging to all classes, based on [Chawla, Karakoulas, 2005] study.

The study evaluates four learning techniques:

Co-training

ASSEMBLE

Re-weighting

Expectation-Maximization


22

Co-training and ASSEMBLE assume a

MCAR distribution and the other two

techniques a MAR one.

The study used Naïve Bayes as

underlying supervised learner and, for co-

training (that requires two classifiers), the

second classifier was C4.5.


23

Co-training was proposed by Blum and Mitchell in the paper “Combining labeled and unlabeled data with co-training” presented in 1998 at the Workshop on Computational Learning Theory:

The attributes x describing examples can be split in two disjoint subsets that are independent, or, in other words, the instance space X can be written as a Cartesian product:

X = X1 X2

where X1 and X2 correspond to two different views of an example

Alternate definition:

each example x is given like a pair: x = (x1, x2)

Co-training (Blum and Mitchel)

24

The main assumption is that X1 and X2 are

sufficient for learning a classifier each.

The example presented in the original

article is a set of web pages.

Each page is described by

x1 = {words on the web page} and also by

x2 = {words on the links pointing to the web

page}.


25

1. Initially LA = LB = L, UA = UB = U 2. Build two classifiers, A from LA and X1 and B from

LB and X2 3. Allow A to label the set UA, obtaining L1 4. Allow B to label the set UB, obtaining L2 5. Based on confidence, select C1 from L1 and C2 from

L2 (subsets containing a number of most confident examples for each class)

6. Add C1 to LB and subtract it from UB 7. Add C2 to LA and subtract it from UA 8. Go to step 2 until stopping criteria are met

Co-training algorithm (v1)

26

The process ends when there are no more unlabeled examples or C1 and C2 are empty

In the last case there are some unlabeled examples but the confidence of their classifications – probability of the assigned class for example - is below a given threshold.

In the end, the final classifier is obtained by combining A and B (the final two classifiers obtained at step 2).

The experiments described in the original article are made using a slightly different form of the algorithm, presented on the next slide.


27

1. Given:

• A set L of labeled examples

• A set U of unlabeled examples

2. Create a pool U’ of examples by choosing u examples at random from U.

3. Loop for k iterations: 3.1. Use L to train a classifier h1 that considers only the x1 portion of

x

3.2. Use L to train a classifier h2 that considers only the x2 portion of x

3.3. Allow h1 to label p positive and n negative examples from U’

3.4. Allow h2 to label p positive and n negative examples from U’

3.5. Add these self-labeled examples to L

3.6. Randomly choose 2p + 2n examples from U to replenish U’


28

The experiments were made using Naive Bayes as h1 and h2, p = 1, n = 3, k = 30 and u = 75.

Begining with 12 labeled web pages and using 1000 additional unlabeled web pages, the results were:

average error for learning only from labeled data 11.1%;

average error using co-training 5.0%


29

Results described in [Blum, Mitchell, 98]

Co-training results

Page-based classifier

Link-based classifier

Combined classifier

Supervised training 12.9 12.4 11.1

Co-training 6.2 11.6 5.0

30

In [Goldman, Zhou, 2000] there is another approach in co-training: o Because not always is possible to split the feature

space X in two disjoint and independent subspaces X1 and X2, use two different algorithms for building the two classifiers h1 and h2, on the same feature space.

o Each algorithm labels some unlabeled examples and these will be included by the other algorithm in its training data:

o If the two classifiers are denoted with A and B, A will produce LB and B will produce LA.

Co-training (Goldman and Zhou)

31

1. Repeat until LA and LB do not change during iteration. For each algorithm do

2. Train algorithm A on L LA to obtain the hypothesis HA (a hypothesis defines a partition of the instance space). Similar for B

3. Each algorithm considers each of its equivalence classes and decides which one to use to label data from U for the other algorithm, using two tests. For A the tests are (similar for B):

o The class k used by A to label data for B has accuracy at least as good as the accuracy of B.

o The conservative estimate of the class k is bigger than the conservative estimate of B.

(The conservative estimate is an estimation for 1/2 where is the hypothesis error. This prevents the degradation of B performances due to the noise.)

4. All examples in U passing these tests are placed in LB (similar for B).

5. End Repeat

6. At the end, combine HA and HB

Algorithm

32

ASSEMBLE is an ensemble algorithm presented in [Bennet at all, 2002]

It won the NIPS 2001 Unlabeled data competition.

It alternates between assigning “pseudo-classes” to the instances from the unlabeled data set and constructing the next base classifier using the labeled examples but also the unlabeled examples

For these examples the previous assigned pseudo-class is considered.

ASSEMBLE

33

Any weight-sensitive classification algorithm can be boosted using labeled and unlabeled data.

ASSEMBLE can exploit unlabeled data to reduce the number of classifiers needed in the ensemble therefore speeding up learning.

ASSEMBLE works well in practice.

Computational results show the approach is effective on a number of test problems, producing more accurate ensembles than AdaBoost using the same number of base learners.

ASSEMBLE - advantages

34

Re-weighting is a technique for reject-inferencing in credit scoring presented in [Crook, Banasik, 2002].

The main idea is to extrapolate information on the examples from approved credit applications to the unlabeled data.

The re-weighting may be used if data is of the MAR type, so the provided population model for all applicants is the same as that for accepts only:

P(y=1| x, labeled=1) = P(y=1| x, labeled=0) = P(y=1| x),

So: for a given x the distribution of the examples having a certain label is the same in the labeled and unlabeled set.

Re-weighting

35

All credit institutions have an archive of approved applications and for each of these applications there is also a Good/Bad performance label.

Based on the classification variables used to accept/reject an application, applications (past-labeled but also those unlabeled) can be scored and partitioned in score groups.

For every score group the distribution of classes in the labeled examples is then extrapolate to the unlabeled examples of the same score group, picking at random examples from here.

Re-weighting

36

Suppose we have a set of previously accepted

applications (Labeled column in Table 1) and a

set of unlabeled applications. Each application

have a score and can be included in a score

group (there are 5 score groups below)

Re-weighting example

Score group Unlabeled Labeled Class0

/ Bad

Class1

/ Good

Group

weight

Re-w.

Class0

Re-w.

Class1

0.0-0.2 10 10 6 4 2 12 8

0.2-0.4 10 20 10 10 1.5 15 15

0.4-0.6 20 60 20 40 1.33 27 53

0.6-0.8 20 100 10 90 1.2 12 108

0.8-1.0 20 200 10 190 1.1 11 209

37

The group weight is computed as (XL + XU) / XL.

For every score group the weight is used to compute the number of examples of class0 and class1 from the whole score group examples (labeled and unlabeled).

Example: for score group 0.8-1.0, the weight is 1.1 so re-weighting class0 and class1 we obtain 10*1.1 = 11 for class0 and 190*1.1 = 209 for class1.

It means that we pick at random 11-10=1 example from the unlabeled set and label it as class0 and 209-190 = 19 examples (the rest of them, 20-1) and label them as class1.


38

This procedure is run for every score group.

At the end, all unlabeled examples have a class0/class1 label.

Note that class0/class1 is not the same as rejected/accepted; the initial set of labeled examples contains only accepted applications!

Using the whole set of examples (L+U) now having all a class0/class1 label, we can learn a new classifier that incorporate not only the data from labeled examples but also information from unlabeled ones.


39

In [Liu 11] and [Nigam et al, 98] the process is described as follows:

Initial: Train a classifier using only the set of labeled documents.

Loop:

Use this classifier to label (probabilistically) the unlabeled documents (E step)

Use all the documents to train a new classifier (M step)

Until convergence.


40

For Naïve Bayes, the expectation step means computing for every class cj and every unlabeled document di the probability Pr(cj | di; ).

Notations are:

ci – class ci

D – the set of documents

di – a document di in D

V – words vocabulary (set of significant words)

wdi, k – the word in position k in document di

Nti - the number of times that word wt occurs in document di

- the set of parameters of all components, = {1, 2, …, K, 1, 2, …, K}. j is the mixture weight (or mixture probability) of the mixture component j; j is the parameters of component j. K is the number of mixture components.


41

Expectation step: compute class labels

(probabilities)

Expectation step

Pr (cj | di ; ) = 𝐏𝐫 𝒄𝒋 )𝐏𝐫 𝒅𝒊 𝒄𝒋 ; )

𝐏𝐫 𝒅𝒊 ) =

𝐏𝐫 𝒄𝒋 ) 𝐏𝐫 𝒘𝒅𝒊,𝒌 𝒄𝒋 ; )|𝐝𝐢|𝐤=𝟏

𝐏𝐫 𝒄𝒓 )|𝑪|𝒓=𝟏

𝐏𝐫 𝒘𝒅𝒊,𝒌 𝒄𝒓 ; )|𝐝𝐢|𝐤=𝟏

42

Maximization step: re-compute the

parameters

Maximization step

Pr (wt | cj ; ) = + 𝑵𝒕𝒊 𝐏𝐫 𝒄𝒋 𝒅𝒊)

|𝑫|𝒊=𝟏

𝑽 + 𝑵𝒔𝒊 𝐏𝐫 𝒄𝒋 𝒅𝒊)|𝑫|𝒊=𝟏

|𝑽|𝒔=𝟏

Pr(cj | ) = 𝐏𝐫 𝒄𝒋 𝒅𝒊 )

| 𝑫 |𝒊=𝟏

| 𝑫 |

43

EM algorithm works well if the two mixture model assumptions for a particular data set are true:

o The data (or the text documents) are generated by a mixture model,

o There is one-to-one correspondence between mixture components and document classes.

In many real-life situations these two assumptions are not met.

For example, the class Sports may contain documents about different sub-classes such as Football, Tennis, and Handball.


44




Summary

Road Map

45

Sometimes all labeled examples are only from the positive class. Examples (see [Liu 11]): Given a collection of papers on semi-supervised learning,

find all semi-supervised learning papers in proceeding or another collection of documents.

Given the browser bookmarks of a person, find other documents that may be interesting for that person.

Given the list of customers of a direct marketing company, identify other persons (from a person database) that may be also interested in those products.

Given the approved and good (as performance) applications from a credit company, identify other persons that may be interested in getting a credit.

Positive and unlabeled data

46

Suppose we have a classification function f and an input vector X labeled with class Y, where Y {1, -1}. We rewrite the probability of error:

Pr[f(X) Y] = Pr[f(X) = 1 and Y = -1] + Pr[f(X) = -1 and Y = 1] (1)

Because:

Pr[f(X) = 1 and Y = -1] = Pr[f(X) = 1] – Pr[f(X) = 1 and Y = 1] = Pr[f(X) = 1] – (Pr[Y = 1] – Pr[f(X) = -1 and Y = 1]).

Replacing in (1) we obtain:

Pr[f(X) Y] = Pr[f(X) = 1] – Pr[Y = 1] + 2Pr[f(X) = -1|Y = 1]Pr[Y = 1] (2)

Pr[Y = 1] is constant.

If Pr[f(X) = -1|Y = 1] is small minimizing error is approximately the same as minimizing Pr[f(X) = 1].

Theoretical foundation

47

If the sets of positive examples P and unlabeled examples U are large, holding Pr[f(X) = -1|Y = 1] small while minimizing Pr[f(X) = 1] is approximately the same as: o minimizing PrU[f(X) = 1]

o while holding PrP[f(X) = 1] ≥ r

(where r is the recall: Pr[f(X)=1| Y=1])

which is the same as (PrP[f(X) = -1] ≤ 1 – r)

In other words: o The algorithm tries to minimize the number of unlabeled

examples labeled as positive

o Subject to the constraint that the fraction of errors on the positive examples is no more than 1-r.

Theoretical foundation

48

For implementing the theory above there is a 2-step strategy (presented in [Liu 11]):

Step 1: Identify in the unlabeled examples a subset called “reliable negatives” (RN).

These examples will be used as negative labeled examples in the next step.

We start with only positive examples but must build a negative labeled set in order to use a supervised learning algorithm for building the model (classifier)

Step 2: Build a sequence of classifiers by iteratively applying a classification algorithm and then selecting a good classifier.

In this step we can use Expectation Maximization or SVM for example.

2-step strategy

49

Building the reliable negative set is really

the key in this case.

There are several methods ([Zhang, Zuo

2009] ):

Spy technique

1-DNF algorithm

Naïve Bayes

Rocchio

Obtaining reliable negatives (RN)

50

In this technique, first randomly select a set S of positive documents from P and puts them in U.

These examples are the spies.

They behave identically to the unknown positive documents in P.

Then using I-EM algorithm with (P-S) as positive and U S as negative, a classifier is obtained

It uses the probabilities assigned to the documents in S to decide a probability threshold th to identify possible negative documents in U:

all documents with a probability less than any spy will be assigned to RN

Spy technique

51

Spy algorithm

1. RN = {};

2. S = Sample(P, s%);

3. Us = U S;

4. Ps = P-S;

5. Assign each document in Ps the class label 1;

6. Assign each document in Us the class label -1;

7. I-EM(Us, Ps); // This produces a Naïve Bayes classifier.

8. Classify each document in Us using the NB classifier;

9. Determine a probability threshold th using S;

10. For each document dUs

11. If its probability Pr(1|d) < th

12. Then RN = RN {d};

13. End If

14.End For

52

The algorithm builds a so-called positive feature set (PF) containing words that occur in the positive examples set of documents P more frequently than in the unlabeled examples set U.

Then using PF it tries to identify (for filtering out) possible positive documents from U.

A document in U that does not have any positive feature in PF is regarded as a strong negative document.

1-DNF algorithm

53

Algorithm

1. PF = {}

2. For i = 1 to n

3. If (freq(wi, P)/|P|> freq(wi, U)/|U|)

4. Then PF = PF {wi}

5. End if

6. End for

7. RN = U;

8. For each document d U

9. If (wi, freq (wi , d ) > 0) and (wi PF)

10. Then RN = RN - {d}

11. End if

12.End for

54

In this case, a classifier is built considering all unlabeled examples as negative. Then the classifier is used to classify U and the negative labeled examples form the reliable negative set.

The algorithm is :

1. Assign label 1 to each document in P;

2. Assign label –1 to each document in U;

3. Build a NB classifier using P and U;

4. Use the classifier to classify U. Those documents in U that are classified as negative form the reliable negative set RN.

Naïve Bayes

55

The algorithm of building RN is the same as for Naïve Bayes with the difference that at step 3 a Rocchio classifier is built instead of a Naïve Bayes one.

Rocchio builds a prototype vector for each class (a vector describing all documents in the class) and then using the cosine similarity finds the class for test examples: the class of the prototype most similar with the given example.

Rocchio

56

This course presented:

What is partially supervised learning, with illustrations of the impact of unlabeled data

Learning from labeled and unlabeled data, where were presented co-training, ASSEMBLE, re-weighting and EM

Learning with positive and unlabeled examples

Next week: Information integration

Summary

57

[Liu 11] Bing Liu, 2011. Web Data Mining, Exploring Hyperlinks, Contents, and Usage Data, Second Edition, Springer.

[Chawla, Karakoulas 2005] Nitesh V. Chawla, Grigoris Karakoulas, Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains, Journal of Artificial Intelligence Research, volume 23, 2005, pages 331-366.

[Nigam et al, 98] Kamal Nigam, Andrew McCallum, Sebastian Thrun, Tom Mitchell, Using EM to Classify Text from Labeled and Unlabeled Documents, Technical Report CMU-CS-98-120. Carnegie Mellon University. 1998

[Blum, Mitchell, 98] Blum, A., Mitcell, T. Combining labeled and unlabeled data with co-training, Procs. Of Workshop on Computational Learning Theory, 1998.

[Goldman, Zhou, 2000] Sally Goldman, Yan Zhou, Enhancing Supervised Learning with Unlabeled Data, Proceedings of the Seventeenth International Conference on Machine Learning (ICML), 2000, pages 327 – 334

[Bennet at al, 2002] Bennet, K., Demiriz, A., Maclin, R., Exploiting unlabeled data in ensemble methods, Procs. Of the 6th Intl. Conf. on Knowledge Discovery and Databases, 2002, pages 289-296.

[Crook, Banasik, 2002] Sample selection bias in credit scoring models, Intl. Conf.on Credit Risk Modeling and Decisioning, 2002.

[Zhang, Zuo 2009] Bangzuo Zhang, Wanli Zuo, Reliable Negative Extracting Based on kNN for Learning from Positive and Unlabeled Examples, journal of Computers, vol. 4, no. 1, 2009

[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org

References

Partialy Supervised Learning - pub.rocursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW8.pdfThis...

Documents

Transcript of Partialy Supervised Learning - pub.rocursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/DMDW8.pdfThis...