Classification Of Web Documents

ClassificationChapter# 05

Data Mining The WebBy

Hussain Ahmad M.S (Semantic Web) University of Peshawar Pakistan

..Classification..• In clustering we use the document class labels for

evaluation purposes.• In classification they are an essential part of the input

to the learning system.• The objective of the system is to create a mapping

(also called a model or hypothesis) between a set of documents and a set of class labels.

• This mapping is then used to determine automatically the class of new (unlabeled) documents

..Classification..

• This mapping process is called classification.

• The general framework for classification includes the model creation phase and other steps.

• Therefore, the general framework is usually called supervised learning (also, learning from examples, concept learning) and includes the following steps:

..Classification..• Step 1: Data collection and preprocessing.

– Documents are • Collected, cleaned, and properly organized, the

terms (features) identified, and a vector space representation created.

– Documents are organized in classes (categories), based on their topic, user preference, or any other criterion.

– Data divided into two subsets:• Training set. This part of the data will be used to create the

model.

• Test set. This part of the data is used for testing the model.

..Classification..• Step 2: Building the model.

– This is the actual learning (also called training) step, which includes the use of the learning algorithm.

– It is usually an iterative and interactive process that may include other steps and may be repeated several times so that the best model is created:

• Feature selection

• Applying the learning algorithm

• Validating the model (using the validation subset to tune some parameters of the learning algorithm)

..Classification..

• Step 3: Testing and evaluating the model– At this step the model is applied to the documents

from the test set and their actual class labels are compared to the labels predicted.

• Step 4: Using the model to classify new documents (with unknown class labels).

Web Documents exhibit some specific properties

• The web documents exhibit some specific properties which may require some adjustment or use of proper learning algorithms.

• Here are the basic ones– Text and web documents include thousands of

words. – The document features inherit some of the properties

of the natural language text from which they are derived.

– Documents are of different sizes.

Nearest-Neighbor Algorithm

• The nearest-neighbor algorithm is a straightforward application of similarity (or distance) for the purposes of classification.

• It predicts the class of a new document using the class label of the closest document from the training set.

• Because it uses just one instance from the training set, this basic version of the algorithm is called

- one-nearest neighbor (1-NN).

..NN Algorithm..

• The closeness is measured by minimal distance or maximal similarity.

• The most common approach is to use the TFIDF (term frequency–inverse document frequency ) framework to represent both the test and training documents and to compute the cosine similarity between the document vectors.

..NN Algorithm..

• Let us consider department document collection, represented as TFIDF vectors with six attributes along with the class labels for each document, as shown in Table.

• Assume that the class of the Theatre document is unknown.

• To determine the class of this document, we compute the cosine similarity between the Theatre vector and all other vectors.

..NN Algorithm..

• The 1-NN approach simply picks the most similar document i.e. Criminal Justice and uses its label B to predict the class of Theatre.

• However, if we look at the nearest neighbor of Theatre (Criminal Justice) we see only one nonzero attribute, which produces the prediction.

• This makes the algorithm extremely sensitive to noise and irrelevant attributes

..NN Algorithm..

• Therefore using 1-NN,– Two assumptions are made

• There is no noise, and • All attributes are equally important for the classification.

• k-NN is a generalization of 1-NN– The parameter k is selected to be a small odd number

(usually, 3 or 5)– For example, 3-NN

• Classify Theatre as of class B, because this is the majority label in the top three documents (B,A,B).

• 5-NN will predict class A, because the set of labels of the top five documents is {B,A,B,A,A}

..NN Algorithm..

• Distance-weighted k-NN– For example, the distance-weighted 3-NN with the

simplest weighting scheme [sim (X,Y )] will predict class B for the Theatre document.

• Because the weight for label B (documents Criminal Justice and Communication) is

– B = 0.967075 + 0.605667 = 1.572742,

• while the weight for Anthropology is

– A = 0.695979,

• And thus B > A.

FEATURE SELECTION• The objective of feature selection is to find a

subset of attributes that best describe a set of documents with respect to the classification task i.e., the attributes with which the learning algorithm achieves maximal accuracy.

• A simple solution is to try all subsets and pick the one that maximizes accuracy.

• This solution is impractical, due to the huge number of subsets that have to be investigated (2n for n attributes).

Naive Bayes Algorithm

• Bayesian classification– Approaches:

• One based on the Boolean document representation and

• Another based on document representation by term counts.

– Consider the set of Boolean document vectors shown in Table.

..Naive Bayes Algorithm..• Classifying Theatre document given the rest of

documents with known class labels.• The Bayesian approach determines the class of

document x as the one that maximizes the conditional probability P(C | x). According to Bayes’ rule,

• Given that x is a vector of n attribute values

[i.e., x = (x1, x2, . . . , xn)], this assumption leads to:

..Naive Bayes Algorithm..

• Now to find the class of the Theatre document, we compute the conditional probability of class A and class B given that this document has already occurred. For class A we have

..Naive Bayes Algorithm..• To calculate each of the probabilities above, we take

the proportion of the corresponding attribute value in class A.

• For example, in the science column we have 0’s in four documents out of 11 from class A. Thus, P(science = 0 | A) = 4/11.

• For class B we obtain

..Naive Bayes Algorithm..• The probabilities of classes A and B are estimated

with the proportion of documents in each– P(A) = 11/19 = 0.578947 and– P(B) = 8/19 = 0.421053

• Putting all this in the Bayes formula:

• At this point we can make the decision that Theatre belongs to class A


• Although the Boolean naive Bayes algorithm uses all training documents but it ignores the term counts.

• Bayesian model based on term counts will classify our test document.– Assume that there are m terms t1, t2, . . . , tm.– and n documents d1, d2, . . . , dn from class C.– Let us denote the number of times that term ti

occurs in document dj as nij.


• And the probability with which term ti occurs in all documents from class C as P(ti |C)

• This can be estimated with the number of times that ti occurs in all documents from class C over the total number of terms in the documents from class C.


• First we calculate the probabilities P(ti | C)]

• For example, this happens with the term history and class A; that is,

P(history | A) = 0

• Consequently, the documents, which have a nonzero count for history will have zero probability in class A. – That is P(History | A) = 0, P(Music | A) = 0, and

P(Philosophy | A) = 0.

..Naive Bayes Algorithm..• A common approach to avoid this problem is

to use the Laplace estimator.

• The idea is to add 1 to the frequency count in the numerator and 2 (or the number of classes, if more than two) to the denominator.

• The Laplace estimator helps to deal with a zero probability situation.

..Naive Bayes Algorithm..• Now we compute the probabilities of each term given each class using the Laplace estimator. For example, P (history | A) = (0 + 1)/(57 + 2) = 0.017 and P(history | B) = (9 + 1)/(29 + 2) = 0.323. • Plugging all these probabilities in the formula results in

P(A | Theatre) ≈ 0.0000354208 and P(B | Theatre) ≈ 0.00000476511,

Numerical Approaches

• In the TFIDF vector space framework, we use cosine similarity as a measure of document similarity.

• However, the same vector representation allows documents to be considered as points in a metric space.

• That is, given a set of points, the objective is to find a surface that divides the space in two parts, so that the points that fall in each part belong to a single class.

• Linear regression, the most popular approach based on this idea.

..Numerical Approaches..

• Linear regression is a standard technique for numerical prediction.

• It works naturally with numerical attributes (including the class).

• The class value C predicted is computed as a linear combination of the attribute values xi as follows:

..Numerical Approaches..

• The objective is to find the coefficients wi given a number of training instances xi with their class values C.

• There are several approaches to the use of linear regression for classification (predicting class labels).

• One simple approach to binary classification is to substitute class labels with the values −1 and 1.

• The predicted class is determined by the sign of the linear combination.

• For example, consider our six-attribute document

vectors (Table 5.1). Let us use −1 for class A and 1 for class B

..Numerical Approaches..• Then the task is to find seven coefficients w0,w1, . . .

, w6 which satisfy a system of 19 linear equations

• The result is positive, and thus the class predicted for Theatre is B and also agrees with the prediction of

1-NN.

RELATIONAL LEARNING

• All classification methods that we have discussed so far are based solely on the document content and more specifically on the bag-of-words model.

• Many additional document features,– such as the internal HTML structure, language structure,

and interdocument link structure, are ignored.

– All this may be a valuable source of information for the classification task.

– The basic problem with this information into the classification algorithm is the need for uniform representation.

..RELATIONAL LEARNING..• Relational learning extends content-based approach

to a relational representation.• Allows various types of information to be

represented in a uniform way and used for web document classification.

• In our domain we have documents d and terms t connected with the basic relation contains.

• That is, if term t occurs in document d, the relation contains (d, t) is true.

Classification Of Web Documents

Education

Transcript of Classification Of Web Documents