Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof....

Iterative similarity based adaptation technique for Cross Domain text

classification

Under: Prof. Amitabha MukherjeeBy: Narendra RoyRoll no: 11451Group: 6

Published by: Himanshu Bhatt,Deepali Semwal

Shourya Roy

Introduction

• Supervised machine learning classifications assume both training and test data are sampled from same domain or distribution (iids).

• Performance degrades for test data from different domain.

• Problems of difference in features across domains, and absence of labelled data in test domain

• So an Iterative similarity based adaptation algorithm proposed in the paper to address these issues.

• A classifier learnt on one domain with sufficient labelled training data is applied to a different test domain with no labelled data.

• Iterative algorithm starts with a shared feature representation of source and target domains.

• To adapt, it Iteratively learns domain specific features from the unlabelled target domain data, similarity between two domains is incorporated in similarity aware manner.

Different stages of proposed algorithm

Philosophy and Features of the algorithm

• Gradual transfer of knowledge from source to target domain while considering similarity between two domains.

• Ensemble of two classifier used.• Transfer occurs within the ensemble where a

classifier learned on shared representation transforms unlabeled test data into pseudo labeled data to learn domain specific classifier.

Salient Features:

1. Common Feature Space Representation• Want to find a good feature representation which

minimizes the divergence between the source and target domains and classification error.

• Structured Correspondence learning feature representation transfer approach used which derives a transformation matrix Q that gives a shared representation between the source and target domains.

• This approach(SCL) aims to learn the co-occurrence between features expressing similar meaning in different domains

Common Feature Space Representation Contd.

● Principal Predictor space is created using some top K eigenvectors of the matrix. Features are then projected form different domains onto this predictor space for shared feature space representation.

● Algorithm generalizes to different shared representations.

2. Iteratively building Target domain labelled data• Hypothesis: Certain target domain instances are more

similar to source domain instances than the rest.• Target: To create (Pseudo)labelled data in target

domains.• Idea: A classifier trained on suitably chosen source

domain instances will be able to categorize similar target domain instances confidently. Hence we get labelled data in target domain from there.

• Only a few can be confidently labelled, so the process is iterated and in next iteration, ensemble output is considered.

• Again this adds to pseudo labelled data and process continues till all instances are exhausted or ends on some criteria.

3. Domain similarity based aggregation

• Dissimilarity hinders the better shared space representation and performance of classification.

• If not similar enough then in this “transfer learning technique” the knowledge passed may result in “Negative learning”.

• So we measure Domain similarity using “Cosine similarity measure”

Notations

Algorithm

Algorithm continues

Algorithm continues.

Algorithm in a nutshell

• Confidence of Prediction: αi for the ith instance is measured as the distance from the decision boundary(Hsu et. al. , 2003) as follows:

• Here, R is the unnormalized output from the SVM classifier, V is the weight vector for the support vectors and |V| = VTV.

• Each iteration shifts the weights in ensemble from the classifier learnt on shared representation to target domain classifier.

Results and Datasets

• Efficacy of proposed algorithm was evaluated on different datasets for cross domain text classification (Blitz et. al, 2007)

• The performance of the experiment evaluated on two-class classification task and reported in classification accuracy terms.

Datasets

• Amazon Review data taken with four different domains, Books, DVDs, Kitchen appliances and Electronics. Each domain comprised of 1000 positives and 1000 negative reviews and in all experiments 1600 labelled and 1600 unlabelled reviews are taken in source and target domains, and performance is reported on the non overlapping 400 reviews from target domain.

• 20 Newsgroups dataset which was a collection of approximately 20,000 documents evenly partitioned across 20 newsgroups is used.

Datasets contd.

• Third dataset was a real world dataset comprising of tweets about products and services. Coll1 about gaming, Coll2 about Microsoft products and Coll3 about mobile support. Each collection has 218 positive and negative tweets and these tweets are collected based on user-defined keywords captured in a listening engine which then crawled the social media and fetched comments matching the keywords.

Results and Analysis

• After the datasets were preprocessed, the experiments were conducted with the SVM and radial basis function kernel as the constituent classifiers of the ensemble classifier.

• Performance mainly affected due to dissimilarity of domains resulting in negative transfer and feature divergence.

• In the experiments, the maximum number of iterations were set to 30. Target specific classifier were found to have more weight after the iterations.

• On an average the weights converged to ws = 0.22 and Wt=0.78 at the end.

Amazon Dataset results and Analysis:

• The above table shows comparison of performance of individual classifiers and ensemble for training on books domain and test across different domains. Cs and Ct

performed on different domains before iterating learning process.

• The ensemble has better accuracy than individual classifiers.

• This further validated that target specific features are more discriminative than the shared features in classifying target domain instances.

Effects of different components of algorithm on Amazon review dataset studied:• Effect of learning target specific features:

Results from the algorithm show that iteratively kerning target specific feature representation( slow opposed to one shot transfer) yields better performance across different domain classification.

Effects of different components of algorithm on Amazon review dataset studied:

• Besides looking at the learning of target specific features, Effect of similarity on performance, Effect of varying threshold and Effects of using different shared representation was studied and the Proposed algorithm was found to perform better.

Results on 20 Newsgroup dataset

• For classification, they divided the data into six different datasets and the top two categories in each was picked as the two classes. The data was further segregated based on sub-categories are combined, where each sub-category was considered as a different domain.

• 4/5th of source and target data was used for shared representation and results were reported on 1/5th of the test data.

• Since different domain are crafted out from the subcategories of the same dataset, domains were exceedingly similar and therefore, the baseline accuracy was relatively better than that on other two datasets

Results on 20 Newsgroup dataset

• The proposed algorithm still yielded an improvement of at least 10.8% over baseline accuracy.

• The proposed algorithm also out performed the other adaptation approaches.

Results on Real World Data

• The proposed algorithm iteratively learned discriminitive target specific features from the data and translated it to an improvement of atleast 6.4% and 3.5% over baseline and the SCL respectively as per their experiment conducted.

Conclusion…!

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof....

Documents

Transcript of Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof....