Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof....

28
Iterative similarity based adaptation technique for Cross Domain text classification Under : Prof. Amitabha Mukherjee By : Narendra Roy Roll no : 11451 Group : 6 Published by : Himanshu Bhatt, Deepali Semwal Shourya Roy

Transcript of Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof....

Page 1: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Iterative similarity based adaptation technique for Cross Domain text

classification

Under: Prof. Amitabha MukherjeeBy: Narendra RoyRoll no: 11451Group: 6

Published by: Himanshu Bhatt,Deepali Semwal

Shourya Roy

Page 2: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Introduction

• Supervised machine learning classifications assume both training and test data are sampled from same domain or distribution (iids).

• Performance degrades for test data from different domain.

• Problems of difference in features across domains, and absence of labelled data in test domain

• So an Iterative similarity based adaptation algorithm proposed in the paper to address these issues.

• A classifier learnt on one domain with sufficient labelled training data is applied to a different test domain with no labelled data.

Page 3: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

• Iterative algorithm starts with a shared feature representation of source and target domains.

• To adapt, it Iteratively learns domain specific features from the unlabelled target domain data, similarity between two domains is incorporated in similarity aware manner.

Page 4: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Different stages of proposed algorithm

Page 5: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Philosophy and Features of the algorithm

• Gradual transfer of knowledge from source to target domain while considering similarity between two domains.

• Ensemble of two classifier used.• Transfer occurs within the ensemble where a

classifier learned on shared representation transforms unlabeled test data into pseudo labeled data to learn domain specific classifier.

Page 6: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Salient Features:

1. Common Feature Space Representation• Want to find a good feature representation which

minimizes the divergence between the source and target domains and classification error.

• Structured Correspondence learning feature representation transfer approach used which derives a transformation matrix Q that gives a shared representation between the source and target domains.

• This approach(SCL) aims to learn the co-occurrence between features expressing similar meaning in different domains

Page 7: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Common Feature Space Representation Contd.

● Principal Predictor space is created using some top K eigenvectors of the matrix. Features are then projected form different domains onto this predictor space for shared feature space representation.

● Algorithm generalizes to different shared representations.

Page 8: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

2. Iteratively building Target domain labelled data• Hypothesis: Certain target domain instances are more

similar to source domain instances than the rest.• Target: To create (Pseudo)labelled data in target

domains.• Idea: A classifier trained on suitably chosen source

domain instances will be able to categorize similar target domain instances confidently. Hence we get labelled data in target domain from there.

• Only a few can be confidently labelled, so the process is iterated and in next iteration, ensemble output is considered.

• Again this adds to pseudo labelled data and process continues till all instances are exhausted or ends on some criteria.

Page 9: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

3. Domain similarity based aggregation

• Dissimilarity hinders the better shared space representation and performance of classification.

• If not similar enough then in this “transfer learning technique” the knowledge passed may result in “Negative learning”.

• So we measure Domain similarity using “Cosine similarity measure”

Page 10: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Notations

Page 11: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Algorithm

Page 12: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Algorithm continues

Page 13: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Algorithm continues.

Page 14: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Algorithm in a nutshell

Page 15: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

• Confidence of Prediction: αi for the ith instance is measured as the distance from the decision boundary(Hsu et. al. , 2003) as follows:

• Here, R is the unnormalized output from the SVM classifier, V is the weight vector for the support vectors and |V| = VTV.

• Each iteration shifts the weights in ensemble from the classifier learnt on shared representation to target domain classifier.

Page 16: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:
Page 17: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Results and Datasets

• Efficacy of proposed algorithm was evaluated on different datasets for cross domain text classification (Blitz et. al, 2007)

• The performance of the experiment evaluated on two-class classification task and reported in classification accuracy terms.

Page 18: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Datasets

• Amazon Review data taken with four different domains, Books, DVDs, Kitchen appliances and Electronics. Each domain comprised of 1000 positives and 1000 negative reviews and in all experiments 1600 labelled and 1600 unlabelled reviews are taken in source and target domains, and performance is reported on the non overlapping 400 reviews from target domain.

• 20 Newsgroups dataset which was a collection of approximately 20,000 documents evenly partitioned across 20 newsgroups is used.

Page 19: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Datasets contd.

• Third dataset was a real world dataset comprising of tweets about products and services. Coll1 about gaming, Coll2 about Microsoft products and Coll3 about mobile support. Each collection has 218 positive and negative tweets and these tweets are collected based on user-defined keywords captured in a listening engine which then crawled the social media and fetched comments matching the keywords.

Page 20: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Results and Analysis

• After the datasets were preprocessed, the experiments were conducted with the SVM and radial basis function kernel as the constituent classifiers of the ensemble classifier.

• Performance mainly affected due to dissimilarity of domains resulting in negative transfer and feature divergence.

• In the experiments, the maximum number of iterations were set to 30. Target specific classifier were found to have more weight after the iterations.

• On an average the weights converged to ws = 0.22 and Wt=0.78 at the end.

Page 21: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Amazon Dataset results and Analysis:

• The above table shows comparison of performance of individual classifiers and ensemble for training on books domain and test across different domains. Cs and Ct

performed on different domains before iterating learning process.

• The ensemble has better accuracy than individual classifiers.

• This further validated that target specific features are more discriminative than the shared features in classifying target domain instances.

Page 22: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Effects of different components of algorithm on Amazon review dataset studied:• Effect of learning target specific features:

Results from the algorithm show that iteratively kerning target specific feature representation( slow opposed to one shot transfer) yields better performance across different domain classification.

Page 23: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Effects of different components of algorithm on Amazon review dataset studied:

• Besides looking at the learning of target specific features, Effect of similarity on performance, Effect of varying threshold and Effects of using different shared representation was studied and the Proposed algorithm was found to perform better.

Page 24: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Results on 20 Newsgroup dataset

• For classification, they divided the data into six different datasets and the top two categories in each was picked as the two classes. The data was further segregated based on sub-categories are combined, where each sub-category was considered as a different domain.

• 4/5th of source and target data was used for shared representation and results were reported on 1/5th of the test data.

• Since different domain are crafted out from the subcategories of the same dataset, domains were exceedingly similar and therefore, the baseline accuracy was relatively better than that on other two datasets

Page 25: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Results on 20 Newsgroup dataset

• The proposed algorithm still yielded an improvement of at least 10.8% over baseline accuracy.

• The proposed algorithm also out performed the other adaptation approaches.

Page 26: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Results on Real World Data

• The proposed algorithm iteratively learned discriminitive target specific features from the data and translated it to an improvement of atleast 6.4% and 3.5% over baseline and the SCL respectively as per their experiment conducted.

Page 27: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group:

Conclusion…!

Page 28: Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: 11451 Group: