Jiit 2013 14 project presentation aniket mishra

10
Major Project Event Based News Clustering Submitted By: Aniket Mishra

Transcript of Jiit 2013 14 project presentation aniket mishra

Page 1: Jiit 2013 14 project presentation aniket mishra

Major Project

Event Based News Clustering

Submitted By: Aniket Mishra

Page 2: Jiit 2013 14 project presentation aniket mishra

Problem Statement:• To implement a clustering system which can cluster the data which is

related to it in one cluster and one can see what is happening in the next event. so basically i have to implement event based news clustering system using clustering algorithm.

Page 3: Jiit 2013 14 project presentation aniket mishra

Implementation Steps Followed:• I have crawled data of election campaign Using BING API in different

time periods.• Used sub categories AAP , BJP,Congress• Applied k-means first I have taken 10 clusters.• Then applied Modified K-means On data to improve it’s Efficiency.• Applied algorithm using tfidf ,centroid calculation,cosine similiarity.

Page 4: Jiit 2013 14 project presentation aniket mishra

  RSS Purity Rand Index

K-means 73.52 65.9 .66

Modified K-means 73.70 71.5 .649

Table 1 shows the results obtained by our system for k-means and modified k-means algorithm.

Table 1-Comparison of clustering results

Page 5: Jiit 2013 14 project presentation aniket mishra

When calculating purity and rand index of k-means and modified k-means we found out that when we repeat the clusters for 10 times and get the initial k-points from each of the k different clusters rather than random restart for modified k-means it gives better results and give better purity as it can be.

Page 6: Jiit 2013 14 project presentation aniket mishra

Results DemonstrationThese are the results in cluster 9 that are coming altogether making it related news as we can see all 4 news are related to Rahul Gandhi. I have taken the news on 29-05-14 and these results were scattered and by using k-means clustering they are clustered and we found out these results.

Page 7: Jiit 2013 14 project presentation aniket mishra

As in this second example that I have taken we can see news is mostly related to Punjab unit of congress.so this is inferring that the news that I have taken correctly clustered. And we can also see that 2 news are also not related so It is not 100% pure clustered news.

Page 8: Jiit 2013 14 project presentation aniket mishra

Conclusion• In this project I have designed and evaluated clustering system. Our clustering

system crawls incoming news reports from Bing api and cluster them according to the event they are describing. The clustering is performed by representing incoming news reports as Bag of Word with TF-IDF weighting, and using a variation of k-means algorithm that works in a single pass without cluster re-organization. The number of cluster to produce is fixed for every query to 29 and new events are detected automatically. Clustering process takes 1-2 minutes to fetch news from website.

• The evaluation results show that our system is very effective when clustering documents into highly specific clusters, but performs rather poorly when clustering documents into more general categories and it performs better for Modified k-means.

Page 9: Jiit 2013 14 project presentation aniket mishra

Future Work:• It is my opinion that our clustering can be applied in other domains

apart from online news. For example it can be applied successfully to the clustering of social media feed to produce clusters according to the item being discussed by different people. In my project in future a user interface for user can be created for better use. And we can also improve its scalability

Page 10: Jiit 2013 14 project presentation aniket mishra

Thank you!