Jiit 2013 14 project presentation aniket mishra

Post on 12-Apr-2017

93 views 4 download

Transcript of Jiit 2013 14 project presentation aniket mishra

Major Project

Event Based News Clustering

Submitted By: Aniket Mishra

Problem Statement:• To implement a clustering system which can cluster the data which is

related to it in one cluster and one can see what is happening in the next event. so basically i have to implement event based news clustering system using clustering algorithm.

Implementation Steps Followed:• I have crawled data of election campaign Using BING API in different

time periods.• Used sub categories AAP , BJP,Congress• Applied k-means first I have taken 10 clusters.• Then applied Modified K-means On data to improve it’s Efficiency.• Applied algorithm using tfidf ,centroid calculation,cosine similiarity.

  RSS Purity Rand Index

K-means 73.52 65.9 .66

Modified K-means 73.70 71.5 .649

Table 1 shows the results obtained by our system for k-means and modified k-means algorithm.

Table 1-Comparison of clustering results

When calculating purity and rand index of k-means and modified k-means we found out that when we repeat the clusters for 10 times and get the initial k-points from each of the k different clusters rather than random restart for modified k-means it gives better results and give better purity as it can be.

Results DemonstrationThese are the results in cluster 9 that are coming altogether making it related news as we can see all 4 news are related to Rahul Gandhi. I have taken the news on 29-05-14 and these results were scattered and by using k-means clustering they are clustered and we found out these results.

As in this second example that I have taken we can see news is mostly related to Punjab unit of congress.so this is inferring that the news that I have taken correctly clustered. And we can also see that 2 news are also not related so It is not 100% pure clustered news.

Conclusion• In this project I have designed and evaluated clustering system. Our clustering

system crawls incoming news reports from Bing api and cluster them according to the event they are describing. The clustering is performed by representing incoming news reports as Bag of Word with TF-IDF weighting, and using a variation of k-means algorithm that works in a single pass without cluster re-organization. The number of cluster to produce is fixed for every query to 29 and new events are detected automatically. Clustering process takes 1-2 minutes to fetch news from website.

• The evaluation results show that our system is very effective when clustering documents into highly specific clusters, but performs rather poorly when clustering documents into more general categories and it performs better for Modified k-means.

Future Work:• It is my opinion that our clustering can be applied in other domains

apart from online news. For example it can be applied successfully to the clustering of social media feed to produce clusters according to the item being discussed by different people. In my project in future a user interface for user can be created for better use. And we can also improve its scalability

Thank you!