Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan...
-
Upload
juliet-freeman -
Category
Documents
-
view
212 -
download
0
Transcript of Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan...
![Page 1: Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bf721a28abf838c7ebd6/html5/thumbnails/1.jpg)
Effective Information Access Over Public Email Archives
Progress Report
William Lee, Hui Fang, Yifan Li
For CS598CXZ Spring 2005
![Page 2: Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bf721a28abf838c7ebd6/html5/thumbnails/2.jpg)
Introduction and Motivation
Information within a newsgroup or a mailing list has largely been underutilized.
For now, access to those data restricted to traditional search and browsing.
Mail traffic also grows rapidly For example, the Tomcat (the Java-based web
application engine) mailing list has more than 37,000 messages from March 2003 to March 2004. That’s around 101 messages per day!
Can we access those information more effectively?
![Page 3: Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bf721a28abf838c7ebd6/html5/thumbnails/3.jpg)
Existing TechnologiesSearch Browse
![Page 4: Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bf721a28abf838c7ebd6/html5/thumbnails/4.jpg)
Project Goals
Thread Detection Detects topic shift within a thread Challenge:
W can not find such cases in our collection. So we will not explore it in our projects. But it is still a quite interesting research question in email domain.
Clustering Group the similar threads together Challenges:
How to define the similarity function between two threads? How to evaluate the clustering results?
Summarizing Generate the summary for each cluster Challenge:
How to identify the important part in each cluster? How to evaluate the summarization results?
Interface to view the clustering result
![Page 5: Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bf721a28abf838c7ebd6/html5/thumbnails/5.jpg)
The Corpus
Newsgroup archive for 3 Computer Science classes (CS473, CS475, and CS225) at UIUC for Fall 2004.
Each newsgroup contains messages for a complete semester for the given class.
Unlike previous newsgroup clustering tasks: Use thread instead of an individual message as the unit. We cluster based on subtopics within a newsgroup
![Page 6: Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bf721a28abf838c7ebd6/html5/thumbnails/6.jpg)
Progress So Far
Implemented clustering by using the CEES (Conversation Extraction and Evaluation Service) architecture CEES provides an architecture to
Gather messages and construct thread trees Parse, index, search, and cluster threads Integration with Lucene and Weka Cluster threads using different fields
Created the judgment files for evaluating the clustering results manually
![Page 7: Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bf721a28abf838c7ebd6/html5/thumbnails/7.jpg)
Clustering Use agglomerative clustering algorithm Similarity = dot product of Okapi-weighted vectors of
corresponding fields Computes the similarity of:
Contents Subject Contents without quote First message Rest of thread Rest of thread without quote Participants in a thread (email addresses in the “From:”) Linear regression using all the above features Logistic regression using all the above features
![Page 8: Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bf721a28abf838c7ebd6/html5/thumbnails/8.jpg)
Overall Entropy=0.5*Cluster Entropy + 0.5*Class Entropy
Cluster Quality Measures (He2002)
12
34
5
123
45
Cluster Entropy Class Entropy
Result Actual
![Page 9: Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bf721a28abf838c7ebd6/html5/thumbnails/9.jpg)
Clustering Performance
Cluster Entropy Class Entropy
![Page 10: Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bf721a28abf838c7ebd6/html5/thumbnails/10.jpg)
Clustering Performance(2)
Overall Entropy=0.53*Cluster Entropy + 0.47*Class Entropy
![Page 11: Effective Information Access Over Public Email Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.](https://reader036.fdocuments.us/reader036/viewer/2022083009/5697bf721a28abf838c7ebd6/html5/thumbnails/11.jpg)
Remaining Work Clustering
Find a more reasonable cluster quality measure Study why sometimes learned similarity function
performs worse than baseline Find a better way to learn the similarity function
Summarization Divide it into two subtasks
Summarization of announcement-driven discussion Summarization of question-driven discussion
Evaluation Create judgement files Evaluation measures