Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES:...
-
Upload
primrose-manning -
Category
Documents
-
view
212 -
download
0
Transcript of Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES:...
Public Email Conversations
Architecture
Clustering Results Conversation Map Conclusion
CEES: Intelligent Access to Public Email ConversationsWilliam Lee, Hui Fang, Yifan Li, ChengXiang Zhai
University of Illinois at Urbana-Champaign
Clustering and Similarity Function
• Information within newsgroups or mailing lists has largely been underutilized.
• For now, access to those data restricted to traditional searching and browsing.
• Mail traffic also grows exponentially.• Can we access those information more effectively?
• CEES = Conversation Extraction and Evaluation Service• Provide a general framework for email-related research • Integrate with popular open source projects such as Lucene, Hibernate, Tapestry, Weka, and
more…• Features object-to-relational mapping of mail metadata, mail threading, flexible indexing,
conversation clustering, and a web-base GUI
• Goal: Find commonly-discussed topics from a set of conversations (threads)
• Use agglomerative clustering with complete link• Learn similarity functions from different “perspectives” of
threads:– authors, date, subject, contents, contents without
quote, first message, reply, reply without quote. – Use Linear and Logistic Regression to learn the
combined similarity function
• Clustering– Learning the similarity function can be effective in combing different
“perspectives”• Conversation Map
– Can give overview of a conversation group– Effective use of 2-D space
• Future Work– Derive better algorithms to learn the similarity function– Faster clustering algorithms that work on mining patterns in
conversations– Summarization of conversation clusters
• Use 3 Computer Science class newsgroups from Univ. of Illinois at Urbana-Champaign for corpus
• Three different human taggers to group messages into subtopics
• 3-way cross validation, using one group’s judgment file as training set and test on the other two.
• Use class and cluster entropy as comparison metric
Search Browse
Existing technologies
• Conversation visualization derived from Treemap• Clusters sorted by two extra time dimensions: Intra-
Cluster Time and Inter-Cluster Time• Allows user to adjust the similarity threshold --
“zoom” to the more similar threads
Autonomic Data Integration SystemsAutonomic Computing
– Shift workload from administrators & developers onto system•Self-tuning, self-maintaining, self-recovering from failures, self-improving
– Examples:•Self-tuning databases and query optimizers•Self-profiling software for bug and bottleneck detection•Self-recovering distributed systems
Data Integration Systems– Many complex components:
–Global schema–Sources, wrappers, and source schemas–Semantic mappings between source and global schemas
– Currently built (semi-)manually in error-prone and laborious process– Extremely difficult to maintain over changing sources
Build Autonomic Data Integration Systems!
The AIDA Project
• Improving Automatic Methods –Schema & ontology matching [SIGMOD-01, WWW-02, SIGMOD-04]–Entity matching & integration [IJCAI-03, IEEE Intelligent-03]–Global interface construction [SIGMOD-04]
• Reducing Costs of System Construction–Mass collaboration to build systems [WebDB-03, IJCAI-03]
• Monitoring Data Sources and Maintaining DI Systems–Recognition of changes in source data–Detection and repair of failures in DI system components
• Fast system deployment• Minimal human effort• Automatic adjustment to changes• Continuous improvement
The Focus ofThis Talk
Automatic Integration of DAta
• Shift workload from developers onto user population• Build system accurately with low individual effort
The MOBS Approach
Mass COllaboration to Build Systems
Automatic TechniquesDevelopers
User Population
MOBS MOBS
FormRecognition
AttributeMatching
SourceDiscovery
SystemInitialization
QueryTranslation
Title:
Cost:
Writer:
Pub:
Title:
Author:
Price:
Year:
Price:
Title:
Authors:
Author:
Title:
Price:
MOBS Applied to the Deep WebMOBS for Query Interface Matching1. Decompose task into binary statements2. Initialize small functioning system3. Solicit and merge user answers to expand the system
Statements for Matching “Writer”
• “Writer = Author” ?
• “Writer = Title” ?
• “Writer = Price” ?
How to Solicit User Answers
Incentive Models
• Leverage a monopoly or better-service system
• Piggy-back on a helper application
• Deploy in a volunteer or community environment
3
1
2
HOOP
0
0
Barnes & NobleBarnes & Noble
Is this form a
Book Sales source?
AuthorTitlePubPrice
YES NO
How to Merge User Answers
Bayesian Learning
• Use a dynamic Bayesian network as a generative feedback model
• Estimate user behavioral parameters from evaluation answers
• Converge statements from teaching answers
Title:
Cost:
Writer:
Pub:
Title:
Author:
Price:
Year:
Price:
Title:
Authors:
Author:
Title:
Price:
Form Recognition 24 forms, 17 bookstore forms
Interface Matching 17 interfaces, 155 attributesAverage 9 attributes per interface
Hub Discovery 30 department sites, 30 hubs
Data Extraction 26 homepages, 155 slotsAverage 6 slots per homepage
Mini-CiteSeer 17 homepages, 22 publication listsAverage 1.3 publication lists per homepage
Applicability of the MOBS ApproachWe Have Applied MOBS in Various Settings…–Scale: from a small community intranet to a highly trafficked website–Users: from cooperative expert volunteers to unpredictable novice users
… and to Several DI Tasks–Deep Web: Form Recognition, Interface Matching–Surface Web: Hub Discovery, Data Extraction, Mini-CiteSeer
Simulation and Real-World Results
Interface Matching
0.0
0.2
0.4
0.6
0.8
1.0
ML P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
Precision Recall
Interface Matching
0
2
4
6
8
10
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
Answers per User per Source Answers per Trusted User per Source
Name Helper Application Duration& Status Current Progress Precision Recall Avg User
Workload
Form Recognition
DB course website,132 undergrad students
5 days,Completed
Completed 24/24 interfaces,Found 17 bookstore interfaces
1.0(0.7 ML)
0.89(0.89 ML) 7.4 answers
Interface Matching
DB course website,132 undergrad students
7 days,Stopped
Completed 10/17 interfaces,Matched 65 total attributes
0.97(0.63 ML)
0.97(0.63 ML)
12.5 answers
Hub Discovery IR course website,28 undergrad students
21 days,Stopped
Completed 15/30 sitesFound 15 hubs
0.87(0.27 ML)
0.87(0.27 ML)
16.1 answers
Mini-CiteseerGoogle search engine,21 researchers, friends, family
19 days,Completed
Completed 17/17 pages (94 lists)Found 19 pubs 1.0 0.86 8.7 answers
P1 – Uniform [0,1]
P2 – Uniform [0.3,0.7]
P3 – Uniform [0.5,0.9]
P4 – Bell [0,1] P5 – Bell [0.3,0.7]
P6 – Bell [0.5,0.9]
P7 – Bimodal {0.2,0.8}
P8 – 90% Uniform [0,0.4], 10% {0.8}
P9 – 10% {0.1}, 50% Uniform [0.5,0.7], 40% Uniform [0.8,1]
P10 – 10% {0.3}, 90% Uniform [0.7,1]
Conclusion & Future Work
MOBS is an Effective Data Integration Tool– Requires small start-up and administrative costs– Solicits minimal effort per user– Constructs system accurately– Complements existing DI techniques– Applies to various scenarios and DI domains
Future Work
– Leverage implicit feedback
– Intelligently maintain system
– More tightly integrate with existing DI techniques
– Deploy compelling real-world applications
http://anhai.cs.uiuc.edu/home/projects/aida.html