Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES:...

10
Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee, Hui Fang, Yifan Li, ChengXiang Zhai University of Illinois at Urbana-Champaign Clustering and Similarity Function • Information within newsgroups or mailing lists has largely been underutilized. • For now, access to those data restricted to traditional searching and browsing. • Mail traffic also grows exponentially. • Can we access those information more effectively? • CEES = Conversation Extraction and Evaluation Service • Provide a general framework for email-related research • Integrate with popular open source projects such as Lucene, Hibernate, Tapestry, Weka, and more… • Features object-to-relational mapping of mail metadata, mail threading, flexible indexing, conversation clustering, and a web-base GUI • Goal: Find commonly-discussed topics from a set of conversations (threads) • Use agglomerative clustering with complete link • Learn similarity functions from different “perspectives” of threads: – authors, date, subject, contents, contents without quote, first message, reply, reply without quote. – Use Linear and Logistic Regression to learn the combined similarity function • Clustering – Learning the similarity function can be effective in combing different “perspectives” • Conversation Map – Can give overview of a conversation group – Effective use of 2-D space • Future Work – Derive better algorithms to learn the similarity function – Faster clustering algorithms that work on mining patterns in conversations – Summarization of conversation clusters • Use 3 Computer Science class newsgroups from Univ. of Illinois at Urbana-Champaign for corpus • Three different human taggers to group messages into subtopics • 3-way cross validation, using one group’s judgment file as training set and test on the other two. • Use class and cluster entropy as comparison metric Search Browse Existing technologies • Conversation visualization derived from Treemap • Clusters sorted by two extra time dimensions: Intra-Cluster Time and Inter-Cluster Time • Allows user to adjust the similarity threshold -- “zoom” to the more similar threads

Transcript of Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES:...

Page 1: Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

Public Email Conversations

Architecture

Clustering Results Conversation Map Conclusion

CEES: Intelligent Access to Public Email ConversationsWilliam Lee, Hui Fang, Yifan Li, ChengXiang Zhai

University of Illinois at Urbana-Champaign

Clustering and Similarity Function

• Information within newsgroups or mailing lists has largely been underutilized.

• For now, access to those data restricted to traditional searching and browsing.

• Mail traffic also grows exponentially.• Can we access those information more effectively?

• CEES = Conversation Extraction and Evaluation Service• Provide a general framework for email-related research • Integrate with popular open source projects such as Lucene, Hibernate, Tapestry, Weka, and

more…• Features object-to-relational mapping of mail metadata, mail threading, flexible indexing,

conversation clustering, and a web-base GUI

• Goal: Find commonly-discussed topics from a set of conversations (threads)

• Use agglomerative clustering with complete link• Learn similarity functions from different “perspectives” of

threads:– authors, date, subject, contents, contents without

quote, first message, reply, reply without quote. – Use Linear and Logistic Regression to learn the

combined similarity function

• Clustering– Learning the similarity function can be effective in combing different

“perspectives”• Conversation Map

– Can give overview of a conversation group– Effective use of 2-D space

• Future Work– Derive better algorithms to learn the similarity function– Faster clustering algorithms that work on mining patterns in

conversations– Summarization of conversation clusters

• Use 3 Computer Science class newsgroups from Univ. of Illinois at Urbana-Champaign for corpus

• Three different human taggers to group messages into subtopics

• 3-way cross validation, using one group’s judgment file as training set and test on the other two.

• Use class and cluster entropy as comparison metric

Search Browse

Existing technologies

• Conversation visualization derived from Treemap• Clusters sorted by two extra time dimensions: Intra-

Cluster Time and Inter-Cluster Time• Allows user to adjust the similarity threshold --

“zoom” to the more similar threads

Page 2: Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

Autonomic Data Integration SystemsAutonomic Computing

– Shift workload from administrators & developers onto system•Self-tuning, self-maintaining, self-recovering from failures, self-improving

– Examples:•Self-tuning databases and query optimizers•Self-profiling software for bug and bottleneck detection•Self-recovering distributed systems

Data Integration Systems– Many complex components:

–Global schema–Sources, wrappers, and source schemas–Semantic mappings between source and global schemas

– Currently built (semi-)manually in error-prone and laborious process– Extremely difficult to maintain over changing sources

Build Autonomic Data Integration Systems!

Page 3: Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

The AIDA Project

• Improving Automatic Methods –Schema & ontology matching [SIGMOD-01, WWW-02, SIGMOD-04]–Entity matching & integration [IJCAI-03, IEEE Intelligent-03]–Global interface construction [SIGMOD-04]

• Reducing Costs of System Construction–Mass collaboration to build systems [WebDB-03, IJCAI-03]

• Monitoring Data Sources and Maintaining DI Systems–Recognition of changes in source data–Detection and repair of failures in DI system components

• Fast system deployment• Minimal human effort• Automatic adjustment to changes• Continuous improvement

The Focus ofThis Talk

Automatic Integration of DAta

Page 4: Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

• Shift workload from developers onto user population• Build system accurately with low individual effort

The MOBS Approach

Mass COllaboration to Build Systems

Automatic TechniquesDevelopers

User Population

MOBS MOBS

FormRecognition

AttributeMatching

SourceDiscovery

SystemInitialization

QueryTranslation

Page 5: Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

Title:

Cost:

Writer:

Pub:

Title:

Author:

Price:

Year:

Price:

Title:

Authors:

Author:

Title:

Price:

MOBS Applied to the Deep WebMOBS for Query Interface Matching1. Decompose task into binary statements2. Initialize small functioning system3. Solicit and merge user answers to expand the system

Statements for Matching “Writer”

• “Writer = Author” ?

• “Writer = Title” ?

• “Writer = Price” ?

Page 6: Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

How to Solicit User Answers

Incentive Models

• Leverage a monopoly or better-service system

• Piggy-back on a helper application

• Deploy in a volunteer or community environment

3

1

2

HOOP

0

0

Barnes & NobleBarnes & Noble

Is this form a

Book Sales source?

AuthorTitlePubPrice

YES NO

Page 7: Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

How to Merge User Answers

Bayesian Learning

• Use a dynamic Bayesian network as a generative feedback model

• Estimate user behavioral parameters from evaluation answers

• Converge statements from teaching answers

Title:

Cost:

Writer:

Pub:

Title:

Author:

Price:

Year:

Price:

Title:

Authors:

Author:

Title:

Price:

Page 8: Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

Form Recognition 24 forms, 17 bookstore forms

Interface Matching 17 interfaces, 155 attributesAverage 9 attributes per interface

Hub Discovery 30 department sites, 30 hubs

Data Extraction 26 homepages, 155 slotsAverage 6 slots per homepage

Mini-CiteSeer 17 homepages, 22 publication listsAverage 1.3 publication lists per homepage

Applicability of the MOBS ApproachWe Have Applied MOBS in Various Settings…–Scale: from a small community intranet to a highly trafficked website–Users: from cooperative expert volunteers to unpredictable novice users

… and to Several DI Tasks–Deep Web: Form Recognition, Interface Matching–Surface Web: Hub Discovery, Data Extraction, Mini-CiteSeer

Page 9: Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

Simulation and Real-World Results

Interface Matching

0.0

0.2

0.4

0.6

0.8

1.0

ML P1 P2 P3 P4 P5 P6 P7 P8 P9 P10

Precision Recall

Interface Matching

0

2

4

6

8

10

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10

Answers per User per Source Answers per Trusted User per Source

Name Helper Application Duration& Status Current Progress Precision Recall Avg User

Workload

Form Recognition

DB course website,132 undergrad students

5 days,Completed

Completed 24/24 interfaces,Found 17 bookstore interfaces

1.0(0.7 ML)

0.89(0.89 ML) 7.4 answers

Interface Matching

DB course website,132 undergrad students

7 days,Stopped

Completed 10/17 interfaces,Matched 65 total attributes

0.97(0.63 ML)

0.97(0.63 ML)

12.5 answers

Hub Discovery IR course website,28 undergrad students

21 days,Stopped

Completed 15/30 sitesFound 15 hubs

0.87(0.27 ML)

0.87(0.27 ML)

16.1 answers

Mini-CiteseerGoogle search engine,21 researchers, friends, family

19 days,Completed

Completed 17/17 pages (94 lists)Found 19 pubs 1.0 0.86 8.7 answers

P1 – Uniform [0,1]

P2 – Uniform [0.3,0.7]

P3 – Uniform [0.5,0.9]

P4 – Bell [0,1] P5 – Bell [0.3,0.7]

P6 – Bell [0.5,0.9]

P7 – Bimodal {0.2,0.8}

P8 – 90% Uniform [0,0.4], 10% {0.8}

P9 – 10% {0.1}, 50% Uniform [0.5,0.7], 40% Uniform [0.8,1]

P10 – 10% {0.3}, 90% Uniform [0.7,1]

Page 10: Public Email Conversations Architecture Clustering Results Conversation Map Conclusion CEES: Intelligent Access to Public Email Conversations William Lee,

Conclusion & Future Work

MOBS is an Effective Data Integration Tool– Requires small start-up and administrative costs– Solicits minimal effort per user– Constructs system accurately– Complements existing DI techniques– Applies to various scenarios and DI domains

Future Work

– Leverage implicit feedback

– Intelligently maintain system

– More tightly integrate with existing DI techniques

– Deploy compelling real-world applications

http://anhai.cs.uiuc.edu/home/projects/aida.html