Email Data Cleaning (KDD’05) Jie Tang 1, Hang Li 2, Yunbo Cao 2, Zhaohui Tang 3 1 Tsinghua...
-
Upload
cayla-jumper -
Category
Documents
-
view
221 -
download
0
Transcript of Email Data Cleaning (KDD’05) Jie Tang 1, Hang Li 2, Yunbo Cao 2, Zhaohui Tang 3 1 Tsinghua...
Email Data Cleaning(KDD’05)
Jie Tang1, Hang Li2, Yunbo Cao2, Zhaohui Tang3
1 Tsinghua University2 Microsoft Research Asia
3 Microsoft Corporation
Outline
Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary
Motivation
Email is one of the most common modes of communication
Text mining applications on emails Email classification Email summarization Term extraction from email …
Term ExtractionFrom: SY <[email protected]> - Find messages by this author Date: Mon, 4 Apr 2005 11:29:28 +0530Subject: Re: ..How to do addition??
Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows
import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; private int matrixA[][];
public void inputArray() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input) }
-- Sandeep Yadav Tel: 011-243600808 Homepage: http://www.it.com/~Sandeep/
On Apr 3, 2005 5:33 PM, ranger <[email protected]> wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to> enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition.....Tnx
Extra line break
Missing spaceExtra spaceMissing period
Case errors.
Hi Ranger, Your design of Matrix class is not good. What are you doing with two matrices in a single class? Make class Matrix as follows:
Outline
Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary
Related Work -- Data Mining
Email Cleaning Several products have the feature of email cleaning by using rules E.g. eClean (2000), WinPure ListCleaner Pro (2004)
Information Extraction from Email Extracting contact information, etc E.g. Kristjansson and Culotta (2004), Culotta, Bekkerman, and McCallum
(2004), Viola (2005)
Web Page Cleaning Removing banner ads, decoration pictures E.g. Yi and Liu (2003), Lin and Ho (2002)
Tabular Data Cleaning Detecting and removing duplicate information E.g. Hernández and Stolfo (1998), Rahm and Do (2000), SQL Server 2005
Related Work -- Language Processing
Sentence Boundary Detection Palmer and Hearst (1997)
Case Restoration Lita and Ittycheriah (2003) Mikheev (2002)
Spelling Error Correction Golding and Roth (I996)
Outline
Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary
Our Approach -- Cascaded Approach
Cleaning = non-text block filtering + text normalization
•Non-text block filtering
- Quotation detection
- Header detection
- Signature detection
- Program code detection
•Text normalization
- Paragraph normalization * Extra line break detection
- Sentence normalization * Missing period detection
- Word normalization * Case restoration
Cascaded ApproachNoisy Email
Message
Non-text Block Filtering
Quotation Detection
Header Detection
Program Code Detection
Cleaned Email Message
Paragraph Normalization
Extra Line Break Detection
Sentence NormalizationMissing Periods and
Missing spaces Detection
Extra Spaces Detection
Word Normalization
Case Restoration
Signature Detection
From: SY <[email protected]> - Find messages by this author Date: Mon, 4 Apr 2005 11:29:28 +0530Subject: Re: ..How to do addition??
Hi Ranger, Your design of Matrix class is not good.what are you doing with two matrices in a single class?make class Matrix as follows
import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; private int matrixA[][];
public void inputArray() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input) }
-- Sandeep Yadav Tel: 011-243600808 Homepage: http://www.it.com/~Sandeep/
On Apr 3, 2005 5:33 PM, ranger <[email protected]> wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to > enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition..
Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows
Hi Ranger, Your design of Matrix class is not good. What are you doing with two matrices in a single class? Make class Matrix as follows.
Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class? make class Matrix as follows.
In a particular text mining application, we can retain some of the blocks
Quotation Detection
Header Detection
Signature Detection
Extra line break Detection
Missing Period and Missing Space Detection
Program Code Detection
Extra Space Detection
Case Restoration
Outline
Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary
Technical Issues
Non-text filtering Quotation detection Header detection Signature detection Program code detection
Text normalization Extra line break detection Sentence normalization Case restoration
SVMs
Start line model
Two SVM models
End line model
Training data
Test data
Feature extraction
Feature extraction
Identified blocks
Position feature
Positive word feature
...
SVMs
Feature extraction
Position feature
Positive word feature
...
Start line feature set End line feature set
Non-text Filtering Using SVMs
Header detectionSignature detectionProgram code detection
Features Used in Header Detection
Position Feature Is the first line?
Positive Word Features Begins with: “From:”, “Re:”, “In article”, etc.
Contains: “original message”, “Fwd:”, etc.
Ends with: “wrote:”, “said:”, etc.
Negative Word Features Contains: “Hi”, “dear”, “thank you”, “best regards”, etc.
Number of Words Feature Number of words in the current line
Person Name Feature Contains a person name?
Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark, etc.
Special Pattern FeaturesContains one type of special patterns: email, date, number, URL, percentage, etc.
Number of Line Breaks Feature Number of line breaks exist before the current line
Special Pattern Features Contains one type of special patterns: email, date, number, URL, percentage, etc.
Positive Word FeaturesBegins with: “From:”, “Re:”, “In article”, etc. Contains: “original message”, “Fwd:”, etc.Ends with: “wrote:”, “said:”, etc.
Position Feature Is the first line?
Ending Character FeaturesEnds with: colon, semicolon, quotation mark, question mark, exclamation mark, etc.
From: SY <[email protected]> - Find messages by this author Date: Mon, 4 Apr 2005 11:29:28 +0530Subject: Re: ..How to do addition??
Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows
import java.io.*; class Matrix { public static int AnumberOfRows; public static int AnumberOfColumns; private int matrixA[][];
public void inputArray() throws IOException { InputStreamReader input = new InputStreamReader(System.in); BufferedReader keyboardInput = new BufferedReader(input) }
-- Sandeep Yadav Tel: 011-243600808 Homepage: http://www.it.com/~Sandeep/
On Apr 3, 2005 5:33 PM, ranger <[email protected]> wrote: > Hi... I want to perform the addtion in my Matrix class. I got the program to > enter 2 Matricx and diaplay them. Hear is the code of the Matrix class and > TestMatrix class. I'm glad If anyone can let me know how to do the addition.....Tnx
Two SVM models are
employed to respectively
identify the start line
and end line.
Header Detection
Position Feature
Positive Word Features (“From:”)
Negative Word Features
Number of Words Feature
Person Name Feature
Ending Character Features
Special Pattern Features (“email”)
Number of Line Breaks Feature
Position Feature
Positive Word Features (“Subject:”)
Negative Word Features
Number of Words Feature
Person Name Feature
Ending Character Features (“??”)
Special Pattern Features
Number of Line Breaks Feature
- Input: An annotated email dataset. - Output: Discovered features.- Algorithm:
Step 1: Preprocessing. This step first processes emails by using hard rules. it replaces several special patterns by a tag. For example, an email address “[email protected]” is to be replaced by a tag <email>.
Step 2: Learning patterns. This step take the header lines as positive samples and the other lines as negative samples. It employs the pattern learning tool to discovering the patterns. An example of the discovered patterns is: “<begin> Date: <week> <date> <time> <end>”.
Step 3: Generating features. This step generates features according to the learned patterns by using heuristic rules. For the above example, the corresponding feature can be: “^\s*Date: <week> <date> <time>\s*$”. The feature represents whether or not the current line contains the pattern.
Automatic Feature Generation
Generated Features
From: <email>
Subject: (.*?) Re:
<<email>> wrote in message
Date: <week> <date>
Subject:
<week> <date> <time>
Date:
-----Original Message-----
To: <email>
….
- Feature definition is tedious.- Can we automate the feature generation?
Example Features Used in Signature DetectionPosition Feature Is the first line or the last line?
Positive Word Features Contains: “Best Regards”, “Thanks”, “Sincerely”, “Good luck”, etc.
Number of Words Feature Number of words in the current line
Person Name Feature Contains a person name?
Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark…
Special Symbol Pattern Features
Contains consecutive special symbols such as: “--------”, “======”, “******”.
Case Features Whether the tokens are all in upper-case, all in lower-case, all capitalized or only the first token is capitalized
Number of Line Breaks Feature Number of line breaks exist before the current line
Position Feature Position of the current line
Declaration Keyword Features Starts with: “string”, “char”, “double”, “dim”, “typedef struct”, “#include”, “import”, “#define”, “#undef”, etc.
Statement Keyword Features There are four kind of statement keyword features:- “i++”; - “if”, “else if”, “switch”, and “case”; - “while”, “do{”, “for”, and “foreach”; - “goto”, “continue;”, “next;”, “break;”
Equation Pattern Features There are four kind of equation pattern features:- “=”, “<=” and “<<=” - “a=b+/*-c;” - “a=B(bb,cc);” - “a=b;”
Function Pattern Feature Contains function pattern? E.g., pattern covering “fread(pbBuffer,1, LOCK_SIZE, hSrcFile);”
Example Features Used in Program Code Detection
SVMs
Extra line break model
One SVM model
Training data
Test data
Feature extraction
Feature extraction
Identified extra line breaks
Position feature
Bullet feature
...
Feature set
Extra Line Break Detection Using SVMs
Features Used in Extra Line Break Detection
Position Feature Is the first line or the last line?
Greeting Word Features Contains: “Hi” and “Dear”, etc.
Ending Character Features Ends with: colon, semicolon, quotation mark, question mark, exclamation mark, etc.
Case Features Whether the current line ends with a word in lower case letters and whether or not the next line starts with a word in lower case letters
Bullet Features Is the next line one kind of bullet of a list item like “1.” and “a)”?
Number of Line Breaks Feature Number of line breaks exist after the current line
Case Features Whether the current line ends with a word in lower case letters and whether or not the next line starts with a word in lower case letters
Hi Ranger, Your design of Matrix class is not good. what are you doing with two matrices in a single class?make class Matrix as follows
One SVM model is employed to identify whether a line break is an extra one or not.
Extra Line Break Detection
Position Feature
Greeting Word Features
Ending Character Features
Case Features
Bullet Features
Number of Line Breaks Feature
Case restoration
tri-gram + sentence level decoding
Jack utilize outlook express to retrieve emails.
Jack
jack
JACK
Utilize
utilize
UTILIZE
Outlook
outlook
OUTLOOK
Express
express
EXPRESS
To
to
TO
Receive
receive
RECEIVE
Emails
emails
EMAILS
2 12 1
2 1
( )( | )
( )i i i
i i ii i
C w w wP w w w
C w w
2 1 2 12 1
2 12 1
2 1 1
( ) ( ( ))( ) 0
( )( | )
( ) ( | )
i i i i i ii i i
i ii i i
i i i i
C w w w D C w w wif C w w w
C w wP w w w
w w P w w otherwise
Backoff scheme:
Outline
Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary
Datasets in Experiments
73.2% contain extra line breaks 85.4% need sentence normalization 47.1% contain case errors Only 1.6% are absolutely clean
Data Set # of EmailContaining
HeaderContaining Signature
Containing Prog. Code
Text Only
DC 100 1.00 0.87 0.15 0.0Ontology 100 1.00 0.77 0.02 0.0
NLP 60 1.00 0.883 0.0 0.0ML 40 1.00 0.975 0.05 0.0
Jena 700 0.996 0.97 0.38 0.0Weka 200 0.995 0.975 0.17 0.0005
Protégé 500 0.28 0.822 0.032 0.168OWL 500 0.384 0.932 0.042 0.048
Mobility 400 0.44 0.745 0.0 0.183WinServer 400 0.449 0.672 0.0125 0.221Windows 1000 0.476 0.653 0.007 0.218
PSS 1000 0.492 0.668 0.01 0.208BR 310 0.495 0.643 0.0 0.244
J2EE 255 1.00 0.561 0.094 0
5565 3256(0.585) 4229(0.760) 401(0.072)768(0.138
)3256(0.585) 4229(0.760)
0.15
0.380.17
5565
Cleaning Results -- 5-fold Cross Validation
Cleaning Task Precision Recall F1-Measure
HeaderOur Method 0.9695 0.9742 0.9719
Baseline 0.9981 0.6055 0.7537
SignatureOur Method 0.9133 0.8838 0.8983
Baseline 0.8854 0.2368 0.3736
Quotation 0.9818 0.9201 0.9500
Program Code 0.9297 0.7217 0.8126
Extra Line Break
Our Method 0.8553 0.9765 0.9119
Baseline 0.6355 0.9813 0.7715
Sentence 0.9493 0.9391 0.9442
Baseline methods• Header detection (eClean2000)• Signature detection (rule based)• Extra line break detection baseline (eClean2000)
For case restoration:-Our method can reach 98.15% in terms of accuracy-The accuracy of Trucasing is about 97.7%
Automatic Features vs. Manual Features
Detection Task Precision Recall F1-Measure
HeaderManual 0.9695 0.9742 0.9719
Automatic 0.9932 0.9626 0.9777
SignatureManual 0.9133 0.8838 0.8983
Automatic 0.7616 0.6671 0.7112
BR J2EE
40
50
60
70
80
90
100
Precision Recall F1-Measure
Per
cent
age(
%)
Original Data Baseline Our Method
30
40
50
60
70
80
Precision Recall F1-Measure
Per
cent
age(
%)
Original Data Baseline Our Method
Term Extraction Using Email Cleaning
40
50
60
70
80
90
100
Precision Recall F1-Measure
Per
cent
age(
%)
Original Data +Header +Signature +Quotation +Program +Paragraph
How Cleaning Processing Helps Term Extraction
+74.2%+6.4% +41%
BR
30
40
50
60
70
80
Precision Recall F1-Measure
Per
cent
age(
%)
Original Data +Header +Quotation +Signature +Program +Paragraph
How Cleaning Processing Helps Term Extraction (cont.)
+42.4%
+2.3%
+24.7%
J2EE
Outline
Motivation and Problem Description Related Work Our Approach Implementation Experimental Results Summary
Summary
Formalized email data cleaning as non-text filtering and text normalization
Conducted email cleaning in ‘cascaded’ approach Used SVM models for header, signature, program code,
and extra line break detection Our approach significantly outperforms baseline methods When applied to term extraction, significant improvement
on extraction accuracy can be obtained