Dr. William E. Underwood Principal Research Scientist Georgia Tech Research Institute

30
GTRI_B-1 Name – Short Title 1 Dr. William E. Underwood Principal Research Scientist Georgia Tech Research Institute Auto-Classification: Taking a Closer Look ARMA NOVA Chapter Spring Seminar March 6, 2012 How does Auto-Classification Work? The Science Behind It

description

How does Auto-Classification Work? The Science Behind It. Dr. William E. Underwood Principal Research Scientist Georgia Tech Research Institute Auto-Classification: Taking a Closer Look ARMA NOVA Chapter Spring Seminar March 6, 2012. Overview. - PowerPoint PPT Presentation

Transcript of Dr. William E. Underwood Principal Research Scientist Georgia Tech Research Institute

Page 1: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-1Name – Short Title 1

Dr. William E. Underwood

Principal Research Scientist

Georgia Tech Research Institute

Auto-Classification: Taking a Closer LookARMA NOVA Chapter Spring Seminar

March 6, 2012

How does Auto-Classification Work?The Science Behind It

Page 2: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-2Name – Short Title- 2

Overview

• The Problem: Email needs to be categorized by business activity and retained according to agency records schedule.

• What are some of the issues that arise in trying to solve this problem.

• Rule-based text categorization

• Statistics-based text categorization

• An experiment in e-mail categorization

• Conclusions

Page 3: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-3Name – Short Title- 3

BACKGROUND

• NARA Bulletin 2011-03, Dec 22, 2010

• Subject: Guidance Concerning the use of E-mail Archiving Applications to Store E-mail

• What are the requirements for managing e-mail messages as Federal records?

• Provide for the grouping of related records into classifications according to the nature of the business purposes the records serve;

• Permit easy and timely retrieval of both individual records and files or other groupings of related records;

• Retain the records in a usable format for their required retention period as specified by a NARA approved records schedule;

• …

Page 4: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-4Holmes – Wii IRAD- 4

Which e-mails are non-records?

• Many received intra-office e-mail messages should not be saved as records, because they are for information only. If kept by the recipient, they are kept for reference. They are non-records. They do not need to be categorized as records, but may need to be categorized as non records..

• Some received intra-office mail should be saved as records, for example, requests for action and responses to requests for information.

• Email messages that are related, e.g., requests for information and response should be linked .

Page 5: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-5Name – Short Title- 5

TEXT CATEGORIZATION

Text (or Document) Categorization is the problem of assigning a selected document to one or more categories. There are two primary approaches to automated text categorization—rule-based and statistics-based machine learning. Rule-based filters are available in some e-mail clients for automatically classifying mail into email folders. Machine learning techniques are the basis of most spam filters for email.

Page 6: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-6

Rule-based Text Categorization with Shallow Text Processing

• Human experts define template structures to be filled automatically by extracting information from the documents [Ciravegna et al 1999]. The partially filled templates are classified by hand-made rules.

• Result in very high recall/precision or accuracy values

• High costs in analyzing and modeling the application domain, especially if one takes into account the problem of changing content in the categories.

Filename - 6

Page 7: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-7

Categorizing Email in MS Outlook Using Rules

Name – Short Title- 7

Page 8: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-8

Categorizing Email in MS Outlook Using Rules

Name – Short Title- 8

Page 9: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-9

Categorizing Email in MS Outlook Using Rules

Name – Short Title- 9

Page 10: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-10

Supervised Machine Learning

• Supervised machine learning trains classifiers on a set of documents that have been labeled with the correct class.

• Given a sufficient sample for each category, machine learning models generally cost less to create than rule-based systems.

• SML is also easier to scale up to large volumes of email.

• SML promises low costs both in analyzing and modeling the application at the expense of a lower accuracy.

• It is independent of domain specific knowledge.

Filename - 10

Page 11: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-11

Supervised Machine Learning Applied to E-mail

• The supervised machine learning methods that have been used for text categorization include: Maximum Entropy classification, Naïve Bayes, and Support Vector Machines (SVM).

• The most effective SML method for Text categorization has been SVM. [Joachims 1998; Sebastiani 2002]

• SVMs scale up to high dimensionalities

• Work well without term selection

• Robust to over-fitting

Filename - 11

Page 12: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-12

Support Vector Machines

• SVMs distinguish positive and negative examples for each class. During the learning phase, they construct a hyperplane supported by vectors of positive and negative examples. For each class, a categorizer is built by computing such a hyperplane.

• During the categorization phase, each categorizer is applied to the new document vector, yielding the probabilities of the document belonging to a class. The probability increases with the distance of the vector from the hyperplane. A document is said to belong to the class with the highest probability.

Filename - 12

Page 13: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-13

Feature Vector Representation of Email

• Each word token (or word stem) in a document corresponds to a feature. (bag of words)

• All words in a training set correspond to positions in a feature vector.

• Value of each feature is defined by tf-idf: the document term frequency (tf) scaled to the inverse document frequency (idf).

Filename - 13

Page 14: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-14

Learning Support Vector Classifiers

• Crosses & circles represent positive and negative training examples, resp.

• Lines represent decision surfaces

• Thicker line is best decision surface.

• Small boxes indicate support vectors.

IDCC09-Underwood 14

Page 15: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-15

Experiment: Automatic Categorization of GTRI Email using SVMs

1. Select samples of GTRI email that should be categorized and related to the University System of Georgia’s Records Retention Schedule.

2. Select 2/3 of samples in each of the six categories for training.

3. Preprocess each email sample in training set by converting it to text format and transforming it into a representation suitable to the learning method.

4. Train six binary SVM classifiers

5. Evaluate performance of the classifiers using 1/3 of samples from each category that were not used in the training.

Filename - 15

Page 16: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-16Name – Short Title- 16

RESULTS: Retention Categories

Category: Administration

(A4) Advisory Board RecordsExplanation: This series documents the activities of boards and councils, which function in an advisory capacity. Boards and councils may have as their charge highly specific or broad areas of concern and include members from outside the institution. This series may include but is not limited to meeting minutes; agendas; reports; notes; working papers; audio recordings; transcriptions; and related documentation and correspondence.Record Copy: Institutional Archives; Colleges & UnitsRetention: Permanent for minutes, agendas, reports, and correspondence; 3 years for all other recordsCitation or Reference:

(A13) Correspondence, AdministrativeExplanation: Series documents communications received or sent which contain significant information about an institution's programs. Records include letters sent and received, memoranda, notes, enclosures, attachments and electronic messages.Record Copy: UnitsRetention: 5 yearsCitation or Reference: O.C.G.A. 9-3-26

Page 17: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-17Name – Short Title- 17

RESULTS: Retention Categories

(A15) Correspondence, TransitoryExplanation: Series documents communications received or sent which do not contain significant information about an institution's programs (Correspondence, Administrative), fiscal status (Correspondence, Fiscal), or routine agency operations (Correspondence, General). Records include, but are not limited to, advertising circulars, drafts and worksheets, desk notes, memoranda, electronic messages, and other records of a preliminary or informational nature.Record Copy: UnitsRetention: Until readCitation or Reference:

(A16) Correspondence, General (Routine)Explanation: Series documents communications received or sent which do not contain significant information about an institution's programs. Records include: letters sent and received; memoranda; notes; transmittals; acknowledgments; community affair notices; charity fund drive records; routine requests for information or publications; enclosures, attachments and electronic messages.Record Copy: UnitsRetention: 5 yearsCitation or Reference: O.C.G.A. 9-3-26

Page 18: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-18Name – Short Title- 18

RESULTS: Retention Categories

(A38) Special Event RecordsExplanation: This series documents the efforts of a college or unit to provide informative sessions, short-courses, workshops, training programs, excursions, and celebratory events for members of the institution and the communities it serves. This series may include but is not limited to: materials on planning and arrangements; reports; promotional and publicity materials; press releases and news clippings; photographs; presentation materials and handouts; schedules of speakers and activities; registration and attendance lists; participant evaluations; and related documentation and correspondence.Record Copy: Creating unitsRetention: 7 years after end of eventCitation or Reference: O.C.G.A. 9-3-24

Category: Information Management & Planning

(D1) Computer System Maintenance RecordsExplanation: This series documents the maintenance of the institution's computer systems and is used to insure compliance with any warranties or service contracts, schedule regular maintenance and diagnose system or component problems, and document system backups. Records may include: computer equipment inventories; hardware performance reports; component maintenance records (invoices, warranties, maintenance logs, correspondence, maintenance reports, and related records); system backup reports; and backup tape inventories.Record Copy: Information Technology, UnitsRetention: For life of system or component for records related to system or component repair or service; until superseded for records related to regular or vital records backups

Page 19: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-19Name – Short Title- 19

Improved Description of Retention Category

Administrative Correspondence Email

Official communication by Institutional, Departmental, and Divisional Management pertaining to the formulation, planning, implementation, interpretation, or modification of an entity’s programs, services, or projects and the policies and regulations that govern them.

Examples:

E-mail from the President announcing the development of a new campus.

Email from the President announcing the development of a new Research Center.

Email from the CIO announcing a new Service Center and explaining the planned benefits.

Email from the Chair of Pathology announcing the development of a new course curriculum.

Email from the Purchasing Department implementing new procedures for the procurement of contract services.

Email from the Library extending service delivery hours.

Page 20: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-20Name – Short Title- 20

Improved Description of Retention Category

General Correspondence Email

Email communication that documents an entity’s activities (institution, department, division, etc.) arising from the routine operations of policies, programs, services, or projects.

Examples:

Faculty email notifications to students documenting course assignments and due dates.

Email notification from Accounting that the monthly Ledgers have been closed.

Email transmission of a Department’s monthly report.

An email transmittal submitting an official report, where documentation is needed to prove the report was submitted timely.

Human Resources notifications regarding changes in employee benefits.

Email notification from Grants Management notifying faculty of grant filing deadlines.

Page 21: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-21Underwood– IRAD- 21

Email Retention & Filing Categories

Univ. System of Georgia Retention Category

GTRI Filing Category Samples Training Samples

Test Samples

(A4) Advisory Board Records Faculty Library Advisory Board Email 101 68 33

(A13) Correspondence, Administrative

GTRI Director Administrative Email 134 90 44

(A15) Correspondence, Transitory

ACL ListServ Email 100 65 35

(A16) Correspondence, General Employee Benefits 67 48 19

(A38) Special Event Records Brown Bag Email 77 45 32

(D1) Computer System Maintenance Records

Help Desk Computer System Maintenance Email

98 63 35

Total 577 379 198

Page 22: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-22Name – Short Title- 22

Sample Brown Bag Email

Page 23: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-23Name – Short Title- 23

RESULTS – Text Categorization

• Converted the sample emails to text files. For each email in the training sample,

• Used GATE to tokenize each sample.

• Preprocessed each sample to remove punctuation, digits and and 524 stop words (pronouns, prepositions, determiners), and infrequent terms (e.g., those terms that occur 4 times or less in the entire corpus).

• Used SVM with the six training samples to construct six classifiers. The features consisted of 3333 tokens (words) that were the union of all the tokens in the 379 emails in the training sample, less the words removed in preprocessing

• Feature Selection: The six classifiers have positive and negative weights associated with each feature. For each classifier, Selected 100 features with highest positive weights and 100 features with the highest negative weights. The union of these features resulted in 686 features.

• Used SVM with the same six training samples to construct six classifiers based on the 686 features.

• Used the six classifiers to categorize the 198 emails not in the training sample.

Page 24: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-24

First 30 of 686 Features in each classifier

Name – Short Title- 24

Page 25: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-25

Features Weights in the Category 1 SVM Model

Name – Short Title- 25

Page 26: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-26Name – Short Title- 26

RESULTS OF AUTO-CATEGORIZATION

GTRI Filing Category Test Samples

Correct Spurious Missed Precision Recall F1

Faculty Library Advisory Board Email 33 33 0 0 1.0 1.0 1.0

GTRI Director Administrative Email 44 43 1 0 0.97727275 1.0 0.9885058

ACL ListServ Email 35 35 0 0 1.0 1.0 1.0

Employee Benefits Email 19 18 0 1 1.0 0.947368440.972973

Brown Bag Email 32 30 1 1 0.9677419 0.9677419 0.9677419

Help Desk Computer System Maintenance Email 35 35 0 0 1.0 1.0 1.0

Totals 198 194 2 2 0.9897959 0.9897959 0.9897959

Page 27: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-27Name – Short Title- 27

Analysis of Experimental Results

The auto categorization errors involved just two out of 198 sample emails in the test set.

Email Human Categorized Auto-Categorized

20100810-0903 SIGN UP LINK – GTRI VPDirector Search – Brown Bag with Search Firm

Brown Bag Email Spuriously auto-categorized as GTRI Director Administrative Email.Missed as Brown Bag Email

20061030-1102 News You Can Use – October 30th

Employee Benefits Email

Spuriously auto-categorized as Brown Bag Email.Missed as Employee Benefits Email

Page 28: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-28Name – Short Title- 28

RESULTS

• Creation of good, reliable category samples is essential to use of machine learning to create accurate classifiers.

• Support Vector machines can be used to create highly accurate classifiers for email.

• Descriptions of Records Categories in Record Schedule need to be enhanced with examples.

• Accuracy of automatic categorization is improved by training to filing categories that are subcategories of retention categories.

Page 29: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-29Name – Short Title- 29

RESULTS

Auto-categorization is not the complete solution to the Email Categorization and Retention Problem. The following ideas need to be investigated.

At the time of creation, tag copies of intra-organizational email with filing category.

Limit use of classifiers to those email categories specific to an office.

How to associate specific filing categories with generic retention categories.

If person routinely creates a record in filing category, include the category id in a template, or in a pull down menu.

Use subject line tags to facilitate categorization.

Email that is a response to a message that has already been categorized should be in the same category as the original and linked to that email.

Page 30: Dr. William E. Underwood Principal  Research Scientist Georgia Tech Research Institute

GTRI_B-30

References

F. Ciravegna et al. (1999) Facile: Classifying texts integrating pattern matching and information extraction. Proceedings of IJCAI'99, Stockholm, pp. 890-895.

T. Joachims (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine Learning (ECML), Springer, pp. 137-142

F. Sebastiani (2002) Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No. 1, pp. 1–47.

Records Retention Manual. Board of Regents. University System of Georgia, March 30, 2010. www.usg.edu/records_management/schedules/A/

Filename - 30