NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from...

26
Legitimating e-mail for official government business: Automatic adaptive classification Railroad Commission of Texas University of Texas Graduate School of Library and Information Science University of Texas Department of Computer Sciences I. Statement of the Problem Lawful management of e-mail messages in the short and long terms Market specialists have predicted that the number of e-mails sent on an average day would hit 10 billion worldwide in 2000; by 2005, the volume is expected to more than triple to 35 billion e-mails sent daily (Levitt 2000). This volume becomes a task in itself: another study found that e-mail users in 2000 spent an average of 90 minutes per day on mailbox management tasks and predicted that by 2002 this would increase to an average of 2.5 hours per day (Grey 2000). If e- mail is the emerging “primary vehicle for knowledge exchange” in private and governmental enterprises (Harris and Hayward 2000), its management is of primary importance. In addition, a robust and generalizable retrieval mechanism is required for finding information after multiple years of storage. Information technology (IT) departments have typically taken a single approach to managing e-mail messages: the adoption of automatic deletion policies for all centrally stored e-mail, setting limits on message age and/or mailbox size. For IT, then, the central problem of e-mail is bulk reduction. Framing e-mail as anything but what records managers would term transitory (not appropriate for long-term retention) drastically increases the problem. Governmental agencies that are obliged by law to keep a large volume of e-mail messages for an indefinite period of time must also be able to respond to information requests, meet privacy and security requirements, provide ever-expanding storage space, and handle conversions when upgrades or new systems are introduced. So far the solutions available for handling these requirements have been lacking on one or another of these requirements. The Texas Records Management Interagency Coordinating Council (RMICC), composed of representatives of seven state agencies with direct authority over Texas state government's management of its records, is charged with proposing and implementing improvements to Texas state records management. In 1999 RMICC determined that the RRC/UT-GSLIS/UT-CS proposal: Project description Page 1

Transcript of NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from...

Page 1: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

Legitimating e-mail for official government business: Automatic adaptive classification

Railroad Commission of TexasUniversity of Texas Graduate School of Library and Information ScienceUniversity of Texas Department of Computer Sciences

I. Statement of the Problem

Lawful management of e-mail messages in the short and long termsMarket specialists have predicted that the number of e-mails sent on an average day would hit 10

billion worldwide in 2000; by 2005, the volume is expected to more than triple to 35 billion e-mails sent daily (Levitt 2000). This volume becomes a task in itself: another study found that e-mail users in 2000 spent an average of 90 minutes per day on mailbox management tasks and predicted that by 2002 this would increase to an average of 2.5 hours per day (Grey 2000). If e-mail is the emerging “primary vehicle for knowledge exchange” in private and governmental enterprises (Harris and Hayward 2000), its management is of primary importance. In addition, a robust and generalizable retrieval mechanism is required for finding information after multiple years of storage.

Information technology (IT) departments have typically taken a single approach to managing e-mail messages: the adoption of automatic deletion policies for all centrally stored e-mail, setting limits on message age and/or mailbox size. For IT, then, the central problem of e-mail is bulk reduction. Framing e-mail as anything but what records managers would term transitory (not appropriate for long-term retention) drastically increases the problem. Governmental agencies that are obliged by law to keep a large volume of e-mail messages for an indefinite period of time must also be able to respond to information requests, meet privacy and security requirements, provide ever-expanding storage space, and handle conversions when upgrades or new systems are introduced. So far the solutions available for handling these requirements have been lacking on one or another of these requirements.

The Texas Records Management Interagency Coordinating Council (RMICC), composed of representatives of seven state agencies with direct authority over Texas state government's management of its records, is charged with proposing and implementing improvements to Texas state records management. In 1999 RMICC determined that the dramatic expansion of the volume of e-mail in Texas state government, together with the demand from businesses and citizens to transact business with state government via e-mail, was raising a serious records management problem that stood squarely in the way of the state’s efforts to move to e-government automation. Accordingly, RMICC sought an agency willing to serve as a testbed for investigating a suitable implementation of lawful e-mail management sufficient to support official electronic transactions. The volunteer they found was the highly visible Railroad Commission of Texas (RRC), which regulates the oil and gas industry. Co-PI Cisco, Records Management Officer for the RRC, is also an adjunct lecturer in the Graduate School of Library and Information Science (GSLIS) at the University of Texas. Her colleague and architect of the electronic records program at GSLIS, PI Galloway, brought archival and electronic records knowledge from twenty years’ experience at the Mississippi state archives to the project, as well as a significant research interest in the role of recordkeeping in bureaucratic structure. Galloway sought out Co-PI Harris, who could bring both her own work in Natural Language Processing and the acknowledged expertise of the UT Department of Computer Sciences in machine learning to bear on the project. Within the RRC the division of Permitting and Production Services, which includes employees who perform tasks that vary from highly routinized to not at all routinized, was chosen to serve as the pilot group for the initial training phases of research. Members of the IT and legal staffs of the RRC have also participated fully in the design of the project.

E-mail problems were not unknown at the RRC, where the Novell GroupWise e-mail system consumed 13 gigabytes of hard drive storage in October 1999 and over 15.3 gigabytes in April 2000, suggesting an average growth rate of 2.5% per month. Lacking a formal policy on e-mail maintenance

RRC/UT-GSLIS/UT-CS proposal: Project description Page 1

Page 2: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

and usage as well as the tools and techniques to implement the policy, the RRC’s e-mail server periodically fills to capacity, system response slows, and employees are implored (via e-mail) to clean up their mailboxes. The most common approach employees currently use to meet this demand in compliance with the RRC’s retention schedule is to print out e-mail messages that must be retained and file them with other paper documents in local or central filing systems. This solution is, however, likely inconsistent to some degree, makes the records relatively inaccessible, and certainly compounds the greater problem of paper records bulk. What is needed is a more proactive and systematic approach to e-mail management.

In a recent survey for ARMA (Association of Records Managers and Administrators) International, members of the Industry Specific Groups for Petroleum and Utilities chose “legal and litigation risk” as the most significant risk of not managing e-mail messages adequately (Cisco, White-Dollman, and Lloyd 2001). There are two legal aspects of records retention: legal requirements and legal protection. Legal requirements promulgated by regulatory agencies at the federal, state, and local levels create a web of retention requirements that organizations must apply to records in order to achieve compliance. For example, the Oil and Gas Division of the RRC mandates the following retention requirements for the oil and gas industry:

Five-year retention for monitoring records for disposal wells, fluid injection into productive reservoirs, and underground storage facilities.

Three-year retention for all other records, forms, and documents which are required to be filed with the Oil and Gas Division.

The RRC manages its own records lawfully using a records retention schedule approved biennially by the Texas State Library and Archives Commission (TSLAC) and the Office of the State Auditor. The schedule is organized into 450 records series titles and specifies the total retention period, security requirements, and vital records designation for all records. Twenty-two percent (22%) of the records series titles contain information about the earth (such as oil and gas wells, pipelines, water quality, and abandoned mines) and are permanently retained. Since assignment of retention periods to e-mail messages is based upon the content rather than the format of the information, e-mail messages with content having sufficient legal, business, and/or archival value to be retained must be assigned to a records series title. Automatic capture of e-mail records in a Document Management System/Records Management Application (DMS/RMA) with subsequent schedule-driven storage would be one solution, since it would manage and retain e-mail according to specific business rules incorporating approved records retention schedules. Unfortunately, the RRC has no such system in place. Even if it had a DMS/RMA in place, most require the time-consuming tasks of manually assigning a record to the appropriate records series title and retention period (for all employees), supervising employee compliance (for supervisors), and constantly modifying the underlying retention schedule (for the records manager).

Legal protection is an issue with e-mail messages because they are subject to discovery in litigation, including hearings. When organizations do not provide employee guidance on the deletion and retention of e-mail, they are likely to gather irrelevant e-mail as part of a document production in response to litigation discovery, which can cause litigation costs to skyrocket. Lawyers pursue e-mail records in litigation for two reasons: 1) it is easier to review the equivalent of a filing cabinet of e-mail than a filing cabinet full of paper because it is structured and full-text searchable; and 2) employees often use e-mail like a telephone, and the messages tend to be written in a candid and informal manner containing corporate gossip or derogatory or indiscreet remarks. Messages containing “loose” language could easily be taken out of context by judges or regulators, leading to inappropriate and potentially damaging conclusions.

Since 22% of the RRC’s record series are permanently retained, long-term preservation of archival e-mail records must also be addressed in any solution. Long-term access to any electronic record is problematic because of the rapid changes in hardware, software, and storage media. Current state rules require that electronic records having archival value and scheduled to be preserved at the TSLAC must be printed out on alkaline paper or on microforms that meet the specifications in American National

RRC/UT-GSLIS/UT-CS proposal: Project description Page 2

Page 3: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

Standard for Imaging Media (Film)-Silver-Gelatin Type-Specifications for Stability (ANSI IT.1-1992). This requirement is a reflection of the limited resources available for the preservation of state archival records. The state archival program is not funded at an adequate level, in terms of a sufficient number of properly trained professional archivists, to identify and appraise the quantity of state electronic systems to determine which have long-term or archival value. Further, even if such systems are appraised and the electronic records are identified as appropriate for permanent preservation, the State Archives does not have the necessary computer hardware and software to permit transfer of and access to these automated information systems. This project will include the provision to the State Archives of results and budgetary figures to assist them in arguing either the need for additional funding (if retention of the RRC’s permanent records within the State Archives is found to be necessary) or the need for other control mechanisms (if the RRC’s permanent electronic records are better retained in the agency). For the present, the RRC must be prepared to preserve indefinitely the e-mail messages having permanent value through migration to future computer systems, preserving functionality, authenticity, and documentation.

Exploitation of the value of e-mail as information: Open records and agency efficiencyThe Public Information Act (Texas Government Code, Chapter 552) gives the public the right to

access government records, and an officer for public information and the officer’s agent may not ask why a requester wants them. All government information is presumed to be available to the public. Governmental agencies must promptly release requested information that has not been constitutionally, statutorily, or judicially determined to be confidential. Examples of information that is exempt from disclosure at the RRC include confidential personnel information, social security numbers, and credit card numbers.

Since e-mail messages are stored on the network, on local non-networked storage drives, and on backup tapes at the RRC, public requests for information involving e-mail can be time-consuming and costly to satisfy. In the project we propose to create a repository of e-mail messages, scaling down the Open Archival Information System (OAIS) implementation developed by the San Diego Supercomputer Center’s project for the National Archives and Records Administration (NARA), to which we will add a web-based interface. We can thus centralize the storage and retrieval of e-mail, reducing the time and cost of searches by employees and the public. Hadassah Schloss, Open Records Administrator for the State of Texas, provided four functional requirements that would have to be met by any system developed by the project so as to maximize the efficiency of open records requests:

1) It should have the capability to look for specific e-mail messages according to a name, a topic, or a specific mention.

2) It should not require programming to initiate a search for specific e-mail messages.3) It should not make it more difficult to access and print e-mail messages that are subject to an

open records request, resulting in higher charges to requestors.4) It should retain all identifying information about e-mail messages, such as sender’s name, time e-

mail was sent, etc.

Historically, the process of regulating the permitting, drilling, and completion of an oil or gas well or the construction of a pipeline has been linear and paper-based. Many paper forms and individual pieces of correspondence flow between the RRC and its customers, which must be reviewed in step-by-step processes. Because the necessary steps are varied and handled by several geographically dispersed organizational units, it can take three to five days or longer to process a simple form or letter manually. At certain times of the year, six to eight Commission employees sit at a table all afternoon to fold and stuff letters into envelopes in order to communicate important information to Commission stakeholders. Clearly these processes beg to be reengineered. The automation of forms processing is already being implemented at the RRC through the ECAP (Electronic Compliance and Approval Process) project. ECAP’s initial pilot step will convert the filing, review, and approval of a well’s drilling permit

RRC/UT-GSLIS/UT-CS proposal: Project description Page 3

Page 4: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

application to a completely electronic process. The infrastructure developed in the pilot step will be applied to more than two dozen other forms by 2005.

Standard non-form pieces of correspondence that flow between oil and gas companies and the grant proposal’s pilot group (Permitting and Production Services) are not addressed in the ECAP project, but in a study carried out in 1999, employees in the pilot group listed nearly a hundred such transactions that could be automated through the use of e-mail if a properly regulated system were in place to support their authentication. It is proposed that as part of this project, a prioritized list of such transactions be implemented in e-mail form, significantly speeding up the communication process and expediting workflow to and from oil and gas companies. Paper handling and storage would be reduced in both the pilot group and in Central Records, where the paper-based letters eventually have to be sorted, alphabetized, and filed into the centralized filing system. Postage costs would be reduced, a significant investment each time identical letters and/or notices have to be mailed to 9,000 active Texas oil and gas operators. Finally, the phased adoption of new transaction types in e-mail form would provide the project with a set of well-understood classification anomalies to test its classification solution.

It is not possible to use the existing messaging system, GroupWise, to meet the RRC’s requirements for e-mail management, nor is it possible at present to replace it. Although GroupWise does have very modest built-in capabilities for handling message retention and other records management issues, they are mostly limited to restricting the size of individual messages and mailboxes (which are maintained on the server) and the automatic deletion of e-mail based on message age. There is an “archive” function that can be used to save mail or telephone messages, appointments, reminder notes, or tasks to a designated database on a local or networked drive, but since only the owner can access “archived” items, they are not subject to centralized control. Neither approach provides assurance that unnecessary e-mails will be destroyed or that valuable record information will be classified, retained, and backed up according to the records retention schedule. This has had a chilling effect on the use of e-mail to replace standard paper-based communications that flow between the RRC and the industries it regulates, in spite of industry interest.

Handling constant change in government records managementAny effort to manage e-mail or any other electronic government record must take cognizance of

the environment of constant change, political, social, and commerce-driven, in which any solution must be able to function. Change is first of all endemic to democratic and bureaucratic government. For most government agencies election cycles virtually dictate personnel changes, through appointments. The likelihood of changes in recordkeeping is always present as new executive officers make discretionary structural changes in agency management. Even more likely to affect recordkeeping practices is simply the ongoing work of legislatures, again to be anticipated in cycles of activity. Not every agency is affected by significant legislation in every session, but most agencies will be affected by something, and usually must make some alteration in recordkeeping as a result. Government recordkeeping is also being affected by larger patterns of societal and global change. As a group these changes are usually characterized as a shift to an “information economy,” but there are distinct components of this large-scale shift that affect government directly.

On a national scale, enabling easy interactions between and among the states is one of the core tasks of a federal government system, and for a long time efforts have been made to cause state legal codes to be uniform in areas where interaction is desirable or necessary. With the emergence of electronic interactions, regulatory and commercial, these efforts are being extended to the standardization of the records-generating activities that embody them. Although such efforts are underway, they are by no means complete: in fact, they have only just begun as the governmental infrastructures of regulation and legitimation are moved online. The emergence of new standards can be anticipated for some time to come.

As changes take place in the society at large, agency constituencies also change, and agencies must respond to them. Some agencies have a close relationship with an industry that they regulate: such is the case with the RRC and the oil and gas industry. All agencies must respond to the public whose lives

RRC/UT-GSLIS/UT-CS proposal: Project description Page 4

Page 5: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

they affect, and the requirements for such response have changed in the past—with the emergence of open records statutes in the 1970s and of privacy concerns in the 1990s—and are likely to continue to change in the future.All agencies must respond to the press acting on behalf of the people and making use of the public laws to pursue demands for accountability. As agencies’ constituencies make different demands of recordkeeping, it is certain that constituency demand will ultimately be a significant driver of electronic recordkeeping practice.

Although government tends to respond much more slowly than private industry to commercial hardware and software innovation, agencies eventually find themselves at the mercy of computer hardware and software vendors and their profit cycles as they struggle to keep abreast of changes in the face of threatened removal of support for outmoded systems. Further, changes in hardware and software compound the problems of long-term or archival retention of records.

Finally, the shift to e-government, promoted by industry as a panacea for cash-strapped governments eager to cut staffs while improving service, presents a fundamental challenge to recordkeeping practices still centered on paper models. Now this is not new; governments at all levels have been involved with specific reorganization efforts during the course of American history, and the shift to “e-government” is only the most recent of a long line that has included the 1912 Taft Commission on federal government efficiency, two post World War II Hoover Commissions’ efforts directed specifically at government recordkeeping, and the widespread “umbrella” movements to group agencies for efficiency in the 1960s. All of these efforts took for granted the continuing practice of paper-based recordkeeping; no substantive changes beyond carbon paper and copiers would be made until the electronic records revolution began to make itself felt as an alternative to paper. E-government, however, means dramatic change in the way government does business. Using the experience of previous reorganizing efforts, it is safe to say that the move to e-government promises to be confused, lengthy, and uneven—all serious problems for the as-yet immature electronic recordkeeping practices that must document it. Government employees and recordkeeping professionals will have to devise some way to cope with uncertainty and change in any responses they offer to the problems raised by e-government efforts. A central requirement of the proposed project, therefore, is that its solution be designed to be adaptive, that it not require constant attention by employees, IT staffs, or records managers, and that it can assist recordkeepers in responding to change rather than placing barriers in their way.

II. Review of Previous Work

Document management and records management systemsAt the beginning of the digital revolution in business practices, a compromise with paper was

made by the adoption of document management or workflow systems that provided for the management and routing of images of paper documents over networks, providing a repository of documents and a database of metadata to manage them. As documents generated by businesses were increasingly “born-digital,” a new demand for searchability without the necessity for abstracting by hand led to the adoption of versioning technology already being used for programming collaboration, enabling the archiving and retrieval of specific versions of business documents. The proliferation of such systems, together with the explosive increase of electronic records generation especially in government, created a further demand for the ability to manage and store all kinds of born-digital records electronically and in such a way as to satisfy statutory recordkeeping requirements. The Department of Defense undertook to establish the parameters for such a system, resulting in the 5015.2 standard, which combined document management and records management into a system definition that would provide for the lawful management of active born-digital records:

A document management application is defined as a system used for managing documents that allows users to store, retrieve, and share them with security and version control. A records management application is software used by an organization to manage its records. An RMA's

RRC/UT-GSLIS/UT-CS proposal: Project description Page 5

Page 6: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

primary management functions are categorizing and locating records and identifying records that are due for disposition. RMA software also stores, retrieves, and disposes of the electronic records that are stored in its repository (DoD 1997).

The Records Management Application (RMA) Certification Testing page produced by the Joint Interoperability Test Command of the Defense Information Systems Agency lists RMAs currently certified to be compliant with DoD 5015.2-STD. All the products have the capability of managing electronic records in some sort of repository as well as providing a means for managing records in paper and other formats through record profiles. Most products store records in their original format as binary large objects together with descriptive information in a relational database. All products require a “file plan” derived from the retention schedule (NARA 2000) to be built into the system before it can be implemented for records management. Most products depend on the end-user (creator or receiver of documents) to identify the record type of each document or message and to enter descriptive information into a profile before the filing process is complete.

It should also be noted that the 5015.2-STD was never meant to serve as an archival system, but rather as a recordkeeping system that produced reliable and authentic records (Trace and Sannett 2000). The standard requires an organization to ensure that it has the ability to view, copy, print, and, if appropriate, process any record stored in an RMA for as long as that record must be retained within the organization. Several methods are outlined for achieving this, along with a requirement to pre-plan migration to ensure records reliability. This requirement places the responsibility to ensure the accessibility of records on the organization using the RMA rather than incorporating it as a feature designed into the RMA itself. Therefore, the test summary pages for the RMA Certified Product Register do not specifically cover long-term storage of records in the RMAs listed. The proposed project at the RRC, using open standards for e-mail records management and retention being developed for NARA, will make any necessary migration simpler and less expensive. The permanent records maintained by the RRC will likely need to be migrated to new systems multiple times.

Although most of the DoD-approved RMA systems exclude archival functions connected with permanent retention, they do meet requirements of bulk reduction and lawful records retention. They are less well-adapted to retrieval requirements for meeting open records requests and exploiting the knowledge contained in the records they manage, and they are very ill-adapted to coping with the endless political, social, and industry changes endemic to government recordkeeping systems. Further, with the exception of Tower Software's TRIM and TruArc’s ForeMost, most of the DoD-certified RMA products were originally (and many remain primarily) document management systems whose vendors or integrators have licensed records management software module plug-ins to comply with the 5015.2-STD. This practice increases integration and long-term upgrade problems.

Implementing 5015.2-compliant DMS/RMAs is also an expensive undertaking. Comparative costing is difficult because vendors use various strategies to determine prices. For organizations with fewer than 50 employees, Hummingbird DOCS Enterprise Suite costs $499 per workstation and $7000 per server, provided an SQL database is present. IBM Content Manager pricing starts at $15,000 per server plus $2000 per concurrent user. Eastman Software Work Manager Suite Client License Packs, purchased on a named-account basis, range from $99 for a single pack to $69,135 for a 1000-user package. FileNET Integrated Document Management Software Solutions (IDM) bases price on configuration and components selected by the purchaser (Faulkner 2000).

In addition, functionality, technical architecture, vendor viability, and service and support costs must be taken into consideration. The selection process itself may involve 6 to 20 people for 6 to 12 months to make a decision that costs 20% to 35% of the acquisition cost (Logan and Chin 2001). In Texas, the RRC spent approximately $220,000 on a document management and imaging system for 23 concurrent users. Bids received by the Texas General Services Commission for a replacement document management system for in-house applications were between $250,000 and $500,000. The Lower Colorado River Authority has spent approximately $200,000 to implement and integrate its records management application, even though the software seat price for iRIMS was only $49. The project

RRC/UT-GSLIS/UT-CS proposal: Project description Page 6

Page 7: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

proposed for the RRC, which will make broad use of open-source software elements, will certainly provide a less expensive method to manage e-mail used for business purposes.

Filing e-mail messages in a compliant DMS/RMA generally requires the user to move the message explicitly to the document management environment. Some programs allow the user to select whether or not to file the message and any attachments as one object or to save them as separate items. In some programs, the user can file an outgoing message to the record repository when the message is sent; in others, the user must file both incoming and outgoing messages. All programs automatically capture transmission and receipt data. Managing records in these programs is heavily dependent on the expertise of records managers, who set them up initially, and on the end-users, who must enter explicit metadata—“profile” data—each time a record is stored. More recent trends toward automatic classification of e-mail in RMA systems show promise, but are proprietary and cannot be explicitly mapped onto statutory requirements. Further, only two DMS/RMAs claim to be able to manage Novell GroupWise Mail: TRIM and ForeMost Enterprise.

Knowledge management systems and text classificationThe initial impetus for “knowledge engineering” was capturing the knowledge of experts before

those experts retired or left employment. With more and more companies transacting the largest portion of their communications over the Internet, the focus of “knowledge management” has shifted to collecting and using information about employees, customers, and trading partners. Companies such as Autonomy and Tacit Knowledge Systems have developed products to build searchable databases by categorizing e-mail through matching words and phrases in the content to a previously identified taxonomy. The primary use of these products has been to create a knowledge base of employee expertise profiles for responding to business needs. Industry has also focused on developing systems to automate responses to customer questions by analyzing e-mail content and linking it to an appropriate response. Content filtering software is also being used to search both incoming and outgoing e-mail for unauthorized disclosure of trade secrets, circulation of banned material, for regulatory compliance, and other reasons (Cain 2000). Because they are focused on the needs of private business, none of these efforts are significantly concerned with e-mail as a business record, so except in the case of legal concerns they devote little or no attention to archiving the e-mail itself or documenting the emergence and evolution of knowledge structures in the way that good records management and archival practice does through the continuing documentation of recordkeeping practices. The proposed project intends to evaluate the application of existing records management principles to the scheduling and management of e-mail.

Some DMS/RMA vendors are beginning to address the problem of e-mail bulk and user non-compliance with strategies for automatic classification of records using the methods of pattern matching and contextual analysis originally developed for the purposes of knowledge management. TrueArc, formerly Provenance, has developed a new product called AutoRecords (Knowledge Server from Autonomy) to be used as a module with its Foremost Enterprise electronic recordkeeping system. AutoRecords is "trained" by providing the software with 50-100 records, each at least 20K bytes in length, per records series. The system administrator sets up confidence ranges, designating the percentage by which a record must match the known category. The record can be filed automatically without user intervention, the user can select from five probable category choices, or the record can be set aside for human classification. Several sites, including the National Archives, have been beta-testing AutoRecords for about a year (NARA's project apparently includes Novell's Groupwise). Statistics on the success of the tests are just beginning to be compiled. According to Gartner Group, setting up and training an Autonomy system for 2000 to 5000 users, requires a focused effort for about six months (Hayward and Linden 2000). Tower Software also claims that its product TRIM PA automatically captures corporate information and categorizes it to meet recordkeeping requirements through some type of folder matching process. Users file information in folders on their desktop. These folders are linked to folders within TRIM that are in turn linked with the retention schedule. At periodic intervals dictated by retention schedules, the folders are "swept" and the files or copies of the files are transferred to the recordkeeping system. The process depends on the end-user filing documents into the appropriate folders.

RRC/UT-GSLIS/UT-CS proposal: Project description Page 7

Page 8: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

In the case of both AutoRecords and TRIM PA, however, the proprietary classification technology upon which so much depends remains inaccessible to records managers. Since this computer code becomes de facto legal code in its action to implement records retention (and especially records destruction), there is some question of whether such an arrangement is in fact lawful (cf. Lessig 1999). An important aspect of the proposed project at the RRC is an interest in Texas government in making certain that automatic classification is done transparently, by means of non-proprietary technology that can be known to its users both through system documentation and through automatically-generated metadata documenting the classification process.

Automatic classification is important not only to remove the burden of classification from every user, but to aid records managers in tracking records series. In Texas government, approved agency retention schedules must be followed for two years, with limited provision for modification, until it is time to re-certify a new schedule. New records series are added and obsolete records series are deleted during the re-certification process. The Records Management Officer is dependent on departmental records management liaisons or end-users to provide information about changes in the records series, which means that in an electronic environment it is quite possible for records to be created without there being an established records series to place them into. The process of identifying such undocumented changes in records series requires that the ability to recognize anomalies be part of any automatic classification system. Those available so far apparently do this by presenting the user with the opportunity to override a classification decision or reporting anomalies directly to a records manager, but again it is not possible to learn what specifically is meant by “anomaly” in these systems, nor is there as yet enough experience to show just how full the records manager’s “anomaly” mailbox is likely to become as the endemic changes in government are compounded by legislation and reengineering. The proposed project will explicitly address methods for distinguishing changes in series by introducing new message sets presently covered by schedules for paper records but not yet represented in e-mail.

Studies of e-mail usage so far have done little to address recordkeeping concerns specifically. Recent topics of investigation have been gender differences (Cohen 2001), employee status (Mackay 1988; Headlam 2001). Other studies have concentrated on number of e-mails received in the corporate world, how much time is spent dealing with them, and what percentage of e-mails actually relate to business. E-mail policies promulgated by organizations focus on limiting personal use of e-mail and emphasizing security to protect the organization against unauthorized access. Missing are in-depth studies about the influence e-mail has had on altering business paradigms and culture, yet such evidence as exists plainly suggests that the conclusions of a study made in the 1980s can have very little relevance to e-mail use at the turn of the century (cf. Garton and Wellman 1995). The proposed project at the Texas RRC will also attempt to analyze changes in business practices due to the staff's ability to complete more informational and transactional interchanges through e-mail use. By being able to recognize change, record change, and proactively address those changes, more efficient work processes can be supported.

III. Hypotheses and Methods

The present project seeks to understand the underlying patterns of email use as recordkeeping practice, in order to bring records management and archival standards to bear on the long-term management and use of e-mail records. We believe that the “e-mail problem” appears to be intractable because so far people have attempted to apply to it standards that were designed for the handling of paper records, and while we agree with archival scholars that there are core elements of these standards that have a universal applicability to recordkeeping practices in general, it is also clear that other elements of existing standards and practices are either irrelevant or impracticable for the management of electronic records, and simply instantiating those standards and practices in automated systems does nothing to address the need for revision. Accordingly, we propose to test the following hypotheses in the present project.

First, we think that it is necessary to verify that job functions and records series exhibit reliable correlations and that a set of records series descriptions can be formulated as an ontology for the

RRC/UT-GSLIS/UT-CS proposal: Project description Page 8

Page 9: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

government entity that produces them. This is an article of faith for the relatively modern practice of functional analysis in records management, whereby the activities of the entity being documented are analyzed first, so as to map the activities onto the specific records series that document them. The TSLAC recently completed a comprehensive analysis and appraisal of records at the RRC, describing in detail each record series, incorporating the combined knowledge of the appraisal archivist and the RRC’s staff (TSLAC 2001). It includes the purpose of the records and the agency program supported, providing a set of up-to-date descriptions for verification and conversion. We intend to field a team of archival science students from the University of Texas Graduate School of Library and Information Science (GSLIS) to interview RRC staff members from the pilot group in order to verify functional mapping onto records series and also to learn about user perceptions of their patterns of e-mail use. GSLIS students will also hand-classify sets of e-mail from the pilot group, based upon their professional understanding of the records series descriptions, to provide training sets for the classification behavior of a human records manager.

Secondly, we think that the task of classification of e-mail records can be effectively constrained by job function—in other words, that it is possible to constrain the classification activity to a limited number of classes (records series) on the basis of the job activity carried out by each sender (in the case of internally-generated e-mail) or recipient (in the case of externally-generated e-mail) of an e-mail message. We then intend to test whether we must match automatic classification behavior to the semantic sets established by the hand classification performance of the student records managers (a process that is itself poorly understood) or whether it is adequate to relate e-mail content directly to existing records series descriptions using semantic, statistical, or combination methods. We intend to use text classification methods drawn from machine learning (ML), natural language understanding (NLP), information retrieval (IR), and statistics.

Selecting a sample of text—initially a month's worth of e-mail from the RRC—to serve as a training set allows sufficient learning about that type of text to generalize to the larger corpus of six month’s worth and then a year’s worth of e-mail. Several ML techniques, known as "bag of words" methods, can be used; for example, SVM (Support Vector Machine; Joachims 1998), Naive Bayesian (Duda and Hart 1973), or Logistic Regression. Each of these linear techniques can produce effective classification results during the training. Information Retrieval techniques such as Term Frequency/Inverse Document Frequency (TF/IDF; Sparck-Jones 1972) and Maximum Entropy (berger et al. 1996) can also be used for classification. The literature suggests that these methods can be supplemented and improved by NLP, and we already have informal evidence that this is the case. In a preliminary investigation for this project, Co-PI Harris assigned a related project in her NLP class, CS 378, in the fall semester of 2000. The class was divided into three teams and given a dummy RFP asking for a proof of concept system for categorizing e-mail. PI Galloway acted as the agency submitting the RFP and supplied a batch of her own Mississippi state government e-mail for classification. The teams were not given predefined categories but had to determine those themselves based upon message bodies alone. One team took a strictly statistical approach to the problem, using software available on the web. The second team took a purely NLP approach, while the third team was allowed to build a hybrid system using both NLP and statistical methods. The hybrid team produced superior results to the other two. We therefore intend to use NLP as well as statistical methods, and exploit what we know about the context of origin of the records, the records series themselves, and human classification behavior within the oontext, all in a co-training framework (Blum and Mitchell 1998). We will employ a part-of-speech tagger to label word classes to be able to select content words (Brill 1992; Ratnaparkhi 1996). Named Entity tagging, making use of records series description and employee function information, will help with identifying proper names of people, institutions, and companies. In some circumstances further syntactic parsing may be necessary and can employ one of the several parsers now available. Semantic processing to discern meaning will be based on an ontology derived from the RRC records retention schedule. WordNet and SemCor, developed by George Miller's team at Princeton University, can support semantic interpretation needed for classification (Fellbaum 1998). WordNet's extensive semantic network and vocabulary of over 110K words support many NL projects, by revealing synonyms, hypernyms and hyponyms, meronyms

RRC/UT-GSLIS/UT-CS proposal: Project description Page 9

Page 10: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

and holonyms. Together we think these tools can provide sufficient power to classify the e-mail reliably. Retrieval will be enabled by derived classification metatags added to each message.

We know that over time there will be a need for revision of the target classes to reflect structural and other changes in the agency. Thirdly, therefore, we are suggesting that instead of always attempting to undertake these revisions from the top down, through notification of the records manager to analyze the recordkeeping situation and construct new records series, we should use the pattern-matching capacity of our automatic classification method to signal the need for such alterations when existing classes fail to match new records created or received by agency employees. We propose as a matter of routine to develop a threshold measure for goodness of fit with classes relevant to each employee, again using human classification behaviors and knowledge of the context, that can detect outlying messages, so as to be able to dispose lawfully of non-record material in the system (for example, informal memos about the social activities that make government agencies functional human institutions).

We think that such threshold measures can also assist in the detection of new activities that require formal documentation but that often escape management evaluation until long after electronic records pertaining to them would have been deleted under routine system management policies. Rather than adopt extreme retention policies in order to rescue such limit cases, we think we can use machine learning techniques to work interactively with users and records managers through a simple notification system to elicit human evaluation of outliers and thus permit the dynamic tuning of both the threshold measure and the notification process. Fortunately for us, the series of the new non-form regulatory communication types targeted for e-mail implementation by the chosen pilot group of users can be phased in methodically during the course of the project in order to help tune the classification techniques.

Finally, we believe along with others that there is potential value in providing access to substantive e-mail messages, but we think that assumption should be verified by a field test. Accordingly, we propose to create a searchable repository of e-mail messages, modeled upon the SDSC’s implementation of the OAIS model. We will then train employees on its use for retrieving messages and observe how they use it in order to evaluate 1) what kinds of messages are most useful to employees, and 2) how long they are useful—i.e., is there a temporal “window” during which specific kinds of messages are particularly valuable for current uses. The answers to these questions should have implications for the still-mysterious task of assigning an appropriate administrative retention period to e-mail. As already discussed, we suspect that usage behaviors too will change as the agency itself and the records environment change, so here also we will seek a dynamic means of evaluating the “usefulness” of records classes.

IV. Plan of Work and Deliverables

Year One

Month RRC GSLIS UT/CSPrior to beginning of grant period

Interview potential technical project leader for RRC from among UT/CS graduate students. This team member will serve as system manager for repository server and user support for designated pilot group.Establish liaison contacts within pilot group for the project.

Form student team for project.Interview potential archival project leader from among GSLIS graduate students. This team member will organize functional analysis, interview, and hand classification tasks.

Form student team for project.Interview potential technical project leader for UT/CS from among UT/CS graduate students. This team member will serve as system manager for development server and technical liaison to RRC.

RRC/UT-GSLIS/UT-CS proposal: Project description Page 10

Page 11: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

Month RRC GSLIS UT/CSJanuary Assist with final

formalization of technical/functional requirements for equipment and software.Contract with Novell consultant.Secure permission for GSLIS students to interview RRC employees in designated pilot group.Pilot group establishes priorities for automation of list of functions identified in previous study.Gather job description materials and system information.

Finalize formalization of technical/functional requirements and procurement of equipment and software.Assist with contracting with Novell consultant.Review previous work by PI’s E-records classes and relevant literature.Design interview instrument on existing e-mail practice at RRC.Review RRC-developed list of functions for potential e-mail implementation.

Assist with final formalization of technical/functional requirements for equipment and software.Meet with GSLIS project team.Review previous work by NLP class and relevant UT/CS work.

February Install new server, hardware and software.Novell consultant works with grant-funded technical project leader to set up to capture e-mail from GroupWise system (this task includes making record of which captured e-mails are deleted by users).Pilot group meets with UT project team to discuss goals and priorities.Begin to carry out functional analysis with designated pilot group using GSLIS students.Develop training materials for pilot group e-mail practice, including e-mail implementation of initial top-priority category of transaction.

Provide students and supervision to carry out functional analysis correlating job descriptions and retention schedules. Note that this analysis will include all records, not just those with an e-mail or electronic component, since it is possible that any category may generate e-mail communications.GSLIS/CS team meets with RRC pilot group to discuss goals and priorities.

Install new server, hardware and software.Students and PI begin systems analysis of classification task and experimental design.CS/GSLIS team meets with RRC pilot group to discuss goals and priorities.

March Train designated pilot group on simple e-mail best practices and relevant employees on implementation of top-priority transaction as e-mail.Continue functional analysis

Continue functional analysis work, including administering interviews.Meet with CS/RRC group to present interim results of functional analysis

Continue experimental design.Meet with GSLIS/RRC group to hear interim results of functional analysis.

RRC/UT-GSLIS/UT-CS proposal: Project description Page 11

Page 12: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

Month RRC GSLIS UT/CSwork.Meet with GSLIS/CS group to hear interim results of functional analysis.

April Begin e-mail capture this month.Pilot group begins to implement top-priority transaction category in e-mail.

Complete functional analysis work and interviews.Prepare report on functional analysis.Collaborate with CS students to develop possible constraints to assist in narrowing classification tasks.

Complete experimental design.Collaborate with GSLIS students to incorporate possible classification constraints from functional analysis.

May Extract first training set from accumulated e-mail archive this month.Relevant members of pilot group evaluate use of e-mail for top-priority transaction category; if successful, choose second similar category for implementation, set up necessary template.

Classify (by hand) first training set.Analyze interviews.

Experiment with first training set: create automated classifications.Prepare report on results.

June Evaluate preliminary results of classification task.Bring in feedback from Texas archival/records management/IT/legal community.Implement second transaction category in e-mail.

Analyze interviews.Evaluate preliminary results.

Evaluate preliminary results.UT/CS PI reports on project at professional meeting.

July PI assists UT/CS team with several iterations of experimentation on initial training set.PI writes report to NSF.

GSLIS team assists UT/CS team with several iterations on initial training set.PI writes report to NSF.

CS team experiments with several iterations systematically altering parameters.PI writes report to NSF.

August Draw complete six-month data set from designated pilot group.

Hand-classify one-month equivalent sample from six-month data set.

Apply developed algorithms/procedures to six-month data set.

September Evaluate results. Evaluate results.GSLIS PI reports on project at SAA meeting.

Evaluate results.

October Back to drawing board for several iterations.RRC PI reports on project at ARMA meeting.

Back to drawing board for several iterations.

Back to drawing board for several iterations.

November Final evaluation of results on Final evaluation of results Final evaluation of results on

RRC/UT-GSLIS/UT-CS proposal: Project description Page 12

Page 13: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

Month RRC GSLIS UT/CSsix-month sample.PI begins work on year-end report.

on six-month sample.PI begins work on year-end report.

six-month sample.PI begins work on year-end report.

December Complete preliminary work with designated pilot group. Deliverables:Preliminary report on correlation between job description and applicable records retention schedules for designated pilot group.First-year report to NSF.

Complete preliminary work with designated pilot group.Deliverables:Data showing correlation between job description and applicable records retention schedules for designated pilot group.Data on actual e-mail practice in designated pilot group.First-year report to NSF.

Complete preliminary work with designated pilot group.Deliverables:Classification algorithm(s) and processes capable of matching performance of human classifier with 85% accuracy.First-year report to NSF.

Year Two

Month RRC GSLIS UT/CSJanuary Draw complete sample to

date of e-mails from designated pilot group.Identify second pilot group.Begin correlation study on schedules and job descriptions for second RRC pilot group.

Prepare to scale up by beginning correlation study on schedules and job descriptions for second RRC pilot group.Begin analysis of usage data.

Test developed methods on full-year sample from pilot group.

February Evaluate results of full-year classification.Complete correlation study for second pilot group.Supply already-collected full-year data from second group for analysis.

Evaluate results of full-year classification.Complete correlation study for second pilot group.Continue analysis of usage data.

Evaluate results of full-year classification.

March Compare usage patterns of first group (with training) to usage patterns of second group (without training).

Compare usage patterns of first group (with training) to usage patterns of second group (without training).

Apply classification method to full-year data from second group.

April Evaluate results of application to second group.Carry out specific requirements analysis for proposed internal RRC use of e-mail knowledge base.

Evaluate results of application to second group.Assist with requirements analysis for proposed internal RRC use of e-mail knowledge base.

Evaluate results of application to second group.

RRC/UT-GSLIS/UT-CS proposal: Project description Page 13

Page 14: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

Month RRC GSLIS UT/CSMay Provide full-year sample

from entire agency less specific agreed categories.

Carry out rule-guided mapping of job descriptions onto schedules for whole agency.

Apply classification method to full year data from entire agency.

June Evaluate rule-guided mapping of job descriptions onto schedules for whole agency.

Complete rule-guided mapping of job descriptions onto schedules for whole agency.

Continue analysis of entire agency data.UT/CS PI reports on project at professional meeting

July Evaluate results of full-year analysis.

Evaluate results of full-year analysis.

Evaluate results of full-year analysis.

August Assist with design of e-mail repository prototype, especially specifications for metadata requirements and user interface, working with pilot group.

Assist with design of e-mail repository prototype, especially specifications for metadata requirements and user interface.Develop instrument(s) for analysis of usability.

Carry out systems analysis for e-mail repository (data warehouse) prototype capable of providing user access and secure storage; adapt SDSC design to RRC environment.

September Continue work on repository prototype; assist with developing specifications for required hardware and software.Develop training materials for usage of repository prototype.

Continue work on repository prototype; develop specifications for and procure required software and hardware.GSLIS PI reports on project at SAA meeting.

Continue work on repository prototype; tackle indexing strategies, means of handling dynamic evolution of repository over time, and inclusion of usability monitoring.

October Train pilot group on usage of repository prototype.Implement repository prototype.RRC PI reports on project at ARMA meeting.

Implement repository prototype.

Implement repository prototype.

November Evaluate usability of repository prototype.

Evaluate usability of repository prototype.

Evaluate usability of repository prototype.

December Present final results to agency.Deliverables for overall project:Final report to NSF.Tested method for mapping job descriptions onto schedules for automatic classification.Vastly improved e-mail recordkeeping practice capable of supporting online transactions for

Deliverables for overall project:Final report to NSF.Establishment of best practices for management of e-mail of permanent value, to include cost-benefit analysis of agency vs archival custodianship of permanent e-mail records.Proof of concept for scaling of SDSC archiving methods to state agency scale.

Deliverables for overall project:Final report to NSF.Analysis/classification methods based upon domain knowledge combined with textual analysis. Methods developed may have potential commercial application to recordkeeping systems.Educational benefits:Active involvement by UT/CS students with solution

RRC/UT-GSLIS/UT-CS proposal: Project description Page 14

Page 15: NSF grant proposal narrativegalloway/pkghome_web…  · Web viewgrant narrative as developed from outline. Railroad Commission of Texas. University of Texas Graduate School of Library

Month RRC GSLIS UT/CSagency business.Proof of concept for agency use of e-mail as internal intelligence asset, to include cost-benefit analysis of application.

Study of e-mail use in state government setting and its impact on bureaucratic structure.Educational benefits:Active involvement by GSLIS students with real-world electronic records management and archiving project of potentially national significance.

of real-world knowledge-management project of potentially national significance.

RRC/UT-GSLIS/UT-CS proposal: Project description Page 15