EDGAR-Analyzer Automating the Analysis of Corporate Data Contained
Transcript of EDGAR-Analyzer Automating the Analysis of Corporate Data Contained
EDGAR-Analyzer: automating the analysis of corporate data
contained in the SEC’s EDGAR database
John Gerdes Jr.*
The A. Gary Anderson Graduate School of Management, University of California, Riverside, CA, 92521, USA
Abstract
Publicly owned companies, their officers and major investors are required to file regular disclosures with the Securities and
Exchange Commission (SEC). To improve accessibility to these public documents, the SEC began developed the EDGAR
(Electronic Data Gathering, Analysis and Retrieval) electronic disclosure system. This system provides ready, free access to all
electronic filings made since 1994. The paper describes a tool that automates the analysis of SEC filings, emphasizing the
unstructured text sections of these documents. To illustrate the capabilities of the EDGAR-Analyzer program, results of a large-
scale case study of corporate Y2K disclosures in 18,595 10K filings made from 1997 to 1999 is presented.
D 2002 Elsevier Science B.V. All rights reserved.
Keywords: SEC; EDGAR; Tool; Financial Analysis; Functional decomposition model; Y2K
1. Introduction
The recent trend for both the public and private
sectors is to make information web-accessible. Putting
data on-line leverages the universality of the Internet,
improves user access, speeds the dissemination of
information, and reduces costs for both the provider
and user. The Securities and Exchange Commission
(SEC), through its EDGAR (Electronic Data Gather-
ing, Analysis and Retrieval) database initiative, was
an early innovator in this area. The importance of the
EDGAR database rests in the scope of the data it
contains—disclosures of financial and operational
performance of all publicly traded companies. It has
been argued that under the Freedom of Information
Act mandate, the Commission has an obligation to
both promote and provide ready access to these docu-
ments [25,40].
Since its inception in the mid-1930s, the primary
mission of the SEC has been to protect investors and
maintain the integrity of securities markets. As part of
this effort, domestic, publicly held companies are
required to disclose complete and accurate informa-
tion about their operations, as well as any event that
could materially impact them [36]. This required
information is extensive. The SEC receives 12 million
pages of documents annually [29]. Manual processing
of this much information is both expensive and time
consuming. Having to physically handle paper filings
also limits the timely access to this important, public
information.
To address these problems, the SEC began devel-
oping the electronic disclosure system in 1983. After
initial successful prototyping and testing, the Com-
mission mandated electronic filings in 1994 [33].
Even though these documents were being stored in
0167-9236/02/$ - see front matter D 2002 Elsevier Science B.V. All rights reserved.
PII: S0167 -9236 (02 )00096 -9
* Tel.: +1-909-787-4493.
E-mail address: [email protected] (J. Gerdes Jr.).
www.elsevier.com/locate/dsw
Decision Support Systems 35 (2003) 7–29
electronic form, their accessibility was still quite
limited. Data was made available through five nation-
wide SEC reading rooms, and a limited number of
private companies (primarily Mead Data Central)
which provided on-line, tape, CD-ROM or paper
versions of EDGAR Data [21]. A 1993 NSF research
project was initiated to investigate the feasibility of
disseminating EDGAR data through the Internet.
Dubbed EDGAR on the Internet or EOI, this project
demonstrated that it was feasible to provide access
through electronic mail, ftp, gopher and World Wide
Web. In late 1995, the base EDGAR system and
technology developed through this project were trans-
ferred back to the SEC, which used it as the basis for
the own web-based services. Since that time the
Commission has continuously improved and ex-
panded the EDGAR System. In May 1999, they
started accepting filings submitted in HTML and
PDF formats. The EDGAR database has grown to
include over 1.7 million documents representing 610
GB of data, ranking it the 25th largest web accessible
database [7]. For a more detailed history and develop-
ment of the EDGAR system, the reader is directed to
Refs. [5,20,21,33,35].
EDGAR has become a valuable resource for both
investors and the securities markets. Although access
has been greatly improved, the ability to automatically
analyze these filings is limited due to the semi-
structured nature of the documents. The SEC requires
firms to incorporate SGML tags to facilitate the
identification of specific data fields and certain docu-
ment sections. However, these tags provide direct
access to only a small portion of the data contained
in these documents. The typical filing consists of two
major sections—the SEC Header, which identifies the
form being filed along with basic corporate informa-
tion (i.e., company name and address, accounting
contact, etc.), followed by the Text section containing
the filing’s main descriptive content. Depending on
the type of form being filed, an additional Financial
Data Schedule (FDS) may be included at the end of
the filing [29]. This Schedule is submitted with each
10K and 10Q filing, as well as some special Schedules
filed by investment and public utility holding compa-
nies [34]. The FDS utilizes an attribute—value
scheme: a pair-wise, simple to parse representation
of the standardized financial data contained in the
filing. In addition to the FDS, only the Header section
contains tags that identify individual data fields. Since
the content of the Text section is free-form text,
automated data extraction from this section is quite
difficult [26]. Even though the Text section does
includes <Table > tags to identify imbedded tables,
extracting data from these tables is still quite chal-
lenging because there is no imposed structure to the
table layout [27]. Note, as of version 7.0, the EDGAR
System no longer requires firms to file FDS docu-
ments [34].
The sheer amount of information available through
on-line databases such as EDGAR highlights the need
for automated data analysis tools. Although simple,
text-based search tools exist, they cannot handle com-
plex, multi-dimensional inquiries—more advanced
search tools are needed. In this paper, we present an
initial attempt at developing such a tool. EDGAR-
Analyzer, is an advanced, multi-dimensional search
tool designed to facilitate computer-assisted analysis
of unstructured, text-based data. Developmental and
operational issues of this tool are discussed.
The next section briefly discusses the SEC’s
EDGAR database and the currently available tools
that provide access to this data. Section 3 focuses on
the development of the EDGAR-Analyzer tool. To
illustrate the tool’s capabilities, it was used in a large-
scale study of Y2K disclosures made in annual reports
filed from 1997–1999. To provide a basis for this
study, the issues surrounding the Y2K problem are
outlined in Section 4, followed by a discussion of the
exploratory study and the results obtained. Section 5
discusses the operational issues surrounding the use of
EDGAR-Analyzer. Finally, we summarize our find-
ings and give some direction for future research.
2. SEC’s EDGAR database
‘‘The laws and rules that govern the securities
industry in the United States derive from a simple
and straightforward concept: all investors, whether
large institutions or private individuals, should have
access to certain basic facts about an investment prior
to buying it. To achieve this, the SEC requires public
companies to disclose meaningful financial and other
information to the public, which provides a common
pool of knowledge for all investors to use to judge for
themselves if a company’s securities are a good
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–298
investment.’’ [36]. All public, domestic companies
with assets exceeding $10 million with at least 500
stockholders fall under the SEC’s reporting guide-
lines. In addition, certain individuals must also file
with the Commission. Insider trades reported on
Forms 3, 4, and 5 are an important part of EDGAR.
Table 1 identifies the common forms periodically filed
with the SEC.
To improve access to this information, the SEC
developed the EDGAR system, currently in its 8th
revision [37,42]. It has evolved to the point that it
automates ‘‘the collection, validation, indexing,
acceptance, and forwarding of submissions by com-
panies and others who are required by law to file
forms with the U.S. Securities and Exchange Com-
mission (SEC). Its primary purpose is to increase the
efficiency and fairness of the securities market for the
benefit of investors, corporations, and the economy by
accelerating the receipt, acceptance, dissemination,
and analysis of time-sensitive corporate information
filed with the agency’’ [31].
Beside the traditional SEC Reading Rooms, the
Commission provides four Internet-based avenues
through which the EDGAR data can be accessed, as
follows.
. Quick Forms Lookup—a web-based search util-
ity that allows the user to lookup company specific
filings. This tool has a very limited search capabilities,
allowing the user to restrict the search based only on
filing date and form type. This tool has no full-text
search capability. (see http://www.sec.gov/edgar/
searchedgar/webusers.htm).
. Search EDGAR Archives—a web-based, search
utility that permits a full text search of the tagged
headers in the EDGAR filings (the text search does
not extend to the filing body). Although the Boolean
search capability is quite flexible, the interface is
cumbersome. The user must be aware of which fields
exist in the headers to take full advantage of these
features. The only explicit option available to the user
is to restrict the search based on filing dates.
. FTP Access—This mode is used primarily for
bulk downloads of corporate filing for subsequent
remote processing. The SEC provides daily, quarterly
and annual indexes sorted by company name and form
type. These indexes provide the company name, form
type, CIK (central index key uniquely identifying the
submitting company), date filed, and URL (the Inter-
net location where the full text of the filing can be
obtained).
. Direct bulk feed of EDGAR Data—The data
accessible through both the SEC Web and FTP sites
is time-delayed at least 24 hours [31]. As a premium
service, the SEC offers a subscription for ‘real time’
access to all EDGAR data through a direct bulk feed.
This option is used by commercial information
brokers who, in turn provide real time access to their
customers.
By law, corporate public disclosures are required to
be accurate and clearly represent the operations of the
firm [36]. This makes the data contained in the
EDGAR database quite valuable to investors, corpo-
rations and security markets. As a result, a number of
tools have been developed to facilitate data access
(Table 2 contrasts the features of the different tools).
The following section gives an overview of the data
contained in the EDGAR database. This is followed
by a brief discussion of the different tools currently
available to analyze this data.
2.1. Underlying data in SEC’s EDGAR database
The EDGAR database contains all filings that have
been electronically filed since January 1, 1994. (Note,
Lexis/Nexis, Disclosure, and Westlaw have informa-
tion dating as far back as 1968, but this information is
privately held and not contained in the SEC database.)
Because the regulation requiring electronic filings was
Table 1
Common SEC Forms accessible through EDGAR
. Annual Reports (10K, 10-KSB, 10-K405)
. Quarterly Reports (10Q, 10-QSB)
. Special Reports (8-K, 6-K)
. Proxy Filings (DEF 14A, PRE 14A)
. Insider Trading (144, 3, 4, 5)
. IPO Fillings (S-1, SB-1, F-1, 424B, SB-2)
. Tender Offers (14D-1)
. Response to Tender Offers (14D-9)
. Mutual Fund Filings (N-1A, N-30D, 497)
. Mergers and Acquisitions (13D, 14D-1, 14D-9, S-4)
. Employee Benefit Plans (S-8)
. Secondary Stock Offering (S-2, F-2, S-3, F-3)
. REITS (Real Estate Investment Trusts (S-11)
. Small Caps (SB-1, 10-KSB, 10-QSB)
. Registration Statements (S-3, 424B)
. Going Private (13E3, 13E4)
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–29 9
Table 2
Comparison of features and capabilities of free and third-party tools for accessing EDGAR filings
SEC
Edgar
SEC
Info
10k
wizard
Edgar
Scan
Free
Edgar
Yahoo!
Financial
Search
SEC
Tool Focus
Individual company data a a a a a a a
Multiple company data a a a No a No No
Single form a a a a a a a
Multiple forms a a a a,b a a,b a,b
All SEC forms a a a No a No No
SEC Forms Supported
Annual Reports (10K, 10-K405) a a a a a a c a
Quarterly Reports (10Q, 10-QSB) a a a a a a,c a
Current Reports (8-K, 6-K) a a a a a a,c a
Proxy Filings (DEF 14A, PRE 14A) a a a a a No a
Mergers and Acquisitions (S-4) a a a a a No a
Insider Trading (144, 3, 4, 5) a a a a a No a
IPO Fillings (S-1, 42AB, SB-2) a a a a a No a
Prospectus (485) a a a a a No No
Mutual Funds (N-1A, N-30D, 497) a a a a a No No
Private Placement Offerings No No No No No No No
Mergers and Acquisitions (13D, 14D-1, 14d-9, S4) a a a a a No No
No Action Letter No No No No No No
V33 Act Deals (F-1, F-10, F-1MEF, F-2,
F-3, F-3D, F-3MEF, F-7, F-8, F-9, F-10, N-2, S-1,
S-1MEF, S-11, S-11MEF, S-2, S-2MEF, S-20, S-3,
S-3D, S-3MEF, S-B, SB-1, SB-2, and SB-2MEF)
a a a a a No a
Data Reported
Full Filing a a a a a d a
Context of Text Search/Highlight search words No a No No a No No
Extracted Financial Data No a No a a No No
Balance Sheet No a a a a No No
Income Statement No a a a a No No
Cash Flow No a a a a No No
Financial Ratios No a a a a No No
Source of Extracted Financial Data
Financial Data Section (FDS) No a a No No No No
Financial Statements in Filing Body No a No a a No No
Available Constraints
Company name a a a a a No No
Stock Ticker No No a a a a a
CIK (SEC’s Central Index Key) No a No No No No No
Period Date No No No e No No No
Filing Date a a a e No a,c a,f
Today’s Filings a a a e a a a
Date Ranges a a a No No No a,f
Entire EDGAR Database (since 1/1/94) a a a e a No a
Header Fields
Company Name a a a a a No No
Address (i.e., City, State, Zip Code) a a No No a No No
SIC Code No a a No a No No
Industry No a a g No No h
Full Text Search i No a No a No No
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–2910
phased-in over a 3-year period, some filings prior to
May 1996 were submitted on paper, and are therefore
not included in the EDGAR database. However, as of
May 1996, all public firms subject to the SEC’s filing
requirements must submit forms electronically [31].
Official filings must be either in a tagged-text or
HTML format. PDF versions are also accepted, but
only as a supplement to the official filing [31].
The format of documents submitted and stored in
the EDGAR database are based on broad guidelines
set forth by the Securities and Exchange Commission.
These guidelines identify which sections each form
should contain along with the type of accounting
information that should be reported [8]. Unfortu-
nately, there is a great deal of variety in how this
information is presented. The Commission requires
certain header tags such as the company’s name,
address, firm’s SIC code, and auditor’s name. How-
ever, the filing’s body consists primarily of unstruc-
tured, free-form text. Filing guidelines support the use
Table 2 (continued )
SEC
Edgar
SEC
Info
10k
wizard
Edgar
Scan
Free
Edgar
Yahoo!
Financial
Search
SEC
Available Constraints
Boolean Text Searches
Evidence Constraints
AND, OR, NOT Operators a a a No a No No
Stemmed words a a a No a No No
Thesaurus a a No No No No No
Proximity Constraints
Tagged Field Value a a No No No No No
Within n Characters /NEAR a No a No a No No
Other Operators
Case-Sensitive Search a No No No No No No
Relevance Scoring a No No No a No No
Report Output Formats
ASCII a a No a a No No
RTF No a a a a No No
CSV (Spreadsheet) No a l a a No No
HTML a a a a No a a
PDF j No No No No No No
XML No a No a,k No No No
Context for full text search results No No No No a No No
Other Services
Predefined Searches No a a a No No No
Watch list No a a l a No No
Custom Research Service No No No No No No No
Real time No a a No a a a
Related Information Available No a a a a a a
a Feature supported.b Displays all filings, and lets user select from list.c Limited to the past 3 weeks.d Synopsis of the filing (http://help.yahoo.com/help/us/fin/research/research-01.html).e By default all filings for the company are displayed, any the user can pick the desired filing based on period or filing date.f Limited to Today’s Filings, This Weeks Filings, or All Filings.g Links are provided to pull up an industry comparison, with all leading firms hyperlinked.h Monthly Public Utility report, Monthly Real Estate Report and World Bank Report are available.i Search is only of the document header.j Although Companies can file PDF data with EDGAR, these files are not available through their publicly available on-line service.k Experimental.l Available through premium service.
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–29 11
of some SGML tags in the filing body to facilitate
viewing and printing on the Internet, but this are not
required [5]. Unfortunately, the Commission’s filing
submission software does not validate document-for-
matting correctness. The improper structuring of tags
results in the poor identification of data objects, which
complicates the automated parsing of these documents
[21]. To a limited extent, this problem is being
addressed in the Commission’s modernization efforts.
As of version 8.0 the EDGARlink automatically
checks and validates the formatting of the document
header, but still does not validate the structure of the
filing body [37,39].
Standard financial statements contained in a fili-
ng’s body (i.e., income statements, balance sheets,
cash flow statements, etc.) are more structured than
the remaining text. Unfortunately, extracting mean-
ingful information from even this data can be chal-
lenging. For example, terms are not used consistently
among all filers. Even within a given filing, errors and
inconsistencies will occur making it difficult to auto-
mate the analysis process [27]. Some values can be
found in the FDS’s tagged fields, but this data is not as
detailed as regular financial statements. For example,
the FDS only provides current period data and aggre-
gate values, omitting much of the supporting data
presented in financial reports. The FDS section also
does not report footnotes to financial reports, often a
critical source of important information about the
firm’s financial statements.
To appreciate the complexities involved in analyz-
ing these free-form documents, consider EdgarScan,
PricewaterhouseCooper’s innovative tool that extracts
financial tables from SEC filings. However, even with
extensive post processing, EdgarScan can only proc-
ess 90% of the filings automatically [27]. The steps
EdgarScan goes through to provide accurate and
consistent data include (adapted from Ref. [27]):
1. Finding the relevant financial tables in the filing.
2. Finding the boundaries (start and end) of each
table, in a manner that is resilient to page breaks.
3. Finding the column headers and column bounda-
ries for a table.
4. Finding the units (e.g., dollars in thousands)
usually expressed near the table heading.
5. Recognizing line item labels, compensating for
wrapped lines.
6. Compensating for long line labels that ‘‘push
over’’ data values in the first column.
7. Normalizing labels to a canonical form (e.g.,
‘‘Sales’’ and ‘‘Total Revenues’’ mean the same
thing).
8. Inferring the underlying mathematical structure of
the table (e.g., recognizing subtotals), and possi-
bly recognizing mathematical errors in the filing.
9. Extracting the numeric values based on the column
boundaries, while compensating for poorly for-
matted filings with wandering columns.
10. Validating the data by cross checking with other
tables.
11. Resolving the format of footnotes to financial
tables. A wide variety of numbering and layout
conventions are used to identify footnotes (includ-
ing not numbering them at all, and relying solely
on layout).
2.2. State of the art in EDGAR analysis tools
Various tools have been developed that provide
access to the SEC filings [5,15,28]. Three general
classes of tools have emerged—third party, free, and
commercial tools (see Table 3).
The third-party tools contract for their content from
the primary tool providers. These secondary sites
typically are portals or special interest sites that
aggregate content from multiple sources. The capa-
bilities of these tools vary considerably. For example,
the SEC filings section of Yahoo!Financial provides
free, real-time access to select SEC filings. However,
only ‘glimpses’ (summaries, not complete filings) of
10K, 10Q, and 8-K class filings are available, with the
user routed to Edgar-Online for more complete infor-
mation. Only 3 weeks of historical data are available,
and although filing summaries can be displayed, there
is no provision for initiating a text search. In contrast,
RagingBull, powered by 10K Wizard, is a full-fea-
tured site with functionality equivalent to the native
10K Wizard site.
From the researcher’s perspective, the other two
segments (free and commercial tools) are more impor-
tant. Since all of these tools utilize the SEC filings as
their primary data source, they tend to differentiate
themselves primarily through their value-added fea-
tures. All provide access to full text of the filings.
Some use extensive indexing to provide convenient,
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–2912
direct access to individual document subsections, such
as the document header, management’s discussion and
the various financial statements. There is also varying
support for different output formats, including plain
text, RTF (rich text format, compatible with most
word processors), HTML, and CSV (a spreadsheet
format used for financial tables).
One of the most useful features of these tools is
their extensive search facilities. Again, search capa-
bilities vary considerably. The user can implement a
full text search with most of these tools. They allow
the user to optionally refine a search by specifying
explicit field constraints, such as the company’s name,
stock ticker symbol, form type, business/industry
sector (based on SIC code) and filing date. These
two features used in combination can search for a
specific term in a single filing; broaden the search to
include all filings made by that company; or even to
expand the search over the whole EDGAR database.
Other useful features include display of the context for
search results and relevancy ratings. The search con-
text is done by either showing a block of text
surrounding the search terms that are found, or by
highlighting the words in the document. Relevancy
ratings of search results are typically based on the
count of search words in each document.
Commercial tools (those for which there is a fee
for essential features) tend to have some additional
value-added features which differentiate them from
the free tools. Often this entails access to non-
EDGAR content. For example, Lexis/Nexis provides
access to a large array of business, industry, and gov-
ernment information. Some tools (i.e., Lexis/Nexis,
and Disclosure) have filings that predate the elec-
tronic filing regulations and thus are not found in the
SEC’s electronic system. Additional services include
specialized database content (i.e., No-Action letters,
private offering circulars, etc.), premium watch/alert
services (which automatically alert users when fil-
ings of interest are posted), ability to store com-
monly used queries, and the availability of customer
support.
Given that these tools all use the same underling
data, they have had to differentiate themselves based
on other value-added features. Kambil and Ginsburg
suggest three strategic dimensions for information
Table 3
List of tools that provide access to EDGAR data
Research Tool Company URL
Third-Party tools
IPO Powered by 10K Wizard http://www.ipo.com/marketdata/edgarsearch.asp
Raging Bull Powered by 10K Wizard http://10kwizard.ragingbull.com/
Yahoo!Financial Powered by Edgar-Online http://biz.yahoo.com/reports/edgar.html
Free Tools
10K Wizard 10K Wizard http://www.10kwizard.com/
EDGAR SEC http://www.sec.gov/edgar.shtml
EdgarPro Edgar-Online http://www.edgarpro.com/Home.asp
EdgarScan PricewaterhouseCoopers http://216.139.201.54/recruit/edu.html
Freedgar Edgar-Online http://www.freeedgar.com/
Search-SEC Search-SEC http://www.search-sec.com/
SEC Info Finnegan O’Malley & Co. http://www.secinfo.com/
Commercial Tools
Disclosure, Edgar
Direct, Global Access
Thomson Financial/Primark http://www.primark.com/pfid/index.shtml
Edgar-Online Edgar-Online http://www.edgar-online.com/
Lexis/Nexis Lexis/Nexis http://web.lexis-nexis.com/universe/
form/academic/s�secfile.html
Livedgar Global Securities Information http://www.livedgar.com/
SECnet Washington Service Bureau http://www.wsb.com/online/secnet/index.pl
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–29 13
vendors operating in Web-enabled environments (see
Fig. 1): Value-Added Content, Process and Interac-
tion. Most Vendors have already added value by
linking SEC content to non-EDGAR data such as
Ticker symbols. Most have also added value along the
Process Dimension by providing full text searches,
automatic data extraction, watch lists and alert serv-
ices. These technological innovations can typically be
easily copied, and thus do not represent a sustainable
advantage for any particular vendor. In contrast,
leveraging unique intellectual capabilities can provide
points of distinction. They may be based on propri-
etary methods of analyzing the public EDGAR data
alone, or in combination with proprietary data. The
third dimension deals with the amount of custom-
ization available to the user. The most basic is a
generic interface that does not provide for user cus-
tomization. The SEC’s EDGAR site would fall under
this category. Most EDGAR tool vendors provide
some means to personalize the user interface through
extensive search options and customized alert lists. To
date, tool vendors have not adopted a significant
community-based interface on their own sites.
Instead, they have typically acted as content providers
for special interest or portal sites that support com-
munity-based interaction. For example, Yahoo!Fi-
nance uses EDGAR Online to deliver their SEC
filings page.
3. Development of EDGAR-Analyzer, a text-based
analysis tool
EDGAR-Analyzer is designed to facilitate the
analysis of SEC Filings. Although the Commission
specifies the content and to some extent the layout of
the various filings, much of the information is con-
tained in unstructured text. EDGAR-Analyzer is a
general-purpose tool, capable of searching for and
recording evidence of user-specified subjects. Using
data contained in the filing header, the program
prescreens filings and analyzes only those forms that
correspond to the time period and filing types of
interest. It sequentially analyzes SEC filings, looking
for evidence of a particular subject, concept or issue,
and subsequently saves this evidence in a local data-
base. Objective information from the tagged data
fields is recorded for each filing, including those that
do not address the issue of interest. The information
captured includes generic, corporate information (i.e.,
company name, CIK number, SIC number, etc.), form
information (i.e., form type, filing date, period date,
Fig. 1. Web Information System-enabled information vendor strategies (from Ref. [21]).
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–2914
etc.), and tagged financial data from the FDS when
available.
The underlying EDGAR filings are assumed to
conform to a Hierarchical, Functional Dependency
Model. Under this model, general higher-level objects
are recursively constructed into increasingly specific
objects (i.e., a filing consists of multiple sections,
which each section consisting of multiple paragraphs
made out of multiple sentences containing multiple
words). At all levels, each object has a given central
focus. The higher level objects are necessarily more
broad in their scope. Objects can deal with multiple
subjects, but this is undesirable. Consider a long report
made up of a single paragraph. Breaking it up into
separate sections, each with multiple paragraphs allows
for compartmentalizing of central concepts, and makes
it easier to understand. It is further assumed that under-
lying each subject is a set of critical, or at least
important, factors. When a given subject is addressed,
a clearer picture of the issues emerges as more of these
critical factors are considered. This could result in
better analysis, and improve the reader’s confidence
that important matters were not overlooked. Similarly,
a factor’s relative importance to a given subject is
reflected by the frequency that this factor is discussed
within that subject’s context. Consequently, even when
documents are relatively unstructured (as is the case
with SEC filings), issues surrounding a particular
subject of interest are assumed to be in relatively close
proximity to each other. Conceptually, each SEC filing
is viewed as a composite of short discussions address-
ing different major topics. It is assumed that within
each discussion the company focuses on those factors it
feels are important. Due to their loose structure, there is
no presumption that these documents have sections that
can be cleanly divisible into blocks, each dealing with a
single major topic. A specific critical factor can be
discussed in relation to many broad topics. For exam-
ple, lawsuits, patent issues, labor or employee impli-
cations, and international issues can each impact many
different aspect of the firm. Searching the whole filing
for these general concepts would tend to have a high hit
rate, but a hit found in this manner does not necessarily
imply a relationship to the specific issue being studied.
A consequence of this data model is that search
accuracy can be improved by implementing a tiered
strategy. At issue is the high number of false positives
obtained with a simple keyword search when the
context of the word usage is not considered. The
number of false positives can be reduced by first
searching for terms specific to the main subject of
interest, extracting the context where this subject is
discovered, and then doing the final search on this
smaller block to look for terms related to contribu-
ting elements. Note, because of the variability found
in the free-form text, this approach is still not fool-
proof, and manual inspection of the extracted text
blocks is still required. However, it can greatly
reduce the amount of information that must to be
manually processed.
3.1. EDGAR-Analyzer
Using a GUI interface the user specifies the
desired time period, forms and specific subjects or
terms of interest. The user can also specify which
tagged data fields to record, and any sub-concepts
that should be captured within the broader text
search. This search profile information is stored in
a file, which allows the distributed analysis of filings.
At this point, the program has enough information to
begin the search.
The program uses the index files stored on the SEC
FTP site to identify records of interest. These indexes
provide the form type, company name, file size,
submission date, and URLs of each filing, with the
URL identifying the Internet address of the filing’s
full text (see Table 4).
Having prescreened the filings, the full text of the
first targeted filing is downloaded from the SEC site.
The program searches the filing’s text section for
evidence of user-specified concepts and issues using
a keyword search. When a keyword is located, the
whole paragraph containing that keyword is extracted
and placed in a separate text block, thereby capturing
the usage context. Multiple context passages are often
extracted from a given filing. Once the filing text has
been completely processed, the system reanalyzes the
extracted text blocks for evidence of specific factors
of interest to the researcher. It sets Boolean fields in
the output database indicating if evidence of a specific
issue is found. For example, the extracted text block
could be searched for evidence that:
. management feels a certain issue would (or would
not) have a material impact,
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–29 15
. a similar project has been completed thereby
improving the likelihood for success,. cost figures are provided, or. International issues appear to be important.
The analysis of the filing text uses a non-case-
sensitive, literal string equality operator for single and
multi-word terms. In the current version, there is no
support for Regular Expressions, which would auto-
mate the search for common variants of the same term
(i.e., plurals, and different tense) [14]. Also not
supported in this version is the automatic support of
synonyms.
The final stage in analyzing the SEC filing is a
manual review of data generated by EDGAR-Ana-
lyzer. Because of the variability of the documents, the
data collected has to be verified by looking at the raw
filings and double-checking the information col-
lected. Before pulling the documents up in a word
processor, the targeted keywords and phrases are
highlighted (i.e., bold-faced, increased font size, and
a color change) using rich text format tags. High-
lighting the targeted keywords facilitates the manual
review.
3.2. Operational issues
The use of SEC data as the primary data source
introduces a number of important operational issues.
First, it is very difficult to cross-link SEC filings with
outside information. In each filings, the SEC requires
companies to include their CIK (Central Index Key)
number—a unique corporation identifier assigned by
the SEC. Unfortunately, other data sources to not
include this identifier, using instead the company’s
CUSIP (Committee for Uniform Security Identifica-
tion Procedures) number and/or its stock ticker sym-
bol. ‘‘The SEC does not, in general, use the ticker
symbols or CUSIP number in keeping track of com-
panies. The ticker symbol is the property of the
exchanges that issue them, and they are not required
to file the symbols with the SEC’’ [38]. As a result,
establishing a link between the SEC data and these
external data sources can be difficult. It may be
possible to use the company’s name, but this can
introduce potential errors in cases where the match is
not exact or where the company has changed names.
Many companies include their ticker symbol in their
SEC filings, thereby eliminating this ambiguity.
Unfortunately, ticker symbols and CUSIP numbers
are not a tagged field, which makes them difficult and
time consuming to extract even when they are pro-
vided in the filing.
The second operational issue is that it is difficult
to accurately parse and identify common subjects
across multiple filings. This impacts the ability to
automate the retrieval of information from these
filings. There are a number of causes for this, in-
cluding:
. Poor identification of data objects [21]� Limited number of tagged items� HTML formatting errors
. Content inconsistency and incompleteness within a
filing. Inconsistent use of terminology across companies
Table 4
Excerpt from SEC quarterly index (1Q 1997)
Form type Company name CIK Date Filing URL
10-12B Bull & Bear Global Income Fund 1031235 19970123 edgar/data/1031235/0000950172-97-000052.txt
10-12B First National Entertainment 853832 19970218 edgar/data/853832/0000853832-97-000002.txt
10-12B Hartford Life 1032204 19970214 edgar/data/1032204/0000950123-97-001413.txt
10-12B New Morton International 1035972 19970324 edgar/data/1035972/0000912057-97-009794.txt
10-12B Synthetic Industries 901175 19970213 edgar/data/901175/0000901175-97-000001.txt
10-12B WMS Hotel 1034754 19970228 edgar/data/1034754/0000950117-97-000339.txt
10-12B/A Getty Petroleum Marketing 1025742 19970113 edgar/data/1025742/0000950124-97-000137.txt
10-12B/A Getty Petroleum Marketing 1025742 19970127 edgar/data/1025742/0000950124-97-000358.txt
10-12B/A Getty Petroleum Marketing 1025742 19970313 edgar/data/1025742/0000950124-97-001486.txt
10-12B/A Ralcorp Holdings/MO 1029506 19970203 edgar/data/1029506/0000950138-97-000017.txt
10-12B/A Tanisys Technology 929775 19970124 edgar/data/929775/0000912057-97-001668.txt
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–2916
. Lack of precision (i.e., failure to include units in
the financial statements). Legalistic phrasing complicates automated pro-
cessing of text.
An HTML formatting error can cause incorrect
parsing of the documents. Although the SEC guide-
lines call for tables to be tagged with a <Table> </
Table> pair, occasionally one of these tags is entered
incorrectly (e.g.,/Table without the < > delimiters,
typos such as misplaced slashes as in <Table/>, or
even no end tag at all). The SEC documentation
indicates that it is the responsibility of the filer to
format these documents so that they are readable.
EDGARlink, the SEC’s filing submission software,
does not check for HTML tagging errors [39].
These errors can cause large blocks of text to be
incorrectly interpreted as part of the table. Similarly,
inconsistent content (as in contradicting statements),
and variability in the terminology complicates the
automated extraction of data. Ultimately, these types
of errors make fully automated processing unreli-
able.
In addition, sentence construction can be quite
cumbersome. Some sentences extend over 15 lines
of text while others contain compound negatives
(sometimes as many as four or five in single sen-
tence). Consider the following two statements dealing
with the Year 2000 problem that were extracted from
10K reports. Both are relatively common, with sim-
ilar statements being made by more than 30 firms. In
the first case, if the reader focuses on the text in the
immediate proximity to the ‘material adverse’ clause,
or even that following ‘the year 2000 problem,’ he/
she could get the wrong impression about that com-
pany’s readiness. The second statement contains
multiple negative clauses that blur the meaning of
the message.
. The Company has not completed its analysis and is
unable to conclude at this time that the year 2000
problem as it relates to its previously sold products
and products purchased from key suppliers is not
reasonably likely to have a material adverse effect.. Without a reasonably complete assessment of
systems that could be vulnerable to problems, the
Company does not have reasonable basis to
conclude that the Year 2000 compliance issue will
not likely have an operational impact on the
Company.
Lastly, the structure and content of SEC filings
keeps evolving, averaging nearly 1 major revision in
the filing specification per year. For example, the
header tagging structure was changed to an XFDL
scheme in EDGAR 7.0, and modified again in
EDGAR release 8.0 [37,42]. Another important
change is that as of release 8.0, filing of the FDS is
no longer required [37,42]. Extracting financial data,
such as the income statement or balance sheet, now
requires going into the body of the filing and extract-
ing the data from imbedded tables. Furthermore, these
tables are not required to have any special tagging to
facilitate processing [39]. An additional complication
is that filing can now be submitted as a multi-part,
hyper-linked document rather than a single, integrated
document.
Because of the issues involved in analyzing these
free-form documents, a number of trade-offs had to be
considered in the development of the EDGAR-Ana-
lyzer program. The first was the relative importance of
Type I (false negatives) and Type II (false positives)
errors in the analysis. An emphasis on Type I errors
puts a premium on identifying all the targeted records,
resulting in an increased number of records which do
not contain useful or interesting content. Under this
scenario, the assumption is that the cost of an over-
looked record of interest outweighs the added cost of
processing irrelevant records. The opposite is true
when focusing on Type II errors, which stresses the
elimination of these non-targeted records, even at the
expense of missing records of interest.
Since EDGAR-Analyzer uses a two-tiered search
strategy, we must consider which strategy is appro-
priate at each tier. At the first level, it searches for
records that deal with a targeted main issue (e.g., the
Year 2000 Problem). At this level the program empha-
sizes completeness (i.e., avoiding Type I errors).
Once an interesting record is found, the program
executes a secondary search for related factors. For
the Year 2000 issue, this search may focus on imbed-
ded chips, employee retention, and indirect impact of
third parties. We are interested in only those instances
where these factors are discussed in relation to the
main issue, and not related to any other issue. The
search for these terms is done on blocks of text
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–29 17
extracted from the full document. These text blocks
capture the context in which the targeted subject is
discussed. This secondary screening limits which
blocks of text are extracted from each filing in an
attempt to minimize Type II errors.
Since records are screened strictly on the presence
of user-specified keywords, the issue of focusing on
Type I errors reduces to the identification of this target
set of keywords. This tends to be an iterative process.
An initial set of terms is established and run on a
small, sample data set. Results are checked for accu-
racy, and completeness before the process is tried on
the full data. An alternate approach would be to use
focus groups to generate these keyword lists. Reduc-
ing the false positives is also dependent on the proper
keyword selection. Using common terms like ‘sales,’
or ‘profits’ will yield a high hit ratio, but many hits
will not be relevant. Keywords should be as specific
as possible to the issues of interest.
Two sets of keywords (along with their synonyms)
are generated. The primary set of keywords, the ‘Issue
Defining’ (ID) terms, are closely related to the subject
under study. If any of these terms are located in the
document, the relevant section is deemed pertinent
and subsequently extracted. The secondary keywords,
the ‘Critical Factor’ (CF) terms, are associated with
factors related to the targeted subject rather than the
subject itself. For example, when dealing with the
year 2000 problem, ID terms might include ‘Year
2000,’ ‘Y2K,’ and ‘Millennium Bug,’ while the CF
terms might include ‘imbedded chips,’ ‘staff,’ and
‘cost.’ For this particular study, these terms were
initially generated based on the issues discussed in
the popular press, research reports and academic
articles, and subsequently refined during pilot testing
on sample SEC filings. Note that the presence of a CF
term does not imply a discussion of the targeted
subject, and thus does not automatically trigger the
extraction of text. However, it could indicate a poten-
tially relevant passage. As a result, a sliding relevancy
scale is used. The program first executes a keyword
search based on only the ID terms. When an ID term
is located, the paragraph containing that term is
marked for extraction. At this point, the relevancy
threshold is decrease to include the CF set of words in
the search. Contiguous paragraphs following the pre-
viously marked paragraph are then searched for any
term within either the ID or CF set. Paragraphs
containing a qualifying term from either set are
extracted. Each extracted text block is marked with
a delimiter to allow subsequent identification of the
separate contiguous blocks. The remaining text is then
searched for the next instance of an ID term, repeating
the extraction process until the whole document is
processed.
Another trade-off involved in the development of
this tool is the issue of preprocessing filings before
sending them to the search engine. The content variety
and inadvertent formatting errors can greatly impact
the processing of these files. For example, most files
are single spaced, with double-spacing between para-
graphs. However, some files are double-spaced
throughout (using two hard carriage returns) with an
indention indicating a paragraph break. In some
instances, there is no discernable paragraph break at
all (i.e., the company used hard carriage returns at the
end of each line with no indentions). The ability to
identify paragraph boundaries is critical to this appli-
cation since the program extracts the search context
information a paragraph at a time. Improperly identi-
fying paragraph boundaries would reduce the effec-
tiveness of the secondary search to identify con-
tributing factors. A similar issue exists with word
spacing. Since search phrases may contain multiple
words (e.g., Year 2000), the search is sensitive to
inter-word spacing. In both cases (paragraph and
inter-word spacing), the problem can be resolved
through a global search and replace process, but this
can significantly impact processing time.
Two different solutions are used to address these
problems. Because of the central role that paragraphs
played in the methodology, it is important to reinte-
grate text back into contiguous paragraphs. Files were
checked for double spacing and converted to single-
spacing where needed. Using the same approach
proved to be too computationally costly for the
inter-word spacing issue. This issue was handled by
specifying multiple search strings with different inter-
word spacing (i.e., Year2000, Year�2000, and Year��2000). This is not an optimal approach since it tends
to increase Type I errors. This occurs when a spacing
combination existing in the document is omitted from
the search string set (i.e., Year � � � 2000). This can
be addressed by selectively searching for instances of
the first word in multi-word terms and replace
instance where multiple spaces exist. An initial search
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–2918
for ID terms is done to prevent the time-consuming
work of cleaning a file that does not contain anything
of interest.
4. Sample study—Y2K
EDGAR-Analyzer was used to investigate corpo-
rate year 2000 remediation efforts as reported in their
annual reports. Although this issue has been known
since 1971 [6,44], it only emerged into the public and
corporate consciousness around 1995–1996, which
coincidentally is the same period of time that the
EDGAR database was established. Recall that the
year 2000 problem (Y2K) refers to the inability of
software and hardware systems to handle dates
beyond the year 1999. The problem stems from what
was a common system design practice of representing
dates by a six-digit field—MM-DD-YY, thereby cap-
turing the month, day and only the last two digits of
the year. As a result, the dates January 1, 1900 and
January 1, 2000 were both represented as ‘1/1/00’.
Unfortunately, most systems had no means to distin-
guish which of these dates is correct. Extensive
information concerning the Year 2000 problem is
available on the Internet. The interested reader is
directed to the National Y2K Clearinghouse site run
by the U.S. General Services Administration and
located at http://www.y2k.gov/.
Before the actual study is discussed, a brief over-
view of issues surrounding the Y2K problem is
presented. In practice, such a pre-analysis of the issues
is necessary, for it helps to develop the set of key-
words that EDGAR-Analyzer will use when parsing
the document. This is followed by a discussion of the
case study—the methods used and the results
obtained.
4.1. Review of the Y2K problem
The ‘‘Year 2000 problem’’ relates to what was a
common practice of computer programmers to use a
two-digit rather than four-digit number to represent
the year. This could cause systems or applications
using dates in calculations, comparisons, or sorting to
generate incorrect results when working with years
after 1999 [32]. On the surface, the Y2K problem
appeared to be a trivial, with an obvious solution—
simply modify all date fields to include four digit
years. On closer examination, this problem is seen to
be much more complicated (see Table 5 for a list of
potential issues/problems).
Table 5
Potential year 2000 problems
Software
. Valid dates were often used to represent special conditions. For
example, ‘1/1/00’, ‘9/9/99’, and ‘12/31/99’ might represent ‘date
unknown’, ‘date not entered’, and ’infinite date’. Thus the Y2K
problem was not limited to January 1, 2000
. Availability of well-documented source code may be limited,
greatly complicating the analysis and code conversion efforts
. Inconsistent date formats were commonly used (e.g.,
YYYYMMDD, MMDDYYYY, DDMMYYYY)
. Not all dates are based on variable values. Hard-coded dates,
calculated dates and dates imbedded in filenames are just three
examples
. Multiple, non-compatible approaches were used to address the
Y2K problem. These included field expansion, fixed window,
and sliding windows
. The program logic needs to change to account for this different
date representation. Changing date format may corrupt screen and
printed output. Archived data may also have to be changed to be
consistent with revised code so that it is still accessible.
. Leap year issues
Hardware
. Many modern devices have embedded microprocessors that could
be susceptible to the Y2K problem. In these devices, the logic is
‘burned’ into the chip and is therefore not modifiable
Personnel
. Shortage of qualified personnel needed to address problems. Due
to supply and demand pressures the cost to locate, hire and retain
qualified staff was high
Legal Issues
. Business interruption do to the failure of critical systems
. Directors and Officers liability for not addressing Y2K in a timely
manner
. Stockholders suing accounting firms for inadequate disclosure of
Y2K risks
. Collateral litigation – failure of one system preventing a company
from delivering on their commitments
. Breach of contract and failure to perform service
. Consumer fraud class action based on misrepresentation of system
performance
Environmental
. Cascade failure if suppliers or customers fail to become year 2000
compliant
. Impact of potential public utilities failures (electric, gas, water,
phone, etc.)
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–29 19
This was a worldwide problem. The sheer magni-
tude of the required Y2K conversion effort would tend
to introduce new errors into existing applications, and
adequate testing is critical to ensure that the Y2K
problem has been corrected. Because of system inter-
dependence, this testing should involve both unit
testing and integrated system testing [18]. Also,
research has shown that proper testing of large proj-
ects typically accounts for 50% of the whole project
time [4]. Unfortunately, the required time to do
adequate testing is often underestimated and in this
case the time frame was unalterable (it had to be done
by December 31, 1999).
Of particular interest to this case study is the SEC’s
response to the Y2K problem since it is the control-
ling legal authority dealing with disclosure obligations
of public corporations in the United States. The SEC’s
bulletin of October of 1997 (subsequently revised on
January 12, 1998) specifically addressed the ‘‘disclo-
sure obligations relating to anticipated costs, problems
and uncertainties associated with the Year 2000 issue’’
[30]. It required companies to disclose details of Y2K
problems in their ‘Management’s Discussion and
Analysis’ section if:
. ‘‘the cost of addressing the Year 2000 issue is a
material event or uncertainty that would cause
reported financial information not to be necessarily
indicative of future operating results or financial
condition, or
. ‘‘the costs or the consequences of incomplete or
untimely resolution of their Year 2000 issue represent
a known material event or uncertainty that is reason-
ably expected to affect their future financial results, or
cause their reported financial information not to be
necessarily indicative of future operating results or
future financial condition‘‘ [30].
Also, ‘‘if Year 2000 issues materially affect a
company’s products, services, or competitive condi-
tions, companies may need to disclose this in their
‘‘Description of Business.’’ . . .[This] ‘‘disclosure
must be reasonably specific and meaningful, rather
than standard boilerplate’’ [30].
4.2. Case study
The focus of this study is to determine the status of
Y2K remediation efforts as reported in corporate 10K
documents filed with the SEC over the period 1997–
1999 (corresponding to FY 1996–1998). At issue is
the type of disclosures made, and to what extent
critical factors related to the Y2K problem are
acknowledged in these disclosures.
The case study looked at all 10K reports electroni-
cally submitted and stored in EDGAR during the
period January 1, 1997 to April 30, 1999, which
amount to 18,595 filings (see Table 6). The 10K filing
was targeted because it corresponds to the firm’s
annual report that is required to provide extensive
discussion of issues that impact, or even could poten-
tially impact, the firm’s operations. These files tend to
be detailed and can be of significant size. For this
study, the average file size was 291 KB, which
corresponds to approximately 100 pages. The largest
files were 5 MB. Some 10K files reach 23 MB,
although none that size were involved in this study.
The sheer volume of information contained in these
files makes finding topics of interest difficult and
highlights the need for automated support. Note, only
the 10K filings were analyzed, including all variants
(e.g., 10K/A, 10KSB, 10 KT405, etc). The keywords
and their synonyms use for the case study are listed in
Table 7.
Pilot testing indicated that two ID terms were
commonly used outside the context of the year 2000
problem. The first was ‘year 2000’ and its variants.
With the approaching century change, many compa-
nies discussed plans that would be implemented in the
year 2000, which created a false positive. The second
was ‘y2k’. Actually, the term y2k was nearly exclu-
sively used to refer to the year 2000 problem, but
financial tables tended to use ‘fy2k’ (or its equivalent
‘fiscal year 2000’) in a non-relevant context. A simple
keyword search for ‘y2k’ would caused a false hit on
‘fy2k’. These issues were found early in the analysis
and a special filter was added to address this problem.
The program screened 10K filings for any indica-
tion of a year 2000 disclosure (based on the presence
Table 6
Breakdown of 10K filings processed
10Ks screened
by Edgar Analyzer
Manually validated
10K Filings 18,595 9,764
Non-disclosures 7,917 (42.6%) 7,917 (81.1%)
Disclosures 10,678 (57.4%) 1,847 (18.9%)
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–2920
of ID in Table 7), and extracted relevant text blocks.
These text blocks were then searched for critical
issues/elements dealing with Y2K. Table 8 gives a
list of the items tracked during the study. Certain
concepts could not be automatically extracted and
therefore required manual processing—for example,
cost figures and dates information. EDGAR-Analyzer
determined that 42.6% of the filings did not contain
any ID term, and were logged as non-disclosing
filings. To eliminate false positives, the extracted text
blocks were manually reviewed. In the process, the
data extracted by EDGAR-Analyzer was validated.
Due to time constraints, only 1,847 filings containing
Y2K disclosures were manually reviewed.
4.3. Case study results
To illustrate the capabilities of this tool, five
aspects of the Y2K problem were investigated,
namely:
. How did the percentage of Y2K disclosures in
annual reports changed over time?. How did firms characterize the impact of the Y2K
problem?. To what extent are the various factors associated
with Y2K discussed?. How far along are companies in their remediation
effort?
Table 7
Keywords/phrases used to locate information in SEC 10Ks. Multiple spellings of words are included where appropriate
Issue defining terms Critical factor terms
Year2000 (No Spaces) Adverse Embedded Remediation
Year 2000 (One Space) Analysis Evaluated Reviewing
Year 2000 (Two Spaces) Assess Failure Significant
Y2K Completed HVAC Substantial
Millenium Bug Compliance Liability Supplier
Millennium Bug Conducted Material Third Parties
Millenium Problem Contingency Miscalculations Unknown Cost
Millennium Problem Conversion Not Pose Vendor
Customer Positive Effect Warrant
Customers Positive Impact 2000
Disrupt Preliminary
Table 8
Items tracked with EDGAR-Analyzer
Critical elements Informational elements
Imbedded Chips Expected to have a positive impact
Staffing/Programmer Retention In the business of Y2K remediation
Third Parties Not material in 1998
Euro Conversion Was any Y2K disclosure made
Leap Year
Liability and Warranty issues Status of Y2K Remediation
Risk of Disruption Not yet started
Impact on competitive position Not finished remediation plan
Contingency plans Not finished with analysis phase
Material/Not Material Finished with analysis phase
Not material without discussion Schedule date to complete assessment
Not material with some discussion Schedule date to finish changes
Not material in 1998 Schedule date to finish testing
Not expected to be material Schedule date to finish Y2K Project
Material Substantially done with Y2K Project
Mission critical systems are Y2K compliant
Data Inconsistency and incompleteness
Currently Y2K Compliant
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–29 21
. And finally, what disclosures are made regarding
the cost of their Y2K efforts?
For the analysis of Y2K disclosure frequency, all
18,595 annual reports, including those filings that
were not manually checked, were incorporated. This
was done to increase the sample population and get a
sense as to firm’s Y2K awareness. Incorporating the
non-verified data will tend to increase the number of
false positives since an ID term may be used in a non-
Y2K context, and thus, the reported disclosure per-
centages may be inflated from the actual number of
disclosures. The remaining four topics focused on
company specific disclosures, so only manually veri-
fied data was included in this analysis. Each of these
issues is discussed below.
Fig. 2 illustrates the percentage of 10K filings that
contain some form of Y2K disclosure. The number of
filings peaks sharply every March, which corresponds
to the large number of companies with a December
fiscal year end (the SEC requires 10Ks to be filed
within 3 months of the close of the fiscal year,
explaining the peak in March). The bar chart shows
that the percentage of filings with Y2K disclosures
started to increase in November 1997. This corre-
sponds to the SEC’s issuance of Staff Legal Bulletin
No. 5 in October 1997 that outlined the specific
obligations each firm had with regard to their the year
2000 disclosure (see discussion in prior section) [30].
The Commission requires firms to identify and
disclose factors that may have a material impact on
their operations. As mentioned above, both govern-
mental regulating bodies and professional bodies
issued opinions and guidelines requiring disclosure
of Y2K-related information (Ref. [30] and Refs.
[1,2,3,11], respectively). Consequently, this informa-
tion should be common in filings submitted after the
publication of these guidelines. Fig. 3 presents how
the firm’s rating of the severity of the Y2K issue
changed over time. The overall height of the bars
indicates the percentages of 10K filings that contained
some form of Y2K disclosure. The stacked bars break
out six categories—the two most significant being
‘Materiality not Mentioned’ and ‘Not Material with
Support.’ The first category is self-explanatory. The
second category captures the number of filings that
indicated that Y2K will not have a material impact
and presented additional factors related to the Y2K
problem to lend support to this statement. This is in
contrast to those falling into the ‘‘Just Not Material’’
category which did not include any such support. Few
filings fell into the remaining three categories. The
balance of the filings for each year did not mention the
materiality of the Y2K issue (note, the category
Fig. 2. Y2K disclosures in corporate 10K filings submitted from January 1997 to April 1999 (FY 1996–1998). The line graph shows the
number of filings per month. The bar chart shows the percentage of those filings that contained some form of Y2K disclosure.
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–2922
Fig. 3. Breakdown of the self-purported impact of Year 2000 for fiscal year 1996–1998. Values represent percentage of manually checked
10Ks, with the aggregate representing percentage of 10Ks containing some form of Y2K disclosure.
Fig. 4. Frequency that various critical Y2K factors were discussed in 10K filings. The values are percentages of each the manually checked
filings with some form of Y2K disclosure.
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–29 23
‘‘Unknown if Material’’ includes only those filings
which stated specifically that they did not know if the
impact was likely to be material or not).
A statement of non-materiality may not engender
much confidence without some accompanying evi-
dence that the issues involved have been adequately
addressed. For this reason, it is interesting to inves-
tigate the frequency these related issues are discussed.
Fig. 4 looks at the frequency that 11 critical factors are
discussed in the context of Y2K problems. It is
interesting to note the dramatic rise in the awareness
of ‘‘Imbedded Chips’’ in FY 1998 (these filings were
submitted in 1999 for the previous year’s operation).
Another factor that could impact the reader’s con-
fidence is a statement regarding the status of the
remediation process. Statements being made late in
the conversion process would tend to be more reliable
than one made earlier. Fig. 5 focuses on important
milestones in the Y2K effort. Finishing the assessment
phase is important, for it marks when the point the
firm has reviewed its exposure to Y2K problems and
is now ready to address these issues. Unfortunately, in
FY 1998 the number of firms that had not yet finished
the assessment phase almost equaled those who had
finished it. In addition, over 50 percent of the FY
1998 10K filings did not disclose if they completed
their assessment or not. Given this result, coupled
with the approaching December 31st deadline, the
status of the firm’s contingency planning became
important. Most firms had not finished their contin-
gency plan as of the filing of they FY 1998 annual
report. In addition, 187 filings, over 12% of filings
with disclosures, indicated that they would develop a
plan on an as-needed basis.
When addressing the materiality of the Y2K issue,
some firms reported the costs of their remediation
effort—both what had been spent to date and expected
future expenditures. These values varied considerably.
From an accounting perspective, it is important how
these expenses are accounted for in the firm’s finan-
cial statements. Capitalizing these expenses allows the
firm to spread the impact over multiple years, while
expensing the costs recognizes them in a single year.
The Emerging Issues Task Force (EITF) of the Finan-
cial Accounting Standards Board and International
Accounting Standards Committee (IASC) both issued
opinions stating that Y2K-related expenses should
normally be expensed as incurred [11,19]. Fig. 6
Fig. 5. Identifies the Y2K remediation phase as disclosed in manually checked 10Ks for fiscal years 1996–1998.
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–2924
reports how firms planned to account for their Y2K
remediation expenses. Note that there is a significant
number of firms planning to capitalize these expenses
despite these authoritative opinions. Also note that
there is a sizable group of firms which indicated that
they did not track their internal costs.
5. Discussion and directions for further research
EDGAR-Analyzer addresses a shortcoming of
existing tools that extract data from the SEC’s
EDGAR database. Although existing tools provide
basic search capability, most focus primarily on the
financial data contained in the FDS. In contrast,
EDGAR-Analyzer focuses on the text section of these
filings. Extracting information from unstructured free-
form text is challenging [10,22,24]. The approach
adopted incorporated a tiered search strategy. Key-
words specific to the targeted search are used to
identify passages that deal with an issue of interest.
These passages are extracted and subsequently rean-
alyzed to determine if sub-issues are addressed within
the context of the more general targeted search. The
final phase of the analysis is a manual review and
validation of the automated analysis.
The case study points out a number of areas that
would potentially improve the usefulness of the
EDGAR database, as follows:
. Need for extending the EDGAR file specification
to include tagged values of either the stock ticker
symbol (along with the corresponding exchange), or
CUSIP number.
. The structure of the EDGAR filings could be
preprocessed to made them easier to analyze electroni-
cally. This would involve making contiguous para-
graphs by removing imbedded hard line feeds, and
extending the tagging to include tables, sections,
subsections, and even paragraphs.
. The need to validate the tagging structure of
filings. Improper tagging greatly complicates the
analysis of these documents.
EDGAR-Analyzer’s identification of sub-issues
related to the larger issue under study was very useful.
For example, in the case study Boolean variables
indicated if any evidence was found of a Y2K-related
discussion of suppliers and customers, whether they
felt Y2K would materially impact the company, or
Fig. 6. Percentage of manually reviewed 10K filings that disclose certain remediation cost information. Capitalize/Expense Costs categories
indicate how they plan to recognize these expenses on their income statements.
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–29 25
costs related to the remediation effort. Since the
screening process was biased toward minimizing false
negatives, these Boolean values reduced the number
of issues that had to be manually verified.
This method of data mining was found to be
effective in a case study involving Y2K disclosures
found in annual reports. This study focused on finding
all disclosures that dealt with the year 2000 problem.
Consequently, a liberal screening strategy was used,
which tended to include text blocks that were not
pertinent. Even so, EDGAR-Analyzer eliminated
42.6% of the records analyzed as being non-Y2K
disclosing, and extracted an average text block of
11.1 KB from each filing, which averages 291 KB,
amounting to a 96% reduction in the amount of text
that had to be manually processed.
5.1. Future of EDGAR
As suggested earlier, the SEC is continually refin-
ing the guidelines dealing with the filing requirements.
There is an ongoing effort within the Commission to
improve the availability and usability of corporate
information within the EDGAR database. With each
new version, additional forms are added to the list of
required documents that must be filed electronically.
Additional formats have also been added. The earliest
documents were only available in ASCII text. In June
28, 1999, the SEC started accepting HTML and PDF
files (although PDF filings are considered unofficial
copies). In May 2000, the guidelines were again
modified to allow HTML filings to include graphic
images, and to allow multi-part, hyperlinked filings.
Furthermore, ‘‘the Commission has rescinded the
requirement for registrants to submit FDSs for filings
due and submitted after January 1, 2001’’ [34]. This
was dropped because it represented duplicate informa-
tion to that contained in the filing body, and thus
created a potential for data inconsistency. No alter-
native mechanism has been added to compensate for
the absence of the FDS schedule. Those who want the
obtained the financial data must now locate and extract
that information from the filing body [40].
As part of the modernization effort, the SEC is
migrating toward XML-based tagging of documents.
The first step has already taken place. The Commis-
sion has instituted a change from the SGML header
tags used since the initiation of EDGAR, to XFDL
tags (XFDL is an XML-based language designed to
handle forms). ‘‘Legacy filings (with SGML header
tags) will no longer be accepted by the Commission’s
EDGAR system after April 20, 2001’’ [41].
Based on an interview with the EDGAR project
manager, the SEC envisions steady progress in inte-
grating XML-type tagging into the Commission’s
filing regulations [39]. The tagging is still quite
limited, being required only for the filing header.
Future EDGAR revisions will begin to include tags
identifying content within the filing body, especially
the financial reports. Processing these financial reports
is challenging, and proper tagging would greatly
import access to this information. The SEC is taking
a cautious approach, preferring to wait for the private
sector to develop generally accepted standards before
establishing filing policy. One particularly promising
initiative is the XBRL (Extensible Business Reporting
Language) [45], which ‘‘uses XML-based data tags to
describe financial statements for both public and
private companies’’ [46].
5.2. Areas for future research
This paper describes the development and testing
of a prototype tool designed to access, search and
extract information of interest from the SEC EDGAR
database. Having successfully demonstrated feasibil-
ity, the next phase of the project will focus on moving
this application into a more robust, multi-user, server-
based environment.
To improve handling of the multiple variants that
accompany most search terms, it is useful to incorpo-
rate regular expression processing and automatic
synonym support (see Refs. [12,14], respectively).
Regular expressions are especially important since
the analysis of SEC filings primarily involves text
processing. Regular expression support would facili-
tate handling of stemmed words, wild cards and word
variants such as plurals and tenses. Automated sup-
port of domain specific synonyms would also help to
reduce both Type 1 and Type 2 errors. One possibility
is to tap into the extensive work done on WordNet, an
on-line lexical reference systems organized into syn-
onym sets [12].
The tiered search algorithm used in EDGAR Ana-
lyzer depends on the ability to accurately identify
paragraph boundaries within the documents. As pre-
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–2926
viously mentioned, this was a challenging aspect in
the case study and represents a potential source of
error. To assist in this critical process, techniques
developed in the Computational Linguistics (CL),
Information Retrieval (IR) and Data Mining literature
need to be evaluated. Within CL, multiple paragraph
extraction techniques have been developed [23]. Of
particular interest is a TexTiling algorithm, which
partitions full-length documents into coherent, multi-
paragraph passages that reflect the subtopic structure
of the original text [16]. It is also important to validate
the efficacy and efficiency of the tiered search algo-
rithm against other search algorithms found in the IR
literature [9,13,43]. To adequately evaluate and com-
pare these techniques, standard IR methodology must
be used, including the reporting of traditional recall
and precision scores. Recent advances in text-based
data mining may also be of use in refining the search
algorithm [17].
EDGAR-Analyzer’s text-parsing routines must be
continuously adapted to address on-going changes
occurring in the SEC’s filing specifications. For
example, the Commission has recently stopped sup-
porting SGML tagging and required XFDL tagging,
and firms can now submit multi-part, hyperlinked
filings. To fully process these submissions, it must
be possible to recursively recover and analyze each
segment of the filing. Also, in extending EDGAR-
Analyzer, it is important to plan for the future XML
migration plans of the SEC.
EDGAR-Analyzer was written for a single user
environment, but to reach its full potential it needs to
support a broader, multi-user environment. Since the
underlying engine is generic and useful in many
different contexts efficiencies could be gained by
creating specialized, community-based interfaces.
Potential communities could include investors, ana-
lysts, lawyers, pre-IPO firms, regulators, insiders, etc.
As suggested in Fig. 1, value can be added by moving
to community-based environments. In this way, com-
mon variables of interest to a particular community
could be predefined. Generation of community-spe-
cific search terms and appropriate synonym lists could
also be facilitated through a group-based interface.
Both of these features would speed up a search and
hopefully improve the quality of the results returned.
To gain additional efficiencies and allow multi-user
access and support, this tool needs to move to a
server-based environment. Moving to an extensible,
open-source architecture would allow distributed,
multi-author development. The code needs to by re-
engineered in a language such as Python, which has
both strong, native support for regular expressions and
XML processing as well as providing cross platform
support. Finally, it may be possible to extend EDGAR
Analyzer’s data mining methodology to other text-
based archival systems, such as news articles, journal
articles, patent filing and government legislation (i.e.,
www.thomas.gov).
References
[1] AICPA, The Year 2000 Issue—Current Accounting and Audit-
ing Guidance, American Institute of Certified Public Account-
ants, 1997, http://www.aicpa.org/members/y2000/intro.htm.
[2] AICPA, Year 2000 Issue Disclosure Considerations: Public
and Nonpublic Entities, American Institute of Certified Public
Accountants, 1997, http://www.aicpa.org/members/y2000/
discon.htm.
[3] AICPA, AICPA’s Letter to the SEC on Year 2000 MD and A
Disclosure, American Institute of Certified Public Account-
ants, December 9, 1997, http://www.aicpa.org/belt/sec2000/
index.htm.
[4] H. Aronoff, S. Graham, A Testing-Centric Approach to Year
2000 Project Management, White Paper, Tech-Beamers, 1997,
http://www.idsi.net/techbmrs/y2kh.htm.
[5] M.E. Bates, Where’s EDGAR Today? Finding SEC Filings
Online, DATABASE, June 1996, http://www.onlineinc.com/
database/JuneDB/bates6.html.
[6] R.W. Bemer, What’s the date? Honeywell Computer Journal 5
(4) (1971) 205–208.
[7] M. Bergman, The Deep Web: Surfacing Hidden Value,
BrightPlanet.com, July 2000, http://128.121.227.57/download/
deepwebwhitepaper.pdf.
[8] Bowne & Co., Securities Act Handbook, 2001, http://www.
bowne.com/resources/secrules.asp.
[9] J. Callan, M. Connell, Query-based sampling of text databases,
ACM TOIS 19 (2) (April 2001) 97–130.
[10] T.E. Doszkocs, Natural language processing in information
retrieval, Journal of the American Society for Information
Science 37 (4) (1986) 191–196.
[11] Emerging Issues Task Force, EITF Issue No. 96-14, Account-
ing for the Costs Associated with Modifying Computer Soft-
ware for the Year 2000, 1997, http://www.ncua.gov/ref/
accounting�bulletins/bull971.pdf.
[12] C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database,
Bradford Books, MIT Press, Cambridge, MA, May 1998.
[13] F. Feng, W.B. Croft, Probabilistic techniques for text extrac-
tion, Information Process Management 37 (2) (March 2001)
199–220, http://ciir.cs.umass.edu/pubfiles/ir-187.pdf.
[14] J.E. Friedl, Mastering Regular Expressions, Powerful Tech-
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–29 27
niques for Perl and Other Tools, O’Reilly, Cambridge; Sebas-
topol, 1997.
[15] B. Goodwin, P. Smith, R. Bunney, EDGAR and Family: SEC
data on the Web, Internet Prospector, Sept. 1998, http://
www.internet-prospector.org/edgar.htm.
[16] M.A. Hearst, TexTiling: segmenting text info multi-paragraph
subtopic passages, Computational Linguistics 23 (1) (1997)
33–64.
[17] M.A. Hearst, Untangling text data mining, Proceedings of the
ACL, University of Maryland, (June 20–26, 1999).
[18] L. Hyatt, L. Rosenberg, A Software Quality Model and Met-
rics for Identifying Project Risks and Assessing Software
Quality, 8th Annual Software Technology Conference, April
1996, http://satc.gsfc.gov/support/STC�APR96/quality/stc�qual.html.
[19] IASC, SIC Draft Interpretation D6: Cost of Modifying Exist-
ing Software, International Accounting Standards Committee,
October 1997, http://www.iasc.org.uk/frame/cen3doo6.htm.
[20] A. Kambil, Final Report: NSF Award 9319331: Internet Ac-
cess to Large Government Data Archives: The Direct EDGAR
Access System, NYU Stern School of Business, July 29, 1996.
[21] A. Kambil, M. Ginsburg, Public access web information sys-
tems: lessons from the Internet EDGAR project, NYU stern
school of business, Communications of ACM 41 (7) (July
1998) 91 – 98, http://delivery.acm.org/10.1145/280000/
278493/p91-kambil.pdf.
[22] P. McKevitt, D. Partridge, Y. Wilks, Why machines should
analyze intention in natural language dialogue, International
Journal of Human Computer Studies 51 (5) (November 1999)
http://www.idealibrary.com/links/citation/1071-5819/51/947.
[23] M. Mitra, A. Singhal, C. Buckley, Automatic text summa-
rization by paragraph extraction, in: I. Mani, M. Maybury
(Eds.), Intelligent Scalable Text Summarization, Proceed-
ings of a Workshop, Association of Computational Lin-
guistics, vol. 104, (1997) 39–46, http://citeseer.nj.nec.com/
mitra97automatic.html.
[24] M.R. Muddamalle, Natural language versus controlled vo-
cabulary in information retrieval: a case study in soil mechan-
ics, Journal of the American Society for Information Science
49 (10) (1998) 881–887.
[25] R. Nader, Statement of Ralph Nader before FOIndiana: FREE-
DOM OF INFORMATION, September 21, 1996, http://
www.cptech.org/govinfo/foindiana.html.
[26] K.M. Nelson, A. Kogan, R.P. Srivastava, M.A. Vasarhelyi, H.
Lu, Virtual auditing agents: the EDGAR agent challenge, De-
cision Support Systems 28 (3) (2000) 241–253.
[27] PricewaterhouseCoopers, A Technical Overview of the Edgar-
Scan System, April 9, 2001, http://edgarscan.pwcglobal.com/
EdgarScan/edgarscan�arch.html.
[28] B.F. Schwartz, EDGAR Update: the proliferation of commer-
cial products, Legal Information Alert 15 (1) (January 1996)
1–5.
[29] Securities and Exchange Commission, Edgar Filer Manual,
Release 5.10, SEC, Washington, DC, Sept. 1996.
[30] Securities and Exchange Commission, Staff Legal Bulletin
No. 5, October 8, 1997, revised January 12, 1998, http://
www.sec.gov/interps/legal/slbcf5.htm.
[31] Securities and Exchange Commission, Important Information
about EDGAR, Sept. 28, 1999, http://www.sec.gov/edgar/
aboutedgar.htm.
[32] Securities and Exchange Commission, Third Report on the
Readiness of the United States Securities Industry and Public
Companies To Meet the Information Processing Challenges of
the Year 2000, July 1999, http://www.sec.gov/news/studies/
yr2000-3.htm.
[33] Securities and Exchange Commission, (undated by R.A. Sand-
ers, and S.K. Das), EDGAR Filer Information: Electronic Fil-
ing and the EDGAR System: A Regulatory Overview,
Washington, DC, SEC, Nov. 14, 2000, http://www.sec.gov/
info/edgar/overview1100.htm.
[34] Securities and Exchange Commission, EDGAR Filer Informa-
tion: Electronic Filing and the EDGAR System: A Regulatory
Overview, May 15, 2000, http://www.sec.gov/info/edgar/
ednews/edreg2ka.htm.
[35] Securities and Exchange Commission, Final Rule: Rulemak-
ing for EDGAR System: RIN 3235-AH79, Rulemaking for
EDGAR System, Nov. 6, 2000, http://www.sec.gov/rules/
final/33-7855.htm.
[36] Securities and Exchange Commission, The Investor’s Advo-
cate: How the SEC Protects Investors and Maintains Market
Integrity, Mar. 1, 2001, http://www.sec.gov/about/whatwedo.
shtml.
[37] Securities and Exchange Commission, EDGAR Filer Manual
v. 8.0, New Version: September 21, 2001, http://www.sec.gov/
info/edgar/filermanual.htm.
[38] Securities and Exchange Commission, Private communication
with the SEC’s Internet Support Staff, April 25, 2001.
[39] Securities and Exchange Commission, Private communication
with the SEC’s Edgar Program Manager, Nov. 30, 2001.
[40] Securities and Exchange Commission, SEC FOIA Program
The Freedom of Information Act: What It Is, What It Does,
October 9, 2001, http://www.sec.gov/foia.shtml.
[41] Securities and Exchange Commission, Termination of Legacy
EDGAR on April 20, 2001, April 2001, http://www.sec.gov/
info/edgar/ednews/endlegacy.htm.
[42] Securities and Exchange Commission, Edgar Filer Manual,
Release 8.0, SEC, Washington, DC, Sept. 2001, http://
www.sec.gov/info/edgar/filermanual.htm.
[43] F. Song, W.B. Croft, A general language model for infor-
mation retrieval, Proceedings of Eighth International Confer-
ence on Information and Knowledge Management, Kansas
City, MO, November 2–6, http://ciir.cs.umass.edu/pubfiles/
ir-171.pdf.
[44] C. Taylor, Millennium Madness: The History And The Hype,
Time.com, (1999), http://www.bobbemer.com/taylor.htm
(http://www.bobbemer.com/QUOTES.HTM).
[45] XBRL.org, Extensible Business Reporting Language Specifi-
cation, version 2.0, 2001, http://www.xbrl.org/tr/2001/xbrl-
2001-11-14-draft.doc.
[46] XBRL.org, Overview/Facts Sheet, 2001, http://www.xbrl.org/
Overview.htm.
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–2928
John Gerdes, Jr. received his BS and M.
Eng. Degrees in Mechanical Engineering
in 1976 and 1977, respectively, from
Cornell University; MBA in 1981 from
Lehigh University; MS in Computer Sci-
ence and PhD in Information Systems in
1994 and 1996, respectively, from Van-
derbilt University. He held the position of
Visiting Assistant Professor in the Fisher
College of Business, Ohio State Univer-
sity from 1996 to 1998. Since 1998, he
has held the position as Assistant Professor in Information Systems
at the A. Gary Anderson Graduate School of Management, Uni-
versity of California, Riverside. Research interests include Web
Data Mining, Distance Learning, Decision Support Systems and
Electronic Commerce.
J. Gerdes Jr. / Decision Support Systems 35 (2003) 7–29 29