[IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012...

10

Click here to load reader

Transcript of [IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012...

Page 1: [IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012 Annual SRII Global Conference - Document Quality Checking Tool for Global Software

Document Quality Checking Tool for Global Software Development

Kohtaroh Miyamoto, Takashi Nerome, Taiga Nakamura IBM Japan

Tokyo, Japan {kmiya, nerome, taiga}@jp.ibm.com

Abstract—Software development projects often utilize global resources to reduce costs. Typically a large volume of unstructured office documents are involved. Unfortunately, in many cases the low quality of unstructured documents due to various location-related barriers (e.g. time zones, languages, and cultures) can cause negative effects on the outcomes of projects. Several approaches have been introduced for document quality checking but they have not generalized well enough to handle various unstructured documents in a broad range of projects. Based on past experience, we have prepared guidelines, templates, rules, and document quality-checking tools for designing and developing global software development projects. In this paper we specifically focus on the effectiveness of our document quality checking tool. The challenges for such a checking tool are that it must be generally adaptive and also highly accurate to be practical for industrial use. Our approach is template-based and consists of an extraction process for the physical-syntactic structure, a transformation process for the logical-semantic structure and an analysis process. Our experiments inspected 66 authentic customer documents, detecting 118 errors. The accuracy as measured by the true-positive ratio (accurately detected true errors) was 98.3% and the true-negative ratio (accurately detected non-errors) was 99.4%.

Keywords-services; document modeling; global development; document quality checking

I. INTRODUCTION The world is rapidly globalizing and it is well known that

many hardware products are assembled from parts and materials from all over the world. This is also becoming true for software development [31]. The motivation for many companies in the software industry is to lower the total cost while maintaining the quality.

The rapid shift to such global software development has introduced many new challenges [32]. Just to name a few, there are time zones, cultures, and languages. Time zone differences make it difficult to ask real-time questions so each detail of the specification must be precise and clear. Cultural differences can also affect holidays and working hours. The differences in native languages create language barriers, and there are also dialects within many languages.

The errors in various documents can cause negative effects on the outcomes of projects or within the development phases [1][2]. The negative effect accelerates since many projects produce a huge number (often in

hundreds if not thousands) of documents. Therefore we initiated a project to standardize documentation and rules for global software development and then developed a document quality checking tool, which is the focus of this paper.

The checking tool faces a huge challenge since it must: 1. Support unstructured office documents 2. Handle relationships among documents 3. Support various rule types 4. Be independent from specific language sets 5. Have high accuracy. Several approaches have been introduced for document quality checking but they were not generalized well enough to handle various unstructured documents for various global projects

Our approach is an extension of our previous work on document quality checking for various project-specific formalisms [3]. In this work we extend our approach with a generalized template-based approach to cover various global software development projects.

This paper is structured as follows. In section II we state the requirements and we provide an overview of document checking. In Section III we describe our approach, and in Section IV we show the platform of our tool. The results of our experiment from the initial pilot program is discussed in Section V. Section VI compares our approach with other related approaches. In Section VII we summarize this paper.

II. OVERVIEW To realize our effort of global standardization of software

development documents, first we created a guideline which defines how the development must be conducted and how each documents should be written. Based on that guideline, we defined the templates and a set of rules to be applied. With the templates and rules we then created a document checking tool that can check each document for violations of the rules. In this paper we focus on the checking tool of this effort.

A. Requirements for the checking tool For a document checking tool many special requirements

must be satisfied. Here are the major requirements:

1) Handle Various document formats Office documents such as Microsoft Word and Excel are

used in various projects. Though these documents are partially structured, the users basically have flexibility to edit freely in any format. There is no restriction for a

2012 Service Research and Innovation Institute Global Conference

978-0-7695-4770-1/12 $26.00 © 2012 IEEE

DOI 10.1109/SRII.2012.37

261

2012 Service Research and Innovation Institute Global Conference

978-0-7695-4770-1/12 $26.00 © 2012 IEEE

DOI 10.1109/SRII.2012.37

267

Page 2: [IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012 Annual SRII Global Conference - Document Quality Checking Tool for Global Software

project-specific document format, so we consider these office document formats as “unstructured documents”.

There is an estimation that 80% of all data are unstructured[5]. Although it is hard to estimate how it maps to the ratio for documents, it seems quite clear that we need to assume a high ratio of unstructured document in various projects.

There is a benefit in the usability, since most people are quite accustomed to using office software in much of their daily work. Therefore for practicality there is a strong requirement to support such document formats. Many techniques have been exploited for handling structured software information[18]. But compared to handling structured documents or information with strictly defined fields for well-defined forms or models, obviously there is a huge challenge in supporting such unstructured documents, since extracting the target elements precisely is difficult.

2) Check Relationship between Documents There is typically a large number of documents in a

software development project. The document checking tool must be able handle relationships among many combinations of documents. For example, in certain cases it needs to check whether an ID string in a field in one format (such as Microsoft Excel) matches an ID written in a different format (such as Microsoft Word). Therefore our tool cannot reside inside the document editing software (such as using a macro embedded in a Microsoft Word file), but rather it must work outside as an independent system.

3) Flexibility to Define Various Rules

Other than relationship among documents there are various type of rules which need to be checked. For example, whether or not the document matches the given template, or defined naming conventions are followed correctly, or mandatory fields are not left blank. The checking tool must be flexible enough to define such various rules.

4) Language Independence

In a customer-related project, the target language of the documents is decided by the client. For a multinational corporation, we must consider clients located all over the world using wide range of languages. Though language-specific approaches are used in many areas, for our purposes the document checking tool must not depend on a specific set of languages.

5) High Accuracy

The performance requirements for a document checking tool can be generalized as speed and high accuracy. Generally, speed is now a minor issue since powerful computers have become so inexpensive. In general, the time required for the manual work of the authors and editors to prepare the documents is so large that the time required by the checking tool is relatively insignificant.

In contrast, the accuracy of the checking tool is crucial for acceptable performance. Depending on how a checking tool works, the accuracy can be quite different. Incorrect

detections from checking tools can be categorized into two cases[7]:

• False positive: The checking tool reports an error that is not actually an error. In this case, when a user checks the results of the check tool, the proposed change should be rejected.

• False negative: The checking tool does not report an error where one exists. This usually causes the error to be left unnoticed and the error is passed along to the next development phase. Sometimes a user may notice and repair the error.

In order to meet these requirements, we designed our tool

so that it is independent of any specific document authoring system so that it can handle unstructured documents and its relationship. For flexibility of defining various rule types the tool supports various rule operators. And also it is designed to analyze unstructured documents not from its language dependent format but rather its (language independent) structured format. Finally, we designed our overall architecture by combining robust primitives to achieve high accuracy.

B. Templates There are many categories of tools to consider. For

example, for defining and authoring the models there are many tools in use (e.g. Rational Software Architect: RSA [35]). For writing descriptions Microsoft Word is often used and when a spreadsheet is required, Microsoft Excel is popular. For a universal template we need to define a standard format for the documents. Since we found that all of the existing tools can export their data to the Excel format, we decided to define such templates in Excel.

Table 1 shows the types of documents, their formats, and the numbers of templates. N/A in the templates column indicates that either (1) The templates for that kind of document are shared with others documents, or (2) The target document is created by a specified tool and no template is used.

TABLE I. NUMBER OF TEMPLATES PER DOCUMENTS

Type of Documents Format Templates1 Use case list Excel 1 2 Use case description Excel 1 3 UI/Forms specification Excel 16 4 Interface specification (Logical) Excel 2 5 Data item list Excel 1 6 Appication design document 1 Excel 1 7 Design domain model details 1 Excel 1 8 Physical database design 1 Excel 4 9 Operational model Excel 1

10 Interface specification Excel N/A 11 Program module specification Excel 2 12 Application design document 2 Excel N/A 13 Design domain model details 2 Excel N/A

262268

Page 3: [IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012 Annual SRII Global Conference - Document Quality Checking Tool for Global Software

14 Data model Excel 1 15 Physical database design 2 Excel 7 16 Database transaction description Excel N/A 17 Settings parameter Excel 2 18 IT service decision Excel 4 19 Domain use case Word 1 20 System use case Word 1 21 Modeling 1 Excel N/A 22 Modeling 2 Excel N/A 23 Modeling 3 Excel N/A

Total 46 (To cover the various types of documents, 46 templates

were prepared. The details of the templates will not be covered in this paper, therefore we will not go into further details.)

C. Rules Based on the guidelines and templates, 144 rules were

formulized. Rules for various document errors are standardized in IEEE Std. 830-1998 [6] but for simplicity we generally group the rules into 4 types: (1) Template mismatches

Templates are important for a standard approach so that the formats of the documents will be unified. Therefore, it is necessary to check each document for template mismatches. For example, the label tag fields in the template must also exist in the document. For unstructured documents, they should be formatted to match the template. If an editor reformats the document, many errors will be introduced, but if nothing is changed, then no errors should be reported. Therefore this kind of error checking tends to detect many errors or no errors.

(2) Naming convention violations

Standard naming conventions are defined for each of the various types of IDs. For example there is a rule that all IDs representing a Screen ID must start with “SCR” followed by a two-digit number, such as “SCR03”. All of the IDs need to be checked for violations of the naming conventions. This type of error may depend on the user.

(3) Mandatory field left blank or omitted

Often in many documents there are a large number of fields to fill in. Depending on the design of the template, it may be unclear as to which fields are mandatory. Also, some fields may have strong or complicated relationships with other fields. For example, the value of a subcategory field depends on the higher level category. Such factors may cause mandatory fields to be left blank. Each mandatory field should be checked for appropriateness.

(4) Inconsistencies

The consistency of IDs within a document and among multiple documents needs to be checked. For example, in many cases, for each ID listed in a document, there must be

corresponding documentation that describes the actual behavior. This type of error checking is important and is often hard to perform by manual inspection, since multiple fields may be referenced and omissions can cause serious problems by breaking the links among the information.

For the example in Figure 1, the rule is to check the consistency of an ID in a Word document against a Screen ID listed in Excel and due to the mismatch between “SCR01A” in Word and “SCR01” in Excel, the error should be detected. In this case, the checking tool also needs to look into the structural information to allocate text within a chapter , so the tool must either analyze the name (from the text between the headings “Event Flow” and “Preconditions”) or look into the numbers of the chapter (the text between the 2nd chapter and the 3rd chapter). Then the checking tool needs to extract the screen IDs in the text that are expressed according to the naming convention. Also, Word and Excel use heterogeneous formats, so the error checking must transform both formats to a syntactic and semantic structure that can match the IDs from the two different formats. The method we designed and implemented for these transformations will be discussed in Chapter III..

Figure 1. Document Checking Tool

Each pair in Table 1 shows the number of rules for each rule type. The “Template mismatch” has many fields to check, so it has the largest number of rules.

263269

Page 4: [IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012 Annual SRII Global Conference - Document Quality Checking Tool for Global Software

TABLE II. NUMBER OF RULES PER TYPE

Rule Type Rules Template mismatch 61 Naming convention violation 12 Mandatory field left blank 21 Inconsistency 28

Total 122

III. DOCUMENT CHECKING TOOL DESIGN In this section we describe our approach to modeling and

analyzing documents. Figure 2 is an overview of our method. To present these steps, we use the example shown in Figure 3 as an input document written in a spreadsheet file. Note that the sheet named “List of Business Rules” is a table of business rules. Figure 4 shows a description of each business rule, where the description is written in a separate numbered sheet. The ID and name of the rule is written in a specific position. The description itself is written in natural language.

Figure 2. Document Checking Tool

The input is the rules, templates, documents, analysis rules based on the rules, and transformation rules based on the templates, and the output is an error report. The main execution has several processes: an extraction process, a transformation process, an encoding process, and an analysis process. Each process is discussed in detail in the following sub-sections.

Figure 3. Example of Business Rules

Figure 4. Example Document Checking Tool

A. Extraction Procesess From the documents the content “physical model” is

extracted. A physical model is a structure representing the characteristic physical and syntactic features of the input document. A structure is defined for each application file type, such as a word processing file, a spreadsheet, or a presentation. In our implementation, the tool reads a set of input files, accesses the content of each file by invoking the

264270

Page 5: [IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012 Annual SRII Global Conference - Document Quality Checking Tool for Global Software

APIs of the corresponding application, and dumps the output into XML files with the structure representing the input content. Figure 5 shows part of the XML structure extracted from the example input document.

<sheet name=”Change History”> … </sheet> … <sheet name="List of Business Rules"> ... <row y="5"> <cell border="15" x="1" y="5">ID</cell> <cell border="11" x="2" y="5">Category</cell> <cell border="15" x="3" y="5">Name</cell> <cell border="15" x="4" y="5">Description</cell> <cell border="15" x="5" y="5">Reference processes/use cases</cell> <cell border="15" x="6" y="5">Note</cell> <cell border="4" x="7" y="5"/> </row> <row y="6"> <cell border=”7” x=”1” y=”6”BR-OM-0001</cell> <cell border="3" x="2" y="6">Data</cell> <cell border="7" x="3" y="6">Constraint on estimation and order relationships</cell> <cell border="7" x="4" y="6">Constraints regarding estimation overview, estimation detail, order overview, over detail, and relationship between estimation and order</cell> <cell border="7" x="5" y="6">Logical data model SUC-OM-0010</cell> <cell border="7" x="6" y="6"/> <cell border="4" x="7" y="6"/> </row> ... </sheet> <sheet name="0001"> ... </input>

Figure 5. Physical structure dumped from the document

Each sheet is dumped into a <sheet> element, with a

name attribute corresponding to the name of the sheet. Each row in the sheet is dumped as a child of <sheet> to a <row> element, under which <cell> elements are created for each cell in each row. Each <cell> element has x, y and border attributes, which store the column position, the row position, and a 4-bit value representing the status of the cell borders (with each bit set to 1 if a border exists on the corresponding edge). This is a straightforward matrix structure preserving the information on the cell positions and borders while discarding all of the other attributes. This particular representation for a spreadsheet file is an implementation choice, but the approach works with any structure definition.

This extraction process approach is designed so that it does not generally rely on a text analysis approach and thus it does not depend on the availability or the accuracy of the text analysis technology for the target language. Therefore this approach is language independent.

B. Transformation Process The physical model is transformed from the templates

into a “logical model”. The set of documents will change from project to project, but ideally the templates should remain consistent with the least amount of updates.

The logical model is a structure representing the logical and semantic characteristics of the template. In this step, a user defines a transformation rule for each template type. In our implementation, this step uses XML transformations. The tool provides an interface to visually define the transformation rules for better usability. Figure 6 shows the logical structure. To obtain this structure from the XML in Figure 4, each transformation rule is defined to match against the physical features of the input files dumped as corresponding elements and attributes in the physical model.

<model fileName=”BusinessRules.xls” modelName=”brList”> <brList id="BR-OM-0001" name="Constraint on estimation and order relationships" description="Constraints regarding estimation overview, estimation detail, order overview, over detail, and relationships between estimation and order" reference="Logical data model SUC-OM-0010"/> ... </model> <model fileName="BusinessRules.xls" modelName=”brDef”> <brDef defBrId="BR-OM-0001" defBrName="Constraint on estimation and order relationships"> <brParagraph text="Below diagram shows the relationship between estimates and orders"/> <brParagraph text="The following constraints exist among these values"/> <brParagraph text="(1)"/> ... </brDef> <brDef defBrId="BR-OM-0002" defBrName="Approval authorization level for estimation"> ... </model>

Figure 6. Logical strucures as output by the document modeling

While the elements in a physical model represent where

and how the content is written, those in a logical model represent what the content is about. Based on external observations, the user defines transformation rules specifying the mappings between the physical locations and notations and the logical data structure. For efficient transformation capabilities, we use a transformation engine similar to XSLT [4], a generic transformation language for XML. Since the transformation rule often involves standard high-level operations, those operations are explicitly provided as functions in the transformation engine.

265271

Page 6: [IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012 Annual SRII Global Conference - Document Quality Checking Tool for Global Software

For example, transformation rule for spreadsheets often include operations that traverse the cells along a column or row until the traversal reaches a cell border. In the transformation rule, this corresponds to searching the <cell> elements with the same x or y attribute value and an appropriate border attribute value.

In a typical model-centric approach, the information stored in the model is considered to be the master data, and the documents are essentially regarded as user-friendly views that can be generated from the master data. In contrast, the transformation in this step is a reverse conversion from a view to a model. The conversion in this reverse direction is not a completely new concept. Many products for requirements management support import functionality to migrate the content in existing documents into a managed data model. Once the old data is successfully imported, the model becomes the new master data.

Another function supported by many tools is to allow the use of conventional office applications as a front-end to display and edit the data. When a user updates and saves the document file, the changes are synchronized with the data managed by the tool. Nevertheless, certain important differences exist between these functions and our transformation:

• The expressive requirements of our transformation are higher because the inputs are arbitrary docu-ments with various complex data representations.

• The structure of the logical model from our transfor-mation is not bound to a structure or vocabulary defined by a particular method, tool, or standard.

C. Encoding Procesess The encoding process transforms the “Analysis Rules”

defined by the user to the “Encoded Rules”. A sample format is shown in Figure 7. Each rule is comprised of its ID, type, message, along with the rule definition.

<rules> <model pathName="model-DomModel3.xml"/> <model pathName="model-cmpmdl.xml"/> <check errId="DM3_20" errType="Error" message="Class Name does not exist in model list;;$1" status="enabled"> <term name="IncludedIn" type="op"> <term item="cmpmdl.cmpspec.class" type="item"/> <term item="DomModel3.classAlt.ClsAlt" type="item"/> </term> </check> <check errId="DM3_12" errType="Warning" message="Desecription field is empty" status="enabled"> <term name="NotEmpty" params="altLocation=cmpmdl.grMemo.chapName" type="op"> <term item="cmpmdl.grMemo.memoText" type="item"/> </term> </check> </rules>

Figure 7. Example of Encoding Process Output

D. Analysis Procesess Analytics are processed on the logical structures. Once

the logical structure has been obtained, various checks for the documents are implemented as constraints within and among logical document models, without regard to the physical structures of the original input documents.

Sample output of the analysis process is shown in Figure 8. The output includes such information as the error id, error type, message id, message string, and error location.

<analysisoutput analysisDate="2012/02/15 18:24:50" rules="UC_02A UC_02B UC_02C UC_02D UC_02E UC_02F UC_03A UC_03B UC_04A UC_04B UC_04C UC_04D UC_04E UC_14 " toolVersion="2.1.0-20110715182358" user="ibmac" xmlns:sqi="http://www.research.ibm.com/sqale/ns/internal"> <input attributes="48" characters="906" className="uc" digest="6978a121205637bab25fd090326229dd71339529" elements="28" pathName="C:¥SQALE¥sqale-work¥input¥UC04-01_checkcontract.doc" timestamp="2011/06/20 11:33:14"> <sqi:error id="UC_02A" loc="" message1="Subject Area is empty" no="20110715-182447-0" type="Error"/> </input> </analysisoutput>

Figure 8. Example of Analysis Output

IV. PLATFORM FOR DOCUMENT MODELING AND ANALYSIS We developed a tool that provides a user with a platform

to define the rules for document modeling and analysis and to perform the analysis. The tool is implemented in Java™ as a set of Eclipse plug-ins.

A. User Interface The platform provides a visual editor to design rules for

document modeling and analysis. Figure 9 shows a screen shot of the document modeling tool and Figure 10 shows the analysis tool. The user starts by creating a “designer” project to the Eclipse workspace.

For each document format, the user creates a definition for document modeling in which the transformation rule can be graphically defined by mapping the input items representing the physical elements in the document with the output items representing their logical structure. A graphical definition is converted internally into a transformation rule. This is analogous to a graphical editor for standard XSLT, although the interface of the tool allows the user to define the rule at a higher level of abstraction. For example, the user can select and add the “cell” input item from the toolbox with a simple drag-and-drop operation using the mouse, and set the cell coordinates and a pattern as parameters. This definition is then converted into a matching process to find

266272

Page 7: [IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012 Annual SRII Global Conference - Document Quality Checking Tool for Global Software

any <cell> elements and check if the x and y attribute values are within the coordinate range specified, as well as checking if the text value matches the specified pattern.

Figure 9. Editor User Interface for Transformation Rules

Figure 10. Editor User Interface for Analysis Rules

For example, to check if each pair of business rule ID and name in the rule definition shown Figure 4 has a corresponding entry in the list of business rules shown in Figure 3, the user can define an analysis rule using two instances of the concatenate operator and one instance of the includedIn operator as shown in Figure 10. The first concatenate operator combines the value of ID and name in the business rule description, and the second one adds the values from the list. The includedIn operator then checks whether each value created by the first concatenate operator matches one of the values created by the second concatenate operator. If no match is found, an error message is output as defined by the analysis rule.

The tool provides basic operators for the analysis rule definitions. Table III shows examples of the operators. There are operators for string manipulations and comparisons, set operations, and logical operations. “String Manipulations and Comparisons” supports basic operations on arbitrary strings. Here, regular expressions1 are supported to define

1 Regular expressions are a generic method to provide a concise and flexible means for matching, specifying and

certain string rules. “Set Operations” support basic operations on sets of data. “Logic Operations” are the basic logical operators “and”, “or”, and “not”.

TABLE III. EXAMPLE OF ANALYSIS OPERATORS

String Manipulations and Comparisons

concatenate, replace, getPatterns, count, equalsTo, notEmpty, matchesPattern

Set Operations isUnique, getUnique, includedIn, getIncludedIn, containsOneOf, getContainingOneOf, lookup

Logical Operations and, or, not The set of predefined analysis operators is intended to

provide functionality sufficient for describing all of the practically useful rules. When necessary, however, additional operators can be defined by adding a plug-in that calls the appropriate extension point, using the extension mechanism provided by Eclipse.

To perform an analysis, the user creates a runtime project from the designer project. The runtime project is generated by collecting the rule definitions in the document models in the designer project and “compiling” them into an executable package.

B. Execution of Runtime The executable package can then run as a standalone tool

independent from the platform. It can be executed manually at any time or it can run automatically when documents are checked in or committed. Currently the platform supports Rational Team Concert (RTC) [30] as the document control system.

For good performance, the platform checks if a document has been updated since the last error check. Only if there are changes will the platform run the transformation and create a updated model.

V. RESULTS We initiated a pilot program for evaluation. This pilot is

aimed at obtaining initial analysis before applying to large projects involving a huge number of documents. There were many aspects being evaluated, but here we will focus on the evaluation of our document checking tool for three criteria:

• number of errors per execution • accuracy of the error detection • types of error that were found Here are the conditions of our experiment: • We used actual data from a Japanese customer, and

all of the documents were written in Japanese. • The software requirements documentation and the

project leadership were in Japan. • Three employees in China were assigned to write the

documentation and do the development.

recognizing strings of text.

267273

Page 8: [IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012 Annual SRII Global Conference - Document Quality Checking Tool for Global Software

• The native language of the workers was Chinese but there were sufficiently fluent in Japanese to complete all of the assigned tasks.

• The basic training for understanding the framework was done in the same manner as any other project.

• The communications between China and Japan were handled with conventional email and phone calls.

• The Chinese employees wrote several documents, including the application design and domain use cases.

• The documents were checked in and committed and then checked with our system.

• A total of 66 documents were created to cover all of the templates shown in Table I.

• All of the checking rules in Table II were covered.

A. Error Detection Results The tool was used 12 times when new documents were

committed into the RTC, and errors were detected in 10 of those examinations. For the target documents, 122 rules were applied, and 118 errors were detected. Therefore, there were 1.79 errors per document.

Figure 11 shows the actual error detection results (solid line) at each tool execution compared to a “what-if” situation for the case if the document checking tool did not exist (dashed line) or was not used and as a result, the errors were not found. We also assume that the errors continued to accumulate.

Figure 11. Number of errors per execution

The number of errors shows some variation. In cases

where many errors were found, sometimes a relatively large number of documents has been updated (at the 1st, 4th, and 7th executions) and human error caused a certain number of inconsistency errors (at the 3rd and 9th executions).

We were able to detect most of the errors (118 out of 120) compared to the situation without the error checking tool, clearly showing we were able to prevent erroneous documents from being distributed and used in the next phase or at another location. Also, in this pilot program limited number of documents were checked, but actually in many projects there are hundreds (or in cases, even more than a thousand) documents involved, so it is easy to assume much severe results.

B. Accuracy of the Error Detection We also studied the accuracy of the error detection. First,

we separated the detection results for errors into the actual errors and the false positives. Then we investigated each document carefully for cases where no error was reported but an error actually existed. Finally, we checked all of the elements that were targets for error detection (in this case, all of the subject cells of the Excel files and all of the subject words in the Word files). For example, there were cases where there were 5 labels for “template mismatch” in a single document. The results are shown in Table IV.

TABLE IV. ACCURACY OF ERROR DETECTION

Actual State

Error Not Error

Detection Result

Error 118 (98.3%)

2 (1.7%)

Not Error 5 (0.6%)

832 (99.4%)

There were 118 cases (an accuracy of 98.3%) where

actual errors were correctly detected as errors (known as true positives [7]). There were 2 (false positive) cases (1.7%) where false errors were reported. In both of these cases, we found that the regular expression was incorrect for one of the analysis rules.

There were 832 true-negative elements (99.4%) where actual non-error elements were correctly ignored. There were 5 false-negative results (0.6%) where actual errors were not detected. For all five of the detection failures the cause was the same, a regular expression for a file name that did not match the actual configuration of the files in one of the transformation rules.

Our results show that we were able to achieve high accuracy. We have not identified any incorrectness in any of the extraction, transformation, encoding, or analysis processes. The main assumption for this high accuracy is that our approach is basically combined by relatively simple elements. And often the simple design results in high reliably [8].

C. Types of Errors We studied the types of errors that were detected based

on the types of rules we defined in Table II. The results are shown in Table V.

TABLE V. ERRORS PER TYPE

Rule Type Errors Template mismatch 0 Naming convention violation 0 Mandatory field left blank or omitted 41 Inconsistency 77

Total 118

268274

Page 9: [IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012 Annual SRII Global Conference - Document Quality Checking Tool for Global Software

Each document was written based on a template, and therefore there were no template mismatch errors. Similarly, the documents were carefully written to match all of the naming conventions so there were no errors detected for this category. For the mandatory fields being left blank or omitted and inconsistencies there were many errors detected. These types of errors are caused when the worked needed to consider multiple fields simultaneously. This coincides with the theory that in general human errors tend to occur frequently from stressful situations such as multiple factors being required to be handled simultaneously [9]. As discussed earlier, these inconsistency errors are hard to detect manually. The fact that many inconsistency errors were detected shows the effectiveness of our tool.

VI. RELATED WORK In this section we discuss our method in relation to other

work. There are many approaches for consistency checking, document model transformations, and for handling unstructured documents.

A. Consistency Checking According to a survey of consistency management [10],

there are four general approaches to detect and resolve inconsistencies among documents. They are:

(1) logic-based approaches (2) model-checking approaches (3) specialized model analysis approaches, and (4) human-centered collaborative approaches Logic-based approaches such as [11][12][14][19][20]

assume the models are expressed in some formal language. For model checking approaches [21][22], there needs to be a state-oriented modeling expression. Both of these techniques cannot be applied in our situation, since we are checking unstructured documents.

In the work on specialized model analysis approach, inconsistencies are detected by monitoring software changes [9], or by defining viewpoints [24], or by extracting to a specialized XML format [25]. While inconsistencies among multiple documents are detected in these techniques, they did not address other rule types, such as template mismatches, naming convention violations, or mandatory field problems. We have generalized the approach much more widely.

The human-centered approach (such as [23]) conflicts with our requirements since our approach aims at checking errors accidentally created by human beings, such approach is not possible to assume high accuracy. Also, language independence and supporting various document formats is very challenging.

B. Document Model Transformation We have introduced our method of transforming physical

structure of unstructured documents into logical structures. The problem of identifying logical semantic structures from physical representations has been studied in some research. The two most active applications are seen in image

recognition[26][27], which attempts to identify text and objects in scanned images, and Web table understanding [28][29], which attempts to recognize the logical structures of tables in web pages.

Due to the characteristics of the intended applications, their main focus is on fully automated recognition in large amounts of data, rather than explicitly configuring a transformation rule from physical to logical. Therefore it is not intended for language independence nor high accuracy.Effective configurations for transformations of unstructured documents from syntactic to semantic seem to remain relatively little studied.

C. Error Checking for Unstructured Documents Many related studies (such as [13][15][16][17]) have

checked various aspects of unstructured documents, relying on text analysis technologies. While this seems very promising and rather straightforward, our tool has to work with various languages and dialects. Unfortunately the text analysis approaches are limited to certain set of major language (for example, 65 languages [34]) that can be handled. Therefore our approach of analyzing the structure information is both robust and language independent.

Other approaches may be assumed using Microsoft Office macros for a specific document encapsulated checking, but such approach cannot handle various unstructured document types and also it cannot check inconsistencies among documents.

Also, there is much research on mining meaningful data from massive unstructured data[33]. The target of such technology is not in high accuracy.

VII. SUMMARY In this paper we discussed the effectiveness of our

document checking tool for global software development. We faced difficult challenges, such as supporting unstructured document formats, language independence, and assuring high flexibility in the types of rules that could be defined. Templates and rules were defined based on guidelines. “Analysis rules” were defined based on the rules and “transformation rules” were defined based on the templates. The approach of our checking tool is a template-based model transformation. Physical and syntactic structures are extracted from the documents, and the logical and semantic structures are transformed with transformation rules. The encoding process handles the encoded rules, and from the encoded rules and the logical and semantic structures the errors can be analyzed.

We tested our tool and were able to achieve high accuracy. The true-positive ratio was 98.3% and the true-negative ratio was 99.4%.

We recognize that our work is still in its initial stages. Our future work will apply our system to more projects and gather feedback to enhance the guidelines, templates, rules and the checking tool itself for higher applicability.

269275

Page 10: [IEEE 2012 Annual SRII Global Conference (SRII) - San Jose, CA, USA (2012.07.24-2012.07.27)] 2012 Annual SRII Global Conference - Document Quality Checking Tool for Global Software

ACKNOWLEDGMENT We would like to thank all the people who have given us

valuable advice in formalizing this paper and also the people who have contributed in this project.

REFERENCES [1] B. Curtis, H. Kransner, and N. Iscoe, “A field study of the software

design process for large scale systems,” Communications of the ACM, Vol. 31, No. 11, pp. 1268-1287, 1988.

[2] B. W. Boehm, "Software Engineering," IEEE TRANSACTIONS ON COMPUTERS, Vol. C-25, No. 12, Dec. 1976, pp. 1226-1241, Dec. 1976.

[3] T. Nakamura, H. Takeuchi, F. Iwama, and K. Mizuno, "Enabling Analysis and Measurement of Conventional Software Development Documents Using Project-specific Formalism," 2011 Joint Conference of the 21st International Workshop on Software Measurement and the 6th International Conference on Software Process and Product Measurement, pp. 48-54, 2011.

[4] XSL Transformations (XSLT) Version 2.0, http://www.w3.org/TR/xslt20/

[5] J. Grantz, "The Expanding Digital Universe," "An IDC White Paper - sponsored by EMC, 2007.

[6] IEEE Std 830-1998, "IEEE Recommended Practice for Software Requirements," Software Engineering Standards Committee of the IEEE Computer Society, Jun. 1998.

[7] J. Neyman and E. S. Pearson, "On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference, Part I"Joint Statistical Papers Cambridge University Press, pp 1-66, 1967.

[8] B. W. Boehm and P. N. Papaccio, "Understanding and Controlling Software Costs "IEEE TRANSACTIONS ON COMPUTERS, 14, No 10, pp 1462-1475, Oct. 1988.

[9] D. A. Wiegmann and S. A. Shappell: "A Human Error Approach to Aviation Accident Analysis: The Human Factors Analysis and Classification System," Ashgate Pub Ltd, ISBN-10: 0754618730.

[10] G. Spanoudakis and A. Zisman, "Inconsistency Management in Software Engineering: Survey and Open Research Issues," Handbook of Software Engineering and Knowledge Engineering, World Scientific Publishing Co., ISBN 981-02-4973, pp. 329-380, 2001.

[11] B. Nuseibeh and A. M. Russo, "Using Abduction to Evolve Inconsistent Requirements Specifications," Australian Information Systems Journal, Vol. 7, pp. 118-130, 1999.

[12] T. Olsson and J. Grundy, "Supporting Traceability and Inconsistency Management Between Software Artifacts," Proceedings of the IASTED International Conference on Software Engineering and Applications, Boston, MA, Nov. 2002.

[13] A. Fantechi and E. Spinicci, "A Content Analysis Technique for Inconsistency Detection in Software Requirements Documents," Anais do WER05 - Workshop em Engenharia de Requisitos, Porto, Portugal, pp. 245-256, 2005.

[14] D. Alrajeh, J. Kramer, A. Russo, and S. Uchitel, "An Inductive Approach for Modal Transition Systems Refinement," Technical Communications of the 27th International Conference on Logic Programming, ICLP'11, pp.106-116.

[15] L. Kof, "Natural Language Processing for Requirements Engineering: Applicability to Large Requirements Documents," in Proc of the Workshops 19th International Conference on Automated Software Engineering, 2004.

[16] H. Yang, A .D. Roeck, V. Gervasi, W. Alistair, and B. Nuseibeh, "Analysing anaphoric ambiguity in natural language requirements," Requirements Engineering, 16(3), pp. 163-189, 2011.

[17] A. Sinha, A. Paradkar, H. Takeuchi, and T. Nakamura, "Extending Automated Analysis of Natural Language Use Cases to Other Languages," 2010 18th IEEE International Requirements Engineering Conference, pp. 364-369.

[18] W. Visser, K. Havelund, G. Brat, S.J. Park, and F. Lerda, "Model Checking Programs," Automated Software Engineering, vol. 10, No. 2, DOI: 10.1023/A:1022920129859

[19] K. Lano, J. Biccarregui, and A. Evans, "Structured Axiomatic Semantics for UML," Proceedings of the 3rd Workshop on Rigorous Object Oriented Method, York, January 2000

[20] A. van Lamsweerde and E. Letier, "Handling Obstacles in Goal-Oriented Requirements Engineering," IEEE Transactions on Software Engineering, Special Issue on Exception Handling, 2000

[21] C. Heitmeyer, R. Jeffords, and D. Kiskis, "Automated Consistency Checking Requirements Specifications," ACM Transaction on Software Engineering and Methodology, Vol. 5, No. 3, pp.231-261, 1996

[22] J. Holzmann, "The model Checker SPIN," IEEE Transactions on Software Engineering, Vol. 23, No. 5, pp. 279-295, 1997

[23] G. Kotonya and I. Sommerville, "Requirements Engineering and Viewpoints," Software Engineering Journal, Vol. 11, No. 1, January, pp. 5-18, 1999

[24] H. Delugach, "Analyzing Multiple Views of Software Requirements," in Conceptual Structures: Current Research and Practice, Ellis Horwood, New York, pp. 391-410, 1992.

[25] A. Zisman, W. Emmerich, and A. Finkkelstein, Using XML to Specify Consistency Rules for Distributed Documents, In Proceedings of the 10th International Workshop on Software Specification and Design (IWWSD-10), Shelter Island, San Diego, California, November, 2000

[26] X. Lin and X. Lin, “Text-mining based journal splitting,”in Proceedings of International Conference on Document Analysis and Recognition (ICDAR), 2003, pp. 1075-1079.

[27] D. W. Embley, M. Hurst, D. Lopresti, and G. Nagy, “Tableprocessing paradigms: a research survey,” International Journal of Document Analysis and Recognition (IJDAR), Vol. 8, No. 2-3, pp. 66-86, 2006.

[28] M. Hurst and T. Nasukawa, “Layout and language: integrating spatial and linguistic knowledge for layout understanding tasks,” in Proceedings of the 18th conference on Computational linguistics - Volume 1, ser. COLING ’00. Stroudsburg, PA, USA: Association for Computational Linguistics, 2000, pp. 334-338.

[29] R. Zanibbi, D. Blostein, and R. Cordy, “A survey of table recognition: Models, observations, transformations, and inferences,”International Journal on Document Analysis and Recognition, Vol. 7, pp. 1-16, March 2004.

[30] Rational Team Concert (RTC), http://www-01.ibm.com/software/rational/products/rtc/

[31] D. Damian and D. Moitra, "Guest Editors' Introduction: Global Software Development: How Far Have We Come?," IEEE Computer Society, Vol. 23, Issue 5, pp. 17-19, September 2006

[32] J. Noll, S. Beecham, I. Richardson, "Global software development and collaboration : barriers and solutions," Association for Computing Machinery, ACM Inroads;1/ 3/ pp. 66-78, 2010

[33] R. Feldman, "Mining Unstructured Data," Tutorial notes of the fifth ACM SIGKDD, 1999

[34] Supporting Languages in Google Translate, http://translate.google.com/translate_tools

[35] Rational Software Architect, http://www-01.ibm.com/software/awdtools/swarchitect/

270276