UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool...

29
UNIVERSITY OF WATERLOO Faculty of Engineering Optical Character Recognition Recommendation Prepared By: Eugene Chung I.D. # 20302048 2A Mechatronics Engineering August 31, 2010

Transcript of UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool...

Page 1: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

UNIVERSITY OF WATERLOO

Faculty of Engineering

Optical Character Recognition Recommendation

Prepared By:

Eugene Chung

I.D. # 20302048

2A Mechatronics Engineering

August 31, 2010

Page 2: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

ii

149 Halterwood Circle

Markham, Ontario

L3P 7T2 August 31, 2010

Professor Sanjeev Bedi,

Director of Mechatronics Engineering

Department of Mechanical and Mechatronics Engineering

University of Waterloo

200 University Avenue West,

Waterloo, Ontario

N2L 3G1

Dear Professor Bedi:

I have written this report, “Optical Character Recognition Recommendation“, as

the second of four report submissions for graduation. This work term followed my

completion of 2A academic term in April 2009. The purpose of this report is to select the

best Optical Character Recognition (OCR) solution for the joint research program

between Medical Imaging Informatics Research Center at McMaster, National Research

Council and Agfa Healthcare.

The main goal of the joint research program is the completion of the Radiation

Exposure Monitoring project to enable a standardized repository for radiation exposure

research. The OCR component of the project is a major commitment for this work term.

I conducted the research, testing, data collection, design, implementation and

documentation with minimal assistance.

This report was written entirely by me and has not received any previous

academic credit at this or any other institution. I would like to thank Danny D’Amours

from the National Research Council for providing some leads on open source OCR

software. I received no further assistance.

Sincerely,

ID: 20302048

Eugene Yiu Chun Chung

Page 3: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

iii

Table of Contents

List of Tables ......................................................................................................... iv List of Figures ........................................................................................................ v Summary............................................................................................................... vi

1.0 Introduction ................................................................................................... 1

1.1 Background ...................................................................................................... 2 1.2 Scope ............................................................................................................... 3

2.0 Evaluation Criteria ........................................................................................ 4

2.1 Performance Criterion ...................................................................................... 4 2.2 Development Environment ............................................................................... 5 2.3 Cost Effectiveness............................................................................................ 5 2.4 Licenses ........................................................................................................... 6 2.5 Optional Criteria ............................................................................................... 6

3.0 Concerns and Considerations ..................................................................... 7

3.1 Selecting Test Data .......................................................................................... 7 3.2 Ability to Anticipate Changes ............................................................................ 7 3.3 Knowledge Requirements ................................................................................ 8

4.0 OCR Evaluation Analysis ............................................................................. 9

4.1 OCR Candidates .............................................................................................. 9 4.2 OCR Performance and Potential .................................................................... 11

5.0 Conclusions ................................................................................................ 18

6.0 Recommendations ...................................................................................... 20

Glossary ............................................................................................................ 21

Appendix A: ............................................................ Error! Bookmark not defined.

Works Cited ....................................................................................................... 22

Page 4: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

iv

List of Tables

Table 1. Overview of Open-Source OCR Solutions..........................................................9

Table 2. Overview of Commercial OCR Solutions..........................................................10

Table 3. Information generated from Appendix A-1.......................................................11

Table 4. Information generated from Appendix A-2.......................................................12

Table 5. Information generated from Appendix A-3.......................................................14

Table 6. Information generated from Appendix A-4.......................................................14

Table 7. Information generated from Appendix A-5.......................................................15

Table 8. Information generated from Appendix A-5, Figure 3 and Figure 4...................17

Table 9. A Comparison Between OCR Candidates.........................................................18

Page 5: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

v

List of Figures

Figure 1. GOCR separating ‘E’ and ‘x’...........................................................................12

Figure 2. Tesseract Scale Manipulation of Test...............................................................13

Figure 3. ViviData Test 2A after segmentation and other processing.............................16

Figure 4. ViviData Test 2B after segmentation and other processing.............................16

Page 6: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

vi

Summary

The objective of this report is to compare and evaluate the most suitable

commercial or open source Optical Character Recognition (OCR) tool for extracting data

from specific medical images under the Linux platform. The scope of the report is for

anyone interested in learning more about OCR tools and selection strategies.

The major challenges for this project are test data availability, predictions of

future trends, additional costs to increase performance and knowledge requirements to

complete the project. The three critical evaluation criteria of the most fitting OCR

solution are the performance of the OCR engine, cost of the solution and potential

improvements through additional development. Unfortunately, two strong OCR

candidates GOCR and Abbyy will not be selected as a solution, because GOCR is under

the GPL license and the working version of Abbyy only applies to Windows. ViviData is

the best selection for the Radiation Exposure Monitoring project.

The capabilities of other OCR candidates may fit better under a different set of

requirements. For instance, Abbyy is more dominant in the Windows environment and

valued for its accuracy and reliability. Tesseract has the greater potential due to character

training capabilities and flexibility for research and development. GOCR has a strong

open source OCR engine that is reasonably accurate for complex documents. It is

important to find the most suitable OCR tool for that meet the requirements of the

project.

The recommendations include a larger sample size and additional processing

functions to improve the OCR reliability and accuracy.

Page 7: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

1

1.0 Introduction

OCR is a field of interest for computer scientists because the general problem of

recognizing an array of characters in various languages and styles consistently with speed and

accuracy is a challenge [1]. In addition, it is a broad topic that covers exciting fields such as image

normalization, machine learning, artificial intelligence and etc.

The selection of a suitable OCR solution will only deal with English, non-handwritten

fonts, which significantly reduces the complexity of the problem. The implementation is also

useful for retrieving medical data from images while reducing labour costs. The solution will lead

to improved efficiency when manipulating information.

This report serves as a reference for Agfa, NRC, McMaster’s Medical Imaging Informatics

Research Centre and any researcher wishing to inquire about OCR. Since the project is currently

ongoing, the report will not contain the effectiveness of the solution.

Page 8: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

2

1.1 Background

Optical Character Recognition (OCR), a branch of machine vision systems, is the process

of translating an image to machine readable, editable text [2], [3]. OCR originates from rigid

matrix matching; it has evolved beyond feature analysis, and utilizing sophisticated Machine

Learning algorithms such as Support Vector Machines and Hidden Markov Models [4]. Currently,

there is still a clear gap between the reading abilities of humans and machines, because humans are

much more capable of reading degraded images of text [5]

Over the past decade, an increasing number of hospitals are collecting and storing medical

records for many different purposes. The conversion of medical records into medical knowledge is

very valuable in finding trends in medical practices, data mining and provides a foundation for

research [6].

The OCR component is a critical part of the Radiation Exposure Monitoring (REM) project

[7]. The purpose of this project is to create a centralized radiation repository where physicians and

scientists across Canada can produce evidence-based research material as a foundation of

knowledge as well as enhancing current research programs. The radiation data collected from

hospitals and medical centers will be the research basis for generating clinical evidence and

medical knowledge. The OCR solution will extract data and text information from the screen

captures of Modalities incapable conducting the text conversion. These machines are often very

costly to replace and the manufacturers will most likely not update all of the firmware to enable

data transmission to another electronic device [7].

Page 9: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

3

1.2 Scope

The scope of this report is the selection of an OCR approach that will best meet the

requirements and constraints of the REM project. This report will not disclose the details on

implementation or advanced OCR methodologies due to complexity. This document will present

the pros and cons of each OCR selection and some methods to improve the accuracy of the

selected OCR candidates. Due to variety of open-source and commercial OCR solutions, not

every OCR proposal will have the same level of detail.

This report will also include other OCR solutions such that readers will benefit when

searching for an OCR solution that fit their own set of criteria.

Page 10: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

4

2.0 Evaluation Criteria

One approach of evaluating an OCR tool is matching the capabilities of the OCR solution

with the list of critical and optional criteria. In addition, an analysis of the benefits and drawbacks

of each OCR approach is necessary.

2.1 Performance Criterion

The most important criterion of evaluation is the accuracy and consistency of the OCR

solution. It is crucial that the critical information extracted from the OCR tool have the highest

accuracy, because it contains medical data or patient information. It is the primary concern and

challenge of this project to evaluate precision [7].

The first standard to aid analysis is conducting OCR on the same set of sample images. An

accurate text version of the sample images is located in Appendix A. The second standard is

matching keywords of the OCR outputs to evaluate the accuracy of the OCR tool by comparing the

output text and the original text. This method will show key differences in the OCR tool’s

capabilities as well as accuracy on specific images. A keyword comparison needs to match all the

characters in the word exactly, so it is a much stricter evaluation process to separate the

capabilities of the OCR solutions. The third standard is a numerical verification process that

compares the important numerical data by character, while ignoring issues with character

separation and alignment. The most important element in the image to extract is the numbers, so a

numerical standard will ensure the reliability of the OCR solution.

The weighting factor of 60% for accuracy divides into 35% for numerical comparison and

25% keyword comparison.

Page 11: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

5

2.2 Development Environment

The second criterion is the ability for the OCR solution to work in a Linux environment,

because the servers and hardware run on Linux. If an OCR tool is non-Java based and does not

support Linux, it can still be considered if there is a potential solution to make the OCR engine run

on Linux, such as implementing a Java Native Interface for C/C++ OCR engines. All OCR

solutions that do not support Linux will be eliminated from candidate selection, so it is not part of

the final analysis.

2.3 Cost Effectiveness

The cost effectiveness of the solution will account for 25% weighting of the evaluation.

The potential to integrate the OCR piece with the source code of ongoing projects is part of cost

effectiveness. The requirement is not to obtain an external OCR solution that translates images to

text, but rather an integrated OCR component within a particular application. If the OCR solution

includes a black-box component, a wrapper or connection must be written to assimilate with the

result of the project. The other factors that influence the cost of the OCR solution are expenses for

additional development and integration, cost of the tools required and implementation duration.

Page 12: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

6

2.4 Licenses

The next criterion, software licenses for open source software, must be acceptable when

implementing an OCR solution. For instance, software with General Public License that is full

copyleft licenses must not be used, because the rest of the project is not to become open source [7].

Appendix A-1 consists of acceptable and unacceptable software distribution licenses. These

considerations and issues with IP will be included when selecting potential OCR candidates.

2.5 Optional Criteria

The optional criteria are that the OCR is a Java based platform solution. The advantages of

using Java are the automatic compatibility with the Linux operating system, which is one of the

main criterions. It can also save potential development work in terms of writing a wrapper to the

non-Java based solutions, which lowers the potential costs to the project.

The remaining weighting factor of 15% will be allocated for OCR recognition potential, the

ability of the solution to improve accuracy and other enhancements if additional development is

applied.

Page 13: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

7

3.0 Concerns and Considerations

The challenge of comparing competing OCR systems is the amount of resources needed to

evaluate the performance of OCR [8]. The current task of OCR evaluation is difficult because it

has to work in the Linux environment.

3.1 Selecting Test Data

There is no set limit on the amount of test data or images for OCR evaluation. The

deciding factors that determine the size of test data is the number of candidates and the time

constraint for the evaluation. Ultimately, the evaluation is limited in samples with a small number

of OCR tools that meet bare minimal criteria. Therefore, the process of sample selection is simple.

3.2 Ability to Anticipate Changes

Even with a set of constraints on the possible input image scenarios such as font size and

image layout, it is difficult to foresee recognition obstacles for a different set of images [8]. The

evaluation process relies heavy on the developer to notice changes and deal with up-coming

problems. It is important to account for future growth and dynamic changes to the input data.

Page 14: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

8

3.3 Knowledge Requirements

When implementing an OCR solution, one may need extensive background knowledge in

image processing, machine learning and advanced programming concepts to complete the task.

Currently, an overhead cost is expected in terms of development, purchasing a commercial OCR

solution, integration or planning to deliver accurate results. In addition, product integration and

setup may not be a trivial task. For instance, one of the OCR candidates named “ABBYY”

recently released an older SDK version for Linux, but the experimental setup process is very

complex [9].

Page 15: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

9

4.0 OCR Evaluation Analysis

The initial elimination process of the OCR candidates is simply removing all solutions that

do not support Linux. Since many OCR tools do not support Linux, the list narrows quickly. The

following Tables A and B are the list of potential open-source and commercial OCR solutions that

support Linux.

4.1 OCR Candidates

Table 1: Overview of Open-Source OCR Solutions

Names Language License Support

Linux

Description Candidate?

Tesseract/

OCRopus

C++ and

Lua

Apache Yes A pluggable

framework which

can use Tesseract

Yes

GOCR C GPL* Yes One of the top open-

source solutions

No, but it will be

our standard of

comparison due

to its higher

accuracy

CuneiForm/

OpenOCR

C/C++ BSD variant Yes A famous Russian

software

No, because there

is no support or

sufficient

documentation in

English to begin

development

*- indicates potential issues with criteria

According to Table 1, the two candidates representing the open source OCR solutions

selected for comparative analysis are GOCR and Tesseract. Although GOCR does not meet the

license requirement for the project and thus eliminated from the list of potential solutions, it will

serve as a reference of performance for open source OCR tools.

Page 16: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

10

The advantage of utilizing open source OCR tools is the potential cost effectiveness of the

solution, if the OCR software performs at a satisfactory level given a set of requirements for a

specific problem. The challenge with this approach is dealing with unknown setbacks such as

additional development costs, integration issues or simply the future performance. Even the most

sophisticated commercial OCR tools can fail to recognize text properly for a set of unique images.

Therefore, using open source OCR software adds an additional level of risk to the project.

Table 2: Overview of Commercial OCR Solutions

Names SDK? Cost (cdn) Support

Linux

Description Candidate?

Abbyy Yes < $800 Yes (only for

Europe

Command

Line

Interface)

One of the

top

commercial

OCR

solutions

Y, but unable

to setup

Linux version

AspriseOCR Yes $1000+ Yes (Java) Generally

does not work

well with the

sample

images

Y, but

support does

not respond

ViviData Yes $500-1500 Yes Command

line based

OCR solution

with a variety

of OCR

options

Y, support is

responsive,

but site is

always down

The OCR candidates from Table B are the only decent selection Linux OCR distributions

in the market. Although Abbyy is dominant in Windows and Mac OS, it is an entirely new

competitor in the Linux realm [9], [10]. The strong user friendliness of Abbyy is not one of the

requirements for consideration, so will not factored into the final analysis. In contrast, Vividata

targets the Linux OS and delivers a more programmable basis of executing OCR. From

Page 17: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

11

experience, AspriseOCR offers the OCR libraries for development, but it is more suited for

barcode recognition.

Although, the cost of AspriseOCR is noticeably higher, the fact that it is a Java based API

will lower the integration costs. The potential costs of Abbyy and ViviData are similar.

4.2 OCR Performance and Potential

From Appendix A, the initial OCR results are given for each Open Source and commercial

OCR tool. Using this data, the next section will cover an analysis on each tool.

4.2.1 – GOCR

Table 3: Information generated from Appendix A-1

Comments Keyword comparison (%) # comparison

Test 1 60% *24/26

Test 2a 66% 24/35

Test 2b 72.7% 30/37

Test 3a 43.9% 41/43

Test 3b 44.7% 80/94

Average 57.5%

Total 199/235 = 84.7%

*- not a true # comparison but an exact character by character comparison

Although GOCR is only a standard for evaluation, it is a promising tool for any researcher

conducting OCR tasks. According to Table 3, GOCR achieved a numerical accuracy of 84.7%.

However, medical data needs to be as accurate as possible. Thus, an accuracy of mid nineties is a

requirement for the numerical standard. From experience, one of the drawbacks of GOCR is the

process of converting the required image to a ppm or pbm format before calling the command line

interface. As an open source solution, the cost of development is generally greater than

commercial OCR solutions, but it has greater development potential than some commercial

Page 18: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

12

solutions. However, there are no additional features or accuracy enhancements for the OCR

engine.

Figure 1: GOCR separating ‘E’ and ‘x’ Original

Exam DescripTion: CT SPINE Exam Description: CT SPINE

From Figure 1, the sample image for Test 1, the ‘E’ and ‘x’ characters overlap. The

integration program must interpret the error and conduct word segmentation after OCR

recognition. Common mistakes of OCR tools are mismatching similar characters such as ‘I’s for

‘1’s and ‘O’s for ‘0’s and vise-versa. If the developer accounts for these small changes, the

potential improvement of GOCR will be an increase in numerical comparison score to 90% and

keyword comparison score to 65%.

4.2.2 - Tesseract

Table 4: Information generated from Appendix A-2

Comments Keyword comparison (%) # comparison

Test 1 0% *0/26

Test 2a 22.7% 9/35

Test 2b 21.7% 7/37

Test 3a 11.1% 7/43

Test 3b 15% 12/94

Average 14.1%

Total 35/235 = 14.9%

*- not a true # comparison but an exact character by character comparison

From Table 4, the OCR capabilities of Tesseract may seem poor, but it is potentially one of

the most adaptive and promising Open Source OCR engine [11]. It is one of the most active OCR

projects, attracting programmers and researchers. With a larger sample dataset, a custom learning

process can be manually implemented to increase accuracy of the tool. Among all the OCR

candidates, it has the greatest recognition potential.

Page 19: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

13

Figure 2: Tesseract Scale Manipulation of Test 1

Recognition Ratio Result

Test Normal Image 0/26

At 60% size (scale) 0/26

At 70% size 0/26

At 80% size 0/26

At 90% size 0/26

At 110% size 24/26 Exam Descriprinn: CT SPINE

At 120% size 23/26 Exam Dascrnptinn: CT SPINE

At 130% size 24/26 Exam Dsscriptinn: CT SPINE

At 140% size 25/26 Exam Descriptiun: CT SFINE

At 150% size 24/26 Exam Dasnriptinn: CT SPINE

At 160% size 25/26 Exam Descriptinn: CT SPINE

At 170% size 26/26 Exam Description: CT SPINE

At 180% size 25/26 Exam Descriptiun: CT SPINE

At 190% size 26/26 Exam Description: CT SPINE

At 200% size 26/26 Exam Description: CT SPINE

From Figure 2, the accuracy of Tesseract can be greatly influenced by simply the scale of

the image. This is not the case for other open source solutions such as GOCR. For optimal

performance, the sample image needs to be resized to 190 – 200% to achieve optimum accuracy.

Although the tool has great potential, Tesseract is limited in image formats and capabilities without

a development overhead. It is not an ideal tool for recognizing different image layouts of text.

Tesseract will perform better if the image is segmented properly.

After image manipulation and segmentation, the performance of Tesseract will be able to

achieve an estimated numerical comparison score of 60% and keyword comparison score of 40%.

Page 20: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

14

4.2.3 - AspriseOCR

Table 5: Information generated from Appendix A-3

Comments Keyword comparison (%) # comparison

Test 1 42.9% *24/26

Test 2a 49% 31/35

Test 2b 57.4% 34/37

Test 3a 28.9% 41/43

Test 3b 28.8% 88/94

Average 41.4%

Total 218/235 = 92.8%

*- not a true # comparison but an exact character by character comparison

The results for AspriseOCR shown in Table 5 are decent with a numerical comparison

score of 92.8% and keyword comparison score of 40%. The Java implementation of AspriseOCR

is straightforward and definitely less complicated than open source OCR tools, but it is the most

costly solution. The potential capability enhancement of the software beyond its current

recognition engine is lower than open source candidates.

4.2.4 – Abbyy FineReader 10 (Windows)

Table 6: Information generated from Appendix A-4

Comments Keyword comparison (%) # comparison

Test 1 100% *26/26

Test 2a 77.3% 29/35

Test 2b 87% 37/37

Test 3a 77% 43/43

Test 3b 70% 88/94

Average 82.3%

Total 223/235 = 94.9%

*- not a true # comparison but an exact character by character comparison

The general performance of Abbyy FineReader 10 for Windows is considerably accurate at

a numerical comparison score of 94.9% and keyword comparison score of 82.3% from Table 6.

Unfortunately, the Linux version of Abbyy is two versions behind FineReader 10 and the setup

process is the most challenging amongst all the candidates. The fact that it is a new competitor in

Page 21: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

15

the Linux market [9]. Therefore, it is difficult to compare the key performance difference of the

Windows and Linux version of Abbyy.

4.2.5 – ViviData

Table 7: Information generated from Appendix A-5

Comments Keyword comparison (%) # comparison

Test 1 100% *26/26

Test 2a 11% 11/35

Test 2b 6% 4/37

Test 3a 74% 43/43

Test 3b 68% 94/94

Average 51.8%

Total 178/235 = 75.7%

*- not a true # comparison but an exact character by character comparison

From Table 7, the performance of ViviData is lower than most of the candidates, but the

outcome is influenced by the extremely poor results from “Test 2a” and “Test 2b” images. Even

when using a set of images with the same font and parameters, there is no correlation between a set

of images with reliable features and improved accuracy for all OCR engines because every OCR

engine behaves differently. Comparing the results for “Test 2a” and “Test 2b”, other OCR

candidates favours these sample images, but not ViviData.

ViviData provides developers with the parameters for input and output options as well as

pre-processing and image recognition options [2]. After the processing the images by the method

of line segmentation, the image results for “Test 2a” and “Test 2b” improved significantly as

shown in Figure 3 and Figure 4 respectively.

Page 22: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

16

Figure 3: ViviData Test 2A after segmentation and other processing

ID Study ID

Birth Date

Age S

Sex:U Height(kg): Height(cm):

Patient Comments

Study Date : 2010.03.10 Body Part : CHEST

Requesting Department

Referring Physician

Operator Name

Total Image Number : 834

<< Dose Information>>

Total mfls 8197 Total Scan time 31.02

CTDIvoI(mGy) (Head) - (Body) 420.50

DLP(n,Gycn,) (Head) - (Body) 1018.20

EH. DLP(n,Gycn,) 1281.20

<< Contrast/Enhance Information>>

Contrast Enhance : CE

ID: Study ID:

Birth Date: Age:

Sex: Weight(kg): Height(cm):

Patient Comments:

Study Date: 2010. 03. 10 Body Part: CHEST

Requesting Department :

Referring Physician :

Reporting Physician :

Operator Name :

Total Image Number : 634

<< Dose Information >>

Total mAs : 6197 Total Scan time: 31. 02

CTDlvol (mGy) (Head) : - (Body) : 420.50

DLP (mGycm) (Head) : - (Body) : 1016. 20

Eff. DLP(mGycm) : 1281.20

<< Contrast/Enhance Information >>

Contrast Enhance : CE

Word comparison by key words: 65%

Number comparison by number character: 33/35

Figure 4: ViviData Test 2B after segmentation and other processing

ID Study ID

Birth Date Age

Sex : H Height(kg) : Height(cm)

Patient Comments

Study Date : 2010.03.10 Body Part : CHEST

Request i ng Department

Referring Physician

Reporting Physician

Operator Name

Total Image Number 1888

<< Dose Information>>

Total niAs : 8088 Total Scan time : 28.77

CTDIvoI(mGy) (Head) - (Body) 459.70

DLP(n,Gycn,) (Head) - (Body) 1282.90

EH. DLP(n,Gycn,) 1580.50

<< Contrast/Enhance Information>>

Contrast Enhance CE

ID: Study ID:

Birth Date: Age:

Sex: Weight(kg): Height(cm):

Patient Comments:

Study Date: 2010. 03. 10 Body Part: CHEST

Requesting Department :

Referring Physician :

Reporting Physician :

Operator Name :

Total Image Number : 1888

<< Dose Information >>

Total mAs : 6086 Total Scan time: 28. 77

CTDlvol (mGy) (Head) : - (Body) : 459.70

DLP (mGycm) (Head) : - (Body) : 1262. 90

Eff. DLP(mGycm) : 1580.50

<< Contrast/Enhance Information >>

Contrast Enhance : CE

Word comparison by key words: 67%

Number comparison by number character: 35/37

Page 23: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

17

From Figure 3 and Figure 4, some numerical values are still incorrect due to the noise of

the sample images and close proximity of the characters. This is a major concern for all OCR

solutions, because the behaviour of the OCR tool is not perfectly stable. The developers need to

set constraints on the images in order to improve consistency of results.

Table 8: Information generated from Appendix A-5, Figure 3 and Figure 4

Comments Keyword comparison (%) # comparison

Test 1 100% *26/26

Test 2a 65% 33/35

Test 2b 67% 35/37

Test 3a 74% 43/43

Test 3b 68% 94/94

Average 74.8%

Total 231/235 = 98.3%

*- not a true # comparison but an exact character by character comparison

When comparing the results from table 7 and table 8, the numerical comparison score and

keyword comparison score improved from 75.7% to 98.3% and 51.8% to 74.8% respectively due

to additional processing and segmentation shown previously. Therefore, ViviData has the highest

recognition potential among commercial OCR solutions.

Page 24: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

18

5.0 Conclusions

Table 9 illustrates the qualitative summary of the capabilities of each OCR solution based

on previously stated criteria and performance of each OCR candidate.

Table 9. A Comparison Between OCR Candidates

Candidates Performance

# comparison

Performance

keyword

Recognition

Potential

Cost

* Weighting Factor [F] 3.5 2.5 1.5 2.5

Tesseract ** Score

[S]

~6.0 ~4.0 9.0 8.0

*** Total

[T=F x S]

21 10 13.5 20 64.5

[x] GOCR ** Score

[S]

8.4 6.5 6.0 7.5

*** Total

[T=F x S]

29.4 16.25 9 18.75 73.4

[x] Abbyy

(Windows)

** Score

[S]

9.5 8.2 2.0 6.0

*** Total

[T=F x S]

33.25 20.5 3 15 71.75

AspriseOCR ** Score

[S]

9.2 4.1 3.5 4.0

*** Total

[T=F x S]

32.2 10.25 5.25 10.0 57.7

ViviData ** Score

[S]

9.8 7.5 7.0 5.0

*** Total

[T=F x S]

34.3 18.75 10.5 12.5 67.05

* The total value of the weighting factor is 10

** The total score for each criterion is out of 10

*** The total score has a maximum value of 100

~ Estimation is necessary due to variability of data

[x] Candidate not select because missing important criteria (included for reference)

Page 25: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

19

ViviData has the highest recognition potential; Tesseract has the highest development

potential. AspriseOCR is the only Java-based OCR solution with an easy to use Application

Programming Interface. Abbyy is a strong OCR solution in the Windows and Macintosh

environment, while ViviData focuses on the Linux market.

According to Table 9, ViviData is the ideal candidate for the Linux platform to extract

medical data for the REM project. Comparing with the other OCR solutions, it has the highest

numerical comparison score and a decent keyword performance score, which makes it ideal for

retrieving important medical data. GOCR and Abbyy does not meet certain criterion of the

project, and thus eliminated from selection.

Page 26: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

20

6.0 Recommendations

A larger set of test images should be utilized to justify the selection of the most suitable

OCR solution with greater confidence. If time and resources are available, a greater emphasis on

testing and data collection methods or tools is also important for the success of any OCR related

project.

Along with the OCR engine, it is important to develop pre-processing, post-processing and

helper functions to enhance to accuracy of the output. For instance, manipulating the input image

to improve the accuracy of the OCR engine or include fuzzy matching algorithms to better detect

keywords in the output in order to match corresponding values for data extraction, can improve

clarity and consistency. This is to ensure the reliability of the results especially when the

application deals with the medical data.

Page 27: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

21

Glossary

Optical Character

Recognition (OCR)

the translation of an image of written text into printed text

Copyleft license Essentially the opposite of copyright where if a party creates work

based on GPL’d software and distributes the resulting work, then

the party must distribute the resulting work under GPL

GPL A free, copyleft license for software and other works.

Segmentation The method or instance of dividing something into parts

Modality Equipment or probes that acquire images of the body such as

ultrasound, radiography, magnetic resonance imaging

API Application Programming Interface allows interaction between

different software programs

Page 28: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

22

Works Cited

[1] Mumtaz, Shakeel. “Character Recognition and Google Tesseract.” 2007. meSmarty. [Online]

<http://www.mesmarty.com/2010/02/07/character-recognition-and-google-tesseract/.>

[Accessed: July 7, 2010]

[2] “OCR Shop XTR Users Guide.” 2004. ViviData. [Online]

<http://www.vividata.com/.>

[Accessed: July 25, 2010]

[3] Baird, Henry and Riopka, Terry. “ScatterType: a Reading CAPTCHA Resistant to

Segmentation Attack.” Comp. Sci. & Engineering Department of Lehigh University.

Document Recognition and Retrieval XII, vol. 5676, 2005. SPIE Digital Library.

<http://www.cse.lehigh.edu/~baird.>

[Accessed: July 7, 2010]

[4] Natarajan, Prem et al., “The BBN Byblos Hindi OCR System.pdf.” BBN Technologies.

Document Recognition and Retrieval XII, vol. 5676, 2005. SPIE Digital Library.

<http://spiedl.org/terms.>

[Accessed: July 7, 2010]

[5] Varga, Tamas and Bunke, Horst. “Perturbation Models for Generating Synthetic Training Data

in Handwriting Recognition.” 2008. University of Bern Institute of Computer Science and

Applied Mathematics.

<www.springerlink.com.>

[Accessed: July 8, 2010]

Page 29: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope

23

[6] Ng, Andrew. “Lecture 1 | Machine Learning”. July 2008. Stanford University. [Online]

<http://www.youtube.com/watch?v=UzxYlbK2c7E.>

[Accessed: July 10, 2010]

[7] D’Amours, Danny. Research Support Coordinator. Personal Interview. July 16, 2010.

[8] Nartker, Thomas et al., “Software Tools and Test Data for Research and Testing of Page-

Reading OCR Systems.” University of Nevada. Document Recognition and Retrieval XII, vol.

5676, 2005. SPIE Digital Library.

<http://spiedl.org/terms.>

[Accessed: July 7, 2010]

[9] “ABBYY Announces Its New OCR SDK for Linux Enivronment.” 2010. Abbyy.

<http://www.abbyy.com/Default.aspx?DN=bcfa7070-594f-475c-98e2-dba48097519c.>

[Accessed: July 15, 2010]

[10] “ABBYY Wins Best Professional Software at Macworld UK Awards.” 2010. Abbyy.

<http://www.abbyy.com/finereader_for_mac/.>

[Accessed: July 16, 2010]

[11] “tesseract-ocr” 2010. Google Project Hosting.

<http://code.google.com/p/tesseract-ocr/.>

[Accessed: July 5, 2010]