UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool...
Transcript of UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool...
![Page 1: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/1.jpg)
UNIVERSITY OF WATERLOO
Faculty of Engineering
Optical Character Recognition Recommendation
Prepared By:
Eugene Chung
I.D. # 20302048
2A Mechatronics Engineering
August 31, 2010
![Page 2: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/2.jpg)
ii
149 Halterwood Circle
Markham, Ontario
L3P 7T2 August 31, 2010
Professor Sanjeev Bedi,
Director of Mechatronics Engineering
Department of Mechanical and Mechatronics Engineering
University of Waterloo
200 University Avenue West,
Waterloo, Ontario
N2L 3G1
Dear Professor Bedi:
I have written this report, “Optical Character Recognition Recommendation“, as
the second of four report submissions for graduation. This work term followed my
completion of 2A academic term in April 2009. The purpose of this report is to select the
best Optical Character Recognition (OCR) solution for the joint research program
between Medical Imaging Informatics Research Center at McMaster, National Research
Council and Agfa Healthcare.
The main goal of the joint research program is the completion of the Radiation
Exposure Monitoring project to enable a standardized repository for radiation exposure
research. The OCR component of the project is a major commitment for this work term.
I conducted the research, testing, data collection, design, implementation and
documentation with minimal assistance.
This report was written entirely by me and has not received any previous
academic credit at this or any other institution. I would like to thank Danny D’Amours
from the National Research Council for providing some leads on open source OCR
software. I received no further assistance.
Sincerely,
ID: 20302048
Eugene Yiu Chun Chung
![Page 3: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/3.jpg)
iii
Table of Contents
List of Tables ......................................................................................................... iv List of Figures ........................................................................................................ v Summary............................................................................................................... vi
1.0 Introduction ................................................................................................... 1
1.1 Background ...................................................................................................... 2 1.2 Scope ............................................................................................................... 3
2.0 Evaluation Criteria ........................................................................................ 4
2.1 Performance Criterion ...................................................................................... 4 2.2 Development Environment ............................................................................... 5 2.3 Cost Effectiveness............................................................................................ 5 2.4 Licenses ........................................................................................................... 6 2.5 Optional Criteria ............................................................................................... 6
3.0 Concerns and Considerations ..................................................................... 7
3.1 Selecting Test Data .......................................................................................... 7 3.2 Ability to Anticipate Changes ............................................................................ 7 3.3 Knowledge Requirements ................................................................................ 8
4.0 OCR Evaluation Analysis ............................................................................. 9
4.1 OCR Candidates .............................................................................................. 9 4.2 OCR Performance and Potential .................................................................... 11
5.0 Conclusions ................................................................................................ 18
6.0 Recommendations ...................................................................................... 20
Glossary ............................................................................................................ 21
Appendix A: ............................................................ Error! Bookmark not defined.
Works Cited ....................................................................................................... 22
![Page 4: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/4.jpg)
iv
List of Tables
Table 1. Overview of Open-Source OCR Solutions..........................................................9
Table 2. Overview of Commercial OCR Solutions..........................................................10
Table 3. Information generated from Appendix A-1.......................................................11
Table 4. Information generated from Appendix A-2.......................................................12
Table 5. Information generated from Appendix A-3.......................................................14
Table 6. Information generated from Appendix A-4.......................................................14
Table 7. Information generated from Appendix A-5.......................................................15
Table 8. Information generated from Appendix A-5, Figure 3 and Figure 4...................17
Table 9. A Comparison Between OCR Candidates.........................................................18
![Page 5: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/5.jpg)
v
List of Figures
Figure 1. GOCR separating ‘E’ and ‘x’...........................................................................12
Figure 2. Tesseract Scale Manipulation of Test...............................................................13
Figure 3. ViviData Test 2A after segmentation and other processing.............................16
Figure 4. ViviData Test 2B after segmentation and other processing.............................16
![Page 6: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/6.jpg)
vi
Summary
The objective of this report is to compare and evaluate the most suitable
commercial or open source Optical Character Recognition (OCR) tool for extracting data
from specific medical images under the Linux platform. The scope of the report is for
anyone interested in learning more about OCR tools and selection strategies.
The major challenges for this project are test data availability, predictions of
future trends, additional costs to increase performance and knowledge requirements to
complete the project. The three critical evaluation criteria of the most fitting OCR
solution are the performance of the OCR engine, cost of the solution and potential
improvements through additional development. Unfortunately, two strong OCR
candidates GOCR and Abbyy will not be selected as a solution, because GOCR is under
the GPL license and the working version of Abbyy only applies to Windows. ViviData is
the best selection for the Radiation Exposure Monitoring project.
The capabilities of other OCR candidates may fit better under a different set of
requirements. For instance, Abbyy is more dominant in the Windows environment and
valued for its accuracy and reliability. Tesseract has the greater potential due to character
training capabilities and flexibility for research and development. GOCR has a strong
open source OCR engine that is reasonably accurate for complex documents. It is
important to find the most suitable OCR tool for that meet the requirements of the
project.
The recommendations include a larger sample size and additional processing
functions to improve the OCR reliability and accuracy.
![Page 7: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/7.jpg)
1
1.0 Introduction
OCR is a field of interest for computer scientists because the general problem of
recognizing an array of characters in various languages and styles consistently with speed and
accuracy is a challenge [1]. In addition, it is a broad topic that covers exciting fields such as image
normalization, machine learning, artificial intelligence and etc.
The selection of a suitable OCR solution will only deal with English, non-handwritten
fonts, which significantly reduces the complexity of the problem. The implementation is also
useful for retrieving medical data from images while reducing labour costs. The solution will lead
to improved efficiency when manipulating information.
This report serves as a reference for Agfa, NRC, McMaster’s Medical Imaging Informatics
Research Centre and any researcher wishing to inquire about OCR. Since the project is currently
ongoing, the report will not contain the effectiveness of the solution.
![Page 8: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/8.jpg)
2
1.1 Background
Optical Character Recognition (OCR), a branch of machine vision systems, is the process
of translating an image to machine readable, editable text [2], [3]. OCR originates from rigid
matrix matching; it has evolved beyond feature analysis, and utilizing sophisticated Machine
Learning algorithms such as Support Vector Machines and Hidden Markov Models [4]. Currently,
there is still a clear gap between the reading abilities of humans and machines, because humans are
much more capable of reading degraded images of text [5]
Over the past decade, an increasing number of hospitals are collecting and storing medical
records for many different purposes. The conversion of medical records into medical knowledge is
very valuable in finding trends in medical practices, data mining and provides a foundation for
research [6].
The OCR component is a critical part of the Radiation Exposure Monitoring (REM) project
[7]. The purpose of this project is to create a centralized radiation repository where physicians and
scientists across Canada can produce evidence-based research material as a foundation of
knowledge as well as enhancing current research programs. The radiation data collected from
hospitals and medical centers will be the research basis for generating clinical evidence and
medical knowledge. The OCR solution will extract data and text information from the screen
captures of Modalities incapable conducting the text conversion. These machines are often very
costly to replace and the manufacturers will most likely not update all of the firmware to enable
data transmission to another electronic device [7].
![Page 9: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/9.jpg)
3
1.2 Scope
The scope of this report is the selection of an OCR approach that will best meet the
requirements and constraints of the REM project. This report will not disclose the details on
implementation or advanced OCR methodologies due to complexity. This document will present
the pros and cons of each OCR selection and some methods to improve the accuracy of the
selected OCR candidates. Due to variety of open-source and commercial OCR solutions, not
every OCR proposal will have the same level of detail.
This report will also include other OCR solutions such that readers will benefit when
searching for an OCR solution that fit their own set of criteria.
![Page 10: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/10.jpg)
4
2.0 Evaluation Criteria
One approach of evaluating an OCR tool is matching the capabilities of the OCR solution
with the list of critical and optional criteria. In addition, an analysis of the benefits and drawbacks
of each OCR approach is necessary.
2.1 Performance Criterion
The most important criterion of evaluation is the accuracy and consistency of the OCR
solution. It is crucial that the critical information extracted from the OCR tool have the highest
accuracy, because it contains medical data or patient information. It is the primary concern and
challenge of this project to evaluate precision [7].
The first standard to aid analysis is conducting OCR on the same set of sample images. An
accurate text version of the sample images is located in Appendix A. The second standard is
matching keywords of the OCR outputs to evaluate the accuracy of the OCR tool by comparing the
output text and the original text. This method will show key differences in the OCR tool’s
capabilities as well as accuracy on specific images. A keyword comparison needs to match all the
characters in the word exactly, so it is a much stricter evaluation process to separate the
capabilities of the OCR solutions. The third standard is a numerical verification process that
compares the important numerical data by character, while ignoring issues with character
separation and alignment. The most important element in the image to extract is the numbers, so a
numerical standard will ensure the reliability of the OCR solution.
The weighting factor of 60% for accuracy divides into 35% for numerical comparison and
25% keyword comparison.
![Page 11: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/11.jpg)
5
2.2 Development Environment
The second criterion is the ability for the OCR solution to work in a Linux environment,
because the servers and hardware run on Linux. If an OCR tool is non-Java based and does not
support Linux, it can still be considered if there is a potential solution to make the OCR engine run
on Linux, such as implementing a Java Native Interface for C/C++ OCR engines. All OCR
solutions that do not support Linux will be eliminated from candidate selection, so it is not part of
the final analysis.
2.3 Cost Effectiveness
The cost effectiveness of the solution will account for 25% weighting of the evaluation.
The potential to integrate the OCR piece with the source code of ongoing projects is part of cost
effectiveness. The requirement is not to obtain an external OCR solution that translates images to
text, but rather an integrated OCR component within a particular application. If the OCR solution
includes a black-box component, a wrapper or connection must be written to assimilate with the
result of the project. The other factors that influence the cost of the OCR solution are expenses for
additional development and integration, cost of the tools required and implementation duration.
![Page 12: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/12.jpg)
6
2.4 Licenses
The next criterion, software licenses for open source software, must be acceptable when
implementing an OCR solution. For instance, software with General Public License that is full
copyleft licenses must not be used, because the rest of the project is not to become open source [7].
Appendix A-1 consists of acceptable and unacceptable software distribution licenses. These
considerations and issues with IP will be included when selecting potential OCR candidates.
2.5 Optional Criteria
The optional criteria are that the OCR is a Java based platform solution. The advantages of
using Java are the automatic compatibility with the Linux operating system, which is one of the
main criterions. It can also save potential development work in terms of writing a wrapper to the
non-Java based solutions, which lowers the potential costs to the project.
The remaining weighting factor of 15% will be allocated for OCR recognition potential, the
ability of the solution to improve accuracy and other enhancements if additional development is
applied.
![Page 13: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/13.jpg)
7
3.0 Concerns and Considerations
The challenge of comparing competing OCR systems is the amount of resources needed to
evaluate the performance of OCR [8]. The current task of OCR evaluation is difficult because it
has to work in the Linux environment.
3.1 Selecting Test Data
There is no set limit on the amount of test data or images for OCR evaluation. The
deciding factors that determine the size of test data is the number of candidates and the time
constraint for the evaluation. Ultimately, the evaluation is limited in samples with a small number
of OCR tools that meet bare minimal criteria. Therefore, the process of sample selection is simple.
3.2 Ability to Anticipate Changes
Even with a set of constraints on the possible input image scenarios such as font size and
image layout, it is difficult to foresee recognition obstacles for a different set of images [8]. The
evaluation process relies heavy on the developer to notice changes and deal with up-coming
problems. It is important to account for future growth and dynamic changes to the input data.
![Page 14: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/14.jpg)
8
3.3 Knowledge Requirements
When implementing an OCR solution, one may need extensive background knowledge in
image processing, machine learning and advanced programming concepts to complete the task.
Currently, an overhead cost is expected in terms of development, purchasing a commercial OCR
solution, integration or planning to deliver accurate results. In addition, product integration and
setup may not be a trivial task. For instance, one of the OCR candidates named “ABBYY”
recently released an older SDK version for Linux, but the experimental setup process is very
complex [9].
![Page 15: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/15.jpg)
9
4.0 OCR Evaluation Analysis
The initial elimination process of the OCR candidates is simply removing all solutions that
do not support Linux. Since many OCR tools do not support Linux, the list narrows quickly. The
following Tables A and B are the list of potential open-source and commercial OCR solutions that
support Linux.
4.1 OCR Candidates
Table 1: Overview of Open-Source OCR Solutions
Names Language License Support
Linux
Description Candidate?
Tesseract/
OCRopus
C++ and
Lua
Apache Yes A pluggable
framework which
can use Tesseract
Yes
GOCR C GPL* Yes One of the top open-
source solutions
No, but it will be
our standard of
comparison due
to its higher
accuracy
CuneiForm/
OpenOCR
C/C++ BSD variant Yes A famous Russian
software
No, because there
is no support or
sufficient
documentation in
English to begin
development
*- indicates potential issues with criteria
According to Table 1, the two candidates representing the open source OCR solutions
selected for comparative analysis are GOCR and Tesseract. Although GOCR does not meet the
license requirement for the project and thus eliminated from the list of potential solutions, it will
serve as a reference of performance for open source OCR tools.
![Page 16: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/16.jpg)
10
The advantage of utilizing open source OCR tools is the potential cost effectiveness of the
solution, if the OCR software performs at a satisfactory level given a set of requirements for a
specific problem. The challenge with this approach is dealing with unknown setbacks such as
additional development costs, integration issues or simply the future performance. Even the most
sophisticated commercial OCR tools can fail to recognize text properly for a set of unique images.
Therefore, using open source OCR software adds an additional level of risk to the project.
Table 2: Overview of Commercial OCR Solutions
Names SDK? Cost (cdn) Support
Linux
Description Candidate?
Abbyy Yes < $800 Yes (only for
Europe
Command
Line
Interface)
One of the
top
commercial
OCR
solutions
Y, but unable
to setup
Linux version
AspriseOCR Yes $1000+ Yes (Java) Generally
does not work
well with the
sample
images
Y, but
support does
not respond
ViviData Yes $500-1500 Yes Command
line based
OCR solution
with a variety
of OCR
options
Y, support is
responsive,
but site is
always down
The OCR candidates from Table B are the only decent selection Linux OCR distributions
in the market. Although Abbyy is dominant in Windows and Mac OS, it is an entirely new
competitor in the Linux realm [9], [10]. The strong user friendliness of Abbyy is not one of the
requirements for consideration, so will not factored into the final analysis. In contrast, Vividata
targets the Linux OS and delivers a more programmable basis of executing OCR. From
![Page 17: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/17.jpg)
11
experience, AspriseOCR offers the OCR libraries for development, but it is more suited for
barcode recognition.
Although, the cost of AspriseOCR is noticeably higher, the fact that it is a Java based API
will lower the integration costs. The potential costs of Abbyy and ViviData are similar.
4.2 OCR Performance and Potential
From Appendix A, the initial OCR results are given for each Open Source and commercial
OCR tool. Using this data, the next section will cover an analysis on each tool.
4.2.1 – GOCR
Table 3: Information generated from Appendix A-1
Comments Keyword comparison (%) # comparison
Test 1 60% *24/26
Test 2a 66% 24/35
Test 2b 72.7% 30/37
Test 3a 43.9% 41/43
Test 3b 44.7% 80/94
Average 57.5%
Total 199/235 = 84.7%
*- not a true # comparison but an exact character by character comparison
Although GOCR is only a standard for evaluation, it is a promising tool for any researcher
conducting OCR tasks. According to Table 3, GOCR achieved a numerical accuracy of 84.7%.
However, medical data needs to be as accurate as possible. Thus, an accuracy of mid nineties is a
requirement for the numerical standard. From experience, one of the drawbacks of GOCR is the
process of converting the required image to a ppm or pbm format before calling the command line
interface. As an open source solution, the cost of development is generally greater than
commercial OCR solutions, but it has greater development potential than some commercial
![Page 18: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/18.jpg)
12
solutions. However, there are no additional features or accuracy enhancements for the OCR
engine.
Figure 1: GOCR separating ‘E’ and ‘x’ Original
Exam DescripTion: CT SPINE Exam Description: CT SPINE
From Figure 1, the sample image for Test 1, the ‘E’ and ‘x’ characters overlap. The
integration program must interpret the error and conduct word segmentation after OCR
recognition. Common mistakes of OCR tools are mismatching similar characters such as ‘I’s for
‘1’s and ‘O’s for ‘0’s and vise-versa. If the developer accounts for these small changes, the
potential improvement of GOCR will be an increase in numerical comparison score to 90% and
keyword comparison score to 65%.
4.2.2 - Tesseract
Table 4: Information generated from Appendix A-2
Comments Keyword comparison (%) # comparison
Test 1 0% *0/26
Test 2a 22.7% 9/35
Test 2b 21.7% 7/37
Test 3a 11.1% 7/43
Test 3b 15% 12/94
Average 14.1%
Total 35/235 = 14.9%
*- not a true # comparison but an exact character by character comparison
From Table 4, the OCR capabilities of Tesseract may seem poor, but it is potentially one of
the most adaptive and promising Open Source OCR engine [11]. It is one of the most active OCR
projects, attracting programmers and researchers. With a larger sample dataset, a custom learning
process can be manually implemented to increase accuracy of the tool. Among all the OCR
candidates, it has the greatest recognition potential.
![Page 19: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/19.jpg)
13
Figure 2: Tesseract Scale Manipulation of Test 1
Recognition Ratio Result
Test Normal Image 0/26
At 60% size (scale) 0/26
At 70% size 0/26
At 80% size 0/26
At 90% size 0/26
At 110% size 24/26 Exam Descriprinn: CT SPINE
At 120% size 23/26 Exam Dascrnptinn: CT SPINE
At 130% size 24/26 Exam Dsscriptinn: CT SPINE
At 140% size 25/26 Exam Descriptiun: CT SFINE
At 150% size 24/26 Exam Dasnriptinn: CT SPINE
At 160% size 25/26 Exam Descriptinn: CT SPINE
At 170% size 26/26 Exam Description: CT SPINE
At 180% size 25/26 Exam Descriptiun: CT SPINE
At 190% size 26/26 Exam Description: CT SPINE
At 200% size 26/26 Exam Description: CT SPINE
From Figure 2, the accuracy of Tesseract can be greatly influenced by simply the scale of
the image. This is not the case for other open source solutions such as GOCR. For optimal
performance, the sample image needs to be resized to 190 – 200% to achieve optimum accuracy.
Although the tool has great potential, Tesseract is limited in image formats and capabilities without
a development overhead. It is not an ideal tool for recognizing different image layouts of text.
Tesseract will perform better if the image is segmented properly.
After image manipulation and segmentation, the performance of Tesseract will be able to
achieve an estimated numerical comparison score of 60% and keyword comparison score of 40%.
![Page 20: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/20.jpg)
14
4.2.3 - AspriseOCR
Table 5: Information generated from Appendix A-3
Comments Keyword comparison (%) # comparison
Test 1 42.9% *24/26
Test 2a 49% 31/35
Test 2b 57.4% 34/37
Test 3a 28.9% 41/43
Test 3b 28.8% 88/94
Average 41.4%
Total 218/235 = 92.8%
*- not a true # comparison but an exact character by character comparison
The results for AspriseOCR shown in Table 5 are decent with a numerical comparison
score of 92.8% and keyword comparison score of 40%. The Java implementation of AspriseOCR
is straightforward and definitely less complicated than open source OCR tools, but it is the most
costly solution. The potential capability enhancement of the software beyond its current
recognition engine is lower than open source candidates.
4.2.4 – Abbyy FineReader 10 (Windows)
Table 6: Information generated from Appendix A-4
Comments Keyword comparison (%) # comparison
Test 1 100% *26/26
Test 2a 77.3% 29/35
Test 2b 87% 37/37
Test 3a 77% 43/43
Test 3b 70% 88/94
Average 82.3%
Total 223/235 = 94.9%
*- not a true # comparison but an exact character by character comparison
The general performance of Abbyy FineReader 10 for Windows is considerably accurate at
a numerical comparison score of 94.9% and keyword comparison score of 82.3% from Table 6.
Unfortunately, the Linux version of Abbyy is two versions behind FineReader 10 and the setup
process is the most challenging amongst all the candidates. The fact that it is a new competitor in
![Page 21: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/21.jpg)
15
the Linux market [9]. Therefore, it is difficult to compare the key performance difference of the
Windows and Linux version of Abbyy.
4.2.5 – ViviData
Table 7: Information generated from Appendix A-5
Comments Keyword comparison (%) # comparison
Test 1 100% *26/26
Test 2a 11% 11/35
Test 2b 6% 4/37
Test 3a 74% 43/43
Test 3b 68% 94/94
Average 51.8%
Total 178/235 = 75.7%
*- not a true # comparison but an exact character by character comparison
From Table 7, the performance of ViviData is lower than most of the candidates, but the
outcome is influenced by the extremely poor results from “Test 2a” and “Test 2b” images. Even
when using a set of images with the same font and parameters, there is no correlation between a set
of images with reliable features and improved accuracy for all OCR engines because every OCR
engine behaves differently. Comparing the results for “Test 2a” and “Test 2b”, other OCR
candidates favours these sample images, but not ViviData.
ViviData provides developers with the parameters for input and output options as well as
pre-processing and image recognition options [2]. After the processing the images by the method
of line segmentation, the image results for “Test 2a” and “Test 2b” improved significantly as
shown in Figure 3 and Figure 4 respectively.
![Page 22: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/22.jpg)
16
Figure 3: ViviData Test 2A after segmentation and other processing
ID Study ID
Birth Date
Age S
Sex:U Height(kg): Height(cm):
Patient Comments
Study Date : 2010.03.10 Body Part : CHEST
Requesting Department
Referring Physician
Operator Name
Total Image Number : 834
<< Dose Information>>
Total mfls 8197 Total Scan time 31.02
CTDIvoI(mGy) (Head) - (Body) 420.50
DLP(n,Gycn,) (Head) - (Body) 1018.20
EH. DLP(n,Gycn,) 1281.20
<< Contrast/Enhance Information>>
Contrast Enhance : CE
ID: Study ID:
Birth Date: Age:
Sex: Weight(kg): Height(cm):
Patient Comments:
Study Date: 2010. 03. 10 Body Part: CHEST
Requesting Department :
Referring Physician :
Reporting Physician :
Operator Name :
Total Image Number : 634
<< Dose Information >>
Total mAs : 6197 Total Scan time: 31. 02
CTDlvol (mGy) (Head) : - (Body) : 420.50
DLP (mGycm) (Head) : - (Body) : 1016. 20
Eff. DLP(mGycm) : 1281.20
<< Contrast/Enhance Information >>
Contrast Enhance : CE
Word comparison by key words: 65%
Number comparison by number character: 33/35
Figure 4: ViviData Test 2B after segmentation and other processing
ID Study ID
Birth Date Age
Sex : H Height(kg) : Height(cm)
Patient Comments
Study Date : 2010.03.10 Body Part : CHEST
Request i ng Department
Referring Physician
Reporting Physician
Operator Name
Total Image Number 1888
<< Dose Information>>
Total niAs : 8088 Total Scan time : 28.77
CTDIvoI(mGy) (Head) - (Body) 459.70
DLP(n,Gycn,) (Head) - (Body) 1282.90
EH. DLP(n,Gycn,) 1580.50
<< Contrast/Enhance Information>>
Contrast Enhance CE
ID: Study ID:
Birth Date: Age:
Sex: Weight(kg): Height(cm):
Patient Comments:
Study Date: 2010. 03. 10 Body Part: CHEST
Requesting Department :
Referring Physician :
Reporting Physician :
Operator Name :
Total Image Number : 1888
<< Dose Information >>
Total mAs : 6086 Total Scan time: 28. 77
CTDlvol (mGy) (Head) : - (Body) : 459.70
DLP (mGycm) (Head) : - (Body) : 1262. 90
Eff. DLP(mGycm) : 1580.50
<< Contrast/Enhance Information >>
Contrast Enhance : CE
Word comparison by key words: 67%
Number comparison by number character: 35/37
![Page 23: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/23.jpg)
17
From Figure 3 and Figure 4, some numerical values are still incorrect due to the noise of
the sample images and close proximity of the characters. This is a major concern for all OCR
solutions, because the behaviour of the OCR tool is not perfectly stable. The developers need to
set constraints on the images in order to improve consistency of results.
Table 8: Information generated from Appendix A-5, Figure 3 and Figure 4
Comments Keyword comparison (%) # comparison
Test 1 100% *26/26
Test 2a 65% 33/35
Test 2b 67% 35/37
Test 3a 74% 43/43
Test 3b 68% 94/94
Average 74.8%
Total 231/235 = 98.3%
*- not a true # comparison but an exact character by character comparison
When comparing the results from table 7 and table 8, the numerical comparison score and
keyword comparison score improved from 75.7% to 98.3% and 51.8% to 74.8% respectively due
to additional processing and segmentation shown previously. Therefore, ViviData has the highest
recognition potential among commercial OCR solutions.
![Page 24: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/24.jpg)
18
5.0 Conclusions
Table 9 illustrates the qualitative summary of the capabilities of each OCR solution based
on previously stated criteria and performance of each OCR candidate.
Table 9. A Comparison Between OCR Candidates
Candidates Performance
# comparison
Performance
keyword
Recognition
Potential
Cost
* Weighting Factor [F] 3.5 2.5 1.5 2.5
Tesseract ** Score
[S]
~6.0 ~4.0 9.0 8.0
*** Total
[T=F x S]
21 10 13.5 20 64.5
[x] GOCR ** Score
[S]
8.4 6.5 6.0 7.5
*** Total
[T=F x S]
29.4 16.25 9 18.75 73.4
[x] Abbyy
(Windows)
** Score
[S]
9.5 8.2 2.0 6.0
*** Total
[T=F x S]
33.25 20.5 3 15 71.75
AspriseOCR ** Score
[S]
9.2 4.1 3.5 4.0
*** Total
[T=F x S]
32.2 10.25 5.25 10.0 57.7
ViviData ** Score
[S]
9.8 7.5 7.0 5.0
*** Total
[T=F x S]
34.3 18.75 10.5 12.5 67.05
* The total value of the weighting factor is 10
** The total score for each criterion is out of 10
*** The total score has a maximum value of 100
~ Estimation is necessary due to variability of data
[x] Candidate not select because missing important criteria (included for reference)
![Page 25: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/25.jpg)
19
ViviData has the highest recognition potential; Tesseract has the highest development
potential. AspriseOCR is the only Java-based OCR solution with an easy to use Application
Programming Interface. Abbyy is a strong OCR solution in the Windows and Macintosh
environment, while ViviData focuses on the Linux market.
According to Table 9, ViviData is the ideal candidate for the Linux platform to extract
medical data for the REM project. Comparing with the other OCR solutions, it has the highest
numerical comparison score and a decent keyword performance score, which makes it ideal for
retrieving important medical data. GOCR and Abbyy does not meet certain criterion of the
project, and thus eliminated from selection.
![Page 26: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/26.jpg)
20
6.0 Recommendations
A larger set of test images should be utilized to justify the selection of the most suitable
OCR solution with greater confidence. If time and resources are available, a greater emphasis on
testing and data collection methods or tools is also important for the success of any OCR related
project.
Along with the OCR engine, it is important to develop pre-processing, post-processing and
helper functions to enhance to accuracy of the output. For instance, manipulating the input image
to improve the accuracy of the OCR engine or include fuzzy matching algorithms to better detect
keywords in the output in order to match corresponding values for data extraction, can improve
clarity and consistency. This is to ensure the reliability of the results especially when the
application deals with the medical data.
![Page 27: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/27.jpg)
21
Glossary
Optical Character
Recognition (OCR)
the translation of an image of written text into printed text
Copyleft license Essentially the opposite of copyright where if a party creates work
based on GPL’d software and distributes the resulting work, then
the party must distribute the resulting work under GPL
GPL A free, copyleft license for software and other works.
Segmentation The method or instance of dividing something into parts
Modality Equipment or probes that acquire images of the body such as
ultrasound, radiography, magnetic resonance imaging
API Application Programming Interface allows interaction between
different software programs
![Page 28: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/28.jpg)
22
Works Cited
[1] Mumtaz, Shakeel. “Character Recognition and Google Tesseract.” 2007. meSmarty. [Online]
<http://www.mesmarty.com/2010/02/07/character-recognition-and-google-tesseract/.>
[Accessed: July 7, 2010]
[2] “OCR Shop XTR Users Guide.” 2004. ViviData. [Online]
<http://www.vividata.com/.>
[Accessed: July 25, 2010]
[3] Baird, Henry and Riopka, Terry. “ScatterType: a Reading CAPTCHA Resistant to
Segmentation Attack.” Comp. Sci. & Engineering Department of Lehigh University.
Document Recognition and Retrieval XII, vol. 5676, 2005. SPIE Digital Library.
<http://www.cse.lehigh.edu/~baird.>
[Accessed: July 7, 2010]
[4] Natarajan, Prem et al., “The BBN Byblos Hindi OCR System.pdf.” BBN Technologies.
Document Recognition and Retrieval XII, vol. 5676, 2005. SPIE Digital Library.
<http://spiedl.org/terms.>
[Accessed: July 7, 2010]
[5] Varga, Tamas and Bunke, Horst. “Perturbation Models for Generating Synthetic Training Data
in Handwriting Recognition.” 2008. University of Bern Institute of Computer Science and
Applied Mathematics.
<www.springerlink.com.>
[Accessed: July 8, 2010]
![Page 29: UNIVERSITY OF WATERLOO€¦ · commercial or open source Optical Character Recognition (OCR) tool for extracting data from specific medical images under the Linux platform. The scope](https://reader034.fdocuments.us/reader034/viewer/2022052000/6012993d71b4cc6ee4406ffd/html5/thumbnails/29.jpg)
23
[6] Ng, Andrew. “Lecture 1 | Machine Learning”. July 2008. Stanford University. [Online]
<http://www.youtube.com/watch?v=UzxYlbK2c7E.>
[Accessed: July 10, 2010]
[7] D’Amours, Danny. Research Support Coordinator. Personal Interview. July 16, 2010.
[8] Nartker, Thomas et al., “Software Tools and Test Data for Research and Testing of Page-
Reading OCR Systems.” University of Nevada. Document Recognition and Retrieval XII, vol.
5676, 2005. SPIE Digital Library.
<http://spiedl.org/terms.>
[Accessed: July 7, 2010]
[9] “ABBYY Announces Its New OCR SDK for Linux Enivronment.” 2010. Abbyy.
<http://www.abbyy.com/Default.aspx?DN=bcfa7070-594f-475c-98e2-dba48097519c.>
[Accessed: July 15, 2010]
[10] “ABBYY Wins Best Professional Software at Macworld UK Awards.” 2010. Abbyy.
<http://www.abbyy.com/finereader_for_mac/.>
[Accessed: July 16, 2010]
[11] “tesseract-ocr” 2010. Google Project Hosting.
<http://code.google.com/p/tesseract-ocr/.>
[Accessed: July 5, 2010]