ocmr
-
Upload
vinay-singh -
Category
Documents
-
view
350 -
download
0
Transcript of ocmr
INTRODUCTION
Now-a-days the increase in popularity of portable computing devices such as
PDAs and handheld computers, non keyboard based methods for data entry are
receiving more attention in the research communities and commercial sector. The
most promising options are pen-based and voice-based inputs. Digitizing devices
like Smart Boards and computing platforms such as the IBM ThinkPad TransNote
and Tablet PCs, have a pen based user interface. Such devices, which generate
handwritten documents with online or dynamic (temporal) information, require
efficient algorithms for processing and retrieving handwritten data. Online
documents may be written in different languages and scripts. Most of the text
recognition algorithms are designed to work with a particular script and treat any
input text has being written only in the script under consideration. Therefore, an
online document analyzer must first identify the script before employing a
particular algorithm for text recognition. Fig.1.1 shows an example of a document
page containing six different scripts.
Note that a typical multilingual, online document contains two or three scripts. The
document page shown in figure 1 was created by us for illustrating the results.
Note that the writing style in English is mixed, with both cursive and handprint
words. For lexicon-based recognizers, it may be helpful to identify the specific
language of the text if the same script is used by multiple languages.
A manuscript is defined as a graphical form of a writing system. A specific script
like Roman may be used by multiple languages such as English, German and
French. The six scripts considered in this work, named Arabic, Cyrillic, Devnagari,
Han, Hebrew, and Roman, cover the languages used by a majority of the world
population (see Fig. 1). The general class of Han-based scripts includes Chinese,
Japanese, and Korean (we do not consider Kana or Han-Gul). Devnagari script is
used by many Indian languages, including Hindi, Sanskrit, Marathi, and
1 | P a g e
Rajasthani. Arabic script is used by Arabic, Farsi, Urdu, etc. Roman script is used
by many European languages like English, German, French, and Italian. We
attempt to solve the problem of script recognition, where either an entire document
or a part of a document (up to word level) is classified into one of the six scripts
mentioned above, with the aim of facilitating text recognition or retrieval. The
problem of identifying the actual language often involves recognizing the text and
identifying specific words or sequence of characters, which is beyond the scope of
this project.
Fig1.1: Online multilingual document containing Cyrillic, Hebrew, Roman, Arabic, Devnagari, and Han scripts.
Most of the published work on automatic script recognition deals with offline
documents, i.e., documents which are either handwritten or printed on a paper and
then scanned to obtain a two-dimensional digital representation.
2 | P a g e
LITERATURE REVIEW
Existing System
Most of the published work on automatic script recognition deals with offline
documents, i.e., documents which are either handwritten or printed on a paper and
then scanned to obtain a two-dimensional digital representation. It does not cope
up with the developing technologies such as Smart Boards, IBM ThinkPad
TransNote and Tablet PCs have a pen-based user interface. The Disadvantages
of the existing system are as follows:
The actual languages are identified using projection profiles of words
and character shapes.
The existing system is looking for the presence or absence of specific
shapes in different scripts.
Existing method deals with only spatial characteristics.
Error rate is higher.
Proposed System
The proposed method uses the features of connected components to classify six
different scripts (Arabic, Cyrillic, Devnagari, Han, Hebrew, Roman) and reported a
classification accuracy of 88 percent on document pages. There are few important
aspects of online documents that enable us to process them in a fundamentally
different way than offline documents. The most important characteristics of online
documents are that they capture the direction of strokes while writing the
document. This allows us to analyze the individual strokes and use the additional
direction for both script identification as well as text recognition. In the case of
online documents, segmentation of foreground from the background is a relatively
simple task as the captured data, i.e., the (x; y) coordinates of the locus of the
3 | P a g e
stylus, define the characters and any other point on the page belongs to the
background. We use stroke properties as well as the spatial and temporal
information of a collection of strokes to identify the script used in the document.
Unfortunately, the temporal information also introduces additional variability to the
handwritten characters, which creates large intra class variations of strokes in
each of the script classes. The advantages of the proposed system are as follows
This system especially deals with online documents.
The proposed system deals with both spatial and temporal
characteristics of the document.
Easy segmentation of foreground and background of the online
document.
Error rate is lesser.
4 | P a g e
SOFTWARE REQUIREMENT SPECIFICATION
Introduction
An Online Calligraphic Manuscript Recognition (OCMR) is part of the so-called Manuscript Recognition Systems, which are applications supporting the direct contact with the user.
Scope of the Project
We describe what features are in the scope of the software and what are not in the
scope of the software to be developed.
In scope
1) Allows the users to write a new script and then save it.
2) Users can write a script and then system can recognise it.
3) Users can view all the scripts saved in the database.
4) Users can delete any/all the saved scripts.
Out of scope
1) The user can recognise only the scripts which are saved in the database.
2) It can generate a error rate to some extent.
Development Methodologies
The following software and languages may be used
a) Java b) Java Plug-Inc) Java Web Startd) Java 2D Toolkitse) Java Swing
5 | P a g e
Model Used
Here we will be using the Incremental model; one of the most effective type of model which is most widely used. The incremental model is an intuitive approach to the waterfall model. Multiple development cycles take place here, making the life cycle a “multi-waterfall” cycle. Cycles are divided up into smaller, more easily managed iterations. Each iteration passes through the requirements, design, implementation and testing phases.
Fig 3.1: Incremental Model
Functional Requirements
1) Functional Summary:
The application will ensure that user will be able to write, save, view, recognise and delete scripts without going through any trouble. The scripts saved by a person are reflected in the database in real-time.
2) Performance:
The application on the client side will work as an effective GUI. A system will be needed to store the program files and the database and other files which are necessary for the smooth functioning of the software.
3) Design Constraints:
6 | P a g e
There are no design constraints at the time as all aforementioned criteria are satisfied with ease.
General Description
1) Product Perspective:
It is an intended to be areal time application. OCMR should be auserfriendly, quick to
learn and reliable application. It shouls run on both UNIX and windows based operating
systems.
2) Functional Summary:
The application will ensure that the system has a consistent database which can be accessed by the user to access the scripts saved in it. The GUI will work in any system. The software will allow the user to write a script,save and recognise it.
3) User Characteristics:
The application is designed for the script recognition in an online and offline environment . The application will be of a user friendly nature and can be used by any computer literate person. The simplified GUI is easy to use and handle.
Database Design
This section provides a brief description of all the tables in the database that are used by the system.
The database stores the script in six languages provided/specified by the user.
1. The six types of script: Arabic Cyrillic Devnagari Han Hebrew Roman
2. Users:
7 | P a g e
A user can store a sript in one of the six different languages and then try to recognise it.
SOFTWARE DESCRIPTION
8 | P a g e
The project entitled as ONLINE CALLIGRAPHIC MANUSCRIPT RECOGNITION
is a Real Time application which helps to recognize the different scripts in online
using a common platform. The front end is designed and executed with the
JDK1.5.0 handling the core java part with User interface Swing component.
Development Tools
The development tools provide everything you’ll need for compiling, running,
monitoring, debugging, and documenting your applications. As a new developer,
the main tools you’ll be using are the Java compiler (javac), the Java launcher
(java), and the Java documentation (javadoc).
Application programming Interface (API)
The API provides the core functionality of the Java programming language. It
offers a wide array of useful classes ready for use in your own applications. It
spans everything from basic objects, to networking and security.
Deployment Technologies
The JDK provides standard mechanisms such as Java Web Start and Java Plug-
In, for deploying your applications to end users.
User Interface Toolkits
The Swing and Java 2D toolkits make it possible to create sophisticated Graphical
User Interfaces (GUIs).
UML DIAGRAMS
9 | P a g e
1. UML Diagram
User
Fig. 5.1: Use Case Diagram
2. DFD Level 0
10 | P a g e
Access recognizable patterns
Calligraphic parameters
Update database
User commands Display
And data information
language
types
Sensor Storing
Status values
Fig. 5.2: LEVEL 0 Data Flow Diagram
3. DFD Level 1
11 | P a g e
Calligraphic
Software
Control
Panel
Sensors
Control Panel
Display
Languages
Database
display
information
user commands configure data
and data configure request
sensor info
start/stop language type
update info
sensor status
Fig. 5.3: LEVEL 1 Data Flow Diagram
4. DFD Level 2
12 | P a g e
Control panel
Configure system
Interact with user
sensors
Monitor sensors
Display messages
Control panel display
languages
database
Inception July '10
Planning August '10September '10
Analysis September '10
Construction October '10 - February '11
Testing March '11
Deployment April '11
Sensor information
data
data
sensor
configure request configure data information
data update info
changes sensor ID,
type
update info sensor ID
new data store changes
sensor
status
Fig. 5.4: LEVEL 2 Data Flow Diagram
TIMELINE CHART
13 | P a g e
Control panel
sensors
Control panel display
languages
database
Interact with user
Monitor changes
Monitor sensors
Configure system
Display messages
Read sensors
Generate signal
Format for display
Update changes
Start the process
Choose the language and write a new document
Fig 6.1 Time Line Chart
OVERALL PROCESS FLOW DIAGRAM
14 | P a g e
Fig. 7.1 Overall Flow Diagram
CONCLUSION AND FUTURE ENHANCEMENT
15 | P a g e
We have presented a script identification algorithm to recognize six major scripts in
an online document. The aim is to facilitate text recognition and to allow script-
based retrieval of online handwritten documents. The classification is done at the
word level, which allows us to detect individual words of a particular script present
within the text of another script. The classification accuracies reported here are
much higher than those reported in the case of script identification of offline
handwritten documents, although the reader should bear in mind that the
complexities of the two problems are different.
One of the main areas of improvement in the above algorithm is to develop a
method for accurately identifying text lines and words in a document. We are
currently working on developing statistical methods for robust segmentation of
online documents.The script classification algorithm can also be extended to do
page segmentation, when different regions of the handwritten text are in different
scripts.The results of our System and the efficiency of it to classify the words is
provided in the appendix .
REFERENCESPAPERS
16 | P a g e
1. Anoop M. Namboodiri, and Anil K. Jain, “Online Handwritten Script
Recognition” Ieee Transaction On Pattern Analysis And Machine
Intelligence, Vol. 26, no. 1, January 2004.
2. A.L. Spitz, “Determination of the Script and Language Content of Document
Images,” IEEE Trans. Pattern Analysis and Machine Intelligence,
WEBSITES
1. www.ieee.explore.com
2. www.pencomputing.com
3. www.ancientscripts.com
4. www.informit.com
5. www.sun.com
6. www.java.com
17 | P a g e