OCRFeeder - OCR made easy on GNOME (GUADEC 2012)
-
Upload
igalia -
Category
Technology
-
view
147 -
download
2
description
Transcript of OCRFeeder - OCR made easy on GNOME (GUADEC 2012)
static void_f_do_barnacle_install_properties(GObjectClass
*gobject_class){
GParamSpec *pspec;
/* Party code attribute */ pspec = g_param_spec_uint64
(F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code",
0, G_MAXUINT64,
G_MAXUINT64 /* default value */,
G_PARAM_READABLE | G_PARAM_WRITABLE |
G_PARAM_PRIVATE);
g_object_class_install_property (gobject_class,
F_DO_BARNACLE_PROP_CODE,
Joaquim [email protected]
OCRFeeder
OCR Made Easy on GNOME
July 27 2012
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
What is it?
Document Analysis and Optical Character Recognition
for GNOME
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Why?
Paper has a number of problems
No applications for GNU/Linux to do a fair job
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Paper problems:Security
CC Photo by: http://www.flickr.com/photos/badwsky/
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Paper problems:Preservation
CC Photo by: http://www.flickr.com/photos/98469445@N00/
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Paper problems:Data processing
CC Photo by: http://www.flickr.com/photos/hugovk/
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Paper problems:Ecology
CC Photo by: http://www.flickr.com/photos/pranavsingh/
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Paper problems:Accessibility
CC Photo by: http://www.flickr.com/photos/illustrator/
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
No fair conversion apps for GNU/Linux
apart from OCR engines, but...
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
OCR != Document Conversion
(it only deals with chars)(does not consider the layout)(does not distinguish contents)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
What's needed is
Document Analysis and Recognition
(conversion of documents to an electronic format)
(first projects in the 80s)
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
How it works
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
So many layouts...
CC Photo by: http://www.flickr.com/photos/uber-tuber/
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Layouts vary with the type of document
What works on detecting one, won't work on others
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
OCRFeeder focuses on contents, not on layouts!
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Key concept:
If a document image can be divided in windows of 1 (content)
or 0 (not content), then it is possible to group all the
1s and outline the contents
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Recognition:
System-wide OCR engines are used
Engines are configured from the GUI or XML files
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Most known free OCR engines are detected and configured
automatically:
* Tesseract* GOCR
* OCRAD* Cuneiform
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Exportation formats:
ODTHTML
Plain textPDF
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
User interaction:
Users can edit everythingand review the algorithm's results
So, UI can work in attended and unattended ways
CLI only works in an unattended mode
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Demo time!
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Other features:
* PDF importation* Unpaper preprocessor
* Font style edition* Image deskewing
* OCR results cleaning* Project saving/loading
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Future:
* More exportation formats: HOCR, etc.
* Make OCR engines' management easier
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Webpage:http://live.gnome.org/OCRFeeder
git:http://git.gnome.org/ocrfeeder
Bugzilla:http://bugzilla.gnome.orgproduct: OCRFeeder
Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012
Thank you!