Integrating a simple OCR in...
Transcript of Integrating a simple OCR in...
![Page 1: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/1.jpg)
IntegratingasimpleOCRinAlfresco
AngelBorroydeveloper@keensoft
![Page 2: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/2.jpg)
OCRfortheEnterprise• Minimumlicensestartingin100,000documents/year
• Dedicatedserverrequired• Hardlearningcurve– Regularexpressions– Templatesandworkflows– Proprietaryintegration
![Page 3: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/3.jpg)
OCRfortheCommunity
• OpenSource• NootherserverthanAlfresco• Nolearningcurve,justdropoffyourdocumentsonafolderandgetSearchablePDFs
• EveryhostingOSissupported
![Page 4: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/4.jpg)
BuildingasimpleOCRAction
1 REPOAMP
• Contentmodel(simple)• Action• Transformer
![Page 5: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/5.jpg)
OCRAction:Keyclasses<bean id="ocr-extract"
class="es.keensoft.alfresco.ocr.OCRExtractAction" parent="action-executer" init-method="init"> <property name="ocrTransformWorker" ref="transformer.worker.OCR" />
</bean>
<bean id="transformer.worker.OCR" class="es.keensoft.alfresco.ocr.OCRTransformWorker">
<property name="serverOS" value="${ocr.server.os}" />
<property name="executerWindows"><property name="executerLinux">
</bean>
![Page 6: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/6.jpg)
OCRAction:ConfigurationLinuxalfresco-global.properties
#localocr programocr.command=/usr/local/bin/pdfsandwichocr.output.verbose=trueocr.output.file.prefix.command=-o#rotating,cleaning,languages…ocr.extra.commands=-lang spaocr.server.os=linux
![Page 7: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/7.jpg)
OCRAction:ConfigurationWindows
alfresco-global.properties
#localocr serviceocr.url=http://localhost:60064/api/OCRocr.output.verbose=true#rotating,cleaning,languages…ocr.extra.commands=Spanishocr.server.os=windows
![Page 8: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/8.jpg)
OCRAction:Ruleconfiguration
Onlyapplyforforeground
![Page 9: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/9.jpg)
OCRAction:Ruleconfiguration
SYNCHRONOUS
ASYNCHRONOUS
![Page 10: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/10.jpg)
OCRAction:Results
![Page 11: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/11.jpg)
Whatelse?• Studydifferentoriginaldocuments– Existing(incorrect)layertext– Imageresolutionbelow200dpi– Landscape/portraitorientation– Papersizemaychange
• PlainOCRsoftisnotenough
*Imagecomingfromhttp://www.tobias-elze.de/pdfsandwich/
![Page 12: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/12.jpg)
OCRSoftware:MacOSXhttps://github.com/jbarlow83/OCRmyPDF• GeneratesasearchablePDF/AfilefromaregularPDF• Keepstheexactresolutionoftheoriginalimages• Keepsfilesizeaboutthesame• Deskews and/orcleanstheimagebeforeperformingOCR
• UsesTesseract OCR engine• OpenSourceanddevelopedwithPython3
![Page 13: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/13.jpg)
OCRSoftware:Linuxhttp://www.tobias-elze.de/pdfsandwich/• Generates"sandwich"OCRpdffiles• Recognizespagelayout(evenformulticolumn)
• Usesunpaper,convert,gs andtesseract
• OpenSourceanddevelopedusingOCAML
![Page 14: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/14.jpg)
OCRSoftware:Windowshttps://github.com/Xandroid4Net/CommandLineOcr (nonfinal)• Windows.Media.Ocr– MicrosoftAPIrunnableinWindows8andWindows2012
– NativeinWindows10andWindows2016
![Page 15: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/15.jpg)
OCRSoftware:Hostedserviceshttps://ocr.space/OCRAPIhttp://www.ocrwebservice.com/api/restguidehttp://www.bitocr.com/documentation.html…
https://cloud.google.com/vision/
![Page 16: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/16.jpg)
Realworldusecase(1)OS Ubuntu14.04LTSVersion Alfresco5.0.dOCRsoft pdfsandwichLanguages eng+spa+cat+fra
OCR
![Page 17: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/17.jpg)
Realworldusecase(2)OS Ubuntu15.10Version Alfresco5.0.dOCRsoft OCRmyPDFLanguage eng
OCR
![Page 18: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/18.jpg)
Realworldusecase(3)OS WindowsServer2012R2Version Alfresco5.1.eOCRsoft Windows.Media.OcrLanguage Spanish
OCR
![Page 19: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/19.jpg)
OpenSourceOCRaddonhttps://github.com/keensoft/alfresco-simple-ocrLicense LGPLv3.0State ProductionLanguages(interface) English,PortugueseBrazilian,GermanandSpanishLanguages(OCR) 39/25
“NooriginalAlfrescoresourceshavebeenoverwritten”https://github.com/OrderOfTheBee/addons/wiki/Inclusion-criteria-overview
![Page 20: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/20.jpg)
OCR:Recap• GeneratesautomaticallyPDFsearchablefromPDFImage
• OpenSourceaddon forAlfrescoavailable• Minimalconfigurationrequired• DifferentOpenSourceLinuxprogramsavailable
• AlsoMicrosoftisprovidingthelibraryWindows.Media.Ocr
![Page 21: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/21.jpg)
ResourcesGitHubhttp://github.com/keensoft/alfresco-simple-ocrTwitter@AngelBorroyBloghttp://www.keensoft.es/en/category/blog-en/http://angelborroy.wordpress.com
![Page 22: Integrating a simple OCR in Alfrescobeecon.orderofthebee.net/2016/assets/data/files/20160125005/BeeC… · OCR for the Community • Open Source • No other server than Alfresco](https://reader035.fdocuments.us/reader035/viewer/2022070819/5f1a7c2aed1bfa08613f8e7e/html5/thumbnails/22.jpg)
IntegratingasimpleOCRinAlfresco
AngelBorroydeveloper@keensoft