SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and...

22
SIFT Technical Design Document Bhaarat Sharma

Transcript of SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and...

Page 1: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

SIFT Technical Design DocumentBhaarat Sharma

Page 2: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

Table of ContentsSIFT Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1

Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1

Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  2

Web App Container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3

Queue Container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3

Database Container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3

OCR Container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3

Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3

sift0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  3

sift1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  4

text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5

Python Command Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  6

RESTful Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  7

Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  10

Building and running containers separately . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  10

Queue Container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  10

Database Container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  11

OCR Container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  11

Web App Container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  11

Building and Running Containers with One Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  14

Building and Running Containers on EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  14

Install Docker and Docker-Compose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  14

Transfer Containers to EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  14

Fetch Base Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  15

Run Docker-Compose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  15

Verify Running Containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  15

Launching SIFT™ on an Amazon Machine Image (AMI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  16

Changing Config Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  16

Run Docker-Compose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  16

Running Containers on Separate Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  17

Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  18

Using SIFT OCR solution with Spark (PySpark) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  18

SIFT OCR Solution as a Docker Instance on each Spark node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  18

SIFT OCR Solution as a native program on each node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  19

Read data outside Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  20

Integrating SIFT with Other Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  20

Integrating SIFT with a RESTful API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  20

Page 3: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

This document explains how to use, install, and configure Aerstone Labs SIFT™.Additionally, it explains how to use SIFT™ with other third party solutions.

If you want to know what SIFT is all about, go to Aerstone LabsSIFT. For a 3 minute video overview of SIFT, go to SIFT OverviewVideo

SIFT OverviewSIFT™, from Aerstone Labs, is a browser-based and agentless cloud data migration solution. It isdesigned to protect an organization from accidentally spilling information onto unauthorizednetwork enclaves. SIFT™ can be used as a stand-alone file-publishing portal, or integratedseamlessly with existing document management systems. Once configured to search for the kind ofdata an organization considers sensitive, based on keywords or regular expressions, SIFT™ can beused to implement data transfer approval workflow, and to optionally tag documents with anydiscovered keywords. SIFT™ natively supports both searchable documents (e.g., MS Office) andnon-searchable assets (e.g., picture, video, and scanned PDFs). SIFT™ can also be used by systemadministrators to scan specific network locations for spilled documents, and ships with an API thatallows it to be implemented in line with document management solutions, including highassurance guards, content management systems like Adobe AEM or Microsoft SharePoint, or emailgateways. And detailed reporting provides valuable real-time audit data about document transfers.

AssumptionsThe code snippets in this document use the following demo image

Demo Image

The build and install instructions assume that the user has access to the SIFT™ installation filesprovided by Aerstone Labs. The file structure looks like this:

1

Page 4: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

File Structure

├── aerstone-sift-mysql ①├── aerstone-sift-ocr ②├── aerstone-sift-rabbitmq ③├── aerstone-sift-ubuntu ④├── aerstone-sift-webapp ⑤├── sift-compose.yml ⑥

① Database Container directory

② OCR Container directory

③ MQ Container directory

④ Ubuntu Image directory

⑤ Webapp Container directory

⑥ Compose YAML file

Architecture Overview

SIFT Architecture Diagram

SIFT™ uses a highly modular architecture that makes it easy to deploy into an organization’sexisting architecture. Additionally, this architecture allows an organization to pick and choosedifferent parts of SIFT™, which can be deployed as a stand-alone RESTful microservice, and as aninteractive browser-based solution.

SIFT™ is comprised of four core docker containers:

1. Web App Container (Exposes 8080/tcp)

2. OCR Container (Exposes 5000/tcp)

3. Queue Container (Exposes 15672/tcp)

4. Database Container (Exposes 3306/tcp)

Each docker container exposes a port which allows external services to consume SIFT, and alsoallows the containers to communicate with one other. This containerized isolation makes SIFT™highly scalable. For example, the Web App can be load balanced, and the OCR and Queue and

2

Page 5: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

Database containers can be clustered — based on application throughput requirements.

Web App ContainerThe Web App container contains the SIFT™ web application. The web application is packaged as aWAR file and runs in the Tomcat Application Server. The SIFT™ web application can accessed eithervia its API, or via the exposed GUI. The web application is built on Grails and runs in any JVMenvironment. The Web App container communicates with the OCR Container to processunsearchable files (e.g. images, scanned PDFs, etc.); the Database Container as a backendconfiguration store; and with the Queue Container to store and process high volume requests in aFIFO ("first in first out") fashion.

Queue ContainerThe Queue Container contains RabbitMQ, which is called by the Web App Container and used forrequest queuing.

Database ContainerThe Database Container includes a MySQL instance, which is called by the Web App container andused to store SIFT™ configuration settings.

OCR ContainerThe OCR Container receives asset processing requests from the Web App container. The OCRContainer contains all the code required to extract text from unsearchable files per above.

Profiles

The OCR Container has a concept of "profiles." Each profile pre-processes images prior to sendingthem to the OCR engine. Profiles are custom built for specific asset types, to enhance text extractionaccuracy. SIFT™ ships with three core profiles that together work well for most kinds of assets.Profiles can be combined to further increase processing accuracy. The preprocessing parameter ofthe REST API shows how to combine profiles.

sift0

This profile is designed for images with annotated text. Below is an example of one such image:

3

Page 6: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

(Sample 1) Image with annotated text

{  "url": "https://s3-us-west-2.amazonaws.com/aerstone-sift/digital-images/Demo_xmp_c.jpg",  "uniquecode": "Sample 1",  "preprocessing": "sift0"}

sift1

This profile is designed for images with varying shades of text that are not well annotated:

(Sample 2) Image with text of varying colors not annotated

{  "url": "https://s3-us-west-2.amazonaws.com/aerstone-sift/digital-images/img_27.png",  "uniquecode": "Sample 2",  "preprocessing": "sift1"}

4

Page 7: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

(Sample 3) Image from a wearable or hand held camera device (language: spanish)

{  "url": "https://s3-us-west-2.amazonaws.com/aerstone-sift/wearable/spa1.jpg",  "uniquecode": "Sample 3",  "preprocessing": "sift1",  "lang": "spa",  "rinse": false}

(Sample 4) Real world image from a hand held camera device (language: french)

{  "url": "https://s3-us-west-2.amazonaws.com/aerstone-sift/wearable/fra1.jpg",  "uniquecode": "Sample 4",  "preprocessing": "sift1",  "lang": "fra",  "rinse": false}

text

This profile is designed for scanned text documents. It will convert a PDF to an image, perform pre-processing, and then send the scanned image to the OCR. Below is an example of a scanned PDFdocument de-classified from the NSA:

(Sample 5) Scanned PDF document (multiple pages)

5

Page 8: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

{  "url": "https://s3-us-west-2.amazonaws.com/aerstone-sift/text_scans/img6.pdf",  "uniquecode": "Sample 5",  "preprocessing": "text",  "pdf": true}

(Sample 6) Image of a document taken using a cell phone (language: turkish)

{  "url": "https://s3-us-west-2.amazonaws.com/aerstone-sift/DI2E/Laden/7.1.tur.lg.20160517_183245.jpg",  "uniquecode": "Sample 6",  "preprocessing": "text",  "rotate": true}

It is important to note that the image in sample 6 is not horizontally aligned. As part of the pre-processing, SIFT performs the required degrees of rotation on the image prior to attempting textextraction.

The OCR solution can be run in two ways:

1. Via the Python command line

2. Via the exposed RESTful interface

Python Command Line

The SIFT™ Python command line interface takes various arguments. Most of the arguments aredefault and don’t need to be passed in by the user. The table below lists all arguments:

Argument Description Default CMD REST

processing (file orstream)

memory or I/O stream -p processing

leplib path to theleptonica library

/usr/local/lib/liblept.so.4.0.2

-l leplib

6

Page 9: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

Argument Description Default CMD REST

tesslib path to tesseractlibrary

/usr/local/lib/libtesseract.so.3.0.4

-o tesslib

tessdata path to tesseractdata

/usr/local/share/tessdata

-d tessdata

storefolder path wheretemporary filesshould be stored.(only reqd. whenvalue forprocessing arg isfile)

 —  -s storefolder

url URL for the inputimage

 —  -u url

filepath filepath to theinput image

 —  -f filepath

uniquecode Uniquecode forthe request

 —  -r uniquecode

debug (True orFalse)

debug flag False -e debug

Below are few examples of how the script might be executed:

command-line

$ python sift_ocr.py -f /path/to/image.png -r TESTING ①:33: LOT 5 OF CAR 5 SENTINEL H MAG NOLIA ②

LOTS OF WHITE VAN S ...... I :5 H3

M I la Bal ColumbiaC. Silver Springi :1qu 0 Washihgton ISENTINEL H MAGNOLIA Declassify on: 12.31.48Telephone Company Suspicious Activity

① Run the command line program

② Result is sent to stdout

The command line script does all the processing in memory. There are no I/O operations and nosubprocesses are called. Additionally, the Python script and its dependencies could be installed onany *nix server and are not limited to just this docker container.

RESTful Interface

The RESTful API service listens on 5000/tcp. The service parameters are listed below:

Parameter Name Description Default Required?

7

Page 10: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

pdf Is file a PDF False No

url URL of file None Yes - Either url, base64,or filepath are required

base64 Base64 string of file None Yes - Either url, base64,or filepath are required

filepath Absolute path of the file None Yes - Either url, base64,or filepath are required

lang Language ID eng No

preprocessing Type of pre-processingalgorithm to use

None Yes - CSV variation of"text,sift0,sift1"

uniquecode Code for file request None Yes - Example:"SAMPLE-101"

border border for cropped files 4 No (integer)

zoom zoom for cropped files 6 No - integer

blur blur for cropped files 7 No - integer

debug Runs in debug mode No No - boolean (True orFalse)

constant Constant subtractedfrom mean

5 No - integer

block Size of pixelneighborhood

15 No - odd integer

rotate Rotate the image False No - boolean (True orFalse)

rinse Clean up text True No - boolean (True orFalse)

Below are few examples of how the RESTful endpoint might be called:

RESTful-service

$ curl -X POST -H "Content-Type: application/json" -d '{ ①  "filepath": "/path/to/image.png",  "uniquecode":"TESTING"  }' "https://demo.aerstone.com/sift/api/ocr"$ {  "uniquecode": "WHATEVER",  "rotated": 0,  "extracted_text": "<text extracted from image" ④  }

$ curl -X POST -H "Content-Type: application/json" -d '{ ②  "url": "https://s3-us-west-2.amazonaws.com/aerstone-sift/digital-images/Demo_xmp_c.jpg",

8

Page 11: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

  "uniquecode": "WHATEVER",  "preprocessing": "sift0,sift1",  }' "https://demo.aerstone.com/sift/api/ocr"$ {  "uniquecode": "WHATEVER",  "rotated": 0,  "extracted_text": "L .. 17 January 2013 14:28:56 :3: LOT 5 or CAR 5 .SENTINEL I MAGNOLIA SENTINEL I MAGNOLIA Declassify on: 12.31.48 Columbia 0 SllSprung Cl C H Telephone Company Suspicious Activity a: 2. ..1.f.:.1 II oBalI BalSilvet Spnng O s. U LOTS OF WHITE VANS y Washington Silver Spring 9 .1 urn: i 9Washmgt WWW II. 1... I1 Li: 1: 51 29 9 y W i Q. Ac: .3 r 3. has .3... h .2 9.... .a o2... ..o .0 7r. IQ 9 3. IJRI. u : a... .f1. a LOTS OF CARS"  } ④

$ curl -X POST -H "Content-Type: application/json" -d '{ ③  "string": "<binary_string_for_image>",  "uniquecode":"TESTING"  "preprocessing":"sift1"  }' "https://demo.aerstone.com/sift/api/ocr"$ {  "uniquecode": "SOMECODE",  "rotate": 0  "extracted_text": "1 January 2013 14:28:56 :33: LOT 5 OF CAR 5 SENTINEL H MAGNOLIA E...I . LOTS OF WHITE VAN S ...... I :5 H3 M I la Bal ColumbiaC.Silver Springi :1qu 0 Washihgton I SENTINEL H MAGNOLIA Declassify on: 12.31.48Telephone Company Suspicious Activity"  } ④

$ curl -X POST -H "Content-Type: application/json" -d '{ ⑤  "url": "https://s3-us-west-2.amazonaws.com/aerstone-sift/text_scans/img6.pdf",  "uniquecode": "WHATEVER",  "preprocessing": "text",  "pdf": true  }' "https://demo.aerstone.com/sift/api/ocr"{  "uniquecode": "WHATEVER",  "rotated": 0,  "extracted_text": "DOCID:\n\n4046925\n\nUNCLASSIFIEDH-FOR\u2014GFFIe-Ikt-U'S\u2018E'O'N\u2018t\u2018f\u2014\n\n \n\nPreface: The Clew to the Labyrinth\n\n\n\nOne of the most famous stories about libraries tells of the tenth century GrandVizier\nof Persia, Abdul Kassem Ismael who, \u201cin order not to part with hiscollection of\n117,000 volumes when traveling, had them carried by a caravan of 400camels\ntrained to walk in alphabetical order.\"1 However charming this tale may be,the actual\nevent upon which it is based is subtly different. According to theoriginal manuscript,\nnow in the British Museum, the great scholar and literary...."} ④

① Pass absolute path for the image

② Pass URL for the image

③ Pass binary string for the image

9

Page 12: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

④ Return result

⑤ Scanned PDF

The snippet above demonstrates the flexibility of the solution. The images can be passed indifferent ways to the RESTful endpoint.

InstallationThe only dependency for installing and running SIFT is docker (1.9+). Using docker, there are twoways that SIFT can be installed and run:

1. Building and running containers separately

2. Building and running containers with one command

For both ways mentioned above, there are few base docker images that are required for all otherSIFT docker images. Below is how these images can be built:

Base Docker Images

docker pull ubuntu:trusty-20151218 ①docker pull debian:jessie ②docker build -t aerstone-sift/ubuntu aerstone-sift-ubuntu ③

① Build base ubuntu image

② Build base debian image

③ Build aerstone sift ubuntu image - JAVA7

Building and running containers separatelyBelow are the instructions for building and running each container.

When building containers separately, they should be built in the order mentionedbelow.

Queue Container

Build

docker build -t docker_aerstone-sift-rabbitmq aerstone-sift-rabbitmq

docker build -t docker_aerstone-sift-rabbitmq-management aerstone-sift-rabbitmq/management

Run

docker run -d --hostname sift-rabbit --name docker_aerstone-sift-rabbitmq-management -eRABBITMQ_DEFAULT_USER=<user> RABBITMQ_DEFAULT_PASSWORD=<password> -p 15672:15672docker_aerstone-sift-rabbitmq-management

10

Page 13: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

Database Container

Build

$ docker build -t docker_aerstone-sift-mysql aerstone-sift-mysql

Run

Run DB Container

docker run -p 3306:3306 --name docker_aerstone-sift-mysql -d \-v /opt/mysql/data:/var/lib/mysql \-e 'DB_USER=<uname>' -e 'DB_PASS=<pwd>' -e 'DB_NAME=sift' -e'DB_REMOTE_ROOT_NAME=<uname>' -e 'DB_REMOTE_ROOT_PASS=<pwd>' -e'DB_REMOTE_ROOT_HOST=%' \docker_aerstone-sift-mysql

OCR Container

Build

$ docker build -t docker_aerstone-sift-ocr aerstone-sift-ocr

Run

$ docker run -d --name docker_aerstone-sift-ocr -p 5000:5000 docker_aerstone-sift-ocr

Web App Container

Build

$ docker build -t docker_aerstone-sift-webapp aerstone-sift-webapp/7.0

Run

Run WebApp Container

$ docker run -d \  --link docker_aerstone-sift-ocr \  --link docker_aerstone-sift-mysql \  --link docker_aerstone-sift-rabbitmq-management \  -p 8080:8080 --name docker_aerstone-sift-webapp \  -e TOMCAT_PASS=<password> \  -e AWS_SECRET_KEY=<secret key> \  -e AWS_ACCESS_KEY=<access key> \  -e AWS_BUCKET_NAME=<bucket name> \  -e RABBITMQ_PASS=<pwd> \  -e RABBITMQ_USER=<pwd> \  docker_aerstone-sift-webapp

The Web App container comes with a configuration file, which supports the configuration of

11

Page 14: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

various settings in SIFT™. By default, the configuration file is located in the container under/tomcat/sift.groovy, however, this path can be changed by specifying a Tomcat environmentvariable SIFT_CONFIG. This environment variable is set in /tomcat/bin/setenv.sh, with default valueexport SIFT_CONFIG="/tomcat/sift.groovy"

An example of the default sift.groovy configuration file is shown below. The values for eachproperty can be changed as required for a specific environment:

SIFT configuration file

def env = System.getenv()def isCompose = false ①def OCR_ADDR = 'AERSTONE_SIFT_OCR_PORT_5000_TCP_ADDR' ②def OCR_PORT = 'AERSTONE_SIFT_OCR_PORT_5000_TCP_PORT'

def DB_ADDR = 'AERSTONE_SIFT_MYSQL_PORT_3306_TCP_ADDR'def DB_PORT = 'AERSTONE_SIFT_MYSQL_PORT_3306_TCP_PORT'def DB_NAME = 'AERSTONE_SIFT_MYSQL_ENV_DB_NAME'def DB_USER = 'AERSTONE_SIFT_MYSQL_ENV_DB_USER'def DB_PASS = 'AERSTONE_SIFT_MYSQL_ENV_DB_PASS'

def Q_ADDR = 'AERSTONE_SIFT_RABBITMQ_MANAGEMENT_PORT_15672_TCP_ADDR'def Q_PORT = 'AERSTONE_SIFT_RABBITMQ_MANAGEMENT_PORT_15672_TCP_PORT'def Q_USER = 'RABBITMQ_USER'def Q_PASS = 'RABBITMQ_PASS'def LDAP_PASS = 'LDAP_PASS'

def AWS_ACCESS_KEY = 'AWS_ACCESS_KEY'def AWS_SECRET_KEY = 'AWS_SECRET_KEY'def AWS_BUCKET_NAME = 'AWS_BUCKET_NAME'

if (isCompose == false) {  OCR_ADDR = 'DOCKER_'+OCR_ADDR  OCR_PORT = 'DOCKER_'+OCR_PORT  DB_ADDR = 'DOCKER_'+DB_ADDR  DB_PORT = 'DOCKER_'+DB_PORT  DB_NAME = 'DOCKER_'+DB_NAME  DB_USER = 'DOCKER_'+DB_USER  DB_PASS = 'DOCKER_'+DB_PASS  Q_ADDR = 'DOCKER_'+Q_ADDR  Q_PORT = 'DOCKER_'+Q_PORT}

sift.role.admin = "SIFT_ADMINS" ③sift.role.requestorsingle = "SIFT_REQUESTORSSINGLE"sift.role.requestorbatch = "SIFT_REQUESTORSBATCH"sift.role.reviewer = "SIFT_REVIEWERS"sift.role.auditor = "SIFT_AUDITORS"sift.storefiles.location = 's3' ④// sift.storefiles.location = 'fileserver'

12

Page 15: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

// sift.fileserver.path = '/Users/bhaarat/code/fileserver'sift.ocrservice.address=env[OCR_ADDR]sift.ocrservice.port=env[OCR_PORT]sift.exif.path='/usr/bin/exiftool'sift.identify.path='/usr/bin/identify'sift.ffmpeg.path='/usr/local/bin/ffmpeg'sift.batch.notallowed='/etc'dataSource.dbCreate = "update"dataSource.pooled = truedataSource.driverClassName="com.mysql.jdbc.Driver"dataSource.dialect="org.hibernate.dialect.MySQL5InnoDBDialect"dataSource.url="jdbc:mysql://"+env[DB_ADDR]+":"+env[DB_PORT]+"/"+env[DB_NAME]+"?useUnicode=true&amp;zeroDateTimeBehavior=convertToNull&amp;characterEncoding=UTF-8"dataSource.username=env[DB_USER]dataSource.password=env[DB_PASS]rabbitmq {  connectionfactory {  username = env[Q_USER]  password = env[Q_PASS]  hostname = env[Q_ADDR]  port = env[Q_PORT]  }  queues = {  exchange name: 'amq.direct', type: direct, durable: true, autoDelete: false, {  siftrequestqueue durable: true //, concurrentConsumers:10  siftbatchqueue durable: true  siftpdfrequestqueue durable: true  }  }}grails { ⑤  plugin {  aws {  credentials {  accessKey = env[AWS_ACCESS_KEY]  secretKey = env[AWS_SECRET_KEY]  }  s3 {  bucket = env[AWS_BUCKET_NAME]  }  }  }}

grails.plugin.springsecurity.ldap.active = true ⑥grails.plugin.springsecurity.ldap.context.managerDn = "[email protected]"grails.plugin.springsecurity.ldap.context.managerPassword = env[LDAP_PASS]grails.plugin.springsecurity.ldap.context.server = 'ldap://10.10.71.52:389'grails.plugin.springsecurity.ldap.authorities.groupSearchBase='OU=Groups,OU=SIFT,OU=Projects,DC=LABS,DC=aerstonelabs,DC=com'grails.plugin.springsecurity.ldap.authorities.retrieveGroupRoles = true

13

Page 16: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

grails.plugin.springsecurity.ldap.authorities.groupSearchFilter ='(&(objectClass=group)(member={0}))'//grails.plugin.springsecurity.ldap.authorities.groupSearchFilter="member={0}"grails.plugin.springsecurity.ldap.authorities.retrieveDatabaseRoles = falsegrails.plugin.springsecurity.ldap.mapper.userDetailsClass = 'person'grails.plugin.springsecurity.ldap.search.filter="(sAMAccountName={0})" // for ActiveDirectory you need thisgrails.plugin.springsecurity.ldap.search.base = 'DC=LABS,DC=aerstonelabs,DC=com'grails.plugin.springsecurity.ldap.authorities.ignorePartialResultException = true //typically needed for Active Directorygrails.plugin.springsecurity.ldap.search.searchSubtree = truegrails.plugin.springsecurity.ldap.auth.hideUserNotFoundExceptions = true

① Whether the containers were ran using Docker Compose or not (true or false)

② Environment variables for OCR, DB, and Queue

③ Roles tied to LDAP

④ Whether to store files uploaded to SIFT in S3

⑤ S3 settings

⑥ Enable/Disable AD authentication

Building and Running Containers with One CommandBuild and run all four containers:

docker-compose -p docker -f sift-compose.yml up -d

Building and Running Containers on EC2

Install Docker and Docker-Compose

Install Docker and Docker-Compose

$ sudo yum update -y && sudo yum install -y docker && sudo service docker start$ sudo usermod -a -G docker ec2-user$ sudo curl -L https://github.com/docker/compose/releases/download/1.4.2/docker-compose-`uname -s`-`uname -m` | sudo tee /usr/local/bin/docker-compose > /dev/null$ sudo chmod +x /usr/local/bin/docker-compose

Transfer Containers to EC2

14

Page 17: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

Transfer Containers in EC2

$ scp -i ~/<keyname>.pem -r \ aerstone-sift-mysql \ aerstone-sift-ocr \ aerstone-sift-compose \ aerstone-sift-rabbitmq \ aerstone-sift-ubuntu \ aerstone-sift-webapp \ <user>@<server_name>:<location>

Fetch Base Images

Fetch and build base images

$ cp aerstone-sift-compose/sift-compose.yml .$ docker pull ubuntu:trusty-20151218$ docker pull debian:jessie$ docker build -t aerstone-sift/ubuntu aerstone-sift-ubuntu

Run Docker-Compose

Run Docker-compose

$ docker-compose -p docker -f sift-compose.yml up -d

Verify Running Containers

The above command will run all four containers, so it will take a little while to finish. Uponsuccessful completion, the below docker containers will be running on the instance:

CONTAINERID

IMAGE COMMAND CREATED STATUS PORTS NAMES

57c3c2a8d1d1

docker_aerstone-sift-webapp

"/run.sh" 10 secondsago

Up 9 seconds 0.0.0.0:8080→8080/tcp

docker_aerstone-sift-webapp_1

c3deeafc4ed9

docker_aerstone-sift-rabbitmq-management

"/docker-entrypoint.s"

2 minutesago

Up 2 minutes 4369/tcp,5671-5672/tcp,15671/tcp,25672/tcp,0.0.0.0:15672→15672/tcp

docker_aerstone-sift-rabbitmq-management_1

15

Page 18: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

8d14b8966ab6

docker_aerstone-sift-rabbitmq

"/docker-entrypoint.s"

2 minutesago

Up 2 minutes 4369/tcp,5671-5672/tcp,25672/tcp

aerstone-sift-rabbitmq

2ae03593b0cb

docker_aerstone-sift-ocr

"pythonsift.py"

5 minutesago

Up 5 minutes 0.0.0.0:5000→5000/tcp

docker_aerstone-sift-ocr_1

14f1fd6a7d74

docker_aerstone-sift-mysql

"/sbin/entrypoint.sh "

40 minutesago

Up 40minutes

0.0.0.0:3306→3306/tcp

docker_aerstone-sift-mysql_1

Launching SIFT™ on an Amazon Machine Image (AMI)SIFT™ is available on the AWS Marketplace, with all containers available on one single EC2instance. The instructions below show how to launch all docker containers using the docker-composecommand, and have SIFT™ running on http://<public_dns>:8080/sift

Changing Config Properties

The following configuration settings should be modified prior to running the Web App container.

(these settings may be changed in aerstone-sift-webapp/7.0/sift.groovy)

S3 Settings to be changed

accessKey = "<aws access key>"secretKey = "<aws secret key>"bucket = "<aws bucket name>"

Run Docker-Compose

The following steps should be followed on a new EC2 SIFT™ instance:

1. Verify that the following tcp ports are open: 8080, 5000, 3306, 15672, and 22

2. SSH into the EC2 instance

3. Verify the contents of /home/ec2 directory with the ls command. The following should be listed:

• aerstone-sift-mysql

• aerstone-sift-ocr

• aerstone-sift-rabbitmq

• aerstone-sift-ubuntu

• aerstone-sift-webapp

• sift-compose.yml

• README.txt

16

Page 19: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

4. Use the docker ps command to verify that no docker containers are running

5. Run all the containers by switching to the /home/ec2 directory and executing docker-compose -pdocker -f sift-compose.yml up -d

6. Verify that all containers are running by executing the docker ps command. The followingshould be the output. If all the containers are running, the SIFT GUI should also be accessible athttp://<public_dns>:8080/sift

CONTAINERID

IMAGE COMMAND CREATED STATUS PORTS NAMES

57c3c2a8d1d1

docker_aerstone-sift-webapp

"/run.sh" 10 secondsago

Up 9 seconds 0.0.0.0:8080→8080/tcp

docker_aerstone-sift-webapp_1

c3deeafc4ed9

docker_aerstone-sift-rabbitmq-management

"/docker-entrypoint.s"

2 minutesago

Up 2 minutes 4369/tcp,5671-5672/tcp,15671/tcp,25672/tcp,0.0.0.0:15672→15672/tcp

docker_aerstone-sift-rabbitmq-management_1

8d14b8966ab6

docker_aerstone-sift-rabbitmq

"/docker-entrypoint.s"

2 minutesago

Up 2 minutes 4369/tcp,5671-5672/tcp,25672/tcp

aerstone-sift-rabbitmq

2ae03593b0cb

docker_aerstone-sift-ocr

"pythonsift.py"

5 minutesago

Up 5 minutes 0.0.0.0:5000→5000/tcp

docker_aerstone-sift-ocr_1

14f1fd6a7d74

docker_aerstone-sift-mysql

"/sbin/entrypoint.sh "

40 minutesago

Up 40minutes

0.0.0.0:3306→3306/tcp

docker_aerstone-sift-mysql_1

Running Containers on Separate Instances

In order to scale, each container may be run separately on different instances. Docker-composecan’t be used to accomplish this. In this scenario, each container should be built and run oninstances by following the above sections in this document. Once all containers are running, thefollowing environment variable paths should be created on the Web App container in order for it tobe linked to other containers. The docker-compose command creates these environment variablesautomatically, but in this scenario they should be created manually prior to running the Web Appcontainer using the following commands:

17

Page 20: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

Environment variables created after docker-compose

$ echo $AERSTONE_SIFT_RABBITMQ_MANAGEMENT_PORT_15672_TCP_ADDR ①172.17.0.5$ echo $AERSTONE_SIFT_RABBITMQ_MANAGEMENT_PORT_15672_TCP_PORT ②15672$ echo $AERSTONE_SIFT_MYSQL_PORT_3306_TCP_ADDR ③172.17.0.2$ echo $AERSTONE_SIFT_MYSQL_USERNAME<mysql username>$ echo $AERSTONE_SIFT_MYSQL_PASSWORD<mysql pwd>$ echo $AERSTONE_SIFT_MYSQL_DBNAMEsift$ echo $AERSTONE_SIFT_OCR_PORT_5000_TCP_ADDR ④172.17.0.3$ echo $AERSTONE_SIFT_OCR_PORT_5000_TCP_PORT ⑤5000

① IP Address to reach the RabbitMQ container

② Port to reach the RabbitMQ container

③ IP Address to reach the Database container

④ IP Address to reach the OCR container

⑤ Port to reach the OCR container

Use CasesThis section describes different ways in which SIFT™ can be integrated with various third partysolutions.

Using SIFT OCR solution with Spark (PySpark)Spark is a general purpose distributed computing platform. SIFT&trade; can be part of a big datapipeline to process un-searchable files.

The sections below describe different ways in which SIFT&trade; can be integrated with Spark.

SIFT OCR Solution as a Docker Instance on each Spark node

SIFT-OCR docker container on each Spark Node

Put the SIFT OCR docker container on each spark node. PySpark can pass in absolute path to theimage, S3 URL to the image, or, image string to the REST endpoint and get the results back.

18

Page 21: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

SIFT OCR Solution as a native program on each node

SIFT-OCR docker container on each Spark Node

This will require installing all dependencies required for SIFT OCR on each Spark Node. PySparkwill be able to import the SIFT OCR command line script and call the methods. Below code shows anexample:

PySpark (Spark 1.6)

>>> import sift_ocr ①>>> import numpy as np>>> so = sift_ocr.SiftOcr() ②>>> image = sc.binaryFiles("hdfs://localhost:9000/*.png").take(1) ③>>> file_bytes = np.asarray(bytearray(image[0][1]), dtype=np.uint8) ④>>> string = so.text_from_image(so.read_image_from_bytes(file_bytes)) ⑤>>> string'17 January 2013 14:28:56 SENTINEL H MAG NOLIA Declassify on: 12.31.48 MM OBaI LOTSOF WHITE VANS Washington Silver Spring 0 WEE 0 IT. IT to Telephone Company .3 1.24... P Suspicious Activity LOTS OF CARS SENTINEL H MAG NOLIA'>>> images = sc.binaryFiles("/flask_server/*.jpg")>>> image_to_array = lambda rawdata:so.text_from_image(cv2.imdecode(np.asarray(bytearray(image[0][1]), dtype=np.uint8),1)) ⑥>>> img = images.values().map(image_to_array)>>> img.take(1)['17 January 2013 14:28:56 SENTINEL H MAG NOLIA Declassify on: 12.31.48 MM OBaILOTS OF WHITE VANS Washington Silver Spring 0 WEE 0 IT. IT to Telephone Company .31.24.. . P Suspicious Activity LOTS OF CARS SENTINEL H MAG NOLIA'] ⑦

① Import sift_ocr module

② Create instance of the module

③ Build RDD and use one for sampling

④ Convert to bytearray then to np array

⑤ Call SIFT to get result from Image

⑥ Apply the lambda function to all images

⑦ Result back from SIFT

19

Page 22: SIFT Technical Design Document - avriosoft.io · This document explains how to use, install, and configure Aerstone Labs SIFT™. Additionally, it explains how to use SIFT™ with

Read data outside Spark

SIFT-OCR docker container on each Spark Node

image::sift-use-case-3.png[Spark Use Case 3, 300, 200

This approach would handle the processing of images outside of Spark. Spark would be used toingest the extracted text for each file. An n-tier REST service with multiple SIFT OCR Dockercontainers would be created. A Client would be written to read image files from the disk and passthem to the REST service or pass image URLs to the service. The service would write the results foreach file into a uniquely named text file. A spark cluster would then read the text files and put theresults in the DB from where an Analyst could run queries on it.

Integrating SIFT with Other SystemsSIFTs RESTFul API makes it easy for it to be integrated with various third party tools.

Integrating SIFT with a RESTful API

This section describes using SIFT with a content management solution like Adobe ExperienceManager (AEM). AEM is a content management solution that makes it easy to manage content andassets. SIFT enhances AEM by making unsearchable assets searchable. This is accomplished byutilizing AEM’s flexible workflow process to call the SIFT™ RESTful API, using a custom AEMjavascript workflow. For a video of SIFT working together with AEM, visit SIFT - AEM Video.

The custom SIFT-AEM workflow step is added to AEM’s Digital Asset Management (DAM) "UpdateAsset" workflow. Any time a user attempts to upload a new asset to AEM, the asset’s URI is providedto the SIFT RESTful API, which detects the file type, extracts text from the asset, and returns theextracted text. The custom step then tags the asset’s metadata with the extracted text, and finalizesadding the asset to AEM. The end results is that an asset that would otherwise have beenunsearchable in AEM is now searchable.

The image below shows the custom SIFT-AEM workflow step added to AEM’s DAM "Update Asset"workflow.

SIFT-AEM custom workflow step

20