Unified Information Governance - Stratas · enterprise Information Governance, considerably...

*Gartner Innovation Insight: File Analysis Innovation Delivers an Understanding of Unstructured Dark Data, Analyst Alan Dayley, March 28, 2014

Unified Information Governance

*Gartner Innovation Insight: File Analysis Innovation Delivers an Understanding of Unstructured Dark Data, Analyst Alan Dayley, March 28, 2014

Stratas Data Forge - Create intelligence.

Gartner found that 80% of all corporate data is unstructured and will grow by 800% in the next five years*, meaning the bulk of a

company’s information backbone is not easily accessed, understood or utilised.

The majority of this data is deemed ROT (“Redundant, obsolete or trivial”) and the cost of managing and storing this information is

considerable. This “dark data” also represents a source of risk to the business, for the contents of these documents remains largely

unknown.

Remediating ROT will typically remove 70% of the unstructured data in the company, leaving the remainder of the data in a position

where it can be catalogued, classified and mined; creating actionable business intelligence and regulatory compliance.

The Stratas Data Forge is a revolutionary platform to discover, classify, search, store and control any document which may exist in

your organisation – be it physical or electronic - with speed; accuracy and functionality never before achievable.

Optical Character Recognition as a tool for processing data can be slow and limiting - the more data, the slower the process. Stratas

uses the Data Forge to break this convention through the application of scientifically-proven statistical tools which visually group and

classify enterprise data with content analytics and defensible remediation that delivers unified Information Governance.

Machine Learning, multi-purpose deep hierarchical architecture and statistics-based algorithms create scalable language-agnostic

supervised and unsupervised learning methods for automated document identification, classification, remediation and retention.

The Data Forge platform performs a number of key functions:

1. Unstructured Data Analysis. Data Forge allows faceted search and cross section analysis of unstructured enterprise data,

clustering search results using extractive summaries and key phrases.

2. Structured clustering. Data Forge can automatically group large volumes of documentation based upon similarity,

regardless of how the data has been stored. This step can be independent of any content recognition or classification of

the underlying documents. Using scientific text and image tools, the documents can be intelligently classified for routing

into workflow or inclusion into an Electronic Document Repository. By turning unstructured data into business intelligence,

you become fully aware of all the documents contained in the organisation – knowing what you have and where it is

enables informed decision-making and accurate remediation of enterprise wide unstructured data.

3. Intelligent Document Classification. With Data Forge, data extraction is made simple using context-based methodologies

for semantic recognition. Features include Logical Document Boundary Determination, scalable Near Duplicate

identification, enriched metadata and genre-base clustering. The platform is unique, powerful and intuitive.

Data Forge is more than the revolution of Document Management. It is a unified Information Governance platform for unstructured

data and content intelligence, which provides a systemic way to find, classify and manage compliance documents across

organisations. Our approach to document classification, defensible remediation and retention goes beyond the remit of eDiscovery,

fundamentally changing the economics of enterprise Information Governance and considerably improving business operations.

Stratas is the first and, to date, the only unified Information Governance solutions provider who can effectively addresses the need

for a single, highly flexible and integrated data platform. Designed specifically around the customer, our holistic approach considers

the people, process, technology and culture of every business. The resulting solutions are faster, intuitive, more accurate and feature-

rich than any technology or manual process. We deliver unparalleled process improvement, user adoption, compliance and cost

benefit to companies of any size.

Proof of Value (“POV”) Information sheet

We run two types of POV with Data Forge. The simpler of the two is an “eDiscovery” approach, which does not look to arrange the

data in accordance with a predetermined business process or need, but simply to identify and cluster the information.

The Crawler is “pointed” at a set of unstructured data and set to Discovery mode. The documents are segmented and the ROT

(Redundant, Obsolete and Trivial data) is identified, which includes duplicates and system files. The remaining data is then clustered

into groups based upon visual similarity.

In essence, this is a digital version of what the manual process for sorting a pile of unidentified documents would be. Spread them

out on the table, get rid of the ones you cannot read or do anything with, then group the rest together into piles based upon what

they look like e.g. invoices, contracts, purchase orders, CVs, pictures, presentations etc.

Once the data is clustered, you are now in a position to do something meaningful with that information.

Data Forge is a collection of specialised scientific tools which are compiled, as defined by a project scope, to address a specific need

within a customer environment. As such, upfront initialisation for POV aligned to a specific challenge or process, is greater detailed

than that of the “eDiscovery” approach. It requires a defined objective in order for the tools to be compiled correctly and the data

interrogated with this set goal in mind. The process is more easily explained by the graphic below:

Firstly, we build the scope for the POV – this tells the system what we want to achieve. It comprises obtaining document exemplars

from the data set, defining critical rules (e.g. document specific content sought), defining the customer specific taxonomies and

stating any rules (e.g. retention policies). This creates the Controlled Vocabulary for the POV. The tools required for the project are

assembled into the Data Forge platform and we are ready to initiate the process.

The data set is run and the ROT (Redundant, Obsolete and Trivial documents) is again excluded, together with System Files,

duplicates and multiple versions.

The Classification Strategy is applied to the resulting data set and refined according to the required output. Filters are applied and

then the Classification Decision is applied to the data via a number of techniques or tools, depending on the ultimate requirement.

The output is vetted via Quality Control measures and then presented to a database or ECM.

Data Forge Workflow Data Analysis and Classification

This targeted and scientific approach is more than the revolution of Document Management. It is the creation of a unified Information

Governance platform for unstructured data and content intelligence, which provides a systemic way to find, classify and manage

compliance across an organisations unstructured data. The Stratas approach to document classification, defensible remediation and

retention goes beyond the remit of eDiscovery and Enterprise Content Management. It fundamentally changes the economics of

enterprise Information Governance, considerably improving business operations and creating a far greater business value from your

unstructured data.

System requirements

We can provide the POV using either a Cloud-based service or via an appliance behind your firewall. A typical data set for the POV is

between 20gb and 50gb. The specification below provides the optimum host machine configuration for an appliance-based POV,

showing how the platform is associated with the target repositories and limitations as to the target environment.

Hardware

3 servers: 2 x 6 core Xeon, 96 GB RAM; and

1 GB Ethernet.

Data assumptions

about 30-40% of the data are non-records

System configuration

index nodes = 30

DB nodes = 3

crawlers/pre-processors = 30

classifiers = 30, however the platform scales linearly, so to process in 60 hours would require 6 servers rather than 3.

Speed of operation

The platform has the ability to support large (>50TB) data sets and processing speeds between 1 MB/sec to > = 25 MB/sec (or

faster) depending upon hardware and type of data.

The system is linear to hardware. For example the hardware configuration below will accomplish processing of 10 TB in 120 hours:

3 servers: 2 x 6 core Xeon, 96 GB RAM

1 GB Ethernet

Or 1,000,000 pages could be processed using:

3 processing servers: 2 x 6 core Xeon, 64 Gb RAM 1 network attached storage: 10Tb, 4 core, 24Gb RAM

Platform Capabilities The tables below highlight some of the capabilities the Data Forge platform exhibits and how they may be applied to varying

business processes and challenges within any business.

Subject Matter Expertise

Document Coding

Capability Description

Tagging/Coding

Documents

Input defining content affected by preservation holds;

use of Fuzzy Pattern Matching Framework for required data points extraction

Predictive Coding-

Custom Control Sets

Use of machine learning to code data by applying matter-specific control sets


Custom Training Model

Development

Build relevant custom document exemplar-based training models based on specific client

requirements using in-house SME

Turnkey Service Delivery Provision of certified labour resources (engagement managers, project management and data

analysts) required to deliver classification results to client-desired quality level

Pre-Built Knowledge

Models

Pre-built models to auto-classify data "out of the box" sorting the data based on business

function, security, product development, audit and fraud categories

Data Processing


File Types Identification

and Text Extraction

500 unstructured data file types using Oracle Software Development Kit (SDKs), Notes and

Exchange email, SharePoint

Culling Trash file identification and de-NIST'ing using Oracle SDKs or equivalent

Email Processing Thread detection, classification each message in the thread separately to the model, calculate

median score of thread, calculate median score of all attachments and take max median score as a

category of the thread.

Distributed Architecture

Grid architecture for processing large data volumes (100's TB/PB); hardware Determinative

(specifications)

OCR Engine

For processing TIFF and PDF images to create text file for classification/legal hold (coding). Fully

integrated solution to process scanned images: image pre-processing, OCR, post processing.

OCR Text QC Filter

Filter for text amount and presence of garbage text to separate from higher quality files

Native File Viewing

Using Oracle SDKs or equivalent

Clustering and

classification of the

scanned imaged and

Logical Boundary

Determination of scanned

multi document images

(PDF, TIFF)

Scanned multi-document images: clustering (visual, text based), classification (visual, text based);

data points extraction

Analytics


Duplicate Detection Using SHA-1 hash or equal

Near-Duplicate Detection Detecting document versions (image, text) and comparison of color-coded text differences (similar to

Delta View process) between selected text documents.

Data Profiling Ability to search, filter and facet results by file type, extension, domain, path, date, or full text search;

Also applying clustering to the search results.

Modelling of Data

(what approach)

Supervised learning/example-based training for auto classification into deep multi-purpose

categories hierarchies.

Information Extraction Fuzzy Pattern Matching Framework: context-based fuzzy pattern matching rules combined with set

of dictionaries (gazetteers) for:

- Named Entities extractions (Persons, Companies, Address/Locations)

- Context based information extraction and tables support

- identification of PII.

Clustering Clustering of Search Results prior to getting into rules and queries.

Sorting of the data in logical pools of data with semantic nearness; machine generated labels of

clusters; ability to facet clusters.

Email Thread Detection

and Classification

Perform analytics on the email based on data point extractions; correlation and cross-reference. (QC

and reporting functions)

Search Engine Search traditionally or by facets and in-context query completion

Saved Queries Filter for specific key words/phrases which can be saved and used as an additional facet for data

review, or used as rules for classification (selected via drop-down menu).

Extractive Summaries Machine-generated list of most important sentences and key phrases; requiring no user input

System Training and Quality Assurance


Random Sample

Generator

QA process for filtering data and retraining system

Iterative Learning

Environment

Presentation of sampled data and drag/drop retraining;

ability to filter samples with facets and saved queries or ad hoc searches

Discriminative Measure System feedback on discriminative gain associated with a potential training candidate; "Is it worth

training this document based on system feedback"

(Green = add, yes, this has value. Red = duplicate already, don't bother. Has no value for the training

set…..)

Novelty Detection Important feature in machine learning environments to detect novel documents and treat them as

such and thus reducing amount of false positives during classification

Faceted Data Review Separation of data by classification category, file type, age, saved queries, ad hoc queries

“More Like This” Retrieval of data with similarity to source; a feature of Solr

5-Fold Cross Validation Automated performance validation using control sets, measuring precision and recall and calculation

of the F-Score

Accuracy Level

Attainment

Ability to provide client-defined accuracy levels, using system tools and statistically valid protocols;

audit trail proof of attainment

Multi-Value Tagging and

Classification

Supports multi-value tagging and indefinite number of classification models that include manual

assignment, to more than one category.

Workflow flexibility Determine the landscape of the data at the outset of the case or classification process; spread of

categories and time to ramp-up (training); generally depends on hardware, composition of data

volume (email: longer, OCR: longer, native files: quicker), and how deep the file plan.

Manual Assignment As an outcome of the search results from a saved query, the search results could be manually assigned

to certain business groups without the requirement of training of the system, or including them for

usage as training exemplars

Data Management


Preservation Filtering

for Disposition Eligibility

Isolating content affected by one or more holds

Disposition Eligibility Calculation of eligibility based on older of file creation, file modified date or embedded document date

and time-based retention rule; caveat that hot document or extremely sensitive document may be of

value for training regardless of disposition.

Duplicate and

Near Duplicate

Management

Identify opportunity to cleanse data of these duplicates:

reporting of items; file path locations; version distance measurement and correlation;

best occurs within an extracted text environment.

Master Database

Creation

Ability to aggregate and ingest multiple repository content and its associated metadata, and to

perform cross-queries and correlation across the multiple repositories.

Ability to support engineering associated with supporting or developing APIs into other databases.

Scaling Ability to support large (>50TB) data sets and processing speeds between

1 MB/sec to > = 25 MB/sec (or faster)

depending upon hardware and type of data.

Reporting Capabilities Inclusive of file statistics details, duplicate and near-duplicates reports, classification, PII, custom

coding, and other custom reports

Security Protocols Ability to function behind the firewall or in the cloud, and meeting client requirements for dedicated

hardware, access protocols and other security requirements.

Custom Solutions


Poor Quality Documents Poor quality documents with OCR text that is not searchable are resolved using a soft-dictionary

approach to identify and extract titles, where document titles are used to classify and index the

documents.

What's in the Box Identifying relevant boxes and folders within the boxes for scanning and coding based on their short

descriptions and provided title taxonomy

PDF Splitting Reconstructing document collections using automated logical breaks and classification.

Copyright 2014 Stratas Business Solutions LLP Monday, 08 December 2014

Company Proprietary & Confidential 1 Non-Disclosure & Teaming Agreement

Unified Information Governance - Stratas · enterprise Information Governance, considerably...

Documents

Transcript of Unified Information Governance - Stratas · enterprise Information Governance, considerably...