Post on 29-Mar-2015
1© 2007 IBM Corporation
The Challenges of Building Enterprise Content Taxonomies and the Role of Classification Technologies in Maintaining Their EffectivenessReginald J. Twigg, Ph.D. (rtwigg@us.ibm.com)Capture, Classification and Taxonomy, IBM ECM
2© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Agenda The Challenge of Unstructured Content
Key Concepts and Terms
Taxonomy, Classification and ECM Adoption
Classification Technologies for ECM
3© 2007 IBM Corporation
The Challenge of Managing Unstructured Content
4© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
80% of Enterprise Data is Unstructured
Databases
• Billing statements• Claims images• Customer
correspondence• Mortgage docs• Contracts• Signed BOLs• Healthcare EOBs• Marketing collateral• Website content• Voice authorizations• Signature cards• Credit enrollments• Material Safety
Data Sheets• ISO 9000 docs• Plant schematics• Product images• Spec sheets
• ….and much more!
5© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
What is Enterprise Content?
6© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Where do I start?
We’ve got 600 GB of content from basic content services all over the enterprise.How can we get this content efficiently mapped into our ECM taxonomy?
We’ve been managing our content without classifying it for a few years now.How can our users navigate amongst this existing content in a way that’s intuitive for our business?
The lawyers have to review 400,000 electronic documents for their case.How can we make sure they don’t waste their time?
Organizing the explosion of unstructured content becomes critical:
7© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Key Business Drivers
Increase worker productivity and automate content related decisions
Ad Hoc Category Suggestion
Content-Based Workflow Selection
Content Based Decision Making
In Process Classification
Increase accessibility of content under management
Automated, High Scale Classification
Classify at ingestion and/or re-classify over time
Taxonomy Evolution Tools
Enhanced Accessibility
Taxonomy Proposer
ECM Taxonomy and Classification
Increase legal discovery review effectiveness while reducing risk
Legal Discovery Prioritization and Workflow Assignment
Records Classification and Exception Handling
Storage and Retention Policy Assignment
Compliance, Records, Legal Discovery
Reduce inquiry costs, automate message routing and increase customer satisfaction
Email, Chat Routing
Agent Response Suggestion
Email Supervision and Monitoring
Automatic Customer Response
Message Tagging, Classification and Monitoring1 32 4
Business Value of Classification for ECM
8© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Percent of corporate information value managed in traditional databases
Percent of corporate information value managed in traditional databases
DataCreation
And Demand
DataCreation
And Demand
OLTP and BI(narrow scope)OLTP and BI
(narrow scope)Application
TypesApplication
TypesCompliance, Competitive Intelligence (wide scope)Compliance, Competitive Intelligence (wide scope)
Source: GartnerSource: Gartner
UnstructuredData
StructuredData
Ability to Structure Content with Databases
9© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Multiple Repositories Make Access Difficult
36%
14%
25%
17%
1 repository5%
2-5 repositories
6-10 repositories10-15 repositories4%
More than 15 repositories
Don't know
Base: 81 North American decision-makers(multiple responses accepted)
“The Future of Content in the Enterprise,” Connie Moore and Robert Markham
10© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
And Then There’s SharePoint, File Shares and . . .
11© 2007 IBM Corporation
Key Concepts and Terms
12© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Key Concepts
Metadata: a means of describing, locating, cataloging, and activating content as objects in a software ecosystem (literally, data about data).
Enterprise Catalog: a centralized and normalized metadata model for unstructured content for the purposes of providing consistent services across all ECM applications.
Taxonomy: a hierarchical structure of information components, any part of which can be used to classify a content item in relation to other items in the structure.
Classification: a coding of content items as members of a group for the purposes of cataloging them or associating them with a taxonomy.
13© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Taxonomy Is . . .
Not turning animals into trophies
A system for organizing the corpus of business content
14© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Taxonomy and Classification in ECM
Classification Examples:
– Document Classing– Foldering
Taxonomy Examples:
– Enterprise Content Catalog– Industry Standard Document Taxonomies (ISO, XMI)
Methods:
– Rules-Based: Applies pre-determined rules for ‘if, then’ classification of text and properties
– Analytics-Based: Applies algorithms to interpret classes in order to apply classification rules to them
15© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
ECM Taxonomy Illustrated
16© 2007 IBM Corporation
Taxonomy, Classification and ECM Adoption
17© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Drive New Business Value from Content
Content Classification
Solutions
Improve Content Access Organize Unstructured Content
Derive Business Insight
18© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Business Drivers for ECM Taxonomy Management
Proliferating departmental solutions
– Content Management
– Collaboration (SP, Quickr, Team Rooms, Wikis)
User-based classification and high workforce turnover
– Productivity declines as knowledge disappears
– Legal discovery is a secondary concern
Mergers and Acquisitions – need to reconcile disparate content management practices, repositories and processes
19© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
1
Classification is Hard Work
Key Business ChallengesECM Taxonomy and Classification
Most organizations face content taxonomy pain – especially as they standardize around ECM
– Mapping content to taxonomy during ingestion
– Reclassifying content under management
– Evolving taxonomies as new types of content emerge
– Integrating folksonomies (SharePoint) into a master taxonomy
Increase accessibility of content under management
Automated, High Scale Classification
Classify at ingestion and/or re-classify over time
Taxonomy Evolution Tools
Enhanced Accessibility
Taxonomy Proposer
1
20© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Organization is the Root Cause
Most organizations face content taxonomy barriers – especially as they standardize around ECM
– Assigning categories en masse
– Reclassifying existing content as taxonomies evolve
– Merging taxonomies
– Integrating the wisdom of folksonomies
21© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Challenges and Impacts of Merging Taxonomies
Misclassification – change is constant, and master taxonomies must manage multiple custom taxonomies for each content source
“Folksonomies” from departmental collaboration solutions are created by users and unmanaged by ECM standards
Impact: – Unreliable Metadata – Inconsistencies lose or
mislabel content– Process Misfires – Poor metadata triggers
incorrect events and workflows
Scale is the Challenge – Automation is Essential
23© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Lessons Learned From ERP Adoption
Getting Classification Right: ‘Garbage in = garbage out’ is often used in metadata management projects to describe the problem of building a metadata model on inconsistent sources.
Driving Process on Taxonomies: ERP systems depending on 3 master taxonomies – material, vendor and customer. These taxonomies drive events, workflow definition and the development of transaction-centric business process applications
Mastering Metadata: The ability to deploy new enterprise applications depends upon the re-usability, scalability and integrity of the metadata model
System of Record is Required for Standardization:
– Establishes an enterprise standard that can be audited
– Forms the foundation for building demonstrable best practices
– Enforces consistency of data capture and output
24© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Customer Lessons for Mastering ECM Taxonomies
‘Master’ taxonomy of record required for
– Compliance
– Business process applications
Merged master taxonomies become large and unwieldy
– Multiple taxonomies require integration and translation
– Centralized, decentralized, or hybrid?
Intelligent Classification increasingly is used to manage:
– Taxonomy merging from multiple use cases
– Taxonomy/folksonomy translation from distributed content sources
25© 2007 IBM Corporation
A Look at ECM Classification Technologies
26© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
State of Classification Management Technologies
ECM Classification/Taxonomy is an emerging discipline
– Industry standard taxonomies:
• Focus on business function or transaction types
• Have not reached the enterprise level– Classification best practices:
• Content ingestion
• Application development reclassification Classification software focuses on content ingestion:
– Electronic content (email, Office documents, free-form text)
– Paper content (document images) requires OCR
Search is not enough – must drive value in the business process
27© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Criteria For ECM Classification Management Solutions
Integrate with and support the ECM metadata model
Interpret a highly-federated content ecosystem
Go beyond search to catalog and manage content
Build on advanced analytic technologies – rules alone are not enough
– Interpret content to extract meaningful (meta)data
– Employ multiple methods (engines) for classification
– Integrate teaching/learning
28© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Common Platform for Electronic Content Classification
Email QueueClassification and
Monitoring
In Process Classification
ECM Taxonomy and Classification
Compliance, Records, Legal
Discovery
ClassificationPlatform
30© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
IBM Classification Module for Electronic Content
Organize your ECM content
Automated classification and filtering
Combines text analytics understanding with rules
Acquires domain specificity from your own content
Unique learning technology for adaptive classification
Suggests new categories or even seeds an entirely new taxonomy
Rectifies conflicting taxonomies
Market proven, scalable platform
31© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Understanding Content with Text Analytics
Match
ing
Categories list andRelevancies
(Scores)
ClassificationEngine
ClassificationEngine
Corpus(Categorized)
The strategic value of this market is paramount to IBM
The strategic value of this market is paramount to IBM
Audit
Training (Teach)
Feedback
CThe core marketfor this newproduct has beendefined as such by IBM
CThe core marketfor this newproduct has beendefined as such by IBM
A
IP isessential
A
IP isessential
ALegal iscurrentlyrequiringfull approval
ALegal iscurrentlyrequiringfull approval
BEngineeringrequires clearrequirements
BEngineeringrequires clearrequirements
CStrategy isImportant tothe marketing team
CStrategy isImportant tothe marketing team
C: 97%, B: 54%,A: 12%
The strategic value of this market is paramount to IBM
The strategic value of this market is paramount to IBM
32© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Classification Workflow: Accelerating Content Organization
FileSystem
Classifier
ExistingUnclassified
Managed Content
Classification Review
Tool
Filter out documents
Automatically categorize majority of content
Reference: Integration Components
Classifier (Runtime Application)
Classification Review (UI)
Taxonomy Proposer (UI)
Content Extractor (training based on P8)
Send to taxonomy proposer
BasicContentServices
33© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Components of the Solution for Text Classification
Classifier
– Automatically classifies and filters out documents
– Moves some documents for manual review
Classification Review Tool
– Allows user to manually review documents
Content Extractor
– Extracts content from the ECM system for training
Taxonomy Proposer
– User workflow to identify and name new categories or apply existing taxonomy from P8
34© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Classification for Paper Documents
Classification of paper documents occurs in capture process
Use cases for paper document classification
– Recognition using OCR/ICR
– Classification to associate to folders or doc class
– Separation to reduce costs and improve process
35© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Three Primary Types of Images – The Document Recognition Problem
•
Less Advanced
More Advanced
Semi-Semi-StructuredStructured
StructuredStructured
Un-StructuredUn-Structured
36© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
The Document Separation Problem in Image CaptureSeparation of documents is a
significant expense for a high-volume capture system
– Typical ‘structured’ recognition technologies are not applicable
– Manual insertion of separator sheets is the primary workaround today
– 50% of document preparation labor is spent sorting documents and inserting separator pages – source: TAWPI
Where does one document stop and the next begin?
Here?Here? Here?Here? Here?Here? Here?Here?
37© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Classification Methods for Paper Content (Images)
Image Classification
– based on the overall layout and structure of a document
– Includes lines, boxes, logos and placement of text
Text Classification
– based on detailed analysis of the text content of a page
Rules-Based Classification
– performed by searching for specific data or keywords
– independent of layout
Templated Classification
– determined by the presence of one or more marks, barcodes or items of text in pre-defined locations
38© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Waterfall Approach to Classification and SeparationTwo-pass system:
1st pass: Classification
– optimizes performance by using fastest classification techniques first
– Advanced Text Classification final “catch-all
11 22Page # 33 44 55 66 77 88
ImageClassification:
N/A ? ?? ? ?
Rules Based : N/A N/A N/A ?
Text Classification:
N/AN/AN/AN/A N/AN/A
BarcodeRecognition: ? ?? ? ? ? ?
1 ms
20 ms
200 ms
1000 ms
FirstForm X
FirstForm Z
FirstForm Y
LastForm X
LastForm Z
LastForm Y
MiddleForm X
MiddleForm Z
?
39© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Why Invest in Automated Classification?
Accelerate the time to value in your investment in ECM
Free up your subject matter experts
Ensure more accurate content catalogs
Make your content easier to find and leverage
40© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Summary
1. Accelerate ECM StandardizationPoor content classification undermines ECM value – maximize your ECM
potential and time-to-value with automated classification
2. Automating Classification Always PaysTypical employees spend 10 hours/week searching for information – slash
that time and increase productivity
3. Classification Technologies Automate Classification to Drive Development of Best Practices
IBM Classification Module for IBM FileNet P8Automatically organizing your content by understanding it
41© 2008 IBM Corporation
Information Management Software | Enterprise Content Management
Contact Reggie Twigg (rtwigg@us.ibm.com) for more information or to arrange a demonstration