Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group...
-
date post
19-Dec-2015 -
Category
Documents
-
view
221 -
download
3
Transcript of Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group...
Semantic Infrastructure Workshop Development
Tom ReamyChief Knowledge Architect
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
2
Agenda
Text Analytics – Foundation– Features and Capabilities
Evaluation of Text Analytics – Start with Self-Knowledge – Features and Capabilities – Filter, Proof of Concept / Pilot
Text Analytics Development– Progressive Refinement– Categorization, Extraction, Sentiment– Case Studies – Best Practices
3
Semantic Infrastructure - FoundationText Analytics Features Noun Phrase Extraction
– Catalogs with variants, rule based dynamic– Multiple types, custom classes – entities, concepts, events– Feeds facets
Summarization– Customizable rules, map to different content
Fact Extraction– Relationships of entities – people-organizations-activities– Ontologies – triples, RDF, etc.
Sentiment Analysis– Rules – Objects and phrases – positive and negative
4
Semantic Infrastructure - Foundation Text Analytics Features Auto-categorization
– Training sets – Bayesian, Vector space– Terms – literal strings, stemming, dictionary of related terms– Rules – simple – position in text (Title, body, url)– Semantic Network – Predefined relationships, sets of rules– Boolean– Full search syntax – AND, OR, NOT– Advanced – NEAR (#), PARAGRAPH, SENTENCE
This is the most difficult to develop Build on a Taxonomy Combine with Extraction
– If any of list of entities and other words
5
6
7
8
9
10
11
12
13
14
Evaluating Text Analytics Software Start with Self Knowledge
Strategic and Business Context Info Problems – what, how severe Strategic Questions – why, what value from the taxonomy/text
analytics, how are you going to use it Formal Process - KA audit – content, users, technology, business
and information behaviors, applications - Or informal for smaller organization,
Text Analytics Strategy/Model – forms, technology, people– Existing taxonomic resources, software
Need this foundation to evaluate and to develop
15
Evaluating Text Analytics Software Start with Self Knowledge
Do you need it – and what blend if so? Taxonomy Management Stand alone
– Multiple taxonomies, languages, authors-editors Technology Environment – ECM, Enterprise Search – where is it
embedded Publishing Process – where and how is metadata being added –
now and projected future– Can it utilize auto-categorization, entity extraction, summarization
Is the current search adequate – can it utilize text analytics? Applications – text mining, BI, CI, Alerts?
Evaluating Text Analytics Software Team - Interdisciplinary
IT – Large software purchase, needs assessment• Text Analytics is different – semantics• Construction company designing your house
Business – Understand the business needs• Don’t understand information • Restaurant owner doing the cooking
Library - know information, search• Don’t understand the business, non-information experts• Accountant doing financial strategy
Team – combination of consulting and internal
16
Semantic Infrastructure - Foundation Design of the Text Analytics Selection Team
Interdisciplinary Team, led by Information Professionals– IT – software experience, budget, support tests– Business – understand business and requirements– Library – understand information structure, understanding of
search semantics and functionality Much more likely to make a good decision
– This is not a traditional IT software evaluation – semantics Create the foundation for implementation
17
Evaluating Text Analytics Software Evaluation Process & Methodology: Two Phases
Phase I – Traditional Software Evaluation – Filter One- Ask Experts - reputation, research – Gartner, etc.
• Market strength of vendor, platforms, etc.– Filter Two - Feature scorecard – minimum, must have, filter to
top 3– Filter Three – Technology Filter – match to your overall scope
and capabilities – Filter not a focus– Filter Four – In-Depth Demo – 3-6 vendors
Phase II - Deep POC (2) – advanced, integration, semantics
18
Evaluating Text Analytics Software Phase II - Proof Of Concept - POC
4-6 weeks POC – bake off / or short pilot Measurable Quality of results is the essential factor Real life scenarios, categorization with your content 2-3 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of
content Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities – have
to determine at POC time Taxonomy Developers – expert consultants plus internal taxonomists
19
Evaluating Text Analytics Software Phase II – POC: Range of Evaluations
Basic Question – Can this stuff work at all? Auto-categorization to existing taxonomy – variety of content
– Essential Issue is complexity of language Clustering – automatic node generation Summarization Entity extraction – build a number of catalogs – design which ones
based on projected needs – example privacy info (SS#, phone, etc.) Entity example –people, organization, methods, etc.
– Essential issue is scale and disambiguation Evaluate usability in action by taxonomists
20
21
Text Analytics Evaluation: Case StudySelf-Knowledge
Platform – range of capabilities– Categorization, Sentiment analysis, etc.
Technical– API’s, Java based, Linux run time– Scalability – millions of documents a day– Import-Export – XML, RDF
Total Cost of Ownership Vendor Relationship - OEM Usability, Multiple Language Support Team – 3 KAPS - Information 5-8 Amdocs – SME - business, Technical.
Text Analytics Evaluation: Case Study Phase I – Case Study
– Attensity– SAP – Inxight– Clarabridge– ClearForest– Concept Searching– Data Harmony / Access
Innovations– Expert Systems– GATE (Open Source)– IBM
– Lexalytics– Multi-Tes– Nstein– SAS – SchemaLogic– Smart Logic– Content Management – Enterprise Search– Sentiment Analysis Specialty– Ontology Platforms
22
Text Analytics Evaluation: Case Study Case Study: Telecom Service
Company History, Reputation Full Platform –Categorization,
Extraction, Sentiment Integration – java, API-SDK,
Linux Multiple languages Scale – millions of docs a day Total Cost of Ownership Ease of Development - new Vendor Relationship – OEM,
etc.
Expert Systems IBM SAS - Teragram Smart Logic
Option – Multiple vendors – Sentiment & Platform
IBM and SAS – finalists
23
24
Text Analytics Evaluation: Case Study POC Design Discussion: Evaluation Criteria
Basic Test Design – categorize test set– Score – by file name, human testers
Categorization – Call Motivation– Accuracy Level – 80-90%– Effort Level per accuracy level
Sentiment Analysis– Accuracy Level – 80-90%– Effort Level per accuracy level
Quantify development time – main elements Comparison of two vendors – how score?
– Combination of scores and report
Text Analytics Evaluation: Case Study Phase II – POC: Risks
CIO/CTO Problem –This is not a regular software process Language is messy not just complex
– 30% accuracy isn’t 30% done – could be 90% Variability of human categorization / expression
– Even professional writers – journalists examples Categorization is iterative, not “the program works”
– Need realistic budget and flexible project plan Anyone can do categorization
– Librarians often overdo, SME’s often get lost (keywords) Meta-language issues – understanding the results
– Need to educate IT and business in their language
25
Text Analytics POC OutcomesCategorization Results
SAS IBM
Recall-Motivation 92.6 90.7
Recall-Actions 93.8 88.3
Precision – Mot. 84.3
Precision-Act 100
Uncategorized 87.5
Raw Precision 73 46
26
Text Analytics POC OutcomesVendor Comparisons
Categorization Results – both good, edge to SAS on precision– Use of Relevancy to set thresholds
Development Environment– IBM as toolkit provides more flexibility but it also increases
development effort Methodology – IBM enforces good method, but takes more
time– SAS can be used in exactly the same way
SAS has a much more complete set of operators – NOT, DIST, START
27
Text Analytics POC OutcomesVendor Comparisons - Functionality
Sentiment Analysis – SAS has workbench, IBM would require more development– SAS also has statistical modeling capabilities
Entity and Fact extraction – seems basically the same– SAS and use operators for improved disambiguation –
Summarization – SAS has built-in– IBM could develop using categorization rules – but not clear that
would be as effective without operators
Conclusion: Both can do the job, edge to SAS Now the fun begins - development
28
29
Text Analytics Development: Foundation
Articulated Information Management Strategy (K Map)– Content and Structures and Metadata– Search, ECM, applications - and how used in Enterprise– Community information needs and Text Analytics Team
POC establishes the preliminary foundation– Need to expand and deepen– Content – full range, basis for rules-training– Additional SME’s – content selection, refinement
Taxonomy – starting point for categorization / suitable? Databases – starting point for entity catalogs
30
Text Analytics DevelopmentEnterprise Environment – Case Studies
A Tale of Two Taxonomies – It was the best of times, it was the worst of times
Basic Approach– Initial meetings – project planning– High level K map – content, people, technology– Contextual and Information Interviews– Content Analysis– Draft Taxonomy – validation interviews, refine– Integration and Governance Plans
31
Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets
Taxonomy of Subjects / Disciplines:– Science > Marine Science > Marine microbiology > Marine toxins
Facets:– Organization > Division > Group– Clients > Federal > EPA– Instruments > Environmental Testing > Ocean Analysis > Vehicle– Facilities > Division > Location > Building X– Methods > Social > Population Study– Materials > Compounds > Chemicals– Content Type – Knowledge Asset > Proposals
32
Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets
Project Owner – KM department – included RM, business process
Involvement of library - critical Realistic budget, flexible project plan Successful interviews – build on context
– Overall information strategy – where taxonomy fits Good Draft taxonomy and extended refinement
– Software, process, team – train library staff– Good selection and number of facets
Final plans and hand off to client
33
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets
Taxonomy of Subjects / Disciplines:– Geology > Petrology
Facets:– Organization > Division > Group– Process > Drill a Well > File Test Plan– Assets > Platforms > Platform A– Content Type > Communication > Presentations
34
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets
Environment Issues– Value of taxonomy understood, but not the complexity
and scope– Under budget, under staffed– Location – not KM – tied to RM and software
• Solution looking for the right problem– Importance of an internal library staff– Difficulty of merging internal expertise and taxonomy
35
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets
Project Issues– Project mind set – not infrastructure– Wrong kind of project management
• Special needs of a taxonomy project• Importance of integration – with team, company
– Project plan more important than results• Rushing to meet deadlines doesn’t work with semantics as
well as software
36
Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets
Research Issues– Not enough research – and wrong people– Interference of non-taxonomy – communication– Misunderstanding of research – wanted tinker toy connections
• Interview 1 implies conclusion A
Design Issues– Not enough facets– Wrong set of facets – business not information– Ill-defined facets – too complex internal structure
37
Text Analytics Development Conclusion: Risk Factors
Political-Cultural-Semantic Environment – Not simple resistance - more subtle
• – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations
Understanding project scope Access to content and people
– Enthusiastic access Importance of a unified project team
– Working communication as well as weekly meetings
38
Text Analytics DevelopmentCase Study 2 – POC – Telecom Client
Demo of SAS - / Enterprise Content Categorization
39
Text Analytics Development Best Practices - Principles
Importance of ongoing maintenance and refinement Need dedicated taxonomy team working with SME’s Work with application developers to incorporate text
analytics into new applications Importance of metrics and feedback
– Software and social Questions:
– What are important subjects (and changes)– What information do they need?– How is their information related to other silos?
40
Text Analytics Development Best Practices - Principles
Process– Realistic Budget – not a nice to have add on– Flexible Project plan - semantics are complex and messy
• Time estimates are difficult, object success measures are too– Transition from development to maintenance is fluid
Resources– Interdisciplinary Team is essential– Importance of communication – languages– Merging internal and external expertise
41
Text Analytics Development Best Practices - Principles
Categorization taxonomy structure– Tradeoff of depth and complexity of rules– Multiple avenues – facets, terms, rules, etc.
• No right balance– Recall-precision balance is application specific– Training sets of starting points, rules rule– Need for custom development
Technology– Basic integration – XML– Advanced –combine unstructured and structured in new ways
42
Text Analytics Development Best Practices – Risk Factors
Value understood, but not the complexity and scope Project mindset – software project and then done Not enough research on user information needs, behaviors
– Talking to the right people and asking the right questions– Getting beyond “All of the Above” surveys
Not enough resources, wrong resources Enthusiastic access to content and people Bad design – starting with the wrong type of taxonomy Categorization is not library science
– More like cognitive anthropology
43
Semantic Infrastructure Development Conclusion
Text Analytics is the Foundation for Semantic infrastructure Evaluation of Text Analytics – different than IT software
– POC – essential, foundation of development– Difference of taxonomy and categorization
• Concepts vs. text in documents
Enterprise Context – strategic, self-knowledge– Infrastructure resource, not a project– Interdisciplinary Team and applications
Integration with other initiatives and technologies– Text Mining, Data Mining, Sentiment & beyond, Everything!
Questions?
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com