Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group...

Semantic Infrastructure Workshop Development

Tom ReamyChief Knowledge Architect

KAPS Group

Knowledge Architecture Professional Services

http://www.kapsgroup.com

2

Agenda

Text Analytics – Foundation– Features and Capabilities

Evaluation of Text Analytics – Start with Self-Knowledge – Features and Capabilities – Filter, Proof of Concept / Pilot

Text Analytics Development– Progressive Refinement– Categorization, Extraction, Sentiment– Case Studies – Best Practices

3

Semantic Infrastructure - FoundationText Analytics Features Noun Phrase Extraction

– Catalogs with variants, rule based dynamic– Multiple types, custom classes – entities, concepts, events– Feeds facets

Summarization– Customizable rules, map to different content

Fact Extraction– Relationships of entities – people-organizations-activities– Ontologies – triples, RDF, etc.

Sentiment Analysis– Rules – Objects and phrases – positive and negative

4

Semantic Infrastructure - Foundation Text Analytics Features Auto-categorization

– Training sets – Bayesian, Vector space– Terms – literal strings, stemming, dictionary of related terms– Rules – simple – position in text (Title, body, url)– Semantic Network – Predefined relationships, sets of rules– Boolean– Full search syntax – AND, OR, NOT– Advanced – NEAR (#), PARAGRAPH, SENTENCE

This is the most difficult to develop Build on a Taxonomy Combine with Extraction

– If any of list of entities and other words

14

Evaluating Text Analytics Software Start with Self Knowledge

Strategic and Business Context Info Problems – what, how severe Strategic Questions – why, what value from the taxonomy/text

analytics, how are you going to use it Formal Process - KA audit – content, users, technology, business

and information behaviors, applications - Or informal for smaller organization,

Text Analytics Strategy/Model – forms, technology, people– Existing taxonomic resources, software

Need this foundation to evaluate and to develop

15

Evaluating Text Analytics Software Start with Self Knowledge

Do you need it – and what blend if so? Taxonomy Management Stand alone

– Multiple taxonomies, languages, authors-editors Technology Environment – ECM, Enterprise Search – where is it

embedded Publishing Process – where and how is metadata being added –

now and projected future– Can it utilize auto-categorization, entity extraction, summarization

Is the current search adequate – can it utilize text analytics? Applications – text mining, BI, CI, Alerts?

Evaluating Text Analytics Software Team - Interdisciplinary

IT – Large software purchase, needs assessment• Text Analytics is different – semantics• Construction company designing your house

Business – Understand the business needs• Don’t understand information • Restaurant owner doing the cooking

Library - know information, search• Don’t understand the business, non-information experts• Accountant doing financial strategy

Team – combination of consulting and internal

16

Semantic Infrastructure - Foundation Design of the Text Analytics Selection Team

Interdisciplinary Team, led by Information Professionals– IT – software experience, budget, support tests– Business – understand business and requirements– Library – understand information structure, understanding of

search semantics and functionality Much more likely to make a good decision

– This is not a traditional IT software evaluation – semantics Create the foundation for implementation

17

Evaluating Text Analytics Software Evaluation Process & Methodology: Two Phases

Phase I – Traditional Software Evaluation – Filter One- Ask Experts - reputation, research – Gartner, etc.

• Market strength of vendor, platforms, etc.– Filter Two - Feature scorecard – minimum, must have, filter to

top 3– Filter Three – Technology Filter – match to your overall scope

and capabilities – Filter not a focus– Filter Four – In-Depth Demo – 3-6 vendors

Phase II - Deep POC (2) – advanced, integration, semantics

18

Evaluating Text Analytics Software Phase II - Proof Of Concept - POC

4-6 weeks POC – bake off / or short pilot Measurable Quality of results is the essential factor Real life scenarios, categorization with your content 2-3 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of

content Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities – have

to determine at POC time Taxonomy Developers – expert consultants plus internal taxonomists

19

Evaluating Text Analytics Software Phase II – POC: Range of Evaluations

Basic Question – Can this stuff work at all? Auto-categorization to existing taxonomy – variety of content

– Essential Issue is complexity of language Clustering – automatic node generation Summarization Entity extraction – build a number of catalogs – design which ones

based on projected needs – example privacy info (SS#, phone, etc.) Entity example –people, organization, methods, etc.

– Essential issue is scale and disambiguation Evaluate usability in action by taxonomists

20

21

Text Analytics Evaluation: Case StudySelf-Knowledge

Platform – range of capabilities– Categorization, Sentiment analysis, etc.

Technical– API’s, Java based, Linux run time– Scalability – millions of documents a day– Import-Export – XML, RDF

Total Cost of Ownership Vendor Relationship - OEM Usability, Multiple Language Support Team – 3 KAPS - Information 5-8 Amdocs – SME - business, Technical.

Text Analytics Evaluation: Case Study Phase I – Case Study

– Attensity– SAP – Inxight– Clarabridge– ClearForest– Concept Searching– Data Harmony / Access

Innovations– Expert Systems– GATE (Open Source)– IBM

– Lexalytics– Multi-Tes– Nstein– SAS – SchemaLogic– Smart Logic– Content Management – Enterprise Search– Sentiment Analysis Specialty– Ontology Platforms

22

Text Analytics Evaluation: Case Study Case Study: Telecom Service

Company History, Reputation Full Platform –Categorization,

Extraction, Sentiment Integration – java, API-SDK,

Linux Multiple languages Scale – millions of docs a day Total Cost of Ownership Ease of Development - new Vendor Relationship – OEM,

etc.

Expert Systems IBM SAS - Teragram Smart Logic

Option – Multiple vendors – Sentiment & Platform

IBM and SAS – finalists

23

24

Text Analytics Evaluation: Case Study POC Design Discussion: Evaluation Criteria

Basic Test Design – categorize test set– Score – by file name, human testers

Categorization – Call Motivation– Accuracy Level – 80-90%– Effort Level per accuracy level

Sentiment Analysis– Accuracy Level – 80-90%– Effort Level per accuracy level

Quantify development time – main elements Comparison of two vendors – how score?

– Combination of scores and report

Text Analytics Evaluation: Case Study Phase II – POC: Risks

CIO/CTO Problem –This is not a regular software process Language is messy not just complex

– 30% accuracy isn’t 30% done – could be 90% Variability of human categorization / expression

– Even professional writers – journalists examples Categorization is iterative, not “the program works”

– Need realistic budget and flexible project plan Anyone can do categorization

– Librarians often overdo, SME’s often get lost (keywords) Meta-language issues – understanding the results

– Need to educate IT and business in their language

25

Text Analytics POC OutcomesCategorization Results

SAS IBM

Recall-Motivation 92.6 90.7

Recall-Actions 93.8 88.3

Precision – Mot. 84.3

Precision-Act 100

Uncategorized 87.5

Raw Precision 73 46

26

Text Analytics POC OutcomesVendor Comparisons

Categorization Results – both good, edge to SAS on precision– Use of Relevancy to set thresholds

Development Environment– IBM as toolkit provides more flexibility but it also increases

development effort Methodology – IBM enforces good method, but takes more

time– SAS can be used in exactly the same way

SAS has a much more complete set of operators – NOT, DIST, START

27

Text Analytics POC OutcomesVendor Comparisons - Functionality

Sentiment Analysis – SAS has workbench, IBM would require more development– SAS also has statistical modeling capabilities

Entity and Fact extraction – seems basically the same– SAS and use operators for improved disambiguation –

Summarization – SAS has built-in– IBM could develop using categorization rules – but not clear that

would be as effective without operators

Conclusion: Both can do the job, edge to SAS Now the fun begins - development

28

29

Text Analytics Development: Foundation

Articulated Information Management Strategy (K Map)– Content and Structures and Metadata– Search, ECM, applications - and how used in Enterprise– Community information needs and Text Analytics Team

POC establishes the preliminary foundation– Need to expand and deepen– Content – full range, basis for rules-training– Additional SME’s – content selection, refinement

Taxonomy – starting point for categorization / suitable? Databases – starting point for entity catalogs

30

Text Analytics DevelopmentEnterprise Environment – Case Studies

A Tale of Two Taxonomies – It was the best of times, it was the worst of times

Basic Approach– Initial meetings – project planning– High level K map – content, people, technology– Contextual and Information Interviews– Content Analysis– Draft Taxonomy – validation interviews, refine– Integration and Governance Plans

31

Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets

Taxonomy of Subjects / Disciplines:– Science > Marine Science > Marine microbiology > Marine toxins

Facets:– Organization > Division > Group– Clients > Federal > EPA– Instruments > Environmental Testing > Ocean Analysis > Vehicle– Facilities > Division > Location > Building X– Methods > Social > Population Study– Materials > Compounds > Chemicals– Content Type – Knowledge Asset > Proposals

32

Text Analytics Development Enterprise Environment – Case One – Taxonomy, 7 facets

Project Owner – KM department – included RM, business process

Involvement of library - critical Realistic budget, flexible project plan Successful interviews – build on context

– Overall information strategy – where taxonomy fits Good Draft taxonomy and extended refinement

– Software, process, team – train library staff– Good selection and number of facets

Final plans and hand off to client

33

Text Analytics Development Enterprise Environment – Case Two – Taxonomy, 4 facets

Taxonomy of Subjects / Disciplines:– Geology > Petrology

Facets:– Organization > Division > Group– Process > Drill a Well > File Test Plan– Assets > Platforms > Platform A– Content Type > Communication > Presentations

34


Environment Issues– Value of taxonomy understood, but not the complexity

and scope– Under budget, under staffed– Location – not KM – tied to RM and software

• Solution looking for the right problem– Importance of an internal library staff– Difficulty of merging internal expertise and taxonomy

35


Project Issues– Project mind set – not infrastructure– Wrong kind of project management

• Special needs of a taxonomy project• Importance of integration – with team, company

– Project plan more important than results• Rushing to meet deadlines doesn’t work with semantics as

well as software

36


Research Issues– Not enough research – and wrong people– Interference of non-taxonomy – communication– Misunderstanding of research – wanted tinker toy connections

• Interview 1 implies conclusion A

Design Issues– Not enough facets– Wrong set of facets – business not information– Ill-defined facets – too complex internal structure

37

Text Analytics Development Conclusion: Risk Factors

Political-Cultural-Semantic Environment – Not simple resistance - more subtle

• – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations

Understanding project scope Access to content and people

– Enthusiastic access Importance of a unified project team

– Working communication as well as weekly meetings

38

Text Analytics DevelopmentCase Study 2 – POC – Telecom Client

Demo of SAS - / Enterprise Content Categorization

39

Text Analytics Development Best Practices - Principles

Importance of ongoing maintenance and refinement Need dedicated taxonomy team working with SME’s Work with application developers to incorporate text

analytics into new applications Importance of metrics and feedback

– Software and social Questions:

– What are important subjects (and changes)– What information do they need?– How is their information related to other silos?

40


Process– Realistic Budget – not a nice to have add on– Flexible Project plan - semantics are complex and messy

• Time estimates are difficult, object success measures are too– Transition from development to maintenance is fluid

Resources– Interdisciplinary Team is essential– Importance of communication – languages– Merging internal and external expertise

41


Categorization taxonomy structure– Tradeoff of depth and complexity of rules– Multiple avenues – facets, terms, rules, etc.

• No right balance– Recall-precision balance is application specific– Training sets of starting points, rules rule– Need for custom development

Technology– Basic integration – XML– Advanced –combine unstructured and structured in new ways

42

Text Analytics Development Best Practices – Risk Factors

Value understood, but not the complexity and scope Project mindset – software project and then done Not enough research on user information needs, behaviors

– Talking to the right people and asking the right questions– Getting beyond “All of the Above” surveys

Not enough resources, wrong resources Enthusiastic access to content and people Bad design – starting with the wrong type of taxonomy Categorization is not library science

– More like cognitive anthropology

43

Semantic Infrastructure Development Conclusion

Text Analytics is the Foundation for Semantic infrastructure Evaluation of Text Analytics – different than IT software

– POC – essential, foundation of development– Difference of taxonomy and categorization

• Concepts vs. text in documents

Enterprise Context – strategic, self-knowledge– Infrastructure resource, not a project– Interdisciplinary Team and applications

Integration with other initiatives and technologies– Text Mining, Data Mining, Sentiment & beyond, Everything!

Questions?

Tom [email protected]

KAPS Group

Knowledge Architecture Professional Services

http://www.kapsgroup.com

Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group...

Documents

Transcript of Semantic Infrastructure Workshop Development Tom Reamy Chief Knowledge Architect KAPS Group...