Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard...

26
Text Analytics World Current Applications and Future Directions of Text Analytics Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture Professional Services http://www.kapsgroup.com

Transcript of Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard...

Page 1: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

Text Analytics World Current Applications and

Future Directions of Text Analytics Tom Reamy

Chief Knowledge Architect KAPS Group

Program Chair – Text Analytics World Knowledge Architecture Professional Services

http://www.kapsgroup.com

Page 2: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

2

Agenda

§  Introduction: –  Current State of Text Analytics –  Survey / Discussion Themes

§  Enterprise Text Analytics - Search – still fundamental –  Shift from information to business

§  Social Media – Next Generation –  Text Analytics and CRM

§  Integration – Text and Data, Enterprise and Social §  Future of Text Analytics

–  Roadblocks, Deep Vision §  Questions

Page 3: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

3

Introduction: KAPS Group

§  Knowledge Architecture Professional Services – Network of Consultants §  Applied Theory – Faceted taxonomies, complexity theory, natural

categories, emotion taxonomies §  Services:

–  Strategy – IM & KM - Text Analytics, Social Media, Integration –  Taxonomy/Text Analytics development, consulting, customization –  Text Analytics Quick Start – Audit, Evaluation, Pilot –  Social Media: Text based applications – design & development

§  Partners – SAS, Smart Logic, Expert Systems, SAP, IBM, FAST, Concept Searching, Attensity, Clarabridge, Lexalytics

§  Projects – Portals, taxonomy, Text analytics – news, expertise location, information strategy, text analytics evaluation, Quick Start in Text A.

§  Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, etc.

§  Presentations, Articles, White Papers – www.kapsgroup.com

Page 4: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

4

Text Analytics World Current State of Text Analytics §  History – academic research, focus on NLP §  Inxight –out of Zerox Parc

–  Moved TA from academic and NLP to auto-categorization, entity extraction, and Search-Meta Data

§  Explosion of companies – many based on Inxight extraction with some analytical-visualization front ends

–  Half from 2008 are gone - Lucky ones got bought §  Early applications – News aggregation and Enterprise Search – §  Second Wave = shift to sentiment analysis §  Enterprise search – 30-50% of market ($1Bil) §  Text Analytics is growing 20% a year, 10% of analytics §  Fragmented market – no clear leader

Page 5: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

5

Text Analytics World Current State of Text Analytics: Vendor Space §  Taxonomy Management – SchemaLogic, Pool Party §  From Taxonomy to Text Analytics

–  Data Harmony, Multi-Tes §  Extraction and Analytics

–  Linguamatics (Pharma), Temis, whole range of companies §  Business Intelligence – Clear Forest, Inxight §  Sentiment Analysis – Attensity, Lexalytics, Clarabridge §  Open Source – GATE §  Stand alone text analytics platforms – IBM, SAS, SAP, Smart

Logic, Expert System, Basis, Open Text, Megaputer, Temis, Concept Searching

§  Embedded in Content Management, Search –  Autonomy, FAST, Endeca, Exalead, etc.

Page 6: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

6

Interviews with Leading Vendors, Analysts: Current Trends §  From Mundane to Advanced – reducing manual labor to

“Cognitive Computing” §  Enterprise – Shift from Information to Business – cost cutting

rather than productivity gains §  Integration – data and text, text analytics and analytics

–  Social Media – explosion of wild text, combine with data – customer browsing behavior, web analytics

§  Big Data – more focus on extraction (where it began) but categorization adds depth and sophistication

§  Shift away from IT – compliance, legal, advertising, CRM §  US market different than Europe/Asia – project oriented

Page 7: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

7

Enterprise Text Analytics

§  Search is still #1 = 30-50% of applications §  New Standard Search – facets (more and more metadata), auto-

categorization built on taxonomies, clustering –  Issue – consistent metadata, multiple content sources

§  Trend = Text Analytics/Search as Semantic Infrastructure –  Platform for Info Apps (Search-based applications)

§  SharePoint – Major focus of TA companies – fix problems with taxonomy/folksonomy

–  Hybrid workflow – Publish document -> TA analysis -> suggestions for categorization, entities, metadata -> present to author

§  External information = more automation, extraction – precision more important

§  Use of predictive facets, enhanced relevance (Fast)

Page 8: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

8

Enterprise Text Analytics Adding Structure to Unstructured Content §  Beyond Documents – categorization by corpus, by page, sections

or even sentence or phrase §  Documents are not unstructured – variety of structures

–  Sections – Specific - “Abstract” to Function “Evidence” –  Corpus – document types/purpose –  Textual complexity, level of generality

§  Need to develop flexible categorization and taxonomy – tweets to 200 page PDF

§  Applications require sophisticated rules, not just categorization by similarity

Page 9: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

9

Page 10: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

10

Enterprise Text Analytics Document Type Rules §  (START_2000, (AND, (OR, _/article:"[Abstract]", _/

article:"[Methods]“), (OR,_/article:"clinical trial*", _/article:"humans",

§  (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe", _/article:"use", _/article:"animals"),

§  If the article has sections like Abstract or Methods §  AND has phrases around “clinical trials / Humans” and not words

like “animals” within 5 words of “clinical trial” words – count it and add up a relevancy score

§  Primary issue – major mentions, not every mention –  Combination of noun phrase extraction and categorization –  Results – virtually 100%

Page 11: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

11

Enterprise Text Analytics Building on the Foundation: Applications §  Focus on business value, cost cutting §  Enhancing information access is means, not an end

–  Governance, Records Management, Doc duplication, Compliance

–  Applications – Business Intelligence, CI, Behavior Prediction –  eDiscovery, litigation support –  Risk Management –  Productivity / Portals – spider and categorize, extract – KM

communities & knowledge bases •  New sources – field notes into expertise, knowledge base –

capture real time, own language-concepts

Page 12: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

12

Enterprise Text Analytics: Applications Pronoun Analysis: Fraud Detection; Enron Emails §  Patterns of “Function” words reveal wide range of insights §  Function words = pronouns, articles, prepositions, conjunctions, etc.

–  Used at a high rate, short and hard to detect, very social, processed in the brain differently than content words

§  Areas: sex, age, power-status, personality – individuals and groups §  Lying / Fraud detection: Documents with lies have:

–  Fewer, shorter words, fewer conjunctions, more positive emotion words

–  More use of “if, any, those, he, she, they, you”, less “I” §  Current research – 76% accuracy in some contexts

–  Italian – stylometry – linguistic hedges §  Text Analytics can improve accuracy and utilize new sources §  Data analytics (standard AML) can improve accuracy

Page 13: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

13

Social Media: Next Generation Beyond Simple Sentiment §  Beyond Good and Evil (positive and negative)

–  Degrees of intensity, complexity of emotions and documents §  Importance of Context – around positive and negative words

–  Rhetorical reversals – “I was expecting to love it” –  Issues of sarcasm, (“Really Great Product”), slanguage

§  Essential – need full categorization and concept extraction §  New Taxonomies – Appraisal Groups – “not very good”

–  Supports more subtle distinctions than positive or negative §  Emotion taxonomies - Joy, Sadness, Fear, Anger, Surprise, Disgust

–  New Complex – pride, shame, confusion, skepticism §  New conceptual models, models of users, communities

Page 14: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

14

Social Media: Next Generation Behavior Prediction – Telecom Customer Service

§  Problem – distinguish customers likely to cancel from mere threats §  Basic Rule

–  (START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"), –  (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))

§  Examples: –  customer called to say he will cancell his account if the does not stop receiving

a call from the ad agency. –  cci and is upset that he has the asl charge and wants it off or her is going to

cancel his act §  More sophisticated analysis of text and context in text §  Combine text analytics with Predictive Analytics and traditional behavior

monitoring for new applications

Page 15: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

15

Social Media: Next Generation Variety of New Applications §  Crowd Sourcing Technical Support

–  User Forums – find problem area, nearby text for solution –  Automatic or Human mediated

§  Legal Review –  Significant trend – computer-assisted review (manual =too many) –  TA- categorize and filter to smaller, more relevant set –  Payoff is big – One firm with 1.6 M docs – saved $2M

§  Financial Services –  Trend – using text analytics with predictive analytics – risk and fraud –  Combine unstructured text (why) and transaction data (what) –  Customer Relationship Management, Fraud Detection –  Stock Market Prediction – Twitter, impact articles

Page 16: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

16

Text Analytics: New Directions Integration §  Text and Data, Internal and External, Enterprise and Social §  Focus - multiple approaches are needed and multiple ways to

combine –  Death to the Dichotomies – All of the Above

§  Massive parallelism or deeply integrated solution –  Example of Watson - fast filtering to get to best 100 answers,

then deep analysis of 100 §  Role of automatic / human §  CRM – struggle to connect to enterprise

–  Have to learn to speak “enterprise” §  Imply – Sentiment analysis focus for companies not enough §  Enterprise and Social Media (Delve)

–  Social Media analysis and news aggregation

Page 17: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

Delve for the Web: The Front Page of Knowledge Management

Users follow topics, people, and

companies selected from Delve

taxonomies.

Social media data from Twitter powers recommendation algorithms.

Page 18: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

18

Text Analytics: New Directions - Integration Thinking Fast and Slow – Daniel Kahneman §  System 1 – fast and automatic – little conscious control §  Represents categories as prototypes – stereotypes

–  Norms for immediate detection of anomalies – distinguish the surprising from the normal

–  fast detection of simple differences, detect hostility in a voice, find best chess move (if a master)

–  Priming / Anchoring – susceptible to systemic errors –  Biased to believe and confirm –  Focuses on existing evidence (ignores missing – WYSIATI)

§  .

Page 19: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

19

Text Analytics: New Directions - Integration Thinking Fast and Slow §  System 2 – Complex, effortful judgments and calculations

–  System 2 is the only one that can follow rules, compare objects on several attributes, and make deliberate choices

–  Understand complex sentences, validity of logical argument –  Focus attention – can make people blind to all else – Invisible Gorilla

§  Similar to traditional dichotomies – Tacit – Explicit, etc §  Basic Design – System 1 is basic to most experiences, and

System 2 takes over when things get difficult – conscious control

§  Text Analysis and Text Mining / Auto-Cat and TA Cat

Page 20: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

20

Text Analytics: New Directions - Integration System 1 & 2 – and Text Analytics Approaches §  “Automatic Categorization” – System 1 prototypes

–  Limited value -- only works in simple environments –  Shallow categories with large differences –  Not open to conscious control

§  System 2 – categories – complex, minute differences, deep categories

§  Together: –  Choose one or other for some contexts –  Combine both – need to develop new kinds of categories and/

or new ways to combine?

Page 21: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

21

Text Analytics: New Directions - Integration Text Mining and Text Analytics §  Text Analytics and Big Data enrich each other

–  Data tells you what people did, TA tells you why §  Text Analytics – pre-processing for TM

–  Discover additional structure in unstructured text –  New variables for Predictive Analytics, Social Media Analytics –  New dimensions – 90% of information, 50% using Twitter analysis

§  Text Mining for TA– Semi-automated taxonomy development –  Apply data methods, predictive analytics to unstructured text –  New Models – Watson ensemble methods, reasoning apps

§  Extraction – smarter extraction – sections of documents, Boolean, advanced rules – drug names, adverse events – major mention

Page 22: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

22

Text Analytics: New Directions - Integration Integration – Text Analytics and CRM §  Overall – growing demand for natural language processing, TA

–  Identify when a customer is angry or at risk of closing an account –  Growth of regulatory compliance requirements is driving –  Used to understand why people call and whether they were satisfied with the

quality of the experience, diagnose issues and address them –  Combine with Web analytics – need an integrated system

§  Contact Center Search – searching and analyzing customer data across multiple channels – Integration – Salesforce, Coveo, eGain, InQuira

§  Enterprise Feedback Management ––want to track satisfaction and loyalty – issue of unstructured content social media, multimedia channels

§  Contact Center Infrastructure – Importance of Cloud based –  Services and Infrastructure – Need Semantic Infrastructure –  Cisco – Packaged Contact Center Enterprise

§  Web Support – virtual agents – deliver one answer to a customer’s question, not search results list

–  Missing – integrated knowledge management system

Page 23: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

23

Future of Text Analytics Obstacles - Survey Results §  What factors are holding back adoption of TA?

–  Lack of clarity about TA and business value - 47% –  Lack of senior management buy-in - 8.5%

§  Need articulated strategic vision and immediate practical win §  Issue – TA is strategic, US wants short term projects

–  Sneak Project in, then build infrastructure – difficulty of speaking enterprise

§  Integration Issue – who owns infrastructure? IT, Library, ? –  IT understands infrastructure, but not text –  Need interdisciplinary collaboration – Stanford is offering English-

Computer Science Degree – close, but really need a library-computer science degree

Page 24: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

24

Future of Text Analytics Primary Obstacle: Complexity §  Usability of software is one element §  More important is difficulty of conceptual-document models

–  Language is easy to learn , hard to understand and model §  Need to add more intelligence (semantic networks) and ways for

the system to learn – social feedback §  Customization – Text Analytics– heavily context dependent

–  Content, Questions, Taxonomy-Ontology –  Level of specificity – Telecommunications –  Specialized vocabularies, acronyms

Page 25: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

25

New Directions in Text Analytics Conclusions §  Text Analytics is growing out (20%) and up – more mature

applications and technique §  Find the right balance of infrastructure and application focus §  Essential theme – integration – text and data, enterprise and

social §  Big obstacles remain

–  Strategic Vision of text analytics in the enterprise –  Concrete and quick application to drive acceptance

§  Future – Women, Fire, and Dangerous Things –  Text Analytics and Cognitive Science = Metaphor Analysis, deep

language understanding, common sense?

Page 26: Text Analytics World · 2014-03-24 · Search is still #1 = 30-50% of applications ! New Standard Search – facets (more and more metadata), auto-categorization built on taxonomies,

Questions? Tom Reamy

[email protected] KAPS Group

http://www.kapsgroup.com Upcoming: Text Analytics World SF - 2015

Workshop on Text Analytics: Enterprise Search Summit – New York, May 12-14

Taxonomy Boot Camp, ESS, KMWorld -DC, Nov 4-7 Fall Announcement!