Automatic Facets: Faceted Navigation and Entity Extraction Tom Reamy Chief Knowledge Architect KAPS...

26
Automatic Facets: Faceted Navigation and Entity Extraction Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    232
  • download

    2

Transcript of Automatic Facets: Faceted Navigation and Entity Extraction Tom Reamy Chief Knowledge Architect KAPS...

Automatic Facets:Faceted Navigation and Entity Extraction

Tom ReamyChief Knowledge Architect

KAPS Group

Knowledge Architecture Professional Services

http://www.kapsgroup.com

2

Agenda

Introduction: Elements– Facets, Taxonomies, Software, People

3 Environments– E-Commerce, Enterprise, Internet

Design Issues – Facets and Entities

Conclusion – Integrated Solution

3

KAPS Group: General

Knowledge Architecture Professional Services Virtual Company: Network of consultants – 12-15 Partners – Inxight, FAST, etc. Consulting, Strategy, Knowledge architecture audit Taxonomies: Enterprise, Marketing, Insurance, etc. Services:

– Taxonomy development, consulting, customization– Technology Consulting – Search, CMS, Portals, etc.– Metadata standards and implementation– Knowledge Management: Collaboration, Expertise, e-learning– Applied Theory – Faceted taxonomies, complexity theory, natural

categories

4

Elements

Facet – orthogonal dimension of metadata Entity / Noun Phrase – metadata value of a facet Entity extraction – feeds facets, signature, ontologies Taxonomy and categorization rules Auto-categorization – aboutness, subject facets People – tagging, evaluating tags, fine tune rules and

taxonomy

5

Essentials of Facets

Facets are not categories– Categories are what a document is about – limited number– Entities are contained within a document – any number

Facets are orthogonal – mutually exclusive – dimensions– An event is not a person is not a document is not a place.

Facets – variety – of units, of structure– Numerical range (price), Location – big to small– Alphabetical, Hierarchical – taxonomic

Facets are designed to be used in combination• Wine where color = red, price = excessive, location = Calirfornia,• And sentiment = snotty

6

Advantages of Faceted Navigation

More intuitive – easy to guess what is behind each door• Simplicity of internal organization• 20 questions – we know and use

Dynamic selection of categories• Allow multiple perspectives• Ability to Handle Compound Subjects

Systematic Advantages – fewer elements– 4 facets of 10 nodes = 10,000 node taxonomy– Ability to Handle Compound Subjects

Flexible – can be combined with other navigation elements

7

Essentials of TaxonomiesInternal Organization Formal Taxonomy – parent – child relationship

– Is-A-Kind-Of ---- Animal – Mammal – Zebra – Partonomy – Is-A-Part-Of ---- US-California-Oakland

Browse Classification – cluster of related concepts– Food and Dining – Catering – Restaurants

Taxonomies deal with complex, not compound– Conceptual relationships – category membership– Contextual relationships – Computers & Software

Taxonomies deal with semantics & documents– Multiple meanings and purposes– Essential attributes of documents are not single value

8

Developing Facets: Tools and TechniquesSoftware Tools

Text Analytics – Taxonomy management, entity extraction, categorization, sentiment

Search – Integrated features, at index, Internet sources CM – Enterprise environment, taggers and policy Programmable Rules

– Business and Subject matter expertise– Auto-populate variety of metadata – author, title, date, etc.– Relevance – best bets to weights and classes of documents

People – refine, monitor – it’s not automatic

9

Developing Facets: Tools and TechniquesSoftware Tools – Auto-categorization Auto-categorization

– Training sets – Bayesian, Vector Machine– Terms – literal strings, stemming, dictionary of related terms– Rules – simple – position in text (Title, body, url)– Advanced – saved search queries (full search syntax)– NEAR, SENTENCE, PARAGRAPH– Boolean – X NEAR Y and Not-Z

Advanced Features– Facts / ontologies /Semantic Web – RDF +– Sentiment Analysis – positive, negative, neutral

10

Developing Facets: Tools and TechniquesSoftware Tools – Entity Extraction Dictionaries – variety of entities, coverage, specialty

– Cost of update – service or in-house– Inxight – 50+ predefined entity types– Nstein – 800,000 people, 700,000 locations, 400,000 organizations

Rules– Capitalization, text – Mr., Inc.– Advanced – proximity and frequency of actions, associations– Need people to continually refine the rules

Entities and Categorization– Total number and pattern of entities = a type of aboutness of

the document – Bar Code, Fingerprint

11

Elements: People

Programmers, Librarians, Taxonomists, Metadata specialist– Integrate, design, develop rules, monitor activity & quality

Authors, Subject Matter Experts– Input into design (important facets), rules, activity meaning

Users – Web 2.0– Feedback – quality and usability– Suggestions – missing terms, bad categorization & entity– Tags Clouds & folksonomy – for social networking features,

not for information retrieval

12

Three Environments

E-Commerce– Catalogs, small uniform collections of entities– Uniform behavior – buy this

Enterprise– More content, more types of content– Enterprise Tools – Search, ECM– Publishing Process – tagging, metadata standards

Internet– Wildly different amount and type of content, no taggers– General Purpose – Flickr, Yahoo– Vertical Portal – selected content, no taggers

13

Three Environments: E-Commerce

14

Three Environments: E-Commerce

15

Enterprise Environment – When and how add metadata

Enterprise Content – different world than eCommerce– More Content, more kinds, more unstructured– Not a catalog to start – less metadata and structured content – Complexity -- not just content but variety of users and activities

Combination of human and automatic metadata – ECM– Software aided - suggestions, entities, ontologies

Enterprise – Question of Balance / strategy– More facets = more findability (up to a point)– Fewer facets = lower cost to tag documents

Issues– Not enough facets– Wrong set of facets – business not information– Ill-defined facets – too complex internal structure

16

Facets and Taxonomies Enterprise Environment – Case One – Taxonomy, 7 facets

Taxonomy of Subjects / Disciplines:– Science > Marine Science > Marine microbiology > Marine toxins

Facets:– Organization > Division > Group– Clients > Federal > EPA– Instruments > Environmental Testing > Ocean Analysis > Vehicle– Facilities > Division > Location > Building X– Methods > Social > Population Study– Materials > Compounds > Chemicals– Content Type – Knowledge Asset > Proposals

17

External Environment – Text Mining, Vertical Portals

Internet Content – Scale – impacts design and technology – speed of indexing– Limited control – Association of publishers to selection of content to none– Major subtypes – different rules – metadata and results

Complex queries and alerts– Terrorism taxonomy + geography + people + organizations

Text Mining – General or specific content and facets and categories– Dedicated tools or component of Portal – internal or external

Vertical Portal – Relatively homogenous content and users– General range of questions

18

Internet Design

Subject Matter taxonomy – Business Topics– Finance > Currency > Exchange Rates

Facets – Location > Western World > United States– People – Alphabetical and/or Topical - Organization– Organization > Corporation > Car Manufacturing > Ford– Date – Absolute or range (1-1-01 to 1-1-08, last 30 days)– Publisher – Alphabetical and/or Topical – Organization– Content Type – list – newspapers, financial reports, etc.

19

20

21

22

Integrated Facet ApplicationDesign Issues - General

What is the right combination of elements?– Faceted navigation, metadata, browse, search, categorized

search results, file plan

What is the right balance of elements?– Dominant dimension or equal facets– Browse topics and filter by facet

When to combine search, topics, and facets?– Search first and then filter by topics / facet– Browse/facet front end with a search box

23

Integrated Facet ApplicationDesign Issues - General Homogeneity of Audience and Content Model of the Domain – broad

– How many facets do you need?– More facets and let users decide– Allow for customization – can’t define a single set

User Analysis – tasks, labeling, communities• Issue – labels that people use to describe their

business and label that they use to find information Match the structure to domain and task

– Users can understand different structures

24

Automatic Facets – Special Issues

Scale requires more automated solutions– More sophisticated rules

Rules to find and populate existing metadata– Variety of types of existing metadata – Publisher, title, date– Multiple implementation Standards – Last Name, First / First Name, Last

Issue of disambiguation:– Same person, different name – Henry Ford, Mr. Ford, Henry X. Ford– Same word, different entity – Ford and Ford

Number of entities and thresholds per results set / document– Usability, audience needs

Relevance Ranking – number of entities, rank of facets

25

Putting it all together – Infrastructure Solution

Facets, Taxonomies, Software, People Combine formal power with ability to support multiple

user perspectives Facet System – interdependent, map of domain Entity extraction – feeds facets, signatures, ontologies Taxonomy & Auto-categorization – aboutness, subject People – tagging, evaluating tags, fine tune rules and

taxonomy The future is the combination of simple facets with rich

taxonomies with complex semantics / ontologies

Questions?

Tom [email protected]

KAPS Group

Knowledge Architecture Professional Services

http://www.kapsgroup.com