Nuts and Bolts of Taxonomies Webinar - Auto … · The Nuts and Bolts of Metadata Tagging and...
Embed Size (px)
Transcript of Nuts and Bolts of Taxonomies Webinar - Auto … · The Nuts and Bolts of Metadata Tagging and...

© Concept Searching 2017
The Nuts and Bolts of Metadata Tagging
and Taxonomies Made Easy
Michael Paye
Chief Technology Officer
Concept Searching
www.conceptsearching.com
Twitter @conceptsearch

© Concept Searching 2017
Michael Paye – Chief Technology Officer at Concept Searching
has been the driving force behind many of the company's recent
innovations, including the SharePoint Add-in and hybrid search
products. He has a wealth of experience across the Microsoft
platform and related technologies, and oversees all product
development.

© Concept Searching 2017
Agenda
• Who we are and what we do
• What’s the problem?
• What does it impact?
• How do you measure performance?
• Metadata generation
• Auto-classification – What does it do?
• Taxonomies – What kinds are there?
• SharePoint Term Store
• Calculating return on investment

© Concept Searching 2017
• Company founded in 2002
• Product launched in 2003
• Focus on management of structured and unstructured information
• Profitable, debt free
• Technology Platform
• Delivered as a web service
• Automatic concept identification, content tagging, auto-classification,
taxonomy management
• Only statistical vendor that can extract conceptual metadata
• 8 years KMWorld ‘100 Companies that Matter in Knowledge Management’
8 years KMWorld ‘Trend Setting Product’
• Authority to Operate enterprise wide US Air Force, NETCON US Army,
and Canadian SLSA
• Client base: Fortune 500/1000 organizations in Healthcare,
Financial Services, Manufacturing, Energy, Professional Services,
Pharmaceutical, Public sector and DoD
• Microsoft Gold Certification in Application Development
• Member of SharePoint PAC and TAP programs
• Deployed as a full trust Add-in for all versions of SharePoint on-premises
and SharePoint Online, including the latest vNext dedicated platform and the
government cloud
The Global Leader in
Managed Metadata Solutions

© Concept Searching 2017
Concept Searching’s technology platforms deliver
semantic metadata generation, auto-classification and
taxonomy/Term Store management, and are fully
integrated with all versions of SharePoint on-premises,
Microsoft Online/Office 365, and OneDrive for Business
What Do We Do?
These infrastructure platforms integrate not only with
SharePoint but also other content repositories, search
engines and file shares, enabling our clients to add
structure and manage their enterprise content,
regardless of environment
The resulting classification metadata is used by clients
to deliver ‘intelligent metadata solutions’ in areas such
as enhanced search, migration, data privacy, records
management, policy enforcement, compliance, text
analytics, and business and social collaboration

© Concept Searching 2017
“Over 80% of business decisions are made using unstructured data.” IDC
What’s the Problem?

© Concept Searching 2017
• 91% use manual metadata tagging
• Free-for-all mode
• Drop down lists
• 15% maintain a home-grown manual
taxonomy
• 77% have no rhyme or reason for
managing content
Information Chaos
• Unstructured data is growing at the rate of 62% per year IDG
• By 2022, 93% of all data in the digital universe will be unstructured IDG
• Data volume is set to grow 800% over the next five years and 80% of it
will reside as unstructured data Gartner
What’s the Problem?

© Concept Searching 2017
It’s not just about search
What Does it Impact?

© Concept Searching 2017
How do you measure performance?

© Concept Searching 2017
Precision Versus Recall
• Usually used by academics
• Precision
• Positive predictive value
• Fraction of retrieved instances that are
relevant
• Recall
• Sensitivity
• Correct number of documents that are
relevant
• Fraction of relevant instances that are
retrieved
• In a perfect world, they should be balanced
• Commercial evaluation criteria also take into
account
• Order of the returned results
• Overall ability of a user to find an answer
rather than relying on a search being
submitted only once

© Concept Searching 2017
• Automated metadata generation is
difficult to achieve consistently with
high precision and recall
• Many products on the market today
require complex rules to be generated
often involving search syntax,
complicated Boolean expressions
• Some require a document training set
for every term to be processed
• Some of these products employ
linguistic techniques that will not
perform consistently across different
vertical markets
Result is very high initial cost in terms of
time and level of qualified staff
Precision Versus Recall

© Concept Searching 2017
“The quality of your metadata will impact the quality of auto-classification
and ultimately negate your outcomes – increasing organizational risk
and noncompliance.”
Metadata

© Concept Searching 2017
Definition
• Metadata describes other data, it
provides information about a certain
item's content
• For example, an image may include
metadata that describes how large
the picture is, the color depth, the
image resolution, when the image
was created, and other data
• A text document's metadata may
contain information about how long
the document is, who the author is,
when the document was written, and
a short summary of the document
TechTerms.com
Metadata

© Concept Searching 2017
Types of Classification Metadata
Intrinsic
• Information that can be extracted directly
from an object (file name, size)
Administrative/Management
• Information used to manage the
document (author, date created,
date to be reviewed)
Descriptive
• Information that describes the object
(title, subject, audience)
Semantic
• Ability to extract concepts from within
content and generate the metadata
(intelligent metadata)

© Concept Searching 2017
A manual metadata approach will fail 95% of the time
Why is Metadata So Hard to Get Right?

© Concept Searching 2017
Advantages
• Ability to develop a single repository of organizationally relevant
metadata to be made available to any application that requires the use
of metadata
• Elimination of costs and errors associated with end user tagging
• Normalization of content across functional and geographic boundaries
to remove ambiguity in vocabulary
• Metadata managed and changed in one place
• Ability to apply policy consistently across diverse repositories and
applications
• Provide flexibility to rapidly make changes to the repository for
regulatory compliance where changes are immediately available for
use by applications
Metadata

© Concept Searching 2017
Automatic generation of compound term metadata

© Concept Searching 2017
Auto-classification
“By itself the search function has limited value. The real value of search
and information access technologies is in the ongoing efforts needed to
establish effective taxonomies, to index and classify content of all kinds, in
order to provide meaningful results.” Tom Eid, Research Vice President
Gartner Group

© Concept Searching 2017
• A feature found in some content management
systems or records management applications
that will scan the contents of a document and
automatically assign metadata, categories,
and keywords based on the document
contents
• Content-based assignment of one or more
pre-defined categories to documents
(records), usually machine learning, statistical
pattern recognition, or neural network
approaches that are used to construct
classifiers automatically
What is Auto-classification?

© Concept Searching 2017
Auto-classification Systems – What Do They Do?
Document
Preparation • Split into language
blocks (paragraphs,
headings),
formatting, layout
Parsing • Entity extraction
• NLP: parts of speech,
phrases
• Terms, variants
Weighting • Frequency
• Location in text,
phrase
• Proximity
• Combination
• Format of text
Classification • If threshold reached
• Can influence search
results
This is where rules
vs statistics come
into play… Not all classification solutions are created equal

© Concept Searching 2017
Auto-classification Systems
Keyword
• Boolean operators add a degree of sophistication,
but also tend to improve precision at the expense
of recall, because any document that does not
match the Boolean expression is ignored
• The majority of search users are unable to
formulate even basic Boolean expressions
Linguistic
• No commitment to a taxonomic tree
• Related to parts of speech, syntactic parses,
or semantic interpretations
• Typically not scalable
• Usually delivered as pre-configured for an
industry, hard to integrate your unique
organizational vocabulary

© Concept Searching 2017
Semantic Networks
• Refers to a set of relationships between
concepts and words, including parts of
speech and real-world relationships
• These can include rules of various types,
not just Boolean
Machine Learning
• Subfield of computer science (CS)
and artificial intelligence (AI) that deals with
the construction and study of systems that
can learn from data, rather than follow only
explicitly programmed instructions
Auto-classification Systems

© Concept Searching 2017
Training Sets
• Specify a set of documents that should be
classified against each term, this becomes the
training set
• If errors, provide more pre-classified documents
to the training set
• Repeat as necessary
Rule-based
• Rule-based classifiers allow the criteria that
causes classifications to be explicitly defined
• Two types
• Exact matching based on keywords, phrases,
Boolean
• Deliver a binary result – the document
either matches the term or it does not
• Fuzzy matching that accumulates evidence
that a document matches each term – sort of
Auto-classification Systems

© Concept Searching 2017
Auto-classification in action

© Concept Searching 2017
Taxonomies
“The metadata infrastructure provides the critical glue that binds the
information infrastructure to the underlying IT infrastructure.
Sound information governance practices would take advantage of the
metadata infrastructure, to ensure that content and data are managed
consistently and adhere to written policies, across on-premises and
cloud-based environments.” IDC

© Concept Searching 2017
Taxonomies
Taxonomy
• A taxonomy is an organized set of
concepts or definitions, usually labeled
keywords
• For search engines, a taxonomy can
also be a set of organized searches
• Taxonomies are typically nested in a
hierarchical manner, often called a ‘tree’
• Subject-based taxonomy – created by
domain experts
• Content-based taxonomy – organizing
the data you already have
• Behavior-based taxonomy – driven by
search analytics, user tagging, or
vocabulary analysis

© Concept Searching 2017
Types of Taxonomies
List, Picklist, Controlled Vocabulary, Authority Files
List of lead or preferred terms, selected by the end
user, may or may not have relationships among the
terms, can include a synonym ring
Synonym Lists
The use of synonyms allows one concept to be
instantiated as the same as the other, but still
allows a term to be preferred over another
Hierarchical
Each content item resides in only one category,
referred to as a ‘tree’
• Piano
• Musical instrument

© Concept Searching 2017
Types of Taxonomies
Polyhierarchical, Faceted, Thesauri
Content items can exist in more than one category,
more structured controlled vocabulary, provides
information about each term and its relationship to
other terms, features of a hierarchical taxonomy
plus associative relationships
• Piano
• Musical instrument
• Stringed instrument
• Percussion instrument
Ontology
Multiple taxonomies with additional relationships
added to specify concepts within a domain
Marlene Rockmore – The Taxonomy Blog
Heather Hedden – The Accidental Taxonomist

© Concept Searching 2017
Set up a taxonomy node, suggest clues for class, document feedback

© Concept Searching 2017
SharePoint Term Store
• Introduced in 2010
• Provides infrastructure for
taxonomy management
• Managed metadata properties
designed for hierarchical
metadata
• Integrated with search via the
refinement panel
SharePoint has no automatic generation of metadata
SharePoint has no auto-classification capability
SharePoint has no facility to generate concepts

© Concept Searching 2017
Globally Unique Identifiers (GUIDs)
• SharePoint uses GUIDs to identify
taxonomies and terms
• GUIDs must be preserved when
updating term sets
• GUIDs need to synchronized between
farms
• Concept Searching preserves GUIDs
SharePoint Term Store

© Concept Searching 2017
Automatic, real-time update of the SharePoint Term Store

© Concept Searching 2017
Return On Investment

© Concept Searching 2017
Return On Investment – Real World Savings
Pique Solutions
The Business Solutions
• Search
• Records Management
• Intelligent Migration
• Data Security/Confidentiality
• eDiscovery/Litigation
Support, FOIA
• Information Governance
• Text Analytics
• Business Social Networking
• Collaboration
• Content Lifecycle
Management
• Metadata Management
• Research
• Knowledge Management

© Concept Searching 2017
Next Expert Webinar
Groundbreaking and Game-changing Enterprise Search
Wednesday, March 8th 2017
Register
Concept Searching and strategic partner C/D/H discuss what intelligent
enterprise search should be.
This webinar demonstrates a solution unique in the marketplace, which
overcomes the limitations of other enterprise search engines.
Read more and register in the Upcoming Webinars area of our website.

© Concept Searching 2017
Thank You
Michael Paye
Chief Technology Officer
Concept Searching
www.conceptsearching.com
Twitter @conceptsearch