Dynamic Classification Workshop Claude Vogel Roadmap & Quality Metrics.

Post on 26-Dec-2015

224 views 0 download

Tags:

Transcript of Dynamic Classification Workshop Claude Vogel Roadmap & Quality Metrics.

Dynamic Classification Workshop

Claude Vogel

Roadmap & Quality Metrics

Outline = Roadmap

• Definitions

• Step by step• Phase 1

• Taxonomy design [QA]• Implementation & Tests

Lexicon extraction [QA]Meta data generation [QA]

• Phase 2• Classification design [QA]• Implementation & Tests

Portal generation [QA]

• Conclusion

Your Problem

• Hit lists are inefficient

• Information is unstructured

• Information structure is irrelevant

Define “Find”

• I’m looking for an “APARTMENT in CARLSBAD”

• I end up with a STUDIO in OCEANSIDE

• “Find” is a result, not a starting point

• Find is not: Search + Retrieval system

• Find is a dynamic process

Apartment

CarlsbadOceansideOceanside

Studio

Relate available information to

OUR

decision-making processes

Dynamic Classification

Rationale: Associate a semantic signature to structured and unstructured sources, then use this semantic representation to slice n’ dice sources.

• Example 1 : Endeca• Meta-data index• Parametric classification

• Example 2: Convera• Taxonomic index• Topical classification

Reduce Complexity

Domestic Salesand Marketing ?

Jobsand Marketing ?

Categorize…

Bonus

Domestic Sales

Marketing

Jobs

…And Classify!

Bonus

Domestic Sales

Marketing

JobsDomestic Salesand Marketing ?

…And Classify Again!

Jobsand Marketing ?

BonusDomestic Sales

MarketingJobs

Leverage K-Assets

TAGS

Categories = Essential Knowledge

Africa

Somalia

“A reasonably stable definition of the basic components of the world”

Genus to species

Munitions

Bombs

Classification = Accidental Knowledge

Africa

Missiles

Missiles

Africa

“A relevant answer to a practical problem”

Whatever

A Twofold Process

1. Taxonomy driven categorization• Steady• Accurate• Scalable

2. Classification driven user interface• Flexible• Relevant• Focused

Glossary

• Paradigmatic models• Ontology, Taxonomy

• Practical models• Inventory, Catalog, Classification

The Semiotic Triangle

Concept

Reference

Mammals Carnivora Canidae Canids Boxer

… It stands about 56 to 61 cm (about 22 to 24 in) high and weighs about 30 kg (about 66 lb) Source: Microsoft Encarta.

Word

“Boxer”Boxer

Lexicon, Taxonomy, Catalog

Catalog

Taxonomy

Lexicon

Ontology

• An ontology is a foundation of categories representing a view of the world. An ontology reflects the commonly used and trusted breakdown of categories. For example, the breakdown of news items into categories of ‘World’, ‘Sports’, ‘Politics’, etc. is ontological.

Taxonomy

• A taxonomy is a hierarchical system describing genera and species. Species derive from a common genus and are hierarchically represented according to their essential characteristics and differences. For example, animals are categorized with the "Taxonomy of Life" which separates mammals from birds and spiders from insects, based on proper features and relative differences. This genus to species nomenclature is highlighted by terminology which moves from generic terms to binomial terms through lexical derivation and compounding.

• A taxonomy doesn’t deal with things, but with the essence of things: a taxonomy is based on an ontology.

Inventory, Catalog

• Inventory• List of things which stand for themselves, as

they are, where they are.

• Catalog• Consolidated inventory, introducing for that

purpose some kind of elementary classification.

• In both cases, the things listed have a unique and non-ambiguous name: e.g. URL, serial number, etc.

Classification

1. Arrangement of things according to some of their properties

2. Arrangement of types of things according to some of their properties

Multiple classification systems might combine multiple ontologies in multiple ways.

Things might have multiple locations in any given classification.

Thesaurus Nomenclature

Glossary

• ANSII/NISO Z39.19-1993

• A thesaurus is a controlled vocabulary arranged in a known order and structured so that equivalence, homographic, hierarchical, and associative relationships among terms are displayed clearly and identified by standardized relationship indicators that are employed reciprocally.

• The primary purposes of a thesaurus are (a) to facilitate retrieval of documents and (b) to achieve consistency in the indexing of written or otherwise recorded documents and other items, mainly for postcoordinate information storage and retrieval systems.

Outline = Roadmap

• Definitions

• Step by step• Phase 1

• Taxonomy design [QA]• Implementation & Tests

Lexicon extraction [QA]Meta data generation [QA]

• Phase 2• Classification design [QA]• Implementation & Tests

Portal generation [QA]

• Conclusion

Terrorism

Vertical Cartridges

Weapons

Geography

Plug and Play

Example 1: Geography

Africa

Algeria

Angola

Asia

Afghanistan

Armenia

Europe

Albania

Andorra

Middle East

Bahrain

Iran

North and Central America

Antigua and Barbuda

Bahamas

Pacific

Australia

Fiji

South America

Argentina

Bolivia

U.S.

Alabama

Alaska

Example 2 : DefenseDefense Communications

Satellite Communications

Tactical Communications

Defense Systems

Air Defense

Antiaircraft Defense Systems

Gun Air Defense Systems

Antimissile Defense Systems

Forward Area Air Defense Systems

Terminal Defense

Aircraft Defense Systems

Antisubmarine Defense Systems

Antiswimmer Defense Systems

Countermeasures

Acoustic Countermeasures

Ordnance

Fire Control Systems

Sights

Gun Sights

Radar Gun Sights

Unique Beginner

Life Form

Generic

Specific

Varietal

Taxonomy Design Canon

Example: Breads

Ontology Proliferation

Mass Nouns

• Linnaeus: Higher taxa are artefacts: “ An order is a subdivision of classes needed to avoid placing together more genera than the mind can follow.” Philosophia Botanica

• Some life-form categories are created to group objects together. Terms associated to these are often mass nouns (versus count nouns) like “furniture”: “a kind of things of different kinds made by people to etc.”

Synonyms

Person

Unwelcome person

Unpleasant person

Selfish person

Opportunist

Backscratcher

(WordNet)

Cycles

• Life-formGenus

SpeciesLife-form (mass noun)

Genus (having derivate forms)Species (derivates

from genus)

Ontology Vacuum

Acceptance

Product Acceptance

Accountability

Social Responsibility

Social Investing

Accountants

Public Accountants

Cpas

Attorney Cpas

Accounting Firms

Big Five Accounting Firms

Big Six Accounting Firms

Unbalanced derivation

Acceptance

Product Acceptance

Accidents

Accident Prevention

Aircraft Accidents and Safety

Air Traffic Control

Hijacking

Boating Accidents and Safety

Construction Accidents and Safety

Electrocutions

Falls

Firearm Accidents and Safety

Household Accidents and Safety

Nuclear Accidents and Safety

Occupational Accidents

Industrial Accidents

Occupational Safety

Indoor Air Quality

Railroad Accidents and Safety

Ship Accidents and Safety

Lighthouses

Swimming Accidents and Safety

Drownings

Traffic Accidents and Safety

Hit and Run Accidents

Duplicated Paths = Classification schema

Tax

Individuals Corporations

Assets Liability Assets Liability

Tax

Individuals Corporations

Assets Liability

Individuals Corporations

Split Paradigms in Multiple Taxonomies

Loans Debts

Liabilities Assets

Tax items Tax payers

Organizations

Individuals

Assoc.

Corporations

Taxonomy 101

1. Identify the main paradigms

2. Look for thesauri

3. Focus on taxonomy first

4. Split partonomies

5. Clean up ontology

6. Check levels, overlaps, etc.

7. Review all synsets

Outline

• Introduction

• Step by step• Phase 1

• Taxonomy design [QA]• Implementation & Tests

Lexicon extraction [QA]Meta data generation [QA]

• Phase 2• Classification design [QA]• Implementation & Tests

Portal generation [QA]

• Conclusion

Sources

• Dispersion (Multiplicity, Size, Homogeneity)

• Refresh

• AccessFeatures Internet,

News, E-Mail

Reports, Patents

E-Trade, Logs

Informative content - + + Number of topics covered + + - Structured information - + + Size of records - + - Number of records + - +

Taxonomy ActivationGeography

Nairobi

AfricaAlgeriaAngolaKenya

NairobiTanzania

Dar es Salaam

AsiaAfghanistanArmenia

Nairobi

Dar es Salaam

Dar es Salaam

Smart Latching

Rifles

Gun sight

Weapons

Disambiguation

Rifles

Weapons

Control the ambiguity generated by the keyword based latching mode used by taxonomy expansion.

Sight

Fire Control Systems

Gun sight

Ranking Formula

Example:

2 occurrences of “chemical laser”1 occurrence of “gun sight”

Defense

Lasers

Ordnance

File Control Systems

Sights

2

1

0

1

2

3

0 1 2 3

Specificity Concentration

Distance

Amplifiers

XML Output

Tables

Charts

Outline

• Introduction

• Step by step• Phase 1

• Taxonomy design [QA]• Implementation & Tests

Lexicon extraction [QA]Meta data generation [QA]

• Phase 2• Classification design [QA]• Implementation & Tests

Portal generation [QA]

• Conclusion

Classification = Matrix

Permutable Trees

Typical Structures

• Geography / Topic• Terrorism in Philippines• Criminal Law in Texas• Domestic Sales• Security in Building C

• Horizontal / Vertical• Petroleum Business• AML Regulations

• Vertical / Vertical• Chemical Compounds for Alzheimer

Xml Representation

Population Mechanism

Population Control

Spread

0

Mutual Information

All Bomb truck are Kenya

Some Kenya are Bomb truck

Example

High MI And Low Spread

Low MI And High Spread

Typical Patterns: Over/Under Populated

Typical Patterns : Bottleneck

Typical Patterns : Interrupted Bell Curve

Typical Patterns : Multiple Cycles

10 Tests To Qualify Your Classification

1. Average size

2. Top folders size

3. Depth

4. Balance

5. Cycles

6. Interrupted cycles

7. “Strings”

8. Buried documents

9. The needle test

10. The false discovery test

Conclusion

How Is It Useful ?

Quickly point to the relevant information

Put in perspective extremely large amounts

of information

Maintain multiple views on a consistent

repository

Roadmap For Success

1. Build a FOUNDATION

2. QUALIFY results

3. Attain MATURITY

Project Metrics

• Typical Planning• 2-3 weeks

• Typical Team• 1 KE + Experts + Users Panel

• Typical Cost• Categorization Software• Internal Support

• Typical ROI• $ 2M / Year for 5,000 Users

Dynamic Classification Workshop

Claude Vogel

cvogel@convera.com

http://www.convera.com

Roadmap & Quality Metrics