Dynamic Classification Workshop Claude Vogel Roadmap & Quality Metrics.
-
Upload
rosaline-justina-cummings -
Category
Documents
-
view
223 -
download
0
Transcript of Dynamic Classification Workshop Claude Vogel Roadmap & Quality Metrics.
Dynamic Classification Workshop
Claude Vogel
Roadmap & Quality Metrics
Outline = Roadmap
• Definitions
• Step by step• Phase 1
• Taxonomy design [QA]• Implementation & Tests
Lexicon extraction [QA]Meta data generation [QA]
• Phase 2• Classification design [QA]• Implementation & Tests
Portal generation [QA]
• Conclusion
Your Problem
• Hit lists are inefficient
• Information is unstructured
• Information structure is irrelevant
Define “Find”
• I’m looking for an “APARTMENT in CARLSBAD”
• I end up with a STUDIO in OCEANSIDE
• “Find” is a result, not a starting point
• Find is not: Search + Retrieval system
• Find is a dynamic process
Apartment
CarlsbadOceansideOceanside
Studio
Relate available information to
OUR
decision-making processes
Dynamic Classification
Rationale: Associate a semantic signature to structured and unstructured sources, then use this semantic representation to slice n’ dice sources.
• Example 1 : Endeca• Meta-data index• Parametric classification
• Example 2: Convera• Taxonomic index• Topical classification
Reduce Complexity
Domestic Salesand Marketing ?
Jobsand Marketing ?
Categorize…
Bonus
Domestic Sales
Marketing
Jobs
…And Classify!
Bonus
Domestic Sales
Marketing
JobsDomestic Salesand Marketing ?
…And Classify Again!
Jobsand Marketing ?
BonusDomestic Sales
MarketingJobs
Leverage K-Assets
TAGS
Categories = Essential Knowledge
Africa
Somalia
“A reasonably stable definition of the basic components of the world”
Genus to species
Munitions
Bombs
Classification = Accidental Knowledge
Africa
Missiles
Missiles
Africa
“A relevant answer to a practical problem”
Whatever
A Twofold Process
1. Taxonomy driven categorization• Steady• Accurate• Scalable
2. Classification driven user interface• Flexible• Relevant• Focused
Glossary
• Paradigmatic models• Ontology, Taxonomy
• Practical models• Inventory, Catalog, Classification
The Semiotic Triangle
Concept
Reference
Mammals Carnivora Canidae Canids Boxer
… It stands about 56 to 61 cm (about 22 to 24 in) high and weighs about 30 kg (about 66 lb) Source: Microsoft Encarta.
Word
“Boxer”Boxer
Lexicon, Taxonomy, Catalog
Catalog
Taxonomy
Lexicon
Ontology
• An ontology is a foundation of categories representing a view of the world. An ontology reflects the commonly used and trusted breakdown of categories. For example, the breakdown of news items into categories of ‘World’, ‘Sports’, ‘Politics’, etc. is ontological.
Taxonomy
• A taxonomy is a hierarchical system describing genera and species. Species derive from a common genus and are hierarchically represented according to their essential characteristics and differences. For example, animals are categorized with the "Taxonomy of Life" which separates mammals from birds and spiders from insects, based on proper features and relative differences. This genus to species nomenclature is highlighted by terminology which moves from generic terms to binomial terms through lexical derivation and compounding.
• A taxonomy doesn’t deal with things, but with the essence of things: a taxonomy is based on an ontology.
Inventory, Catalog
• Inventory• List of things which stand for themselves, as
they are, where they are.
• Catalog• Consolidated inventory, introducing for that
purpose some kind of elementary classification.
• In both cases, the things listed have a unique and non-ambiguous name: e.g. URL, serial number, etc.
Classification
1. Arrangement of things according to some of their properties
2. Arrangement of types of things according to some of their properties
Multiple classification systems might combine multiple ontologies in multiple ways.
Things might have multiple locations in any given classification.
Thesaurus Nomenclature
Glossary
• ANSII/NISO Z39.19-1993
• A thesaurus is a controlled vocabulary arranged in a known order and structured so that equivalence, homographic, hierarchical, and associative relationships among terms are displayed clearly and identified by standardized relationship indicators that are employed reciprocally.
• The primary purposes of a thesaurus are (a) to facilitate retrieval of documents and (b) to achieve consistency in the indexing of written or otherwise recorded documents and other items, mainly for postcoordinate information storage and retrieval systems.
Outline = Roadmap
• Definitions
• Step by step• Phase 1
• Taxonomy design [QA]• Implementation & Tests
Lexicon extraction [QA]Meta data generation [QA]
• Phase 2• Classification design [QA]• Implementation & Tests
Portal generation [QA]
• Conclusion
Terrorism
Vertical Cartridges
Weapons
Geography
Plug and Play
Example 1: Geography
Africa
Algeria
Angola
Asia
Afghanistan
Armenia
Europe
Albania
Andorra
Middle East
Bahrain
Iran
North and Central America
Antigua and Barbuda
Bahamas
Pacific
Australia
Fiji
South America
Argentina
Bolivia
U.S.
Alabama
Alaska
Example 2 : DefenseDefense Communications
Satellite Communications
Tactical Communications
Defense Systems
Air Defense
Antiaircraft Defense Systems
Gun Air Defense Systems
Antimissile Defense Systems
Forward Area Air Defense Systems
Terminal Defense
Aircraft Defense Systems
Antisubmarine Defense Systems
Antiswimmer Defense Systems
Countermeasures
Acoustic Countermeasures
Ordnance
Fire Control Systems
Sights
Gun Sights
Radar Gun Sights
Unique Beginner
Life Form
Generic
Specific
Varietal
Taxonomy Design Canon
Example: Breads
Ontology Proliferation
Mass Nouns
• Linnaeus: Higher taxa are artefacts: “ An order is a subdivision of classes needed to avoid placing together more genera than the mind can follow.” Philosophia Botanica
• Some life-form categories are created to group objects together. Terms associated to these are often mass nouns (versus count nouns) like “furniture”: “a kind of things of different kinds made by people to etc.”
Synonyms
Person
Unwelcome person
Unpleasant person
Selfish person
Opportunist
Backscratcher
(WordNet)
Cycles
• Life-formGenus
SpeciesLife-form (mass noun)
Genus (having derivate forms)Species (derivates
from genus)
Ontology Vacuum
Acceptance
Product Acceptance
Accountability
Social Responsibility
Social Investing
Accountants
Public Accountants
Cpas
Attorney Cpas
Accounting Firms
Big Five Accounting Firms
Big Six Accounting Firms
Unbalanced derivation
Acceptance
Product Acceptance
Accidents
Accident Prevention
Aircraft Accidents and Safety
Air Traffic Control
Hijacking
Boating Accidents and Safety
Construction Accidents and Safety
Electrocutions
Falls
Firearm Accidents and Safety
Household Accidents and Safety
Nuclear Accidents and Safety
Occupational Accidents
Industrial Accidents
Occupational Safety
Indoor Air Quality
Railroad Accidents and Safety
Ship Accidents and Safety
Lighthouses
Swimming Accidents and Safety
Drownings
Traffic Accidents and Safety
Hit and Run Accidents
Duplicated Paths = Classification schema
Tax
Individuals Corporations
Assets Liability Assets Liability
Tax
Individuals Corporations
Assets Liability
Individuals Corporations
Split Paradigms in Multiple Taxonomies
Loans Debts
Liabilities Assets
Tax items Tax payers
Organizations
Individuals
Assoc.
Corporations
Taxonomy 101
1. Identify the main paradigms
2. Look for thesauri
3. Focus on taxonomy first
4. Split partonomies
5. Clean up ontology
6. Check levels, overlaps, etc.
7. Review all synsets
Outline
• Introduction
• Step by step• Phase 1
• Taxonomy design [QA]• Implementation & Tests
Lexicon extraction [QA]Meta data generation [QA]
• Phase 2• Classification design [QA]• Implementation & Tests
Portal generation [QA]
• Conclusion
Sources
• Dispersion (Multiplicity, Size, Homogeneity)
• Refresh
• AccessFeatures Internet,
News, E-Mail
Reports, Patents
E-Trade, Logs
Informative content - + + Number of topics covered + + - Structured information - + + Size of records - + - Number of records + - +
Taxonomy ActivationGeography
Nairobi
AfricaAlgeriaAngolaKenya
NairobiTanzania
Dar es Salaam
AsiaAfghanistanArmenia
Nairobi
Dar es Salaam
Dar es Salaam
Smart Latching
Rifles
Gun sight
Weapons
Disambiguation
Rifles
Weapons
Control the ambiguity generated by the keyword based latching mode used by taxonomy expansion.
Sight
Fire Control Systems
Gun sight
Ranking Formula
Example:
2 occurrences of “chemical laser”1 occurrence of “gun sight”
Defense
Lasers
Ordnance
File Control Systems
Sights
2
1
0
1
2
3
0 1 2 3
Specificity Concentration
Distance
Amplifiers
XML Output
Tables
Charts
Outline
• Introduction
• Step by step• Phase 1
• Taxonomy design [QA]• Implementation & Tests
Lexicon extraction [QA]Meta data generation [QA]
• Phase 2• Classification design [QA]• Implementation & Tests
Portal generation [QA]
• Conclusion
Classification = Matrix
Permutable Trees
Typical Structures
• Geography / Topic• Terrorism in Philippines• Criminal Law in Texas• Domestic Sales• Security in Building C
• Horizontal / Vertical• Petroleum Business• AML Regulations
• Vertical / Vertical• Chemical Compounds for Alzheimer
Xml Representation
Population Mechanism
Population Control
Spread
0
Mutual Information
All Bomb truck are Kenya
Some Kenya are Bomb truck
Example
High MI And Low Spread
Low MI And High Spread
Typical Patterns: Over/Under Populated
Typical Patterns : Bottleneck
Typical Patterns : Interrupted Bell Curve
Typical Patterns : Multiple Cycles
10 Tests To Qualify Your Classification
1. Average size
2. Top folders size
3. Depth
4. Balance
5. Cycles
6. Interrupted cycles
7. “Strings”
8. Buried documents
9. The needle test
10. The false discovery test
Conclusion
How Is It Useful ?
Quickly point to the relevant information
Put in perspective extremely large amounts
of information
Maintain multiple views on a consistent
repository
Roadmap For Success
1. Build a FOUNDATION
2. QUALIFY results
3. Attain MATURITY
Project Metrics
• Typical Planning• 2-3 weeks
• Typical Team• 1 KE + Experts + Users Panel
• Typical Cost• Categorization Software• Internal Support
• Typical ROI• $ 2M / Year for 5,000 Users
Dynamic Classification Workshop
Claude Vogel
http://www.convera.com
Roadmap & Quality Metrics