The Future of Metadata Denise Bedford World Bank Presentation to Fall Metadata Forum November 2,...

38
The Future of Metadata Denise Bedford Denise Bedford World Bank World Bank Presentation to Fall Metadata Forum Presentation to Fall Metadata Forum November 2, 2005 November 2, 2005 Department of Homeland Security Department of Homeland Security

Transcript of The Future of Metadata Denise Bedford World Bank Presentation to Fall Metadata Forum November 2,...

The Future of Metadata

Denise BedfordDenise BedfordWorld Bank World Bank

Presentation to Fall Metadata ForumPresentation to Fall Metadata ForumNovember 2, 2005 November 2, 2005

Department of Homeland SecurityDepartment of Homeland Security

Meta-FutureMeta-Future Most of our information use and access today is based on an Most of our information use and access today is based on an

anonymous access model anonymous access model

It is increasingly clear that anonymous access to information and It is increasingly clear that anonymous access to information and the packaging of information for single use contexts is neither the packaging of information for single use contexts is neither sufficient for users nor an efficient use of development/engineering sufficient for users nor an efficient use of development/engineering resourcesresources

We need to think in terms of contextualization and sensitization of We need to think in terms of contextualization and sensitization of information so that it can be used in any context where it pertainsinformation so that it can be used in any context where it pertains

In the future, information will flow – information, not the systems in In the future, information will flow – information, not the systems in which it lives or was created, will be our focuswhich it lives or was created, will be our focus

Information needs to be agile and mobile – it needs to be sensitized Information needs to be agile and mobile – it needs to be sensitized to the contexts in which it might be used, to the interests of those to the contexts in which it might be used, to the interests of those who might use it, and to the applications that might consume itwho might use it, and to the applications that might consume it

Meta-FutureMeta-Future Envision a future like that described in the Netcentric Envision a future like that described in the Netcentric

Information Models formulated by the Dept. of DefenseInformation Models formulated by the Dept. of Defense

Information is created, tagged, posted and sharedInformation is created, tagged, posted and shared

Any applications or users can – according to security Any applications or users can – according to security privileges – use any information they can find, in any privileges – use any information they can find, in any application they need to use to do their workapplication they need to use to do their work

Technology becomes increasingly invisible but more logic Technology becomes increasingly invisible but more logic basedbased

More and different kinds of information such as reference More and different kinds of information such as reference sources need to be managed and maintainedsources need to be managed and maintained

This meta-future is heavily dependent upon the existence This meta-future is heavily dependent upon the existence of rich, conceptual, sensitized, meaningful metadataof rich, conceptual, sensitized, meaningful metadata

This future is now – it is simply a practical view of the This future is now – it is simply a practical view of the Semantic WebSemantic Web

The problem with metadataThe problem with metadata This future sounds wonderful and the contextualization This future sounds wonderful and the contextualization

vision is exciting but there’s just one problem…metadatavision is exciting but there’s just one problem…metadata

Metadata….Metadata….– Is expensive and time consuming to createIs expensive and time consuming to create– Is sometimes subjective and not granular enoughIs sometimes subjective and not granular enough– Doesn’t always address the ways that users and Doesn’t always address the ways that users and

systems think about the information it describessystems think about the information it describes– May not tell us enough about the information to trust it May not tell us enough about the information to trust it – may address only one context – the context for which it may address only one context – the context for which it

is createdis created– May lives in the source application where it was createdMay lives in the source application where it was created– May not be as accessible as the information assetMay not be as accessible as the information asset

If a Meta-Future depends on metadata, we have to solve If a Meta-Future depends on metadata, we have to solve these problems these problems

The problem with technologiesThe problem with technologies Many of the tools are so tightly integrated, you might Many of the tools are so tightly integrated, you might

generate rich metadata, but it will not make your information generate rich metadata, but it will not make your information agile or mobileagile or mobile

Statistical clustering engines do not get us to persistent Statistical clustering engines do not get us to persistent meaning or contextualization. Clustering engines are great meaning or contextualization. Clustering engines are great for thresholding or pattern tracings, but they will not for thresholding or pattern tracings, but they will not generate the kind of metadata we need to realize this futuregenerate the kind of metadata we need to realize this future

We need semantic engines at the base of all our metadata We need semantic engines at the base of all our metadata efforts, and these engines need to be available in multiple efforts, and these engines need to be available in multiple languages -- semantics vary by language languages -- semantics vary by language

Magic black box approaches are neither meaningful nor Magic black box approaches are neither meaningful nor sustainable -- you need to have access to the programs sustainable -- you need to have access to the programs through a user-friendly interface so you can adapt them to through a user-friendly interface so you can adapt them to your environment without having to have programming your environment without having to have programming knowledgeknowledge

You need to have several different kinds of technologies to do You need to have several different kinds of technologies to do what I’m going to describe today – not just one toolwhat I’m going to describe today – not just one tool

Content Dimension

User Dimension

Information Diffusion (Context Sensitive – Group)_

Information Gathering& Transformation

(Context Sensitive – Person)

Understanding the Dimensions of Contextualization

Topic Scheme

BusinessActivityScheme

CentralizedCollections

ContentElements &

Structure (XML)

Content Metadata

Ideas &Tacit Knowledge

Content QualityManagement

Topic Thesaurus

Anonymous Access(Context Free)

InstitutionalRoles

InstitutionalProfilesCommunities

Of Practice

CommunitiesSDI

SocialGroups Social Group

Profiles

IndividualProfiles

IndividualProfiles

Browsing

ParametricSearching

Searching By Source

Searching By Tools

Programmatic Metadata Capture

ResultsClustering

Text Classification

PersonalSDI

Social GroupSDI

Individual Discovery

IndividualLearning

Task Oriented SDI

Directories of Expertise

ConceptFiltering

Threshold Filtering

User-User ProfileMatching

SenseMaking

Content Repurposing

Collaborative Filtering

ContentAggregation

RecommenderEngines

Publishing

SyndicationEngines

Business Process

Awareness

CommunityBuilding

SocialFiltering

KnowledgeSharing

AdvisoryServices

Q&ASystems

ConceptExtraction

TaskFiltering

ResultsSorting

Searching

CountryScheme

RegionScheme

Bank’s BusinessLanguage

CollectionDevelopment

Policy

TranslationSystems

Organizational Entities

ClientProfiles

PartnerProfiles

AuthorizationRules

AuthenticationRules

Metadata Management

Co

nte

xt

Dim

en

sio

n

WorkflowManagement

OnlineTraining

Vision of ContextualizationVision of Contextualization We need to address metadata challenges not in a We need to address metadata challenges not in a

traditional way but in the future context – with the idea that traditional way but in the future context – with the idea that metadata is contextualizable and sensitized – to support metadata is contextualizable and sensitized – to support information agility and mobilityinformation agility and mobility

In order to achieve contextualization you need to have In order to achieve contextualization you need to have ‘extreme metadata’ ‘extreme metadata’ – Metadata about the informationMetadata about the information– Metadata about the userMetadata about the user– Metadata about the contextMetadata about the context– Rich metadata designed to meet many functional requirementsRich metadata designed to meet many functional requirements– Metadata in multiple languagesMetadata in multiple languages

Metadata needs to be ‘interpretable’ for and in a contextMetadata needs to be ‘interpretable’ for and in a context– Reference sources not only for traditional metadata but for all Reference sources not only for traditional metadata but for all

of the relationships and logic that are present in an ontology of the relationships and logic that are present in an ontology (simply different kinds of taxonomy representations)(simply different kinds of taxonomy representations)

– Metadata must reflect any context or interest that a user might Metadata must reflect any context or interest that a user might express express

– Still need to have some control over metadata in order to Still need to have some control over metadata in order to make it understandable in different contextsmake it understandable in different contexts

Content Entity1

Content Elements

Content

Metadata

Topic Class Scheme

Business ProcessScheme

Thesaurus

Country Names

Region Names

Skill Sets/Competencies

Standard Statistical Variables

Has values

usesHas

Contains

UserHas relationship to

Has Meaning in

Context

ContextualMatrix &Sensiing

Contextual Logic

uses

Hierarchy Flat Taxonomy Network Taxonomy

Profile

Has

Business Rule

Rule Logic

Has values

Content Parts

Has

Metadata

Has

Faceted Taxonomy Ring Taxonomy

New View of OntologyNew View of Ontology

People Referenced

Orgs ReferencedMetadata

Getting to Rich MetadataGetting to Rich Metadata

Given the future demand for rich, contextualizable metadata, Given the future demand for rich, contextualizable metadata, and all of the traditional drawbacks… how will we achieve this and all of the traditional drawbacks… how will we achieve this futurefuture

We need to look for a different model for creating and We need to look for a different model for creating and sustaining metadata and reference sourcessustaining metadata and reference sources

We need to teach technologies how to capture the metadata we We need to teach technologies how to capture the metadata we need and how to maintain our reference sourcesneed and how to maintain our reference sources

I’d like to show you an example of how we might achieve that I’d like to show you an example of how we might achieve that future future

Please keep in mind that I’m showing you an example of what is Please keep in mind that I’m showing you an example of what is possible – Enterprise Search, Authority Control/Entity Discoverypossible – Enterprise Search, Authority Control/Entity Discovery

Fueling Semantic Search With Fueling Semantic Search With MetadataMetadata

Or, ….if Metadata is Dead, Semantic Web and Or, ….if Metadata is Dead, Semantic Web and Semantic Search Are DeadSemantic Search Are Dead

Flat taxonomy

Hierarchical taxonomy

Ring taxonomy

Ring taxonomy

Fielded Search = Faceted Taxonomy

Ring Taxonomy

NetworkTaxonomy

Metadata

More explicitView of faceted

taxonomy

Building and Maintaining Building and Maintaining TaxonomiesTaxonomies

Moving towards automated metadata generation means that Moving towards automated metadata generation means that catalogers shift their effort to reviewing the metadata catalogers shift their effort to reviewing the metadata generated and to more fully developing and maintaining generated and to more fully developing and maintaining subject headings/thesauri and classification schemes as part subject headings/thesauri and classification schemes as part of a suite of categorization toolsof a suite of categorization tools

Level of effort shifts to training and developing the tools and Level of effort shifts to training and developing the tools and away from original cataloging and metadata capture away from original cataloging and metadata capture

Continue to work closely with subject experts to define the Continue to work closely with subject experts to define the controlled vocabularies and classification schemescontrolled vocabularies and classification schemes

It means that you have to have a metadata infrastructure It means that you have to have a metadata infrastructure that looks something like that ontology we just reviewedthat looks something like that ontology we just reviewed

There is no silver bullet ontology tool out there that will do There is no silver bullet ontology tool out there that will do this work for you – your knowledge and skills are criticalthis work for you – your knowledge and skills are critical

Metadata Capture MethodsMetadata Capture Methods

Agent Country Authorized By

Record I dentifier

Title Region Rights Management

Disposal Status

Date Abstract/ Summary

Access Rights

Disposal Review Date

Format Keywords Location Management History

Publisher Subject- Sector- Theme- Topic

Use History Retention Schedule/Mandate

Language Business Function

Preservation History

Version Aggregation Level

Series & Series #

Relation

Content Type

Identification/ Distinction

Use Management Compliant Document Management

Human CaptureProgrammatic Capture

Inherit from System Context

Extrapolate from Business Rules

Search & Browse

Smart Use of TechnologiesSmart Use of Technologies

Sample structure – Bank Topics Classification Scheme Sample structure – Bank Topics Classification Scheme (hierarchical taxonomy)(hierarchical taxonomy)

– Oracle data classes used to represent Topic Classification Oracle data classes used to represent Topic Classification scheme scheme hierarchical taxonomy as reference source for the hierarchical taxonomy as reference source for the

attribute – Topicattribute – Topic used for Browse, Search, Content Syndication, used for Browse, Search, Content Syndication,

PersonalizationPersonalization

– 11stst challenge is to architect the hierarchy correctly challenge is to architect the hierarchy correctly 3 distinct data classes, not a tree structure with 3 distinct data classes, not a tree structure with

inheritanceinheritance Allows you to use the three data classes for distinct Allows you to use the three data classes for distinct

functions across systems but still enforce relationships functions across systems but still enforce relationships across the classesacross the classes

Relationships across data

classes

3 OracleData

classes

Topic data class

SubtopicData Class

SubsubtopicData class

Categorizing and Indexing ContentCategorizing and Indexing Content

Let’s look at how we’re categorizing our content to this Let’s look at how we’re categorizing our content to this structure automaticallystructure automatically

Topic classification, geographical region assignment, Topic classification, geographical region assignment, keywording exampleskeywording examples

Can apply this approach to any kind of content Can apply this approach to any kind of content

Enables us to build a robust metadata repository model, Enables us to build a robust metadata repository model, with strong metadata quality, to move towards SI at the with strong metadata quality, to move towards SI at the functional levelfunctional level

Also note that we can do this across many languagesAlso note that we can do this across many languages

Semantic Analysis Semantic Analysis Using The Technologies to Best Using The Technologies to Best

AdvantageAdvantage

Semantic analysis tools which support concept extraction, Semantic analysis tools which support concept extraction, categorization, summarization and pattern matching rules categorization, summarization and pattern matching rules enginesengines

Teragram works in 23 languagesTeragram works in 23 languages

Use categorization to capture Topics, Business Activities, Use categorization to capture Topics, Business Activities, Regions, Sectors, Themes, etc.Regions, Sectors, Themes, etc.

Use Concept Extraction to capture keywordsUse Concept Extraction to capture keywords

Use Rules Engine to capture Loan #, Credit #, Project ID, Trust Use Rules Engine to capture Loan #, Credit #, Project ID, Trust Fund #, etc.Fund #, etc.

Use Summarization to generate a ‘gist’ of the contentUse Summarization to generate a ‘gist’ of the content

How does semantic analysis work?How does semantic analysis work?

Semantic Analysis BasicsSemantic Analysis Basics

Once you have made some sense of the sentence Once you have made some sense of the sentence (decompose), reconstruct entities for information (decompose), reconstruct entities for information extraction (compose)extraction (compose)

– Identify names and other fixed form expressions – Identify names and other fixed form expressions – people, organizations, actions, relationships, placespeople, organizations, actions, relationships, places

– Identify basic noun groups, verb groups, formatting Identify basic noun groups, verb groups, formatting elements, logic statementselements, logic statements

– Construct complex noun groups and verb groupsConstruct complex noun groups and verb groups

– Identify event structuresIdentify event structures

– Identify common elements and associate Identify common elements and associate

Leveraging the Topic StructureLeveraging the Topic Structure

Each subtopic is a knowledge domain (hierarchical taxonomy)Each subtopic is a knowledge domain (hierarchical taxonomy)

Each subtopic has an extensive concept level definition Each subtopic has an extensive concept level definition (1,000 – 5,000+ concepts)(1,000 – 5,000+ concepts)

Concepts are controlled vocabularies in their raw form (flat Concepts are controlled vocabularies in their raw form (flat taxonomy)taxonomy)

Concepts with relationships (extensive per new Z39.19 Concepts with relationships (extensive per new Z39.19 standard) comprise semantic network (network taxonomy)standard) comprise semantic network (network taxonomy)

Categorization tools work with topic structure & concept Categorization tools work with topic structure & concept definitions to categorize and index content definitions to categorize and index content

The following screen illustrates how that same structure is The following screen illustrates how that same structure is embedded into Teragram profile to support categorizationembedded into Teragram profile to support categorization

Subtopics

Domain concepts or controlled vocabulary

Extensive operators allow us to write

grammatical rules to manage typical semantic

problems

Concept based rules engine allows us to define patterns to

capture other kinds of data

Example of use of Authority Control to capture country

names but extract ‘authorized’ version of

country name

Example of use of a gazetteer + concept

extraction + rules engine to support semantic

interoperability

Use of concept extraction + rules engine to capture Loan #, Credit #,

Project ID#

Overview of Process & ToolsOverview of Process & ToolsActivityActivity ApproachApproach ToolsTools

Create new facetCreate new facet Human review & consultation, Human review & consultation, data structures, governancedata structures, governance

Oracle DBMS, in future Metadata Oracle DBMS, in future Metadata Repository tools (ISO 11179); Repository tools (ISO 11179); Oracle representation of data Oracle representation of data classesclasses

Create new classCreate new class Human review & harmonization Human review & harmonization of existing information of existing information structures; tool based discovery structures; tool based discovery of new structures through of new structures through clustering & extractionclustering & extraction

Teragram dynamic concept Teragram dynamic concept extraction using grammars, extraction using grammars, categorization, clustering; Oracle categorization, clustering; Oracle representation of data classesrepresentation of data classes

Create new conceptCreate new concept Create training sets working with Create training sets working with experts, identify & integrate experts, identify & integrate existing vocabulariesexisting vocabularies

Teragram concept extraction, Teragram concept extraction, Oracle representation of values Oracle representation of values

Create new relationshipCreate new relationship Human relationship creation, Human relationship creation, augmented by technological augmented by technological discoverydiscovery

Teragram clustering engine, Teragram clustering engine,

MultiTes Thesaurus Management MultiTes Thesaurus Management System, Oracle copy of System, Oracle copy of thesaurus relationshipsthesaurus relationships

Create new metadata Create new metadata Enterprise Profile Development Enterprise Profile Development with human review in some with human review in some cases, no review in others; cases, no review in others; Metadata in the language of the Metadata in the language of the document/contentdocument/content

Teragram enterprise profile Teragram enterprise profile leveraging concept extraction, leveraging concept extraction, categorization, and categorization, and summarizaitonsummarizaiton

Enterprise Profile

Development & Maintenance

Enterprise Metadata Profile

Concept Extraction TechnologyCountryOrganization NamePeople NameSeries Name/Collection TitleAuthor/CreatorTitlePublisher Standard Statistical VariableVersion/Edition

Categorization TechnologyTopic CategorizationBusiness Function CategorizationRegion CategorizationSector CategorizationTheme Categorization

Rule-Based CaptureProject IDTrust Fund #Loan #Credit #Series #Publication DateLanguage

Summarization

e-CDS Reference Sources forCountry, Region, Topics

Business Function, Keywords,Project ID, People, Organization

Data GovernanceProcess for

Topics, Business Function,Country, Region, Keywords,

People, Organizations, Project ID

Teragram Team

TK240 Client ISP IRIS ImageBankFactiva

JOLISE-Journals

Enterprise Profile Creation and Maintenance

UCM ServiceRequests

Update & Change Requests

ImageBank Integration

Content Capture

ISP Integration

Enterprise Profile

Development &

Maintenance

XML Wrapped Metadata

Dedicated Server – Teragram Semantic

Engine – Concept Extraction, Categorization, Clustering, Rule Based Engine, Language Detection

APIs & Integration

APIs & Integration

Content Capture

XML Wrapped Metadata

Factiva Metadata Database

IRIS Integration

APIs & Integration

EnterpriseMetadata Capture Strategy

TK240 Client

XML Output

e-CDS Reference Sources

APIs & Technical Integration

Content OwnersContent Owners

Business Analyst

IDU Indexers SITRC Librarians

IRIS FunctionalTeam

Enterprise Metadata Capture – Functional Reference Model

Impacts & OutcomesImpacts & Outcomes Information Access impactsInformation Access impacts

– Increased precision of searchIncreased precision of search– Better control over recall Better control over recall – Searching like we talk Searching like we talk – Exact match searching – known item searching will work betterExact match searching – known item searching will work better– Metadata based searching now begins to resemble full-text Metadata based searching now begins to resemble full-text

searching but with all the advantages of structure & context, and searching but with all the advantages of structure & context, and a significant reduction in the amount of noisea significant reduction in the amount of noise

Productivity ImprovementsProductivity Improvements– Can now assign deep metadata to all kinds of content Can now assign deep metadata to all kinds of content – Remove the human review aspect from the metadata captureRemove the human review aspect from the metadata capture– Reduce unit times where human review is still usedReduce unit times where human review is still used

Information Quality impactsInformation Quality impacts– All metadata carries the information architecture with itAll metadata carries the information architecture with it– Apply quality metrics at the metadata level to eliminate need to Apply quality metrics at the metadata level to eliminate need to

build ‘fuzzy search architectures’ – these rarely scale or improve build ‘fuzzy search architectures’ – these rarely scale or improve in performancein performance

– Use the technologies to identify and fix problems with our dataUse the technologies to identify and fix problems with our data

In Progress ImpactsIn Progress Impacts

Same methodology can be leveraged to develop a structure of Same methodology can be leveraged to develop a structure of lines of business, entities prominent in particular domains, lines of business, entities prominent in particular domains, relationships among entities in a domain, standard statistical relationships among entities in a domain, standard statistical variables, etc.variables, etc.

The richer the metadata and the more fully elaborated the The richer the metadata and the more fully elaborated the reference structures, the closer we come to understanding at a reference structures, the closer we come to understanding at a system level what is happening in a particular domain at any system level what is happening in a particular domain at any point in timepoint in time

It is this overall structure which can then be leveraged in other It is this overall structure which can then be leveraged in other contexts, perhaps even a counter-terrorism context, to threshold contexts, perhaps even a counter-terrorism context, to threshold eventsevents

Without metadata, though, no information asset can be secured Without metadata, though, no information asset can be secured but still its importance knownbut still its importance known

Without metadata, no information is agile or mobileWithout metadata, no information is agile or mobile

Thank You.Thank You.

Questions & DiscussionsQuestions & Discussions