Advantage Through Technology Ron Rudnicki November 12, 2013.

40
Advantage Through Technology Advantage Through Technology Information Ontologies for the Intelligence Communities A Survey of DCGS-A Ontology Work Ron Rudnicki November 12, 2013

Transcript of Advantage Through Technology Ron Rudnicki November 12, 2013.

Advantage Through TechnologyAdvantage Through Technology

Information Ontologies for the Intelligence Communities

A Survey of DCGS-A Ontology Work

Ron RudnickiNovember 12, 2013

Topics

• The DCGS-A ontology suite• Standard operating procedures and ontology

quality assurance• Annotation vs. Explication• How the DCGS-A ontologies are being used for

the explication of data models

The DCGS-A Ontology Suite

Motives for Ontology Development

• Multiple formats including free text, semi-structured and structured

• Some “surprise” data sets are made available a short time prior to system testing

• Data sets will change along with domain of interest

• Data can not be collected into a single store• Provide cross-source searching and analytics• Need to maintain the provenance of data

Part of a Big Data solution

Contribution of the Ontologies

• Common Upper Level Ontology – The ontologies extend from a common upper level ontology

• Delineated Content - Each ontology has a clearly specified and delineated content that does not overlap with any other ontology

• Composable Content – Classes in the ontologies represent entities at a level of granularity that can be composed in various ways to map to terms in sources

Design choices affect the outcome

6

Integration Through a Common Upper Level Ontology

• Provides common patterns within the target ontology for mappings from the sources– Easier to include new sources of data

• Enables more uniformity between queries– Easier to transition to domains of interest

Encourages uniform representations of domainsEntity

Organization

Object

Quality of Physical Artifact

Quality of Organization

PhysicalArtifact

Quality

has_quality has_quality

CUBRC - Proprietary

bearer_of

7

Integration Through Delineated Content

• Facilitates locating a class within the target ontologies

• Provides better recall in queries– Less likely to overlook relevant data

Each class in the target ontologies is defined in one place

Entity

Organization

Object

PhysicalArtifact

Spatial Location

located_at located_at

CUBRC - Proprietary

Integration Through Composition of Classes

8

Car

Make

Model

Data Source 1

Car

Full Size Mid Size Compact

Data Source 2

Car

Length of Wheelbase

Manufacturer

Model

Compact

Mid Size

Full Size

prescribes

manufactureshas quality

is nominally measured by

• Granular classes better accommodate mappings from various perspectives on the same domain without loss of information

CUBRC - Proprietary

High Level Depiction of Domain

Provides Coverage of Domain of Human Activity

Attributes

Actions

Natural & Artificial

Environments

Time

People & Organizations

Artifacts

are distinguished by

use

to perform

that take place in

Developed Using a Top-Down Bottom-Up Strategy

• Treasury Office of Foreign Assets Control – Specially Designated Nationals and Blocked Persons

• NCTC – Worldwide Incidents Tracking System• UMD – Global Terrorism Database• RAND – Database of Worldwide Terrorism Incidents• LDM version .60 (TED)• VMF PLI• DCGS-A Global Graph• DCGS-A Event Reporting• BFT Report (CCRi test data)• Cidne Sigact (CCRi test data)• Long War Journal• Harmony Documents from CTC at West Point• Threats Open Source Intelligence Gateway

Partial List of Data Sources Used

Based Upon Standards

• DOD Dictionary of Military and Associated Terms (JP 1-02)• JC3IEDM• Counterinsurgency (FM 3-24)• Operations (FM 3-0)• Multinational Operations (JP 3-16)• International Standard Industrial Classification of all Economic Activities Rev.4

(ISIC4)• Universal Joint Task List (CJSCM 3500.04C)• Weapon Technical Intelligence (WTI) Improvised Explosive Device IED Lexicon• Information Artifact Ontology (IAO)• Phenotype and Trait Ontology (PATO)• Foundational Model of Anatomy (FMA)• Regional Connection Calculus (RCC-8)• Allen Time Calculus• Wikipedia

Partial List of Doctrine and Standards Used

Current DCGS-A Ontology Architecture

Basic Formal Ontology 1.1

Extended Relation Ontology

Agent Ontology Artifact Ontology Event Ontology

Emotion Ontology

Geospatial Ontology

Mid-Level Ontology

Information Entity Ontology Quality Ontology Time Ontology

Ontology Metrics

Ontology Name Number of Classes

Number of Relations

Equivalent Class Axioms

Subclass Of Axioms

Agent Ontology 986 71 378 1004

AIRS Emotion Ontology

73 88

AIRS Mid-Level Ontology

516 8 221 641

Artifact Ontology 298 3 310

Event Ontology 409 2 423

Extended Relation Ontology

45

Geospatial Ontology 297 14 13 316

Information Entity Ontology

83 29 21 83

Quality Ontology 681 2 681

Relation Ontology 20

Time Ontology 16 22 30

Totals 3359 209 640 (~19%) 3576 (~106%)

Standard Operating Procedures and Ontology Quality Assurance

Semantic Conformance Testing

• An importing ontology reuses a term from another and adds to its content in some way – adds an axiom to some upper-level term. – the imported class inherits content from parent classes

of the importing ontology• Corrective action– request that the curators of the ontology that is the

source of the class add the content• If not possible, then plan for revision of import architecture

– the importing ontology should introduce a subtype of the term to which the content could then be added.

Semantic Smuggling

Semantic Conformance Testing

• Defining a class to be a subtype of more than one superclass

• Corrective action– remove any subclass assertions that are false (e.g.

Bank subClassOf Organization, Bank subClassOf Facility)

– refactor superclasses into disjoint classes– write axiom so that the multiple inheritance exists

in the inferred hierarchy rather than the asserted hierarchy

Multiple Inheritance

Semantic Conformance Testing

• Extending an ontology by introducing terms as child terms of a higher-level ontology using another relation (e.g. part of, is narrower in meaning than)

• Corrective action– Place the terms into their appropriate place in the

taxonomy

Taxonomy Overloading

Semantic Conformance Testing

• a term from a lower level is not a subclass of any class of the ontologies it imports

• containment requires that the domain covered by a lower-level ontology be circumscribed by the domain covered by the higher-level ontology from which it extends.

• Corrective action – Add the class (or an appropriate superclass) to the

appropriate higher-level ontology– Import a higher-level ontology that does provide a

superclass

Containment

Semantic Conformance Testing

• an ontology includes information model assertions that are not true of the domain– e.g. carrying over a not null constraint as in every

person must have an email address• Corrective action– Make needed modifications to axiom (generally

the source of such violations) so that it conforms to the domain• e.g. every person that has purchased from amazon.com

must have an email address

Conflation

Semantic Conformance Testing

• a class is a set-theoretic combination of other classes

• Corrective action– Add the class as a new type (College or University

=> Higher Education Organization)

Logic of Terms

Calculating Value of Ontology Terms

• The content of ontologies used in an enterprise will be the subject of debate and possibly, disagreement

• Having one or more metrics that are proven measures of value would help resolve such disagreements

• Current methods are often applied to ontologies in their entirety (e.g. Swoogle), fewer are designed to evaluate value of ontology classes and properties

Provide some basis for class inclusion/exclusion

Calculating Value of Ontology Terms

• A purely statistical method applied to an ontology as a graph will undervalue isolated terms that are of importance in a domain

• Importance, is at least a function of amount of use and criticality

• Usage is tractable to definition, criticality less so

Statistical Methods Supplemented by Weightings

Annotation vs. Explication

Mappings

• Many of the purposes for which ontologies are built will be realized only to the degree to which they are linked to data

• One component of mapping is an act of translation and should be assessed on the degree of equivalence between source and target

• Another component of mapping is an implementation and should be assessed on performance criteria such as costs and scalability

• Techniques and technologies vary*

Value and Assessment

*An introductory overview can be found at: http://www.w3.org/2005/Incubator/rdb2rdf/RDB2RDF_SurveyReport_01082009.pdf

Mappings

• Hashtags – the subjective assignment of uncurated keywords to a source

• Annotations – rule based assignment of curated terms to a source

• Machine maps – automated, structure-based translation of source into target vocabulary

• Definitions – rule based expansion of source terms into types and differentiating attributes

• Explications – rule based translation of all semantic content (including that which is implicit) of a source using terms and relations of the ontology

Subtypes

TermMappings

AssertionMappings

26

Mappings

• Term mappings– Can be automated– Enable faceted queries (Select “JFK” as type Airport)– Can result in significant loss of information– Not reuseable

• Assertion mappings– Manual process that does not scale– Requires extensive knowledge of the target ontology– Enables navigational queries– Improves integration of data sources– Can result in significant carry over of source information– Not reuseable

Pros and Cons

Assessing Current Mapping MethodsTi

me/

Mon

ey

High

Low

TranslationLossy Lossless

No Ideal Instances…Hashtags

Annotations

Machine Maps

Definitions

Explications

28

Examples of Mappings

CityId Name State IncorporationDate Area Coordinates

1 Tampa Florida July 15, 1887 170.6 sq mi.

27 56’50” N82 27’31” W

2 Boston Massachusetts March 4, 1822 89.63 sq. mi.

42 21’29” N71 03’49” W

3 Dallas Texas February 2, 1856 385.8 sq. mi.

32 46’58” N96 48’14” W

4 Los Angeles California April 4, 1850 503 sq. mi.

34 03’ N118 15’ W

A Source of Data About Cities

Explication of the Source as an End Point

29

Coordinates

State

Incorporation Date

City Name

City

Area has_quality

participates_in

part_of

designated_bydesignated_by

State Name

designated_by

City Government

delimits

Act Of Incorporation

occurs_on

Explication Implementation Example

map:PersonBirth rdf:type d2rq:ClassMap ; rdfs:label "Person Birth" ; d2rq:class event:Birth ; d2rq:classDefinitionLabel "Treasury OFAC Person Birth" ; d2rq:dataStorage map:KDD-02-B-Treasury-SDN ; d2rq:uriPattern "treasurydata_PersonBirth/@@TreasuryPerson.id|urlify@@" .

map:PersonBirthTemporalInterval rdf:type d2rq:ClassMap ; rdfs:label "Person Birth Temporal Interval" ; d2rq:class span:TemporalRegion ; d2rq:classDefinitionLabel "Treasury OFAC Person Birth Temporal Interval" ; d2rq:dataStorage map:KDD-02-B-Treasury-SDN ; d2rq:uriPattern "treasurydata_PersonBirthTemporalIdentifier/@@TreasuryPerson.id|urlify@@_@@TreasuryPerson.dateofbirthlist_uid|urlify@@" .

map:PersonBirthTemporalIntervalIdentifier rdf:type d2rq:ClassMap ; rdfs:label "Person Birth Temporal Interval Identifier" ; d2rq:class airs:TemporalRegionIdentifier ; d2rq:classDefinitionLabel "Treasury OFAC Person Birth Temporal Interval Identifier" ; d2rq:dataStorage map:KDD-02-B-Treasury-SDN ; d2rq:uriPattern "treasurydata_PersonBirthTemporalIdentifier/@@TreasuryPerson.id|urlify@@_@@TreasuryPerson.dateofbirthlist_uid|urlify@@" .

map:PersonBirthTemporalIntervalIdentifierBearer rdf:type d2rq:ClassMap ; rdfs:label "Person Birth Temporal Interval Identifier Bearer" ; d2rq:class airs:TemporalRegionIdentifierBearer ; d2rq:classDefinitionLabel "Treasury OFAC Person Birth Temporal Interval Identifier Bearer" ; d2rq:dataStorage map:KDD-02-B-Treasury-SDN ; d2rq:uriPattern "treasurydata_PersonBirthTemporalIdentifierBearer/@@TreasuryPerson.id|urlify@@_@@TreasuryPerson.dateofbirthlist_uid|urlify@@" .

map:PersonBirthGeospatialLocation rdf:type d2rq:ClassMap ; rdfs:label "Person Birth Geospatial Location" ; d2rq:class geo:GeospatialLocation ; d2rq:classDefinitionLabel "Treasury OFAC Person Birth Geospatial Location" ; d2rq:dataStorage map:KDD-02-B-Treasury-SDN ; d2rq:uriPattern "treasurydata_PersonBirthGeospatialLocation/@@TreasuryPerson.id|urlify@@_@@TreasuryPerson.placeofbirthlist_uid|urlify@@" .

A Portion of a D2RQ File Mapping Birth Place and Date

Explication Current Method

• The full mapping of birth place and date consists of 16 such blocks

• The full mapping of the entire table consists of 150 such blocks

• If the ontologies change, so must the mappings• Common patterns in the ontologies make some re-use

possible by adding placeholders to portions of maps and replacing them with specific values for the source at hand.

• Applications exist or are under development to auto-generate initial mappings that a human can then edit

Explication Current Method

• The improvements are source and implementation specific– What works for structured sources mapped in

D2RQ can’t be reused in structured sources mapped in other languages (R2RML, EDOAL)

– Separate mappings would be needed for sources expressed in XML, HTML or free text

• Another solution is needed

How the DCGS-A Ontologies are Being Used for the Explications of Data Models

Start with Machine Made Assertion Mappings

• Type to Type mapping (e.g. table column to class)

• Relationships between types expressed using a default generic object property

• Meta-data about the source entity (e.g. table name, column name, element name) is mapped to annotation properties (rdfs:label)

Machine Made Assertion Mapping as a Starting Point

35

Coordinates

State

Incorporation Date

Name

CityArea

Class mappings createdby associating the containerwith the components witha generic property

has_area

has_incorporation_date

has_state

has_namehas_coordinates

Current Content of Ontologies is not Well Used

• Ontologists are trained to associate subclass and equivalence axioms to classes

• OWL reasoners don’t expand the graph by creating instances based upon these axioms

• OWL reasoners are resource expensive and often result in unimpressive output

• Not much control can be exerted upon which inferences an OWL reasoner performs

Create a Library of Rules

CONSTRUCT {?city ex:designated_by ?cityname . ?cityname rdf:type ex:CityName .}

WHERE {?city rdf:type ex:City .?cityname rdf:type ex:Name .?city ?related_to ?cityname .NOT EXISTS {?city ex:designated_by ?cityname . }}

Change the relationship and type of the name of a city

Create a Library of Rules

DELETE {?city ?related_to ?name . ?name rdf:type ex:Name . }

WHERE {?city ?related_to ?name .?name rdf:type ex:Name .?city ex:designated_by ?name .?name rdf:type ex:CityName .

}

Delete the original relationship and type

The Affect of Such Rules on Translated Data

39

coordinates _1

state_1

act_of_incorporation_1

area_1

Tampa

July 15, 1887

27 56’50”N82 27’31”W

city_1

has_text_value

has_value

has_value

has_valuedesignated_by

part_of

has_incorporation_date

has_quality

designated_by

city_name_1

170.6 sq. mi.state_name_1

designated_by

Florida

has_valuecity_government_1

act_of_incorporation

act_of_incorporation_2

delimits

is_output_of

occurs_on

Benefits of Rule Library

• No need to write different rules for different source formats

• Changes to the ontology affect a single rule rather than some (possibly large) number of mappings

• Allows mappings from source to target to be simple and possibly fully automated

• Writing of rules can be performed by SMEs• Fine grained control of which rules are executed– by user group– above a stated level of priority (weighting)