© 2011 Noblis, Inc. Noblis proprietary and confidential. Applications of Semantic Technology Victor...

© 2011 Noblis, Inc. Noblis proprietary and confidential.

Applications of Semantic Technology

Victor J. Pollara

29 November 2012

2

• The development of “Semantic Technology” represents the confluence of several fields:

•The Internet/Web

•Knowledge modeling (most notably the field of ontology)

•Mathematical logic and Computational logic

•Database technology

•The general advancement of complex applications with rich GUIs

• Because of these diverse origins, there is a variety of ways in which the technology is commonly used

•Adding machine-readable information to web components to support interoperation and autonomous action by software agents

•Augmenting an existing data set with a model (e.g. ontology, taxonomy)

•Integrating multiple data sets on common data elements with well-defined meanings

•Extracting and structuring information from text

•Implementing knowledgebases (reservoirs of knowledge, support simple reasoning)

•Analysis of “graph-based” problems:•Social network analysis (benign – e.g. Facebook, malign – e.g. terrorist networks)•Cybersecurity analysis•Fraud detection and surveillance

Overview

3

• Social networks• “link analysis”• “degr. of sep.”

Edges may have weights

representing strength or certainty-------------------------“graph” has “nodes”

and “edges”

Joe

Zoe

Moe

Sam

Pam

Joe

Zoe

Moe

Sam

Pam

hasMom

hasD

ad

hasSpouse

hasArmsSupplier

Semantic graph has named relations with

direction.Permits much more

sophisticated queries.Supports reasoning.

-------------------------------<Moe> <hasDad> <Joe> is called a “triple” in the

semantic world

hasG

rand

pare

nt

hasSpouse

Joe

Zoe

Moe

Sam

Pam

hasMom

P=1.0

hasD

adP

=0.6

7

hasSpouse

hasG

rand

pare

nt

P=0.6

7

hasSpouseP=1.0

hasArmsSupplierP=0.8

Enhanced semantic graphs with weighted

edges

Representing Data in Graph Form

Joe

Sam

Moe

Pam

Zoe

…

…

…

…

…

Name Addr

…

…

…

…

…

SSN

Tabular Data

44© 2011 Noblis, Inc. Noblis proprietary and confidential.

Cybersecurity: Security Event Analysis

Organizations commonly deploy “event logging software” to record the events that occur in their networks (e.g. ArcSight)

The most basic data collected for each event is the source and destination IP address• This can be naturally represented as a network graph of nodes (IP addresses)

and edges (event that links two addresses)

• The number of events generated for a mid-sized company is in the billions, so the graph to be analyzed is large.

• The kinds of queries needed to identify problems range over the entire graph, so subdividing the graph (e.g. Hadoop-MapReduce) can make some queries not feasible

5© 2011 Noblis, Inc. Noblis proprietary and confidential. 5

Graph Model of Events

Transform security event data into a semantic graph and examine all relationships to identify unknown cyber threats


Knowledgebase: Semantic Medline

•Example using the XMT2 : Semantic Medline (Rindflesh, Shin, et al.)

• 60M+ High-confidence ‘facts’ extracted from 22M biomedical (PubMed) citations

• Augment it with biomedical knowledge models (e.g. UMLS Metathesaurus, NCBI Taxonomy)

• Integrate with other resources (e.g. Geonames)

The Computing Environment:• 4TB of shared memory

• 128 cores, each capable of running 128 independent threads (16384 threads)

• Maximum recommended size: 20 billion triples (occupies 2TB, but uRiKA uses the remaining 2TB as scratch space)

• uRiKA provides a SPARQL endpoint as well as a web client a user can interact with directly.

• ‘Service nodes’ are Linux machines separate from the ‘compute nodes’ and there is a communication latency between them that must be managed


The XMT2

The architecture of the XMT2 is suited for data that is not easily subdivided Efficiency of computation requires the entire set to be held in shared memory Data with little semantic content is not the best candidate (e.g. triplifying huge

tabular arrays of numerical data is not appropriate)

Since you are going to create a copy of the data for the XMT2, the best approach is to remodel it to contain as rich a semantic structure as possible.

Any ontology that adds semantic richness can support new queries that might be valuable

Since you are doing ETL, a scripting language is appropriate.

The XMT2 is not intended to serve as a persistence layer for transactional applications. It is best suited to non-subdividable graph analytic problems that have billions of nodes and edges.


Text Extraction and Triples

The first task in text extraction is to identify entities (e.g. people, places, things, events)• Good for document characterization, document matching, categorization.

Natural language processing can go much further by:• tagging each term with its part of speech

• Using the part-of-speech tags to extract ‘subject-verb-object’ triples

• These triples mirror the triple structure of semantic data

• Use controlled vocabularies and ontologies to manage entities and relations

• Example: “Tamoxifen has been shown in vitro to inhibit protein kinase C through estrogen receptor-independent antineoplastic effects.”

tamoxifen

protein kinase C

inhibits

urn:nlm.nih.gov:UMLS/CUI/C0039286

urn:nlm.nih.gov:UMLS/CUI/C0033634

urn:nlm.nih.gov:semmed/relation/inhibits


Semantic Medline

The National Library of Medicine hosts a website that contains over 22M citations from the biomedical literature (PubMed).• Even though they are only titles and abstracts, there is a lot of knowledge in them

• But the site only provides access to the citations by ‘search’

NLM scientists (Rindflesh, Shin, et al.) built a web-app for exploring high-confidence ‘facts’ extracted from PubMed citations (Semantic Medline)• The ‘facts’ are represented most naturally as a graph

• Without a high-performance triplestore server, they currently use a relational database (MySQL) to store the facts

We are testing the Cray XMT to see if it has potential to support a graph database as a replacement for MySQL.• We proposed to port Semantic Medline to the Noblis XMT2

• Cray has provided a Beta version triplestore server named uRiKA

• It provides a SPARQL endpoint (analogous to a SQL connector for a MySQL)

First let’s look at Semantic Medline’s functionality…

10

A Network Presentation of Biomedical Facts


Comments

Many problems can (at least in part) be represented as graphs Semantic technology can be used in a variety of ways to solve a wide range

of problems Some application areas (knowledgebases, cybersecurity, fraud detection)

give rise to very large graphs that are not easily subdividable The XMT2 is showing potential as a platform for providing analytical

services on large semantic data sets Over the next 12 months we will build a variety of services and test their

utility and responsiveness Analysis of semantic graphs requires support for logical queries of SPARQL

and analytical methods that calculate other graph properties


Backup Slide: Augmentation

For example, Medicare collects vast amounts of claims data Researchers can use it to evaluate the effectiveness of procedures or drugs But the format makes it difficult to explore the data in medically meaningful ways

Antibacterials

Penicillins

Amoxicillin

Ampicillin

Cephalosporin

UMLS Knowledge Model

Patient ID Age Drug Code

125454 65 229

224377 77 634

986904 82 229

774826 66 551

223556 71 634

394857 70 551

675849 65 551

Tabular Claims Data

Why add more data to an already large set?

© 2011 Noblis, Inc. Noblis proprietary and confidential. Applications of Semantic Technology Victor...

Documents

Transcript of © 2011 Noblis, Inc. Noblis proprietary and confidential. Applications of Semantic Technology Victor...