© 2011 Noblis, Inc. Noblis proprietary and confidential. Applications of Semantic Technology Victor...
-
Upload
chester-lynch -
Category
Documents
-
view
239 -
download
1
Transcript of © 2011 Noblis, Inc. Noblis proprietary and confidential. Applications of Semantic Technology Victor...
© 2011 Noblis, Inc. Noblis proprietary and confidential.
Applications of Semantic Technology
Victor J. Pollara
29 November 2012
2
• The development of “Semantic Technology” represents the confluence of several fields:
•The Internet/Web
•Knowledge modeling (most notably the field of ontology)
•Mathematical logic and Computational logic
•Database technology
•The general advancement of complex applications with rich GUIs
• Because of these diverse origins, there is a variety of ways in which the technology is commonly used
•Adding machine-readable information to web components to support interoperation and autonomous action by software agents
•Augmenting an existing data set with a model (e.g. ontology, taxonomy)
•Integrating multiple data sets on common data elements with well-defined meanings
•Extracting and structuring information from text
•Implementing knowledgebases (reservoirs of knowledge, support simple reasoning)
•Analysis of “graph-based” problems:•Social network analysis (benign – e.g. Facebook, malign – e.g. terrorist networks)•Cybersecurity analysis•Fraud detection and surveillance
Overview
3
• Social networks• “link analysis”• “degr. of sep.”
Edges may have weights
representing strength or certainty-------------------------“graph” has “nodes”
and “edges”
Joe
Zoe
Moe
Sam
Pam
Joe
Zoe
Moe
Sam
Pam
hasMom
hasD
ad
hasSpouse
hasArmsSupplier
Semantic graph has named relations with
direction.Permits much more
sophisticated queries.Supports reasoning.
-------------------------------<Moe> <hasDad> <Joe> is called a “triple” in the
semantic world
hasG
rand
pare
nt
hasSpouse
Joe
Zoe
Moe
Sam
Pam
hasMom
P=1.0
hasD
adP
=0.6
7
hasSpouse
hasG
rand
pare
nt
P=0.6
7
hasSpouseP=1.0
hasArmsSupplierP=0.8
Enhanced semantic graphs with weighted
edges
Representing Data in Graph Form
Joe
Sam
Moe
Pam
Zoe
…
…
…
…
…
Name Addr
…
…
…
…
…
SSN
Tabular Data
44© 2011 Noblis, Inc. Noblis proprietary and confidential.
Cybersecurity: Security Event Analysis
Organizations commonly deploy “event logging software” to record the events that occur in their networks (e.g. ArcSight)
The most basic data collected for each event is the source and destination IP address• This can be naturally represented as a network graph of nodes (IP addresses)
and edges (event that links two addresses)
• The number of events generated for a mid-sized company is in the billions, so the graph to be analyzed is large.
• The kinds of queries needed to identify problems range over the entire graph, so subdividing the graph (e.g. Hadoop-MapReduce) can make some queries not feasible
5© 2011 Noblis, Inc. Noblis proprietary and confidential. 5
Graph Model of Events
Transform security event data into a semantic graph and examine all relationships to identify unknown cyber threats
66© 2011 Noblis, Inc. Noblis proprietary and confidential.
Knowledgebase: Semantic Medline
•Example using the XMT2 : Semantic Medline (Rindflesh, Shin, et al.)
• 60M+ High-confidence ‘facts’ extracted from 22M biomedical (PubMed) citations
• Augment it with biomedical knowledge models (e.g. UMLS Metathesaurus, NCBI Taxonomy)
• Integrate with other resources (e.g. Geonames)
The Computing Environment:• 4TB of shared memory
• 128 cores, each capable of running 128 independent threads (16384 threads)
• Maximum recommended size: 20 billion triples (occupies 2TB, but uRiKA uses the remaining 2TB as scratch space)
• uRiKA provides a SPARQL endpoint as well as a web client a user can interact with directly.
• ‘Service nodes’ are Linux machines separate from the ‘compute nodes’ and there is a communication latency between them that must be managed
77© 2011 Noblis, Inc. Noblis proprietary and confidential.
The XMT2
The architecture of the XMT2 is suited for data that is not easily subdivided Efficiency of computation requires the entire set to be held in shared memory Data with little semantic content is not the best candidate (e.g. triplifying huge
tabular arrays of numerical data is not appropriate)
Since you are going to create a copy of the data for the XMT2, the best approach is to remodel it to contain as rich a semantic structure as possible.
Any ontology that adds semantic richness can support new queries that might be valuable
Since you are doing ETL, a scripting language is appropriate.
The XMT2 is not intended to serve as a persistence layer for transactional applications. It is best suited to non-subdividable graph analytic problems that have billions of nodes and edges.
88© 2011 Noblis, Inc. Noblis proprietary and confidential.
Text Extraction and Triples
The first task in text extraction is to identify entities (e.g. people, places, things, events)• Good for document characterization, document matching, categorization.
Natural language processing can go much further by:• tagging each term with its part of speech
• Using the part-of-speech tags to extract ‘subject-verb-object’ triples
• These triples mirror the triple structure of semantic data
• Use controlled vocabularies and ontologies to manage entities and relations
• Example: “Tamoxifen has been shown in vitro to inhibit protein kinase C through estrogen receptor-independent antineoplastic effects.”
tamoxifen
protein kinase C
inhibits
urn:nlm.nih.gov:UMLS/CUI/C0039286
urn:nlm.nih.gov:UMLS/CUI/C0033634
urn:nlm.nih.gov:semmed/relation/inhibits
99© 2011 Noblis, Inc. Noblis proprietary and confidential.
Semantic Medline
The National Library of Medicine hosts a website that contains over 22M citations from the biomedical literature (PubMed).• Even though they are only titles and abstracts, there is a lot of knowledge in them
• But the site only provides access to the citations by ‘search’
NLM scientists (Rindflesh, Shin, et al.) built a web-app for exploring high-confidence ‘facts’ extracted from PubMed citations (Semantic Medline)• The ‘facts’ are represented most naturally as a graph
• Without a high-performance triplestore server, they currently use a relational database (MySQL) to store the facts
We are testing the Cray XMT to see if it has potential to support a graph database as a replacement for MySQL.• We proposed to port Semantic Medline to the Noblis XMT2
• Cray has provided a Beta version triplestore server named uRiKA
• It provides a SPARQL endpoint (analogous to a SQL connector for a MySQL)
First let’s look at Semantic Medline’s functionality…
10
A Network Presentation of Biomedical Facts
1111© 2011 Noblis, Inc. Noblis proprietary and confidential.
Comments
Many problems can (at least in part) be represented as graphs Semantic technology can be used in a variety of ways to solve a wide range
of problems Some application areas (knowledgebases, cybersecurity, fraud detection)
give rise to very large graphs that are not easily subdividable The XMT2 is showing potential as a platform for providing analytical
services on large semantic data sets Over the next 12 months we will build a variety of services and test their
utility and responsiveness Analysis of semantic graphs requires support for logical queries of SPARQL
and analytical methods that calculate other graph properties
1212© 2011 Noblis, Inc. Noblis proprietary and confidential.
Backup Slide: Augmentation
For example, Medicare collects vast amounts of claims data Researchers can use it to evaluate the effectiveness of procedures or drugs But the format makes it difficult to explore the data in medically meaningful ways
Antibacterials
Penicillins
Amoxicillin
Ampicillin
Cephalosporin
UMLS Knowledge Model
Patient ID Age Drug Code
125454 65 229
224377 77 634
986904 82 229
774826 66 551
223556 71 634
394857 70 551
675849 65 551
Tabular Claims Data
Why add more data to an already large set?