Realizing Semantic Web - Light Weight semantics and beyond
-
Upload
knoesis-center-wright-state-university -
Category
Education
-
view
648 -
download
1
Transcript of Realizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web: Lightweight Semantics and BeyondKrishnaprasad Thirunarayan (T. K. Prasad)
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, OH-454351
Outline
• Domain Goals and Challenges
• Cyberinfrastructure Investments in Science
• Utility and Continuum of Machine-Processable
Semantics : An Architecture
• What?: Nature of Data and Granurality of Semantics
• Why?: Lightweight semantics and its benefits
• How?: Community-ratified Ontologies
+ Semantic Annotations of Data and Documents
+ Linked Open Materials Data
• Research: Processing Tabular Data
2
Domain Goals and Challenges
• Materials Science and Engineering Data and Information sharing, discovery, and application are possible only if domain scientists are able and willing to do so.
• Technological challenges– Computational tools and repositories conducive to easy
exchange, curation, attribution, and analysis of data
• Cultural challenges– Proper protection, control, and credit for sharing data
3
Category of
Geoscience
Data
Characteristics Strategy for Reuse CI Strategy
Short tail
science
data created
by large
organization
s and
projects
Few, large (TB+),
structured, spatially
rich (e.g., remote
sensing), largely
homogeneous,
highly visible,
curated
Planned integration
strategies, could use formal
ontologies / domain models
and vocabularies,
visualization tools and APIs
Data centers / grids
generally using
relational databases
and files, maintained
by people with
significant IT skills
Long tail
science
data created
by individual
scientists
and small
groups
Many, small (GB+),
heterogeneous,
invisible (except via
publications),
poorly curated
Multi-domain and broad
vocabularies (including
community established
ones), create semantic
metadata (annotations) and
optionally publish, search
and download legacy data,
or use an open data
initiative
Web-based easy to
learn and use semantic
tools for annotation,
publication, search and
download that can be
used by individual
scientists without
significant IT skills
4
Our Thesis
Associating machine-processable semantics
with materials science and engineering data
and documents can help overcome
challenges associated with data discovery,
integration and interoperability caused by
data heterogeneity.
5
What?: Nature of Data and Documents
• Structured Data (e.g., relational)
• Semi-structured, Heterogeneous Documents (e.g., publications and technical specs usually include text, numerics, units of measure, images and equations)
• Tabular data (e.g., ad hoc spreadsheets and complex tables incorporating “irregular” entries)
6
Fragment of Materials and Process spec for Ti Alloy Bars, Wire, Forgings, and Rings.
7
What?: Granularity of Semantics and Applications: Examples
• Synonyms– Chemistry, Chemical Composition, Chemical Analysis, ...
– Bend Test, Bending, ...
– Delivery Condition, Process/Surface Finish, Temper, "as received by
purchaser", ...
• Coreference vs broadening/narrowing– Tubing vs welded tubing vs flash-welded part
• Capturing characteristic-value pairs– Recognize and Normalize: “0.1 inch and under in nominal thickness”
is translated to “Thickness <= 0.1 in”.
– Glean elided characteristic: controlled term “solution heat treated”
implies the characteristic “heat treat type”.
8
What?: Granularity of Semantics and Associated Applications
• Lightweight semantics: File and document-level
annotation to enable discovery and sharing
• Richer semantics: Data-level annotation and
extraction for semantic search and summarization
• Fine-grained semantics: Data integration,
interoperability and reasoning in Linked Open
Materials Science Data
9
Computer Assisted Document Extraction Tool
10
Typical view of the tagged Spec Tree/Structure view of the Spec
Computer Assisted Document Extraction Tool
11
Few More Examples: Procedure Melt Methods
View of the Original Spec Tagged Spec
TagEditor
Computer Assisted Document Extraction Tool
12
TagEditor
The SDL
Few More Examples: Procedure Melt Methods
Why?: Benefits of Lightweight Semantics
• Ease of use by domain experts
– Faster and wider adoption, promoting evolution
• Low upfront cost to support
• Shallow semantics has wider applicability to a
range of documents/data and appeal to a broader
community of geoscientists
• Bottom-line: “Learn to Walk before we Run”
13
How?: Using Semantic Web Technologies
Machine-processable semantics achieved by addressing
• Syntactic Heterogeneity: Using XML syntax and RDF datamodel (labelled graph structure)
• Semantic Heterogeneity: – Using “common” controlled vocabularies, taxonomies
and ontologies
– Using federated data sources, exchanges, querying, and services
14
How?: Ingredients for Semantics-based Cyber Infrastructure
• Use of community-ratified controlled vocabularies and lightweight ontologies (upper-level, hierarchies)
• Ease registration, publishing, and discovery
• Provide support for provenance and access control
• Track data citation for credit for data sharing
• Semi-automatic annotation of data and documents : Manual + Automatic
15
How?: Search Continuum
• Keyword-based full-text search
• + Manually provided content and source metadata • Uses upper-level ontology
• + Automatically extracted metadata • Map text to concepts/properties/values• Semantic + faceted search using background knowledge
• + Deeper semi-automatic content annotation andextraction
• Aggregating related pieces of information; conditioning• Integration and Interoperation
• + Linked Open Material Science Data
• + Federated and Faceted Querying and Services
16
Linked Open Data – Why do we need data?
17
Linked Open Data – Just data is not enough
• More and more data are available, But …
18
Isolated islands of data is not enough, akin tothe web of documents without hyperlinks.
data set D
dataset E
dataset F
dataset C
dataset A
dataset B
data set D
dataset E
dataset F
dataset C
dataset A
dataset B
Need to interlink data over the web to enable content-rich applications.
Linked Data
Linked Open Data – A Realization
19
http://dbpedia../John_F._Kennedy
http://dbpedia../politician
http://ex./John_Kennedy
http://ex./A_Nation_of_Immigrants
http://ex./AuthoredBook
Owl:sameAshttp://dbpedia../Profession
http://dbpedia../Massachusetts
http://dbpedia../BirthPlace
http://dbpedia../United_States
http://dbpedia../Boston
http://dbpedia../Countryhttp://dbpedia../Capital
http://dbpedia../BirthDate
1917-05-29
http://ex./publishedIn
1964
http://ex./non-fiction
http://ex./genre
Linked Open Data
20
“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
Example: Lightweight Semantic Registration of Data
21
Title of data Selected from five tier vocabulary
provided Keywords
Type of data maps, excel files, images, text
Data format structured or unstructured
Description of data brief unstructured description of content
Contact information of provider(s) name of provider(s), email for verification,
lineage
Spatial extent of data and
reference system
location
Temporal extent of data date range in time or age range if not recent
Date and type of Related
Publication(s)
Journal, Thesis, Agency report, not published
Host site for publication Journal, Library, Personal computer
Access restrictions copyright regulations
System Architecture and Components
22
Problems and A Practical Approach
(“When rubber meets the road”)
Deeper Issues: Semantic Formalization
of Tabular Data
23
skip
Nature of tables
• Compact structures for sharing information
– Minimize duplication
• Types of Tables
– Regular : Dense Grid with explicit schema
information in terms of column and row
headings => Tractable
– Irregular: Sparse Grid with implicit schema and
ad hoc placement of heading => Hard
24
25
Challenges Associated with Typical Spreadsheet/Table
• Meant for human consumption
• Irregular :
– Not simple rectangular grid
• Heterogeneous
– All rows not interpreted similarly
• Complex
– Meaning of each row and each column context
dependent
• Footnotes modify meaning of entries (esp. in materials
and process specifications)
26
Practical Semi-Automatic Content Extraction
• DESIGN: Develop regular data structures that
can be used to formalize tabular information.
– Provide a natural expression of data
– Provide semantics to data, thereby removing potential
ambiguities
– Enable automatic translation
• USE: Manual population of regular tables and
automatic translation into LOD
27
28
thank you, and please visit us at
http://knoesis.org/
Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing
Wright State University, Dayton, Ohio, USA
Kno.e.sis