Realizing Semantic Web - Light Weight semantics and beyond

Realizing Semantic Web: Lightweight Semantics and BeyondKrishnaprasad Thirunarayan (T. K. Prasad)

Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing

Wright State University, Dayton, OH-454351

Outline

• Domain Goals and Challenges

• Cyberinfrastructure Investments in Science

• Utility and Continuum of Machine-Processable

Semantics : An Architecture

• What?: Nature of Data and Granurality of Semantics

• Why?: Lightweight semantics and its benefits

• How?: Community-ratified Ontologies

+ Semantic Annotations of Data and Documents

+ Linked Open Materials Data

• Research: Processing Tabular Data

2

Domain Goals and Challenges

• Materials Science and Engineering Data and Information sharing, discovery, and application are possible only if domain scientists are able and willing to do so.

• Technological challenges– Computational tools and repositories conducive to easy

exchange, curation, attribution, and analysis of data

• Cultural challenges– Proper protection, control, and credit for sharing data

3

Category of

Geoscience

Data

Characteristics Strategy for Reuse CI Strategy

Short tail

science

data created

by large

organization

s and

projects

Few, large (TB+),

structured, spatially

rich (e.g., remote

sensing), largely

homogeneous,

highly visible,

curated

Planned integration

strategies, could use formal

ontologies / domain models

and vocabularies,

visualization tools and APIs

Data centers / grids

generally using

relational databases

and files, maintained

by people with

significant IT skills

Long tail

science

data created

by individual

scientists

and small

groups

Many, small (GB+),

heterogeneous,

invisible (except via

publications),

poorly curated

Multi-domain and broad

vocabularies (including

community established

ones), create semantic

metadata (annotations) and

optionally publish, search

and download legacy data,

or use an open data

initiative

Web-based easy to

learn and use semantic

tools for annotation,

publication, search and

download that can be

used by individual

scientists without

significant IT skills

4

Our Thesis

Associating machine-processable semantics

with materials science and engineering data

and documents can help overcome

challenges associated with data discovery,

integration and interoperability caused by

data heterogeneity.

5

What?: Nature of Data and Documents

• Structured Data (e.g., relational)

• Semi-structured, Heterogeneous Documents (e.g., publications and technical specs usually include text, numerics, units of measure, images and equations)

• Tabular data (e.g., ad hoc spreadsheets and complex tables incorporating “irregular” entries)

6

Fragment of Materials and Process spec for Ti Alloy Bars, Wire, Forgings, and Rings.

7

What?: Granularity of Semantics and Applications: Examples

• Synonyms– Chemistry, Chemical Composition, Chemical Analysis, ...

– Bend Test, Bending, ...

– Delivery Condition, Process/Surface Finish, Temper, "as received by

purchaser", ...

• Coreference vs broadening/narrowing– Tubing vs welded tubing vs flash-welded part

• Capturing characteristic-value pairs– Recognize and Normalize: “0.1 inch and under in nominal thickness”

is translated to “Thickness <= 0.1 in”.

– Glean elided characteristic: controlled term “solution heat treated”

implies the characteristic “heat treat type”.

8

What?: Granularity of Semantics and Associated Applications

• Lightweight semantics: File and document-level

annotation to enable discovery and sharing

• Richer semantics: Data-level annotation and

extraction for semantic search and summarization

• Fine-grained semantics: Data integration,

interoperability and reasoning in Linked Open

Materials Science Data

9

Computer Assisted Document Extraction Tool

10

Typical view of the tagged Spec Tree/Structure view of the Spec


11

Few More Examples: Procedure Melt Methods

View of the Original Spec Tagged Spec

TagEditor


12

TagEditor

The SDL

Few More Examples: Procedure Melt Methods

Why?: Benefits of Lightweight Semantics

• Ease of use by domain experts

– Faster and wider adoption, promoting evolution

• Low upfront cost to support

• Shallow semantics has wider applicability to a

range of documents/data and appeal to a broader

community of geoscientists

• Bottom-line: “Learn to Walk before we Run”

13

How?: Using Semantic Web Technologies

Machine-processable semantics achieved by addressing

• Syntactic Heterogeneity: Using XML syntax and RDF datamodel (labelled graph structure)

• Semantic Heterogeneity: – Using “common” controlled vocabularies, taxonomies

and ontologies

– Using federated data sources, exchanges, querying, and services

14

How?: Ingredients for Semantics-based Cyber Infrastructure

• Use of community-ratified controlled vocabularies and lightweight ontologies (upper-level, hierarchies)

• Ease registration, publishing, and discovery

• Provide support for provenance and access control

• Track data citation for credit for data sharing

• Semi-automatic annotation of data and documents : Manual + Automatic

15

How?: Search Continuum

• Keyword-based full-text search

• + Manually provided content and source metadata • Uses upper-level ontology

• + Automatically extracted metadata • Map text to concepts/properties/values• Semantic + faceted search using background knowledge

• + Deeper semi-automatic content annotation andextraction

• Aggregating related pieces of information; conditioning• Integration and Interoperation

• + Linked Open Material Science Data

• + Federated and Faceted Querying and Services

16

Linked Open Data – Why do we need data?

17

Linked Open Data – Just data is not enough

• More and more data are available, But …

18

Isolated islands of data is not enough, akin tothe web of documents without hyperlinks.

data set D

dataset E

dataset F

dataset C

dataset A

dataset B

data set D

dataset E

dataset F

dataset C

dataset A

dataset B

Need to interlink data over the web to enable content-rich applications.

Linked Data

Linked Open Data – A Realization

19

http://dbpedia../John_F._Kennedy

http://dbpedia../politician

http://ex./John_Kennedy

http://ex./A_Nation_of_Immigrants

http://ex./AuthoredBook

Owl:sameAshttp://dbpedia../Profession

http://dbpedia../Massachusetts

http://dbpedia../BirthPlace

http://dbpedia../United_States

http://dbpedia../Boston

http://dbpedia../Countryhttp://dbpedia../Capital

http://dbpedia../BirthDate

1917-05-29

http://ex./publishedIn

1964

http://ex./non-fiction

http://ex./genre

Linked Open Data

20

“Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”

Example: Lightweight Semantic Registration of Data

21

Title of data Selected from five tier vocabulary

provided Keywords

Type of data maps, excel files, images, text

Data format structured or unstructured

Description of data brief unstructured description of content

Contact information of provider(s) name of provider(s), email for verification,

lineage

Spatial extent of data and

reference system

location

Temporal extent of data date range in time or age range if not recent

Date and type of Related

Publication(s)

Journal, Thesis, Agency report, not published

Host site for publication Journal, Library, Personal computer

Access restrictions copyright regulations

System Architecture and Components

22

Problems and A Practical Approach

(“When rubber meets the road”)

Deeper Issues: Semantic Formalization

of Tabular Data

23

skip

Nature of tables

• Compact structures for sharing information

– Minimize duplication

• Types of Tables

– Regular : Dense Grid with explicit schema

information in terms of column and row

headings => Tractable

– Irregular: Sparse Grid with implicit schema and

ad hoc placement of heading => Hard

24

Challenges Associated with Typical Spreadsheet/Table

• Meant for human consumption

• Irregular :

– Not simple rectangular grid

• Heterogeneous

– All rows not interpreted similarly

• Complex

– Meaning of each row and each column context

dependent

• Footnotes modify meaning of entries (esp. in materials

and process specifications)

26

Practical Semi-Automatic Content Extraction

• DESIGN: Develop regular data structures that

can be used to formalize tabular information.

– Provide a natural expression of data

– Provide semantics to data, thereby removing potential

ambiguities

– Enable automatic translation

• USE: Manual population of regular tables and

automatic translation into LOD

27

28

thank you, and please visit us at

http://knoesis.org/

Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing

Wright State University, Dayton, Ohio, USA

Kno.e.sis

Realizing Semantic Web - Light Weight semantics and beyond

Education

Transcript of Realizing Semantic Web - Light Weight semantics and beyond