KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

WWW.LEDS-PROJEKT.DE

KNOWLEDGE EXTRACTIONFROM HETEROGENEOUS

SEMI-STRUCTURED DATA SOURCES

MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE

16. September 2016

LEDSCURRENT SITUATION

• knowledge in the Web often only available as weakly interlinked, heterogeneous, semi-structured data

à no semantic classification• how to link or merge data?• how to do semantic queries?

à not usable in a meaningful way

2 16. September 2016

LEDSGOAL

Extraction of knowledge from semi-structured data• knowledge in terms of semantic metadata• semantically enriched data then can utilize

the potential of Linked Data

à provide an automatic process

3 16. September 2016

THE KESEDAAPPROACH

LEDSTHE KESEDA APPROACH

• Especially designed to work on JSON data

• Challenges when working with JSON data

à no schema, only name-value pairs

à any structure and depth possible

16. September 20165

{"id": "krug”,"firstName": "Michael","lastName": "Krug","title": "Dipl.-Inf.","phone": "+49 371 531 39929","email": "michael.krug@informatik.tu-chemnitz.de",[...]

16. September 20166

{"id": "2015-007","title": "SmartComposition: ...","author": [ "Michael Krug", "Martin Gaedke"],"year": "2015","type": "Conference Paper","event": {

"name": "24th International World Wide Web Conference","url": http://www.www2015.it/

},[...]

16. September 20167

Arrays

Objects

• multi-step algorithm• work in existing JSON structure

• find and store various matches with different weights• use additional information sources like API descriptions

• assign classes to objects with multiple properties

• link detected entities

16. September 20168

1. Differentiation of input sources / formats2. Preparation of data structure

3. Analysis of property labels4. Analysis of property values

5. Mapping of classes

6. Generate JSON-LD document7. Evaluation of results

16. September 20169

PROTOTYPE

LEDSPROTOTYPE

• prototype implemented in Node.js• working with properties and classes from:• schema.org• foaf• dublincore• goodrelations• music ontology

• dictionaries for: first & last names, cities, streets, languages• list of manually curated synonyms• option to provide pre-defined mappings

16. September 201611

LEDSPROTOTYPE

• Web interface for• pre-configuration• mappings, synonyms, dictionaries

• data upload• result analysis• statistics and browsing

LEDSPROTOTYPE

CONFIGURATION

LEDSPROTOTYPE

RESULTS

EVALUATION

LEDSEVALUATION

Algorithm applied to datasets of

1) JSON array of people

2) JSON array of publications

a) Without custom pre-configuration

b) With custom pre-configuration

LEDSEVALUATION

Initial Setup• dictionary and structure pattern matching• label à predicate string matching• classes and properties: schema.org, foaf, dublincore, goodrelations

Custom Pre-Configuration• set of label à predicate mappings (hand-picked for data context)• list of known synonyms• more structure patterns

LEDS1A) PEOPLE W/O CONFIG

LEDS2A) PEOPLE W/ CONFIG

LEDS1B) PUBLICATIONS W/O CONFIG

LEDS2B) PUBLICATIONS W/ CONFIG

SUMMARY

LEDSSUMMARY

➙ Approach for extracting knowledge from semi-structured data

➙ by applying a multi-step algorithm➙ to convert JSON data to RDF➙ that assigns known classes to objects and maps

their properties to S-P-O triples

LEDSOPEN CHALLENGES

• detect and reuse JSON structure pattern• disambiguate values• apply quality control to results• improve scalability for large datasets• research application of machine learning

WWW.LEDS-PROJEKT.DE

THANK YOU!MICHAEL.KRUG@INFORMATIK.TU-CHEMNITZ.DE

VSR.INFORMATIK.TU-CHEMNITZ.DE

WWW.LEDS-PROJEKT.DE

1. Differentiation of input sources / formats

• text, file, URL, API• check for format

• optional conversion of XML to JSON

2. Preparation of data structure

• pre-process JSON tree to store matches and mappings• keep original structure to preserve hierachie for later

relations• detect arrays and objects for seperate processing

• clean up: remove empty entries

3. Analysis of property labels

• string matching (substrings, prefixes, …)

• synonyms

• pre-defined mappings• use metadata from API description, if available

4. Analysis of property values

• dictionaries

• structure patterns (uri, date, address, color…)

• data types (date, time, number, boolean…)• (lower weighted)

5. Mapping of classes

• find class by number of matched properties

• select match that is most appropriate for chosen class

• take different weights into account

6. Generate JSON-LD document

• use matches and mappings

• link entities depending on JSON tree structure

• validation of output• optional conversion to various RDF formats

7. Evaluation of results

• manual or automatic comparision of actual vs. desired result to reweight matching components

• store correctly applied mappings for later reuse

KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

Internet

Transcript of KESeDa: Knowledge Extraction from Heterogeneous Semi-Structured Data Sources

Semi-automatic knowledge extraction from semi-structured ...

Chapter 9: Structured Data Extraction

CrowdGather: Entity Extraction over Structured Domainspeople.ischool.berkeley.edu/~adityagp/papers/crowd... · 2014-12-01 · Crowdsourced entity extraction is often used to acquire

Information Extraction over Structured Data: …cs.jhu.edu/~xuchen/paper/acl14-ie-freebase.pdf · Information Extraction over Structured Data: Question Answering with Freebase Xuchen

Heterogeneous Supervision for Relation Extraction: A ...hanj.cs.illinois.edu/pdf/emnlp17_lliu.pdfHeterogeneous Supervision for Relation Extraction: ... aged to automatically generate

Structured Data Types struct class Structured Data Types array – homogeneous container collections of only one type struct – heterogeneous data type.

Template Extraction from Heterogeneous Web Pages - IJERT

Summaries on the fly: Query-based Extraction of Structured Knowledge from Web Documents

Event Extraction for Document -Level Structured … andrew.pdfEvent Extraction for Document -Level Structured Summarization Andrew Hsi CMU -LTI -18 -002 Language Technologies Institute

STRUCTURED DATA EXTRACTION FROM TEMPLATE …dspace.vgtu.lt/bitstream/1/1758/3/2262_GRIGALIS_Dissertation_WEB_Colour.pdf · vilnius gediminas technical university tomas grigalis structured

Synthesis and Machine Learning for Heterogeneous Extraction · Synthesis and Machine Learning for Heterogeneous Extraction PLDI 2019, June 22–26, 2018, Phoenix, AZ Figure 1. Two

Data Mining on NIJ data Sangjik Lee. Unstructured Data Mining Text Keyword Extraction Structured Data Base Data Mining Image Feature Extraction Structured.

Relation Extraction William Cohen 10-18. Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.

A Benchmark for Structured Procedural Knowledge Extraction ...

Drug-drug interaction extraction from Structured Product ...

Cost Framework for a Heterogeneous Distributed Semi-structured Environment

Coupling Extraction and Optimization for Heterogeneous 2 ...

Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web Tim Weninger Computer Science and Engineering Department.

TEXT Automatic Template Extraction From Heterogeneous Web Pages

STRUCTURED DATA EXTRACTION FROM THE WEByzhai/paper/prelim_report_sarah.pdf · STRUCTURED DATA EXTRACTION FROM THE WEB YANHONG ZHAI B.S. (Xi’an Jiaotong University) 1998 M.S. (University