Post on 14-Jan-2017
WWW.LEDS-PROJEKT.DE
LEDS
KNOWLEDGE EXTRACTIONFROM HETEROGENEOUS
SEMI-STRUCTURED DATA SOURCES
MARTIN SEIDEL, MICHAEL KRUG, FRANK BURIAN, MARTIN GAEDKE
16. September 2016
LEDSCURRENT SITUATION
• knowledge in the Web often only available as weakly interlinked, heterogeneous, semi-structured data
à no semantic classification• how to link or merge data?• how to do semantic queries?
à not usable in a meaningful way
2 16. September 2016
LEDSGOAL
Extraction of knowledge from semi-structured data• knowledge in terms of semantic metadata• semantically enriched data then can utilize
the potential of Linked Data
à provide an automatic process
3 16. September 2016
LEDSTHE KESEDA APPROACH
• Especially designed to work on JSON data
• Challenges when working with JSON data
à no schema, only name-value pairs
à any structure and depth possible
16. September 20165
LEDSTHE KESEDA APPROACH
{"id": "krug”,"firstName": "Michael","lastName": "Krug","title": "Dipl.-Inf.","phone": "+49 371 531 39929","email": "michael.krug@informatik.tu-chemnitz.de",[...]
}
16. September 20166
LEDSTHE KESEDA APPROACH
{"id": "2015-007","title": "SmartComposition: ...","author": [ "Michael Krug", "Martin Gaedke"],"year": "2015","type": "Conference Paper","event": {
"name": "24th International World Wide Web Conference","url": http://www.www2015.it/
},[...]
}
16. September 20167
Arrays
Objects
LEDSTHE KESEDA APPROACH
• multi-step algorithm• work in existing JSON structure
• find and store various matches with different weights• use additional information sources like API descriptions
• assign classes to objects with multiple properties
• link detected entities
16. September 20168
LEDSTHE KESEDA APPROACH
1. Differentiation of input sources / formats2. Preparation of data structure
3. Analysis of property labels4. Analysis of property values
5. Mapping of classes
6. Generate JSON-LD document7. Evaluation of results
16. September 20169
LEDSPROTOTYPE
• prototype implemented in Node.js• working with properties and classes from:• schema.org• foaf• dublincore• goodrelations• music ontology
• dictionaries for: first & last names, cities, streets, languages• list of manually curated synonyms• option to provide pre-defined mappings
16. September 201611
LEDSPROTOTYPE
• Web interface for• pre-configuration• mappings, synonyms, dictionaries
• data upload• result analysis• statistics and browsing
16. September 201612
LEDSEVALUATION
Algorithm applied to datasets of
1) JSON array of people
2) JSON array of publications
a) Without custom pre-configuration
b) With custom pre-configuration
16. September 201616
LEDSEVALUATION
Initial Setup• dictionary and structure pattern matching• label à predicate string matching• classes and properties: schema.org, foaf, dublincore, goodrelations
Custom Pre-Configuration• set of label à predicate mappings (hand-picked for data context)• list of known synonyms• more structure patterns
16. September 201617
LEDSSUMMARY
➙ Approach for extracting knowledge from semi-structured data
➙ by applying a multi-step algorithm➙ to convert JSON data to RDF➙ that assigns known classes to objects and maps
their properties to S-P-O triples
16. September 201627
LEDSOPEN CHALLENGES
• detect and reuse JSON structure pattern• disambiguate values• apply quality control to results• improve scalability for large datasets• research application of machine learning
16. September 201628
WWW.LEDS-PROJEKT.DE
LEDS
THANK YOU!MICHAEL.KRUG@INFORMATIK.TU-CHEMNITZ.DE
VSR.INFORMATIK.TU-CHEMNITZ.DE
WWW.LEDS-PROJEKT.DE
16. September 201629
LEDSTHE KESEDA APPROACH
1. Differentiation of input sources / formats
• text, file, URL, API• check for format
• optional conversion of XML to JSON
16. September 201631
LEDSTHE KESEDA APPROACH
2. Preparation of data structure
• pre-process JSON tree to store matches and mappings• keep original structure to preserve hierachie for later
relations• detect arrays and objects for seperate processing
• clean up: remove empty entries
16. September 201632
LEDSTHE KESEDA APPROACH
3. Analysis of property labels
• string matching (substrings, prefixes, …)
• synonyms
• pre-defined mappings• use metadata from API description, if available
16. September 201633
LEDSTHE KESEDA APPROACH
4. Analysis of property values
• dictionaries
• structure patterns (uri, date, address, color…)
• data types (date, time, number, boolean…)• (lower weighted)
16. September 201634
LEDSTHE KESEDA APPROACH
5. Mapping of classes
• find class by number of matched properties
• select match that is most appropriate for chosen class
• take different weights into account
16. September 201635
LEDSTHE KESEDA APPROACH
6. Generate JSON-LD document
• use matches and mappings
• link entities depending on JSON tree structure
• validation of output• optional conversion to various RDF formats
16. September 201636