Data Curation @ SpazioDati - NEXA Lunch Seminar
-
Upload
spaziodati -
Category
Technology
-
view
354 -
download
8
Transcript of Data Curation @ SpazioDati - NEXA Lunch Seminar
Matteo Brunati @dagoneye
22/07/2015
Data Curation @SpazioDati
33° Nexa Lunch Seminarhttp://nexa.polito.it/lunch-33
a lot of European projectshttp://www.spaziodati.eu/en/#research
Data Curation?https://www.google.com/search?q=data+curation&ie=utf-8&oe=utf-8
!
!
Data curation is the process of turning independently created data sources (structured and semi-structured data) into unified data sets ready for analytics, using domain experts to guide the process.
http://strataconf.com/stratany2014/public/schedule/detail/36021
a lot of things involved
!
ETL (Extract-Transform-Load) tools Data Science tools Linked Data tools Big Data tools Domain Knowledge
why we need a data curation process?
@
it’s our mantra: ALL YOU NEED IS DATA
:)accessible
for everyone
lat 00° 00’ 00” -> GPS -> Smartphones -> UI IPhone / Android
it’s our mantra: ALL YOU NEED IS DATA
we are building two products
<Powered by SpazioDati> codename
2014
2015
data platform
Why a knowledge graph?
Our Entity Extraction API is based on a graph
Brussels
Paris
Berlin
Eiffel Tower
2009 World Championships in Athletics
King Baudouin Stadium
Champ de Mars
0.42
0.80
0.43
0.53
0.53
0.53
0.63
0.59
0.440.44
https://dandelion.eu/docs/api/datatxt/nex/v1/
CONTEXTUAL DATA
different sources; different semantics; companies, people, Wikipedia topics, POI… simple to query on traversals global statistics
why a knowledge graph
let’s start with some details on the “Powered by SpazioDati” data platform…
http://blog.spaziodati.eu/en/2014/10/21/spaziodati-at-iswc-2014-visit-our-booth-research-plans-available/
“Powered by SpazioDati” data platform backstage
PWR-BY-SD
OpenRefine
https://azkaban.github.io/
Azkaban Open Source Workflow Manager
Apache Silk
Titan graph db
Apache Cassandra
The Linked Data Integration Framework
Tools involved
http://blog.spaziodati.eu/en/2014/07/24/using-openrefine-to-perform-text-mining-on-your-data-food-for-thoughts/
starting from OpenRefine to clean up the data easily, for example
* reconcile and clean up the data* align the data model to our internal ontologies, using RDF skeletons
* export the RDF modelled using our rules
in other words…
Rexster: JSON-based REST interface to Titan
Our internal ontology: a sample
~5,9 ★ MLN companies
>10 ★ MLN persons
900k
updated weekly
★ Weekly web crawl of the Italian corporate
★ Real-time data collection from company social accounts
★ ~1600 online & offline newspapers (updated daily)
updated weekly
Search: how it works
Direct search of one particular company through its name or “partita iva” (vat number)
Content search into company websites
Keyword search among extracted and refined entities from company resources !Dandelion API is the extraction engine!
1.
2. [*]
3. [*]
Corporate page
atoka.io
Some details on
• Five main “types”:!– Company!– Person!– Site!– Administrative Division!–Website
our infrastructure to crawl the Web for ATOKA
other details
Cerved • Company • People • Site • Position+Share
ISTAT • AdminDiv
ES
DBPedia • Company
cluster computing
something really interesting on OpenRefine
OpenRefine as usual
OpenRefine on Spark
it rocks! :)
more background details on http://blog.spaziodati.eu/wp-content/uploads/2015/07/RefineOnSpark.pdf
References
1) From raw data to dataGEMs for developers - http://ceur-ws.org/Vol-1268/paper1.pdf 2) Knowledge Graph ovunque: http://www.slideshare.net/dagoneye/knowledge-graphs-ovunque-un-quadro-di-insieme-e-le-implicazioni-per-uno-sviluppo-condiviso-del-web-of-data 3) Linking Enterprise Data - https://www.springer.com/it/book/9781441976642 4) Using OpenRefine - https://www.packtpub.com/big-data-and-business-intelligence/using-openrefine 5) Why Your Business Needs A Customer Data Knowledge Graph - http://www.dataversity.net/business-needs-customer-data-knowledge-graph/ 6) Enabling parallel processing for OpenRefine: Spark vs Akka - http://refinepro.com/blog/enabling-parallel-processing-for-openrefine-spark-vs-akka/