Data Curation @ SpazioDati - NEXA Lunch Seminar

Post on 16-Aug-2015

354 views 8 download

Tags:

Transcript of Data Curation @ SpazioDati - NEXA Lunch Seminar

Matteo Brunati @dagoneye

22/07/2015

Data Curation @SpazioDati

33° Nexa Lunch Seminarhttp://nexa.polito.it/lunch-33

Big Data - Linked Data - Machine Learning

spaziodati.eu

a lot of European projectshttp://www.spaziodati.eu/en/#research

Data Curation?https://www.google.com/search?q=data+curation&ie=utf-8&oe=utf-8

!

!

Data curation is the process of turning independently created data sources (structured and semi-structured data) into unified data sets ready for analytics, using domain experts to guide the process.

http://strataconf.com/stratany2014/public/schedule/detail/36021

a lot of things involved

!

ETL (Extract-Transform-Load) tools Data Science tools Linked Data tools Big Data tools Domain Knowledge

why we need a data curation process?

@

it’s our mantra: ALL YOU NEED IS DATA

:)accessible

for everyone

lat 00° 00’ 00” -> GPS -> Smartphones -> UI IPhone / Android

it’s our mantra: ALL YOU NEED IS DATA

we are building two products

Dandelion API Text Analytics as a service

www.dandelion.eu

Sales Intelligence

www.atoka.io

<Powered by SpazioDati> codename

2014

2015

data platform

Why a knowledge graph?

Our Entity Extraction API is based on a graph

Brussels

Paris

Berlin

Eiffel Tower

2009 World Championships in Athletics

King Baudouin Stadium

Champ de Mars

0.42

0.80

0.43

0.53

0.53

0.53

0.63

0.59

0.440.44

https://dandelion.eu/docs/api/datatxt/nex/v1/

CONTEXTUAL DATA

different sources; different semantics; companies, people, Wikipedia topics, POI… simple to query on traversals global statistics

why a knowledge graph

let’s start with some details on the “Powered by SpazioDati” data platform…

http://blog.spaziodati.eu/en/2014/10/21/spaziodati-at-iswc-2014-visit-our-booth-research-plans-available/

“Powered by SpazioDati” data platform backstage

PWR-BY-SD

OpenRefine

https://azkaban.github.io/

Azkaban Open Source Workflow Manager

Apache Silk

Titan graph db

Apache Cassandra

The Linked Data Integration Framework

Tools involved

http://blog.spaziodati.eu/en/2014/07/24/using-openrefine-to-perform-text-mining-on-your-data-food-for-thoughts/

starting from OpenRefine to clean up the data easily, for example

* reconcile and clean up the data* align the data model to our internal ontologies, using RDF skeletons

* export the RDF modelled using our rules

in other words…

Rexster: JSON-based REST interface to Titan

Our internal ontology: a sample

and now

www.atoka.io

~5,9 ★ MLN companies

>10 ★ MLN persons

900k

updated weekly

★ Weekly web crawl of the Italian corporate

★ Real-time data collection from company social accounts

★ ~1600 online & offline newspapers (updated daily)

updated weekly

23

www.atoka.io

Search: how it works

Direct search of one particular company through its name or “partita iva” (vat number)

Content search into company websites

Keyword search among extracted and refined entities from company resources !Dandelion API is the extraction engine!

1.

2. [*]

3. [*]

Corporate page

atoka.io

Some details on

• Five main “types”:!– Company!– Person!– Site!– Administrative Division!–Website

our infrastructure to crawl the Web for ATOKA

other details

Cerved • Company • People • Site • Position+Share

ISTAT • AdminDiv

ES

DBPedia • Company

cluster computing

something really interesting on OpenRefine

OpenRefine as usual

OpenRefine on Spark

it rocks! :)

more background details on http://blog.spaziodati.eu/wp-content/uploads/2015/07/RefineOnSpark.pdf

Thanks :)

@spaziodati

brunati@spaziodati.eu

References

1) From raw data to dataGEMs for developers - http://ceur-ws.org/Vol-1268/paper1.pdf 2) Knowledge Graph ovunque: http://www.slideshare.net/dagoneye/knowledge-graphs-ovunque-un-quadro-di-insieme-e-le-implicazioni-per-uno-sviluppo-condiviso-del-web-of-data 3) Linking Enterprise Data - https://www.springer.com/it/book/9781441976642 4) Using OpenRefine - https://www.packtpub.com/big-data-and-business-intelligence/using-openrefine 5) Why Your Business Needs A Customer Data Knowledge Graph - http://www.dataversity.net/business-needs-customer-data-knowledge-graph/ 6) Enabling parallel processing for OpenRefine: Spark vs Akka - http://refinepro.com/blog/enabling-parallel-processing-for-openrefine-spark-vs-akka/