Data Curation @ SpazioDati - NEXA Lunch Seminar

Matteo Brunati @dagoneye

22/07/2015

Data Curation @SpazioDati

33° Nexa Lunch Seminarhttp://nexa.polito.it/lunch-33

Big Data - Linked Data - Machine Learning

spaziodati.eu

a lot of European projectshttp://www.spaziodati.eu/en/#research

Data Curation?https://www.google.com/search?q=data+curation&ie=utf-8&oe=utf-8

Data curation is the process of turning independently created data sources (structured and semi-structured data) into unified data sets ready for analytics, using domain experts to guide the process.

http://strataconf.com/stratany2014/public/schedule/detail/36021

a lot of things involved

ETL (Extract-Transform-Load) tools Data Science tools Linked Data tools Big Data tools Domain Knowledge

why we need a data curation process?

it’s our mantra: ALL YOU NEED IS DATA

:)accessible

for everyone

lat 00° 00’ 00” -> GPS -> Smartphones -> UI IPhone / Android

it’s our mantra: ALL YOU NEED IS DATA

we are building two products

Dandelion API Text Analytics as a service

www.dandelion.eu

Sales Intelligence

www.atoka.io

<Powered by SpazioDati> codename

data platform

Why a knowledge graph?

Our Entity Extraction API is based on a graph

Brussels

Berlin

Eiffel Tower

2009 World Championships in Athletics

King Baudouin Stadium

Champ de Mars

0.440.44

https://dandelion.eu/docs/api/datatxt/nex/v1/

CONTEXTUAL DATA

different sources; different semantics; companies, people, Wikipedia topics, POI… simple to query on traversals global statistics

why a knowledge graph

let’s start with some details on the “Powered by SpazioDati” data platform…

http://blog.spaziodati.eu/en/2014/10/21/spaziodati-at-iswc-2014-visit-our-booth-research-plans-available/

“Powered by SpazioDati” data platform backstage

PWR-BY-SD

OpenRefine

https://azkaban.github.io/

Azkaban Open Source Workflow Manager

Apache Silk

Titan graph db

Apache Cassandra

The Linked Data Integration Framework

Tools involved

http://blog.spaziodati.eu/en/2014/07/24/using-openrefine-to-perform-text-mining-on-your-data-food-for-thoughts/

starting from OpenRefine to clean up the data easily, for example

* reconcile and clean up the data* align the data model to our internal ontologies, using RDF skeletons

* export the RDF modelled using our rules

in other words…

Rexster: JSON-based REST interface to Titan

Our internal ontology: a sample

and now

www.atoka.io

~5,9 ★ MLN companies

>10 ★ MLN persons

updated weekly

★ Weekly web crawl of the Italian corporate

★ Real-time data collection from company social accounts

★ ~1600 online & offline newspapers (updated daily)

updated weekly

www.atoka.io

Search: how it works

Direct search of one particular company through its name or “partita iva” (vat number)

Content search into company websites

Keyword search among extracted and refined entities from company resources !Dandelion API is the extraction engine!

2. [*]

3. [*]

Corporate page

atoka.io

Some details on

• Five main “types”:!– Company!– Person!– Site!– Administrative Division!–Website

our infrastructure to crawl the Web for ATOKA

other details

Cerved • Company • People • Site • Position+Share

ISTAT • AdminDiv

DBPedia • Company

cluster computing

something really interesting on OpenRefine

OpenRefine as usual

OpenRefine on Spark

it rocks! :)

more background details on http://blog.spaziodati.eu/wp-content/uploads/2015/07/RefineOnSpark.pdf

Thanks :)

@spaziodati

brunati@spaziodati.eu

References

1) From raw data to dataGEMs for developers - http://ceur-ws.org/Vol-1268/paper1.pdf 2) Knowledge Graph ovunque: http://www.slideshare.net/dagoneye/knowledge-graphs-ovunque-un-quadro-di-insieme-e-le-implicazioni-per-uno-sviluppo-condiviso-del-web-of-data 3) Linking Enterprise Data - https://www.springer.com/it/book/9781441976642 4) Using OpenRefine - https://www.packtpub.com/big-data-and-business-intelligence/using-openrefine 5) Why Your Business Needs A Customer Data Knowledge Graph - http://www.dataversity.net/business-needs-customer-data-knowledge-graph/ 6) Enabling parallel processing for OpenRefine: Spark vs Akka - http://refinepro.com/blog/enabling-parallel-processing-for-openrefine-spark-vs-akka/

Data Curation @ SpazioDati - NEXA Lunch Seminar

Technology

Transcript of Data Curation @ SpazioDati - NEXA Lunch Seminar

SAB 2008 LITERATURE CURATION Overview & Integrated Phenotype Curation.

8002 Suzuki Connect Brochure - NEXA

I primi 50 “mercoledì di Nexa”...1 17/09/2008 The Nexa Center for Internet & Society: missione e presentazioni Co-direttori, fellow e staff del Centro Nexa 14 11/11/2009 Internet

SpazioDati presents dataTXT - SenTaClAus project - final meeting

Nexa Report Part 1 2009

4D Curation Squad – Curious About Content Curation?

Download the Suzuki Connect App - NEXA

Nexa Series - dosya.ertek.comdosya.ertek.com/pdf/Kataloglar/Seko/Seko Nexa... · Nexa Series Mechanisms Mechanical return type available in various sizes. Main characteristics: n

Item Number Item Description - Wright Medical Group · Item Number Item Description 17-0920 NEXA FDI Tray Base Assembly 17-0925 NEXA FDI Tray Lid Assembly 17-5000 NEXA Tray Lid Kit

Nexa Center for Internet & Society | Il centro Nexa è un ...Espresso... · sui gadget tecnologici, su Facebook, Amazon ed i grandi protagonisti delle questioni strettamente tecnologiche,

SpazioDati presents dataTXT - SenTaClAus project - 2nd open day

Owner’s App Guide - MarutiSuzuki Nexanexa.cloudapp.net/Content/Assets/KM/05f86419-49e5... · NEXA Owner’s App 4 Getting started! NEXA owner’s app provides all the details about

NEXA INVERTERS with CHARGERS - Solar Solvedsolarsolved.co.za/datasheets/inverters/nexa/inverter1.pdf · NEXA INVERTERS with CHARGERS Author: Administrator Created Date: 2/20/2014

Curation for systemization of authentic content for ... · Curation • Rosenbaum (2011). Curation Nation • Curation is the future of online content • Curation: collection and

Digital Curation: Curation Micro-services approach to building repositories

Nexa Jupiter Touch POS Terminal

subscibe-brochure Nexa Final 18 June copy

NEXA Outlet Case Study

SpazioDati presents Dandelion dataTXT - SenTaClAus project - final meeting

DOSING PUMP TN0 nexa SERIES-Hydraulic Double Diaphragm … · Hydraulic Double Diaphragm ... TD_TN0_SS_rev.3.0 1 of 4 nexa series includes plunger and hydraulic diaphragm dosing pumps