Using entity extraction extension with OpenRefine and dataTXT APIs
-
Upload
spaziodati -
Category
Technology
-
view
1.413 -
download
5
description
Transcript of Using entity extraction extension with OpenRefine and dataTXT APIs
Using entity extraction extension with OpenRefine and dataTXT APIs
!
food for thoughts
What we are talking about
OpenRefine www.openrefine.org
NER extension integrated with dataTXT-NEX API
http://freeyourmetadata.org/named-entity-extraction/
(dandelion.eu)
What industries are using OpenRefine?
https://groups.google.com/d/msg/openrefine/vA75Ac_XODo/AfG8IRlEfSAJ
data journalists
metadata curators
museumslibrariesresearch labs
SEO folks
data scientistsenterprisesuniversities
patent attorneys
Open Data hackers
Social Media specialists
civil servants
What does OpenRefine offer that other data-parsing tools don't?
http://opendata.stackexchange.com/questions/515/what-does-openrefine-offer-that-other-data-parsing-tools-dont
reconciliation of text data against reference data services containing strong identifiers (Freebase, OpenCorporates, any SPARQL or RDF, etc) !
simple linking of reconciled entities to other info sources like Wikipedia, MusicBrainz, IMDB, etc
[…]
[…]
How we are using it, at SpazioDati?
normalize, clean and extract data from different sources reconcile against internal reconciliation services ( administrative regions, names and telephone numbers… )apply rules and transformations to data, aligned it with our internal ontologies
A look at OpenRefine & reconciliation
Why it’s useful reconciliation?
Instruments
bla bla bla
bla bla bla bla
…
what kind of instruments?
reconciliation identifies keywords in flowing text and gives them a URL
from strings to things
instruments data column
musical instruments
measuring instruments
aeronautical instruments
URL
URL
URL
Instruments
bla bla bla
bla bla bla bla
reconciliation works great for those fields in your dataset that contain single terms
names of people countries, works of art […]
and what if we have a column with unstructured texts, like this one?
we need a new step in the data curation workflow…
a new column data, labelled “dataTXT”
extract named entities using
NER extension + dataTXT API
data column with some texts
in this column, there are named concepts, linked to Wikipedia
label + URI“Collective action” + http://en.wikipedia.org/wiki/Collective_action
make a text filter
looking for a concept
classify and categorize the content …
things, not strings
some scenarios
Open Data community real issues
Using OpenRefine + NER extension with dataTXT-NEX
extract meaninful informations from some CVs, like names, organizations, skills, …
http://opendata.stackexchange.com/search?page=3&tab=relevance&q=extraction
normalize organizations names cited in some texts
Data journalists
Using OpenRefine + NER extension with dataTXT-NEX
extract relevant news to a precise topic ( a person, a brand or a company )
write a summary from a politician speech, starting from the main concepts extracted from the text
mine specific informations in judicial decisions (judge's name, court, area of law and neutral citation number
Using OpenRefine + NER extension with dataTXT-NEX
Text mining on tweets: extract brands, places and concepts easily from a twitter flow related to an event
Text mining on website content: extract concepts and places easily from a webpage, to improve website SEO ranking
Social media specialists
Using OpenRefine + NER extension with dataTXT-NEX
Understand your own bank account statements: extract useful informations, like brands and places, to categorize and classify your own expenses
“Quantify self” movement
Analytics on Personal Data
#dataTXT #refine #ner
you know other use cases? tell us on Twitter!
@spaziodatidandelion.eu